Web scraping or public web data harvesting is playing an increasing role in the decision-making process in the private sector.
Today, the alternative data industry is worth nearly $7 billion. Though some experts agree that web scraping is still far from reaching its true potential, recent Oxylabs research indicates that over 52% of UK financial companies use automated processes to gather data. Most of the research participants (63%) employ alternative data to gain competitive business insights.
Despite the active utilization of non-traditional data sources in business, the public sector and academia are still lagging behind. Legal obscurity and complex public procurement procedures might be the main reasons constraining the public sector, however, there’s much more freedom available in academic circles. Why then so many students and researchers on university campuses have only a vague understanding of the web scraping possibilities and tools?
Web scraping for science
Analyzing big data from alternative sources can help test and validate existing hypotheses and formulate new ones. It offers a much broader and, in certain cases, less biased perspective than traditional data sources. However, if you’d try to search for information related to web scraping for science, you’d quickly notice that it mainly concerns data scientists and rarely spills out to other fields.
Despite the lack of awareness, the possibilities of alternative web data analysis in social, economic, or psychological studies are endless. For example, the Bank of Japan has been actively employing alternative data to inform its monetary policy. It uses mobility data, such as the nighttime population in selected areas in Tokyo and recreation and retail trends based on credit card spending, to assess economic activity.
During the COVID-19 pandemic, virology and psychology studies also gained valuable insights from alternative web data: localized Google search trends could predict outbreaks more accurately than other measures, whereas scraping public Twitter data was used to understand the attitudes and experiences of the general public toward remote work. Other prominent examples of using alternative data for scientific research include depression and personality studies based on public social media activity, studying weight stigma in reader comments beneath articles on obesity, etc.
The benefits of web scraping can be easily observed in marketing and ecommerce research. Scientists can automate the collection of prices of specific goods (e.g., electronics, housing, and food) to calculate the consumer prices index. Marketing researchers can track the same products that are sold under different conditions (e.g., discounted prices) to estimate the impact that certain factors can have on an irrational actor.
Finally, web scraping public data is essential to the studies of artificial intelligence (AI) and machine learning (ML). AI and ML studies are becoming very popular, and almost any large university offers AI- and ML-related study programs. The challenge that students in these programs often face is a lack of proper datasets to train AI/ML algorithms. Public data scraping knowledge would help AI and ML students build quality datasets for more efficient machine learning.
One field in which public web data gathering is unavoidable is investigative journalism and political research. These types of research critically depend on unbiased and niche data that is rarely available in its whole complexity through traditional data sources.
Investigative journalists and political scientists can use scrapers to study a wide range of issues: from tracking the influence of lobbyists by investigating visitor logs from government buildings to monitoring prohibited political ads and extremist groups in public social media platforms and forums. It can be argued that web scraping is critical for solving social problems and, as such – for the functioning of the democratic state itself and the rule of law.
The awareness gap
Web scraping isn’t the panacea for all scientific ills. It would hardly help the physical or life sciences perform experiments, but it can open the Holy Grail of data for social, economic, political, and in some cases – clinical studies. Automated big data gathering is the breakthrough many scientists have been waiting for for many years. However, it’s suffering from several misconceptions.
In social sciences, academics sometimes rely on experiments or survey data just because this type of evidence seems easier to collect than harvesting web data. Even if students try to find the necessary information online, without formal education on web scraping, they usually resort to manual data entry (the glorious copy-and-paste technique), which is time-consuming and error-prone.
Popular sources of academic research data are large databases owned by public organizations or governmental institutions and datasets provided by businesses. Unfortunately, the perceived easiness of this method comes at a cost. Government data is collected slowly, can quickly become outdated, and hardly offers fresh insights since the same data points get (over)analyzed by thousands of scientists. Data provided by private organizations might be biased. If the information is sensitive, the business might insist on seeing the study’s final results, often producing the so-called outcome reporting bias.
Countless sources of free alternative data on the web open the possibility of conducting unique research that otherwise would be impossible. It’s like having an infinite dataset that can be updated with nearly any information. Although web scraping definitely requires specific knowledge, today’s data gathering solutions allow users to extract massive amounts of alternative data with only basic programming skills. They can return data in real-time, making scientific predictions more accurate, whereas traditional data collection methods often have a significant time lag.
It’s important to note that there’s rarely a good reason (both time- and resource-wise) for academics to build their own data scrapers and parsers from scratch. Third-party vendors can easily handle proxy management, CAPTCHA solving, or building unique fingerprints and parsing pipelines so that scientists could fully devote their time to data analysis and research.
The fear of legal obscurity
Web scraping is surrounded by various legal concerns that are also discouraging some researchers from leveraging public big data in their studies. Since the industry is relatively new and open to various players, there truly were some cases of unprofessional activities. However, any digital tool can be used for both positive and negative purposes.
There is nothing inherently unethical about web scraping as it simply automates activities that people would otherwise do manually. We all know the most famous web scraper – the Googlebot – and depend on it daily. Web scraping is also extensively used in ecommerce – for example, large flight comparison websites scrape thousands of airlines’ sites to gather public pricing data. Getting the best deal for a trip to NYC depends on public web data gathering technologies.
Since web scraping involves some risks, academics often choose to ditch it altogether, returning to traditional data sources or scraping here and there and expecting no one to get suspicious. The best way out of this obscurity is to consult a legal practitioner before going for a major data harvesting project. Answering the following questions may also help a researcher evaluate possible risks:
- Is the public data accumulated from human subjects? If yes, can it be subject to privacy laws (e.g., GDPR)?
- Does the website provide an API?
- Is web crawling or scraping prohibited by the website’s Terms of Service?
- Is the website’s data explicitly copyrighted or subject to intellectual property rights?
- Is the website’s data paid (you need to subscribe for it)?
- Is the data you need locked behind a login?
- Does the project involve illegal or fraudulent use of the data?
- Have you thoroughly read the robots.txt file and adapted your scrapers accordingly?
- Can crawling and scraping cause material damage to the website or the server hosting the website?
- Can scraping or crawling significantly impact the quality of service (e.g., speed) of the targeted website?
To promote ethical data gathering practices and industry-wide standards, together with other prominent DaaS companies, Oxylabs created the Ethical Web Data Collection Initiative. The consortium aims to build trust around web scraping and educate a wider tech community about big data possibilities.
Project 4β for free web data
The awareness gap around web scraping is probably the single main reason why academia is not utilizing this technology. To fill the gap and help academics gather big data with web scraping tools, Oxylabs launched a pro bono initiative called “Project 4β”. The initiative aims to transfer the technological expertise that Oxylabs accumulated over the years and grant universities and NGOs free access to data scraping tools, supporting important research on big data. “Project 4β” is also a safe space for academics to discuss what actions are appropriate and ethical according to the legal precedents formed over the last 20 years.
Through “Project 4β”, Oxylabs has already engaged in partnerships with professors and students from the University of Michigan, Northwestern University, and CODE – University of Applied Sciences, sharing knowledge about ethical web scraping challenges. Some of the provided educational resources are now integrated into graduate courses.
Also, for the last few years, Oxylabs has been actively working on pushing the frontiers of web scraping technology through AI- and ML-powered solutions. To stimulate the sharing of know-how, the company established an AI and ML Advisory Board, including five prominent academic and industry leaders. More active collaboration with academics would unlock even broader web scraping possibilities to tackle important social challenges.
On a final note
Web scraping has yet to gain traction in the public eye and academia. However, with the sheer volume of web data increasing exponentially every year, big data analysis will slowly become an inevitable part of scientific research. The way it is now routine to teach SPSS basics even on social sciences campuses, it should become normal to familiarize students with web scraping practices.
It’s true that web scraping does involve certain risks and ethical considerations – but so do scientific experiments in the labs. Even though organizations should always consult with legal professionals before scraping, there are best industry practices that, when followed, minimize most risks associated with public web data collection.
About the Author
Juras Juršėnas is COO at Oxylabs.io. At Oxylabs, our mission is to make sure that every company, no matter big or small, has access to publicly available big data. We think that public data gathering is extremely important for every company to achieve success. We see our clients as our partners, this way making sure that both parties profit as much as possible from this interaction. Clients choose us because we offer the highest quality, best proxies out there that help our customers with market research, ad verification, brand protection, travel fare aggregation, SEO monitoring, pricing intelligence and more.
Featured image: ©A_B_C