Web scraping is revolutionizing academic and professional research by enabling the collection of big data.
Advanced collection practices allow higher levels of data extraction at faster rates, enabling new research opportunities in healthcare, finance, ecology, politics, and economics.
The Digital Landscape Makes New Research Possible
New data sources from across the world are continuously being created as people increasingly conduct business, personal, and professional transactions online. As these sources expand, researchers are finding new opportunities to develop their research and obtain new insights.
Advanced insights can also lead to new questions, creating a cycle that drives further research and increases understanding of the subject matter. As a result, researchers improve their findings, derive increasingly accurate conclusions, and produce better solutions to problems affecting people, businesses, and governments.
How Researchers Obtain High-Quality Data
Legacy data sources include journals, purchased data sets, and information collected manually from the internet. Besides being resource-intensive, these methods typically require hours of manual entry into spreadsheets that are tedious, time-consuming, and prone to error.
Today’s research landscape is vastly superior. Researchers now access a trove of online data covering nearly every subject. Examples include financial websites with historical stock information, public databases with clinical drug trials, and online marketplaces with detailed product and pricing information.
Modern data gathering methods enable researchers to extract that information at scale and automatically update their databases. For example, imagine an online resource with thousands of stocks, including historical pricing information, current news, and trading volumes. Web scraping makes it possible to make thousands of data requests from that website per second and deliver the information in a spreadsheet format that analysts can easily read.
How Web Scraping Works
Advanced web scraping requires the creation of scripts (or “bots”) written in a programming language like Python to crawl websites and extract data. Alternatively, smaller or personal data extraction projects can be executed using browser extensions that parse website HTML and export the information in a spreadsheet format.
Another alternative is a web scraping API that can be easily customized. Researchers opting for this solution can quickly extract information at scale and avoid many common process challenges, allowing them to focus on obtaining insights for research purposes.
9 Research Projects Enhanced By Web Scraping
Web scraping enables new research into economics, healthcare, ecology, and politics by allowing researchers to gather data from emerging online resources. Without automation, some of these projects would have been impossible to complete without hundreds of hours of manual data collection, entry, and processing.
Some recent examples include:
Opioid-Related Death Tracking
Oxford researchers downloaded over 3000 PDF documents to study opioid deaths in the United Kingdom. Web scraping made it possible to scale the project considerably so they could focus on other research-related tasks. “We could manually screen and save about 25 case reports every hour,” reads an article in Nature describing the project. “Now, our program can save more than 1,000 cases per hour while we work on other things, a 40-fold time saving.”
Automating data collection also opened up collaboration. By publishing the database and frequently re-running the program, researchers enriched the project by sharing findings with the academic community.
Tracking Clinical Trials
The Oxford researchers studying opioid deaths in the previous example also used web scraping to gather information from clinical-trial registries to further develop their published data set tracking primary-care prescribing in England.
Government entities typically announce gross domestic product (GDP) on a quarterly basis. Web scraping enables researchers to make GDP predictions more frequently.
GDP is calculated by adding consumption, investment, government, and net exports. Most of these components are higher-frequency metrics that can be scraped from online sources, allowing for the creation of models that predict GDP ahead of official announcements.
Reserve banks throughout the world currently leverage this method, including the United States, European Union, South Africa, China, Brazil, and Japan.
The Bank of Japan (BOJ) actively uses alternative data – information outside “official” government and corporate reports – to evaluate key economic sectors and develop policy. Recent applications include the collection of mobility data during COVID-19 that revealed pedestrian traffic, financial transactions, and airport visits.
Researchers from Poland gathered food and non-alcoholic beverage prices from major online retailers and created a framework to estimate inflation rates in the near term (also called a “nowcast”). They demonstrated that accounting for online food prices in a simple, recursively optimized model effectively predicts inflation and even outperforms traditional approaches.
Unemployment Insurance Claims
Unemployment reached all-time highs during COVID-19, highlighting the need to predict jobless rates to estimate unemployment claims. Researchers in a recent paper explored information sets and data structures from the spring of 2020 to predict job losses in the United States and how they can be used to forecast unemployment claims.
Environmental researchers are extracting data from Google Trends, news articles, and social media to get insights into species occurrences, behaviors, traits, phenology, functional roles, and abiotic environmental features. Referred to as “iEcology”, this emerging research approach aims to quantify patterns and processes in the natural environment using digital data from public sources.
Internet users are becoming increasingly vocal about political matters on social media networks and public forums. Political groups are leveraging this trend by scraping online sources to identify critical issues and using that data to formulate campaign content.
Environment, Social, and Governance (ESG) investment guidelines are designed to address climate change concerns, greenhouse gas emissions, water management, and waste reduction. Investment managers and financial analysts can assess an entity’s adherence to these guidelines by scraping online databases containing ESG data.
Discover scraping solutions for your next research project
Online publicly-available sources can be scraped in various ways depending on your project’s size and scope. Discover the best solution for your needs by reading our free guide: Choosing the Right Scraping Solution in 2023: Essentials You Need to Know.
About the Author
Aleksandras Šulženko is Product Owner at Oxylabs. At Oxylabs, our mission is to make sure that every company, no matter big or small, has access to publicly available big data. We think that public data gathering is extremely important for every company to achieve success. We see our clients as our partners, this way making sure that both parties profit as much as possible from this interaction. Clients choose us because we offer the highest quality, best proxies out there that help our customers with market research, ad verification, brand protection, travel fare aggregation, SEO monitoring, pricing intelligence and more.
Featured image: ©GHart