Web Scraping

Web scraping is the automated retrieval of data from web pages. Web scraping is one of the foundational tools of our modern web; for instance, Google was originally a web scraping system on steroids (typically called crawlers) designed to scrape the entire web. While today Google has grown beyond just its crawler system, web scraping is still incredibly important to retrieving data in an automated way for many organizations, not just Google.

Common Applications

Common Industries

Many industries utilize web scraping to keep track of market changes, product prices, etc. For instance:

  • Retail: scraping product prices from a competitors website to automatically keep track of their prices

  • Marketing/Advertising/Communication: companies like Syften monitor web communications using scraping systems and use that to help companies find customers talking about their products (or, their competitors products)

  • Social Media: one of the biggest targets for scraping, because public sentiment can be analyzed here (see Sentiment Analysis)

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:web-scraping-intro

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:web-scraping-intro

Need help implementing any of this code? Feel free to reach out to [email protected] and we can help!