STAT 39000: Project 5 — Spring 2021

Motivation: One of the best things about learning to scrape data is the many applications of the skill that may pop into your mind. In this project, we want to give you some flexibility to explore your own ideas, but at the same time, add a couple of important tools to your tool set. We hope that you’ve learned a lot in this series, and can think of creative ways to utilize your new skills.

Context: This is the last project in a series focused on scraping data. We have created a couple of very common scenarios that can be problematic when first learning to scrape data, and we want to show you how to get around them.

Scope: python, web scraping, etc.

Learning objectives
  • Use the requests package to scrape a web page.

  • Use the lxml/selenium package to filter and parse data from a scraped web page.

  • Learn how to step around header-based filtering.

  • Learn how to handle rate limiting.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

It is not uncommon to be blocked from scraping a website. There are a variety of strategies that they use to do this, and in general they work well. In general, if a company wants you to extract information from their website, they will make an API (application programming interface) available for you to use. One method (that is commonly paired with other methods) is blocking your request based on headers. You can read about headers here. In general, you can think of headers as some extra data that gives the server or client context. Here is a list of headers, and some more explanation.

Each header has a purpose. One common header is called the User-Agent header. A User-Agent looks something like:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0

You can see headers if you open the console in Firefox or Chrome and load a website. It will look something like:

![](./images/headers01.png)

From the mozilla link, this header is a string that "lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent." Basically, if you are browsing the internet with a common browser, the server will know what you are using. In the provided example, we are using Firefox 86 from Mozilla, on a Mac running Mac OS 10.16 with an Intel processor.

When we send a request from a package like requests in Python, here is what the headers look like:

import requests
response = requests.get("https://project5-headers.tdm.wiki")
print(response.request.headers)
{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

As you can see our User-Agent is python-requests/2.25.1. You will find that many websites block requests made from anything such user agents. One such website is: project5-headers.tdm.wiki.

Scrape project5-headers.tdm.wiki from Scholar and explain what happens. What is the response code, and what does that response code mean? Can you ascertain what you would be seeing (more or less) in a browser based on the text of the response (the actual HTML)? Read this section of the documentation for the headers package, and attempt to "trick" project5-headers.tdm.wiki into presenting you with the desired information. The desired information should look something like:

Hostname: c1de5faf1daa
IP: 127.0.0.1
IP: 172.18.0.4
RemoteAddr: 172.18.0.2:34520
GET / HTTP/1.1
Host: project5-headers.tdm.wiki
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Encoding: gzip
Accept-Language: en-US,en;q=0.5
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 107.201.65.5
Cf-Ipcountry: US
Cf-Ray: 62289b90aa55f975-EWR
Cf-Request-Id: 084d3f8e740000f975e0038000000001
Cf-Visitor: {"scheme":"https"}
Cookie: __cfduid=d9df5daa57fae5a4e425173aaaaacbfc91613136177
Dnt: 1
Sec-Gpc: 1
Upgrade-Insecure-Requests: 1
X-Forwarded-For: 123.123.123.123
X-Forwarded-Host: project5-headers.tdm.wiki
X-Forwarded-Port: 443
X-Forwarded-Proto: https
X-Forwarded-Server: 6afe64faffaf
X-Real-Ip: 123.123.123.123
Items to submit
  • Python code used to solve the problem.

  • Response code received (a number), and an explanation of what that HTTP response code means.

  • What you would (probably) be seeing in a browser if you were blocked.

  • Python code used to "trick" the website into being scraped.

  • The content of the successfully scraped site.

Question 2

Open a browser and navigate to: project5-rate-limit.tdm.wiki/. While at first glance, it will seem identical to project5-headers.tdm.wiki/, it is not. project5-rate-limit.tdm.wiki/ is rate limited based on IP address. Depending on when you are completing this project, this may or may not be obvious. If you refresh your browser fast enough, instead of receiving a bunch of information, you will receive text that says "Too Many Requests".

The following function tries to scrape the Cf-Request-Id header which will have a unique value each request:

import requests
import lxml.html
def scrape_cf_request_id(url):
    resp = requests.get(url)
    tree = lxml.html.fromstring(resp.text)
    content = tree.xpath("//p")[0].text.split('\n')
    cfid = [l for l in content if 'Cf-Request-Id' in l][0].split()[1]
    return cfid

You can test it out:

scrape_cf_request_id("https://project5-rate-limit.tdm.wiki")

Write code to scrape 10 unique Cf-Request-Id`s (in a loop), and save them to a list called `my_ids. What happens when you run the code? This is caused by our expected text not being present. Instead text with "Too Many Requests" is. While normally this error would be something that makes more sense, like an HTTPError or a Timeout Exception, it could be anything, depending on your code.

One solution that might come to mind is to "wait" between each loop using time.sleep(). While yes, this may work, it is not a robust solution. Other users from your IP address may count towards your rate limit and cause your function to fail, the amount of sleep time may change dynamically, or even be manually adjusted to be longer, etc. The best way to handle this is to used something called exponential backoff.

In a nutshell, exponential backoff is a way to increase the wait time (exponentially) until an acceptable rate is found. backoff is an excellent package to do just that. backoff, upon being triggered from a specified error or exception, will wait to "try again" until a certain amount of time has passed. Upon receving the same error or exception, the time to wait will increase exponentially. Use backoff to modify the provided scrape_cf_request_id function to use exponential backoff when the we alluded to occurs. Test out the modified function in a loop and print the resulting 10 `Cf-Request-Id`s.

backoff utilizes decorators. For those interested in learning about decorators, this is an excellent article.

Items to submit
  • Python code used to solve the problem.

  • What happens when you run the function 10 times in a row?

  • Fixed code that will work regardless of the rate limiting.

  • 10 unique `Cf-Request-Id`s printed.

Question 3

You now have a great set of tools to be able to scrape pretty much anything you want from the internet. Now all that is left to do is practice. Find a course appropriate website containing data you would like to scrape. Utilize the tools you’ve learned about to scrape at least 100 "units" of data. A "unit" is just a representation of what you are scraping. For example, a unit could be a tweet from Twitter, a basketball player’s statistics from sportsreference, a product from Amazon, a blog post from your favorite blogger, etc.

The hard requirements are:

  • Documented code with thorough comments explaining what the code does.

  • At least 100 "units" scraped.

  • The data must be from multiple web pages.

  • Write at least 1 function (with a docstring) to help you scrape.

  • A clear explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example a head of a pandas dataframe containing the data).

Items to submit
  • Python code that scrapes 100 unites of data (with thorough comments explaining what the code does).

  • The data must be from more than a single web page.

  • 1 or more functions (with docstrings) used to help you scrape/parse data.

  • Clear documentation and explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example using the head of a dataframe containing the data).