TDM 20200: Web Scraping Project 1 — Spring 2025

Motivation: Happy New Year! In this project, we will get comfortable scraping some data from the internet in Python.

Context: In this project, we will use BeautifulSoup, documented here: pypi.org/project/beautifulsoup4/

Scope: Web Scraping in Python

Learning Objectives:
  • We learn the basics about web scraping using Beautiful Soup in Python

Make sure to read about, and use the template found here, and the important information about project submissions here.

Questions

Question 1 (2 pts)

When scraping data from websites using Beautiful Soup, we first import requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup

Afterwards, we can use requests to scrape the data from a website, such as (for instance) The Data Mine staff directory:

myresponse = requests.get("https://datamine.purdue.edu/about/about-welcome/")

and then we can parse this content using BeautifulSoup:

mysoup = BeautifulSoup(myresponse.content, 'html.parser')

Afterwards, we can use select statements to extract elements from the website. For instance, the names of The Data Mine staff members are given as the text after the p tags with class attribute purdue-home-cta-grid__card-name.

We can extract the names of all 22 staff members at once, by selecting the data in these p tags with class attribute purdue-home-cta-grid__card-name from the page, as follows:

mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')

With a list comprehension, we can get these 22 names into a list:

[element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')]

Now that you have the 22 staff members' names in a list, use a similar operation to extract the 22 staff members' job titles.

Just like Dr Ward in the video, you might need to use Command-U (Mac) or Control-U (Windows, Unix) in Firefox to inspect the page and find the attributes that you want.

Finally, make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column.

Deliverables
  • Use Python to make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column.

  • Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

The National Park Service homepage at www.nps.gov lists 56 states and territories. Their names are given as the text after the a tags with class attribute dropdown-item dropdown-state.

Extract the 56 names of the states and territories, and remove the whitespace from the text, using the strip() function.

Now that you have these 56 names, we can get the locations of the webpages devoted to each state and territory, by extracting the href attribute from each tag. If your data is stored in element, then the href attribute can be retrieved as element['href']. Append the string 'https://www.nps.gov' to the front of each string.

Finally, make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.

For instance, the row for Indiana should have 'Indiana' in the left column and 'https://www.nps.gov/state/in/index.htm' in the right column.

Deliverables
  • Use Python to make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.

  • Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

The demo website Books To Scrape does not have real prices for books. It is only a demonstration website, located at books.toscrape.com/

This website has numerous categories in the left-hand sidebar. The names of the categories are given in a double set of li tags, and then an li tag, and then an a tag. The names of the categories are the text after the a tags.

Extract the 50 category types as the text after the a tags, and remove the whitespace from the text, using the strip() function. Hint: 'Travel' should be the first category, and 'Crime' should be the last category.

Now that you have these 50 categories, we can get the locations of the webpages devoted to each category, by extracting the href attribute from each tag. If your data is stored in element, then the href attribute can be retrieved as element['href']. Append the string 'https://books.toscrape.com/' to the front of each string.

(As a very minor point for sharp readers: In question 2, we appended 'https://www.nps.gov' without an additional forward slash, because in the NPS website, the slash was already in the href attribute.)

Finally, make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column.

For instance, the row for Poetry should have 'Poetry' in the left column and 'https://books.toscrape.com/catalogue/category/books/poetry_23/index.html' in the right column.

Just like in Question 2, pleas be sure to make a data frame with 2 columns and (in this case) 50 rows, with 1 book topic and 1 book URL per line.

Deliverables
  • Use Python to make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column.

  • Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

For academic purposes only now we extract a Snoopy comic from the internet. As many students know, Dr Ward loves the Woodstock character from the Peanuts comic strip. Although Woodstock first appeared on March 4, 1966, he was not named until June 22, 1970. We can extract the comic from June 22, 1970, as follows:

Load the comic at this website: www.gocomics.com/peanuts/1970/06/22

In Firefox, right click on the comic (or Control-click on a Mac), and "Inspect" the image in Firefox. If we look into some of the html content for the picture, we will see:

<img class="img-fluid lazyloaded" srcset="https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 900w" data-srcset="https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 900w" sizes="
                       (min-width: 992px) 900px,
                       (min-width: 768px) 600px,
                       (min-width: 576px) 300px,
                       900px" width="100%" alt="Peanuts Comic Strip for June 22, 1970 " src="https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47">

In particular, if we look for an img tag with alt attribute that has value 'Peanuts Comic Strip for June 22, 1970 ' then we can extract the src attribute. Hint: It is necessary to put the space after the year in the string, on this website.

Verify that this URL contains the comic for the day that Woodstock got named: assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47

Now load the Peanuts comic for two other days, and explain your steps. In particular, specify which two other days you explored, and give the location of the comic image for those two days, just like for June 22, 1970, the comic image is located here: assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47

Deliverables
  • Verify that this URL contains the comic for the day that Woodstock got named: assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47

  • For two additional days of your choice, give the days and the locations of the Peanuts comic image for those two days.

  • Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

This website www.scrapethissite.com/pages/forms/ has data about hockey teams, which students can use to practice scraping tables.

We can view 100 rows of this data at a time, for instance, as follows: www.scrapethissite.com/pages/forms/?page_num=4&per_page=100 which gives the 4th page of the data. In other words, this page shows rows 301 through 400.

Indeed, there are only 582 rows altogether. By asking for 582 or more rows at a time, in this particular website, we can actually get all 582 rows at once, like this: www.scrapethissite.com/pages/forms/?per_page=600

(This is website dependent! Not every website will allow you to do this.)

Now we can extract the entire table from this website. First we need to import Pandas, and also io from StringIO:

import pandas as pd
from io import StringIO

Then, as in the previous two questions, we can extract the contents of the website as follows:

myresponse = requests.get("https://www.scrapethissite.com/pages/forms/?per_page=600")
mysoup = BeautifulSoup(myresponse.content, 'html.parser')

and then we can read the entire table, using StringIO and Pandas, as follows:

pd.read_html(StringIO(myresponse.text))[0]

which will show rows 0 through 4 and also rows 577 through 581.

Deliverables
  • Extract all 582 rows and 9 columns of the hockey data into a Pandas data frame. Display rows 0 through 4 and also rows 577 through 581.

  • Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you’ve just finished your first project for TDM 20200! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit
  • firstname_lastname_project1.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.