TDM 20200: Web Scraping Project 2 — Spring 2025

Motivation: Now that we know how to use Beautiful Soup, we will learn comparable methods for web scraping with lxml. This project will look very similar to the previous project.

Context: In this project, we will use lxml, documented here: pypi.org/project/lxml/

Scope: Web Scraping in Python

Learning Objectives:

We learn the basics about web scraping using lxml in Python

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

In this project we will scrape data from the following websites:

Our scraping will be parallel to the examples from Project 1.

Questions

Question 1 (2 pts)

When scraping data from websites using lxml, we again import requests but this time we also import from lxml like this:

import requests
from lxml import html

Afterwards, just like in the previous project, we can use requests to scrape the data from a website, such as (for instance) The Data Mine staff directory:

myresponse = requests.get("https://datamine.purdue.edu/about/about-welcome/")

and then we can parse this content using html.fromstring (which comes from lxml) like this:

mytree = html.fromstring(myresponse.content)

When selecting the parts of the webpage where we want to extract elements, this time we will use an xpath to the location that we want, in the webpage. The specification for an xpath looks a lot like the way that we specify a location in BeautifulSoup, but there are a few differences. We start the string for the location with // which indicates that we want to look in the entire document. We also put an @ symbol (an ampersand) before each attribute.

We can extract the names of all 22 staff members at once, by selecting the data in p tags with class attribute purdue-home-cta-grid__card-name from the page. Now that we are using lxml instead of Beautiful Soup, we change this statement:

mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')

to INSTEAD using this statement:

mytree.xpath('//p[@class = "purdue-home-cta-grid__card-name"]')

which gives us the 22 elements we wanted to find. Now, just like we did with Beautiful Soup, we can use a list comprehension to get these 22 names into a list. Our previous statement from Beautiful Soup

[element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')]

will INSTEAD look like this with lxml:

[element.text for element in mytree.xpath('//p[@class = "purdue-home-cta-grid__card-name"]')]

Now that you have the 22 staff members' names in a list, use a similar operation in lxml (instead of Beautiful Soup) to also extract the 22 staff members' job titles.

Finally, make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column.

You should NOT import BeautifulSoup for this project. We are using lxml instead, for this project.

Deliverables

Use lxml INSTEAD of Beautiful Soup to make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column. You should NOT import BeautifulSoup for this project.
Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

As in the previous project, we will extract the 56 names of the states and territories from the National Park Service homepage at www.nps.gov (again) using the fact that their names are given as the text after the a tags with class attribute dropdown-item dropdown-state.

We will (as before) remove the whitespace from the text, using the strip() function.

To get the locations of the webpages devoted to each state and territory, instead of using element['href'] to extract the href attribute from each element, since we are working with lxml, we will use element.attrib['href'] and (as before) we will append the string 'https://www.nps.gov' to the front of each string.

Finally, make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.

For instance, the row for Indiana should have 'Indiana' in the left column and 'https://www.nps.gov/state/in/index.htm' in the right column.

Deliverables

Use lxml INSTEAD of Beautiful Soup to make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.
Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

Just like in Project 1, Question 3, we would like you to extract from books.toscrape.com/ each of the 50 category types and the 50 URLs corresponding to each of these categories. In this way, you can make a data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column.

The names of the categories are given in a double set of li tags, and then an li tag, and then an a tag. The names of the categories are the text after the a tags.

Instead of using:

mysoup.select('li ul li a')

we will use an xpath like this:

mytree.xpath('//li/ul/li/a')

Extract the 50 category types as the text after the a tags, and remove the whitespace from the text, using the strip() function. Hint: 'Travel' should be the first category, and 'Crime' should be the last category.

Now that you have these 50 categories, we can get the locations of the webpages devoted to each category, by extracting the href attribute from each tag. If your data is stored in element, then the href attribute can be retrieved as element.attrib['href']. Append the string 'https://books.toscrape.com/' to the front of each string.

(As a very minor point for sharp readers: In question 2, we appended 'https://www.nps.gov' without an additional forward slash, because in the NPS website, the slash was already in the href attribute.)

Finally, make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column.

For instance, the row for Poetry should have 'Poetry' in the left column and 'https://books.toscrape.com/catalogue/category/books/poetry_23/index.html' in the right column.

Just like in Question 2, please be sure to make a data frame with 2 columns and (in this case) 50 rows, with 1 book topic and 1 book URL per line.

Deliverables

Use Python to make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column.
Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

For academic purposes only now we extract the Snoopy comic for June 22, 1970 from the URL www.gocomics.com/peanuts/1970/06/22 but this time, we use lxml instead of Beautiful Soup.

Instead of using:

mysoup.select('img[alt = "Peanuts Comic Strip for June 22, 1970 "]')

we will use an xpath like this:

mytree.xpath('//img[@alt = "Peanuts Comic Strip for June 22, 1970 "]')

In this way, using lxml instead of Beautiful Soup, we can extract the URL for the comic, for this particular day (June 22, 1970) that Woodstock got named: assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47

Now use lxml to find and check the location of the Peanuts comic for two additional days, and explain your steps.

We want you to (please) extract the url for the image of the Peanuts comic strip for June 22, 1970, and also for two additional days of your choice.

Deliverables

Verify that this URL contains the comic for the day that Woodstock got named: assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47
For two additional days of your choice, give the days and the locations of the Peanuts comic image for those two days, using lxml instead of Beautiful Soup.
Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Consider the website for Indiana parks from the National Park Service:

www.nps.gov/state/in/index.htm

There are four parks listed on this page.

First, scrape the types of all four parks (National Historical Park, National Park, National Historic Trail, National Memorial).

Next, scrape the names of all four parks (George Rogers Clark, Indiana Dunes, Lewis & Clark, Lincoln Boyhood).

Next, scrape the names of all four locations (Vincennes, IN; Porter, IN; Sixteen States: IA,ID,IL,IN,KS,KY,MO,MT,NE,ND,OH,OR,PA,SD,WA,WV; Lincoln City, IN).

Finally, make a Pandas data frame with 4 rows and 3 columns, namely, 1 row per park, with the type of park in the left column, the name of the park in the middle column, and the location of the park in the right column.

Deliverables

Use Python to make a Pandas data frame with 4 rows and 3 columns, namely, 1 row per park, with the type of park in the left column, the name of the park in the middle column, and the location of the park in the right column.
Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you’ve just finished your second project for TDM 20200! You have a very good introduction to web scraping now, using Beautiful Soup (from Project 1) or lxml (from Project 2).

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit

firstname_lastname_project2.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.