TDM 20200: Web Scraping Project 5 — Spring 2025

Motivation: We learn how to handle pages with differences in style. In particular, we consider pages that have a next page and pages that do not have a next page.

Context: We will use lxml for web scraping in this project, with XPath queries. We will first analyze example pages, and then will build a powerful function to automate scraping pages.

Scope: Web Scraping in Python

Learning Objectives:
  • We continue to learn how to scrape data more systematically from the web

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

In this project we will scrape data from the following website:

Questions

Preface to the Project:

Recall that there are 50 types of pages on the Books To Scrape website. We already learned that we can get these 50 pages as follows:

import requests

from lxml import html

mytree = html.fromstring(requests.get("https://books.toscrape.com").content)

myurls = ['https://books.toscrape.com/' + element.attrib['href'] for element in mytree.xpath('//li/ul/li/a')]

mylistofurls = [element for element in myurls]

If you explore these 50 links, you will see that some of them are individual pages, for instance, consider:

and some of the categories have multiple pages, for instance:

Whenever a category has multiple pages, we can safely change the index.html to page-1.html and then the following pages are page-2.html and page-3.html and so on…​.

BUT on the Books To Scrape website, the categories with just one page do not have that option. So it makes sense to handle these two types of pages separately.

If you get blocked while scraping in this project, you can always use: import time and then time.sleep(2) to sleep in between scraping pages (for instance, this demonstrates how to sleep 2 seconds).

Question 1 (2 pts)

Write a function called myscrapingfunction that works as follows:

It takes a url called myurl as input and returns a list of all of the books from that url’s category.

For instance,

myscrapingfunction('https://books.toscrape.com/catalogue/category/books/travel_2/index.html') should return a list of 11 url’s like this:

['https://books.toscrape.com/catalogue/category/books/travel_2/../../../its-only-the-himalayas_981/index.html',
 'https://books.toscrape.com/catalogue/category/books/travel_2/../../../full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html',
 'https://books.toscrape.com/catalogue/category/books/travel_2/../../../see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html',
...
 'https://books.toscrape.com/catalogue/category/books/travel_2/../../../the-road-to-little-dribbling-adventures-of-an-american-in-britain-notes-from-a-small-island-2_277/index.html',
 'https://books.toscrape.com/catalogue/category/books/travel_2/../../../neither-here-nor-there-travels-in-europe_198/index.html',
 'https://books.toscrape.com/catalogue/category/books/travel_2/../../../1000-places-to-see-before-you-die_1/index.html']

and, as another example,

myscrapingfunction('https://books.toscrape.com/catalogue/category/books/childrens_11/index.html') should return a list of 29 url’s like this:

['https://books.toscrape.com/catalogue/category/books/childrens_11/../../../birdsong-a-story-in-pictures_975/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../the-bear-and-the-piano_967/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../the-secret-of-dreadwillow-carse_944/index.html',
...
 'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../diary-of-a-minecraft-zombie-book-1-a-scare-of-a-dare-an-unofficial-minecraft-book_99/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../matilda_32/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../charlie-and-the-chocolate-factory-charlie-bucket-1_13/index.html']

Here is pseudocode for how Dr Ward defined his function myscrapingfunction with input myurl:

import requests

from lxml import html

Make an empty list called mytemplist.

Let mytree be defined for myurl just like in Project 4 (and earlier).

Create mystub from myurl by using a string replace of 'index.html' by '' (in other words, remove 'index.html' from myurl).

Extend mytemplist by the list [mystub + element.attrib['href'] for element in mytree.xpath('//h3/a')] (this basically gets the books from page 1 of the category corresponding to myurl).

Set mycounter to be 1.

While len([element for element in mytree.xpath('//li[@class="next"]/a')]) is strictly positive (in other words, while there are still "next" pages to handle, do the following:

Increment the value of mycounter by 1.

Create a variable mynewurl from concatenating together these four things:

mystub
'page-'
mycounter (converted to a string)
'.html'

Create mytree from mynewurl

and then extend mytemplist by the list [mystub + element.attrib['href'] for element in mytree.xpath('//h3/a')] (this basically gets the books from the page numbered mycounter of the category corresponding to myurl)

Now the while loop is over, and the function should return mytemplist.

Deliverables
  • Demonstrate that myscrapingfunction works by running it on each of the 10 example URLs given in the preface to the project.

  • Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

Run the function myscrapingfunction on each element of mylistofurls.

You should get a list and each element should itself be a list of the URLs for the books in the respective categories.

Deliverables
  • Show the results from running the function myscrapingfunction on each element of mylistofurls. It is not necessary to give all of the output; for instance, the first several elements of the resulting list are sufficient to show.

  • Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

Create an empty list called mybiglist. Then use list comprehension to extend mybiglist for each element in the list from question 2. As a result, mybiglist should have the URLs for all 1000 books at the Books To Scrape website.

Deliverables
  • Show the head and tail of mybiglist.

  • Verify that mybiglist has 1000 URLs.

  • Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

Write a function that takes a category from Books To Scrape, and lists all of the individual prices of the books from that category. Do not run a function on the individual book URLs! Instead, modify the two occurrences of mytemplist.extend from Question 1, so that you extract all of the prices from the category pages. To get the price of the books on a category page, search for a p tag with @class="price_color".

For instance, when we run the function on:

it should print:

['£54.64',
 '£36.89',
 '£56.13',
 '£58.08',
 '£13.47',
 '£12.96',
 '£22.08',
 '£23.57',
 '£18.28',
 '£53.95',
 '£25.08',
 '£35.96',
 '£52.87',
 '£34.41',
 '£32.38',
 '£22.54',
 '£56.07',
 '£48.77',
 '£43.59',
 '£26.33',
 '£16.26',
 '£28.54',
 '£37.52',
 '£10.79',
 '£10.62',
 '£10.66',
 '£52.88',
 '£28.34',
 '£22.85']

or if you run it on:

it should print:

['£42.96',
 '£57.36',
 '£44.74',
 '£37.55',
 '£55.91',
 '£28.41',
 '£10.01',
 '£13.76',
 '£16.28',
 '£13.03',
 '£57.35',
 '£25.83',
 '£30.60',
 '£29.45']
Deliverables
  • Demonstrate that your function works, by testing it on the categories suggested above.

  • Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Now add the prices of all 1000 books. You will need to remove the British pound symbol from each (you can use the replace function for this). Verify that the total amount of the costs of all of the books from the Books To Scrape website is 35070.35 British pounds altogether.

Deliverables
  • Verify that the total amount of the costs of all of the books from the Books To Scrape website is 35070.35 British pounds altogether.

  • Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit
  • firstname_lastname_project5.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.