TDM 20200: Web Scraping Project 5 — Spring 2025
Motivation: We learn how to handle pages with differences in style. In particular, we consider pages that have a next page and pages that do not have a next page.
Context: We will use lxml for web scraping in this project, with XPath queries. We will first analyze example pages, and then will build a powerful function to automate scraping pages.
Scope: Web Scraping in Python
Questions
Preface to the Project:
Recall that there are 50 types of pages on the Books To Scrape website. We already learned that we can get these 50 pages as follows:
import requests
from lxml import html
mytree = html.fromstring(requests.get("https://books.toscrape.com").content)
myurls = ['https://books.toscrape.com/' + element.attrib['href'] for element in mytree.xpath('//li/ul/li/a')]
mylistofurls = [element for element in myurls]
If you explore these 50 links, you will see that some of them are individual pages, for instance, consider:
and some of the categories have multiple pages, for instance:
Whenever a category has multiple pages, we can safely change the index.html
to page-1.html
and then the following pages are page-2.html
and page-3.html
and so on….
BUT on the Books To Scrape website, the categories with just one page do not have that option. So it makes sense to handle these two types of pages separately.
If you get blocked while scraping in this project, you can always use: |
Question 1 (2 pts)
Write a function called myscrapingfunction
that works as follows:
It takes a url called myurl
as input and returns a list of all of the books from that url’s category.
For instance,
myscrapingfunction('https://books.toscrape.com/catalogue/category/books/travel_2/index.html')
should return a list of 11 url’s like this:
['https://books.toscrape.com/catalogue/category/books/travel_2/../../../its-only-the-himalayas_981/index.html',
'https://books.toscrape.com/catalogue/category/books/travel_2/../../../full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html',
'https://books.toscrape.com/catalogue/category/books/travel_2/../../../see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html',
...
'https://books.toscrape.com/catalogue/category/books/travel_2/../../../the-road-to-little-dribbling-adventures-of-an-american-in-britain-notes-from-a-small-island-2_277/index.html',
'https://books.toscrape.com/catalogue/category/books/travel_2/../../../neither-here-nor-there-travels-in-europe_198/index.html',
'https://books.toscrape.com/catalogue/category/books/travel_2/../../../1000-places-to-see-before-you-die_1/index.html']
and, as another example,
myscrapingfunction('https://books.toscrape.com/catalogue/category/books/childrens_11/index.html')
should return a list of 29 url’s like this:
['https://books.toscrape.com/catalogue/category/books/childrens_11/../../../birdsong-a-story-in-pictures_975/index.html',
'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../the-bear-and-the-piano_967/index.html',
'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../the-secret-of-dreadwillow-carse_944/index.html',
...
'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../diary-of-a-minecraft-zombie-book-1-a-scare-of-a-dare-an-unofficial-minecraft-book_99/index.html',
'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../matilda_32/index.html',
'https://books.toscrape.com/catalogue/category/books/childrens_11/../../../charlie-and-the-chocolate-factory-charlie-bucket-1_13/index.html']
Here is pseudocode for how Dr Ward defined his function myscrapingfunction
with input myurl
:
import requests
from lxml import html
Make an empty list called mytemplist
.
Let mytree
be defined for myurl
just like in Project 4 (and earlier).
Create mystub
from myurl
by using a string replace of 'index.html'
by ''
(in other words, remove 'index.html'
from myurl
).
Extend mytemplist
by the list [mystub + element.attrib['href'] for element in mytree.xpath('//h3/a')]
(this basically gets the books from page 1 of the category corresponding to myurl
).
Set mycounter
to be 1.
While len([element for element in mytree.xpath('//li[@class="next"]/a')])
is strictly positive (in other words, while there are still "next" pages to handle, do the following:
Increment the value of mycounter
by 1.
Create a variable mynewurl
from concatenating together these four things:
mystub
'page-'
mycounter (converted to a string)
'.html'
Create mytree
from mynewurl
and then extend mytemplist
by the list [mystub + element.attrib['href'] for element in mytree.xpath('//h3/a')]
(this basically gets the books from the page numbered mycounter
of the category corresponding to myurl
)
Now the while
loop is over, and the function should return mytemplist
.
-
Demonstrate that
myscrapingfunction
works by running it on each of the 10 example URLs given in the preface to the project. -
Be sure to document your work from Question 1, using some comments and insights about your work.
Question 2 (2 pts)
Run the function myscrapingfunction
on each element of mylistofurls
.
You should get a list and each element should itself be a list of the URLs for the books in the respective categories.
-
Show the results from running the function
myscrapingfunction
on each element ofmylistofurls
. It is not necessary to give all of the output; for instance, the first several elements of the resulting list are sufficient to show. -
Be sure to document your work from Question 2, using some comments and insights about your work.
Question 3 (2 pts)
Create an empty list called mybiglist
. Then use list comprehension to extend mybiglist
for each element in the list from question 2. As a result, mybiglist
should have the URLs for all 1000 books at the Books To Scrape website.
-
Show the head and tail of
mybiglist
. -
Verify that
mybiglist
has 1000 URLs. -
Be sure to document your work from Question 3, using some comments and insights about your work.
Question 4 (2 pts)
Write a function that takes a category from Books To Scrape, and lists all of the individual prices of the books from that category. Do not run a function on the individual book URLs! Instead, modify the two occurrences of mytemplist.extend
from Question 1, so that you extract all of the prices from the category pages. To get the price of the books on a category page, search for a p
tag with @class="price_color"
.
For instance, when we run the function on:
it should print:
['£54.64',
'£36.89',
'£56.13',
'£58.08',
'£13.47',
'£12.96',
'£22.08',
'£23.57',
'£18.28',
'£53.95',
'£25.08',
'£35.96',
'£52.87',
'£34.41',
'£32.38',
'£22.54',
'£56.07',
'£48.77',
'£43.59',
'£26.33',
'£16.26',
'£28.54',
'£37.52',
'£10.79',
'£10.62',
'£10.66',
'£52.88',
'£28.34',
'£22.85']
or if you run it on:
it should print:
['£42.96',
'£57.36',
'£44.74',
'£37.55',
'£55.91',
'£28.41',
'£10.01',
'£13.76',
'£16.28',
'£13.03',
'£57.35',
'£25.83',
'£30.60',
'£29.45']
-
Demonstrate that your function works, by testing it on the categories suggested above.
-
Be sure to document your work from Question 4, using some comments and insights about your work.
Question 5 (2 pts)
Now add the prices of all 1000 books. You will need to remove the British pound symbol from each (you can use the replace
function for this). Verify that the total amount of the costs of all of the books from the Books To Scrape website is 35070.35 British pounds altogether.
-
Verify that the total amount of the costs of all of the books from the Books To Scrape website is 35070.35 British pounds altogether.
-
Be sure to document your work from Question 5, using some comments and insights about your work.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!
-
firstname_lastname_project5.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |