TDM 20200: Project 5 — 2024

Motivation: We practice the skill of scraping information about books from a real publisher’s website. (We only do this for academic purposes and not for commercial purposes.)

Context: We use the Selenium skills that we learned in Project 3 to extract information about books from the publisher O’Reilly’s website.

Scope: Python, XML, Selenium

Learning Objectives
  • Extract XML content from a real-world website using Selenium.

Dataset(s)

The following questions will examine the No Starch Press books that are available from this website:

No Starch Press is one of the Publishers available from the dropdown menu on the left-hand-side of the website. At the moment, there are 350 books available from No Starch Press. (This might change during the project, if more books are published in the next week or two.)

The order of the books are dynamic. For this reason, if you look at the books in a browser on your computer, at the same time that you are scraping books from the O’Reilly website, the order of the books is likely to change.

We prepare to work with Selenium in Python as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options

firefox_options = Options()
firefox_options.add_argument("--headless")
firefox_options.add_argument("--disable-extensions")
firefox_options.add_argument("--no-sandbox")
firefox_options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()),options=firefox_options)

and we can load the pages, for instance, here is the second page, like this:

driver.get("https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=2")

Each book entry is wrapped in the following (example) XML code. Some XML entries have been removed, so that there might be other siblings and/or children that are not shown here. This example is from the XML for the book Rust for Rustaceans by Jon Gjengset.

<article data-testid="card9781098129828" tabindex="0" data-animated="true" data-expanded="false" class="css-ataarr">
    <div class="css-155vutv">
        <div class="css-bkunnn"><span class="MuiTypography-root MuiTypography-label css-1updh7" data-testid="format-label-9781098129828"><span data-viz="srOnly">Format:&nbsp;</span>Book</span></div>
        <section data-testid="Highlights-9781098129828" class="css-1002htw">
            <a href="https://learning.oreilly.com/library/view/-/9781098129828/" data-testid="cover-link-9781098129828" tabindex="-1" class="css-in7hc5"><img src="/covers/urn:orm:book:9781098129828/160h/?format=webp" alt="Cover of Rust for Rustaceans" loading="lazy"></a>
            <a href="https://learning.oreilly.com/library/view/-/9781098129828/" data-testid="title-link-9781098129828" class="css-x6voe2"><h3 class="MuiTypography-root MuiTypography-h4 css-1dbnkcz">Rust for Rustaceans</h3></a>
            <div class="css-iqysc0">
                <div class="MuiTypography-root MuiTypography-linkSmall css-3e7u07" data-testid="search-card-authors-9781098129828">By&nbsp;<a class="MuiTypography-root MuiTypography-linkSmall css-1vjdbg5" href="/search?q=author:&quot;Jon Gjengset&quot;" data-testid="author-search-card-9781098129828-Jon Gjengset">Jon Gjengset</a></div>
                <div data-testid="publication-details-9781098129828" class="css-jxiiw"><a href="/publisher/ed0d603d-6753-43cb-8f84-e7fc52547d84" class="css-3020f2"><span data-viz="srOnly">Publisher: </span>No Starch Press</a><span class="css-11r5j1j">&nbsp;•&nbsp;</span><span>December 2021</span></div>
            </div>
            <footer class="css-oe0o7y"><div data-testid="ratings-9781098129828" class="css-5x0u4k"></div><span class="MuiTypography-root MuiTypography-linkSmall css-1irscvl" data-testid="PageCount-9781098129828">280 pages</span></footer>
        </section>
    </div>
</article>

It is necessary to give each page a few seconds to load. Otherwise, the query for a page might end up blank. Therefore, it is advisable to SLOWLY load one cell at a time, when you are checking your work, waiting a few seconds in between each cell. This allows the O’Reilly pages to load in the browser.

Dr Ward created 9 videos to help with this project.

Questions

Question 1 (2 points)

  1. Load the formats for all 350 entries into a list of length 350, and make sure that each entry says Format: Book.

  2. Load the publisher for all 350 entries into a list of length 350, and make sure that each entry says Publisher: No Starch Press.

For question 1a, you can use the XPath
//article/div/div/span
in Selenium and extract the text.

For question 1b, you can use the XPath
//article/div/section/div/div/a[contains(@href, '/publisher/')]
in Selenium and extract the text.

Question 2 (2 points)

From the first page of 100 entries:

  1. Extract a list of the 100 URLs

  2. Extract a list of the 100 titles

  3. Extract a list of the 100 authors (it is OK if the word "By" is included in each author result)

  4. Extract a list of the 100 dates

  5. Extract a list of the 100 pages

For the URLs, use XPath
//article/div/section/a[@tabindex='-1']
and extract the attribute href.

For the titles, use XPath
//article/div/section/a/h3
and extract the text.

For the authors, use XPath
//article/div/section/div/div[contains(@data-testid, 'search-card-authors')]
and extract the text.

For the dates, use XPath
//article/div/section/div/div[contains(@data-testid, 'publication-details')]/span[not(@class)]
and extract the text.

For the pages, use XPath
//article/div/section/footer
and extract the text.

Question 3 (2 points)

Extract the content from pages 2, 3, and 4 (i.e., from the next 250 entries), and add this content to the lists from question 2, so that you have altogether:

  1. A list of the 350 URLs

  2. A list of the 350 titles

  3. A list of the 350 authors (it is OK if the word "By" is included in each author result)

  4. A list of the 350 dates

  5. A list of the 350 pages

You might want to use a for loop, but if you do, it is worthwhile to import time and to time.sleep(10) after loading a new driver page, before extracting information from it. It is also worthwhile to extend the elements of one list onto another list.

Question 4 (2 points)

  1. For the list of pages, remove the phrase " pages" (including the space) and the remove the commas, and then convert from strings to integers.

  2. Now make a data frame of the URLs, titles, authors, dates, and (the new numeric) pages.

Question 5 (2 points)

  1. If you drop the duplicates from your data frame in Question 4b, you will likely not (yet) have 350 distinct No Starch Press books. Repeat the steps above, building (say) one or two more data frames, until you have all 350 distinct titles.

  2. Once you have all 350 distinct titles in a data frame, sort the results by the date column, and find which month-and-year pair had the largest number of pages written.

You should find that, in June 2021, there were a total of 3096 pages written, in these 7 books:

https://learning.oreilly.com/library/view/-/9781098128999/  How Cybersecurity Really Works                 By Sam Grubb             June 2021  216
https://learning.oreilly.com/library/view/-/9781098129019/  Deep Learning                                  By Andrew Glassner       June 2021  768
https://learning.oreilly.com/library/view/-/9781098129033/  Learn to Code by Solving Problems              By Daniel Zingaro        June 2021  336
https://learning.oreilly.com/library/view/-/9781098128982/  The Art of WebAssembly                         By Rick Battagline       June 2021  304
https://learning.oreilly.com/library/view/-/9781098128975/  Arduino Workshop, 2nd Edition                  By John Boxall           June 2021  440
https://learning.oreilly.com/library/view/-/9781098129002/  Hardcore Programming for Mechanical Engineers  By Angel Sola Orbaiceta  June 2021  600
https://learning.oreilly.com/library/view/-/9781098129026/  The Big Book of Small Python Projects          By Al Sweigart           June 2021  432

Project 04 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project05.ipynb

  • Python file with code and comments for the assignment

    • firstname-lastname-project05.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.