STAT 29000: Project 2 — Spring 2022

Motivation: Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering.

Context: In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from a news site, and parse through our newly scraped data using xpath expressions.

Scope: Python, web scraping, XML

Learning Objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

You will be extracting your own data from online in this project — there is no provided dataset.

Questions

Question 1

The Washington Post is a very popular news site. Open a modern browser (preferably Firefox or Chrome), and navigate to www.washingtonpost.com.

Throughout this project, I will be referencing text and tools from Firefox. If you want the easiest experience, I’d recommend using Firefox for at least this project.

By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website.

To begin exploring the website structure right click on the webpage and select "View Page Source". This will pull up a page full of HTML. This is the HTML used to render the page.

Alternatively, if you want to focus on a single element on the web page, for example, an article title, right click on the title and select "Inspect". This will pull up an inspector that allows you to see portions of the HTML.

Click around on the website and explore the HTML however you like.

Open a few of the articles shown on the front page of the paper. Note how many of the articles start with some key information like: category, article title, picture, picture caption, authors, article datetime, etc.

For example:

Article components
Figure 1. Article components

Copy and paste the header element that is 1 level nested in the main element. Include just the tag with the attributes — don’t include the elements nested within the header element.

Do the same for the article element that is 1 level nested in the main element (after the header element).

Put the pasted elements in a markdown cell and surround the content with a markdown html code block.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for new data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest.

For example, given a link to a new Washington Post article, do you think you could isolate the article title by using the class attribute, class="b-l br-l mb-xxl-ns mt-xxs mt-md-l pr-lg-l col-8-lg mr-lg-l"? Maybe, or maybe not. It looks like those classes are used to structure the size, font, and other parts of the article. In a different article those may change, or maybe they wouldn’t be unique within the page (for example, if another element had the same set of classes in the same page).

Write an XPath expression to isolate the article title, and another XPath expression to isolate the article summary or sub headline.

You do not need to test your XPath expression yet, we will be doing that shortly.

Remember the goal of the XPath expression is to write it in such a way that we can take any Washington Post article and extract the data we want.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Use the requests package to scrape the web page containing our article from questions (1) and (2). Use the lxml.html package and the xpath method to test out the XPath expressions you created in question (2). Use the expressions to extract the element, then print the contents of the elements (what is between the tags). Did they work? Print the element contents to confirm.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Use your newfound knowledge of XPath expressions, lxml, and requests to write a function called get_article_links that scrapes the home page for The Washington Post, and returns 5 article links in a list.

There are a variety of ways to do this, however, make sure it is repeatable, and only gets article links.

The data-* attributes are particularly useful for this problem.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Write a function called get_article_info that accepts a link to an article as an argument, and prints the information in the following format:

Example output
Category: Health
Title: White House to distribute 400 million free N95 masks starting next week
Authors: Lena H. Sun, Dan Diamond
Time: Yesterday at 5:00 a.m. EST

Of course, the Time section may change. Just use whatever text they use in the article. It could be an actual date or something like in the example where it said "Yesterday…​".

In a loop, test out the get_article_info function with the links that are returned by your get_article_links function.

For the authors part, you may find it very difficult to find a single, repeatable way to extract the authors. The reason is that the authors are listed twice within the article. Consider first finding both groups of authors and then use the first of the 2 resulting elements as a starting point to find the authors. You can use the ".//" to search the current element.

For the time part, you may find it difficult to print both the "Yesterday at" and the actual time portion. Use a similar trick that you used for the authors. First find the node with "Yesterday at" text, then use that node as a starting point to find the next node that contains the time info. If you are stumped — make a post in Piazza!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.