TDM 20200: Project 1 - Web Scraping Introduction 1
Welcome back to the TDM course! This semester, we will cover more advanced topics, primarily focused on Python for TDM 202. We have a few notes to help make this semester smoother for you:
|
Project Objectives
This project introduces you to web scraping using Python and the lxml library. Web scraping is the process of programmatically extracting data from HTML documents. Before we can scrape live websites, we need to understand how to parse HTML and extract information from it.
|
If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Make sure to read about, and use the template found here, and the important information about project submissions here.
|
Similar to the Fall 2025 semester, you may post your questions on the course Piazza page. Although the links are labeled “Fall 2025,” the same links will continue to be used for Spring 2026. Below is the Piazza link for this lecture: |
|
The projects are usually due on Wednesdays. You can see the schedule here: the-examples-book.com/projects/spring2026/20200/projects Please do not wait until Wednesday to complete and submit your work! We strongly recommend starting your projects early in the week to avoid any last-minute issues that could cause you to miss the deadline. |
Understanding HTML Structure
Before we can scrape websites, we need to understand how web pages are structured. HTML (HyperText Markup Language) is the standard language for creating web pages. HTML uses "tags" to structure content. Tags are enclosed in angle brackets like <tag> and typically come in pairs: an opening tag <tag> and a closing tag </tag>. A pair of tags is typically called an "element".
Here’s a simple example of HTML:
<html>
<head>
<title>My Web Page</title>
</head>
<body>
<h1>Welcome</h1>
<p>This is a paragraph with <strong>bold text</strong>.</p>
<a href="https://example.com">Click here</a>
</body>
</html>
Common HTML elements you’ll encounter:
- <html> - The root element of an HTML page
- <head> - Contains metadata about the page (not visible)
- <body> - Contains the visible content
- <div> - A container/division (used for layout)
- <p> - A paragraph
- <a> - A link (anchor tag)
- <h1>, <h2>, etc. - Headings of different sizes
- <span> - Inline text container
- <ul>, <ol> - Unordered and ordered lists
- <li> - List items
- <table> - Tables
- <tr> - Table rows
- <td> - Table cells
Tags can have "attributes" that provide additional information. For example, <a href="https://example.com"> has an href attribute that specifies the link destination.
Questions
Question 1 (2 points)
Let’s start with the basics. We will work with a simple HTML string and learn how to parse it using lxml. First, let’s create a simple HTML string and parse it:
from lxml import html
# A simple HTML string
html_string = """
<html>
<head>
<title>My First Web Page</title>
</head>
<body>
<h1>Welcome to Web Scraping</h1>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""
# Parse the HTML string into a tree structure
tree = html.fromstring(html_string)
# Now we can extract elements from the tree
print("HTML parsed successfully!")
print(f"Type of tree: {type(tree)}")
|
The |
Now let’s extract some basic information from our HTML:
# Extract the title (it's inside a <title> tag)
title = tree.xpath('//title/text()')
print(f"Page title: {title[0] if title else 'Not found'}")
# Extract all paragraph text
paragraphs = tree.xpath('//p/text()')
print(f"\nFound {len(paragraphs)} paragraphs:")
for i, para in enumerate(paragraphs, 1):
print(f"{i}. {para}")
# Extract the heading
heading = tree.xpath('//h1/text()')
print(f"\nHeading: {heading[0] if heading else 'Not found'}")
|
XPath is a language for selecting nodes in an XML/HTML document. The |
Now try it yourself! Create your own HTML string and extract:
1. The page title
2. All paragraph text
3. The text content of the <h1> heading
1.1. Create an HTML string with at least a title, heading, and multiple paragraphs,
1.2. Parse the HTML using lxml.html.fromstring(),
1.3. Extract and display the page title,
1.4. Extract and display all paragraph text,
1.5. Extract and display the <h1> heading text.
Question 2 (2 points)
Let’s work with a slightly more complex HTML structure and learn about different XPath patterns:
from lxml import html
# HTML with more structure
html_string = """
<html>
<body>
<div class="quote">
<span class="text">The only way to do great work is to love what you do.</span>
<small class="author">Steve Jobs</small>
</div>
<div class="quote">
<span class="text">Innovation distinguishes between a leader and a follower.</span>
<small class="author">Steve Jobs</small>
</div>
<div class="quote">
<span class="text">Stay hungry, stay foolish.</span>
<small class="author">Steve Jobs</small>
</div>
</body>
</html>
"""
tree = html.fromstring(html_string)
# Different XPath patterns:
# //tag - Find all 'tag' elements anywhere in the document
# //tag[@attribute='value'] - Find elements with specific attribute value
# //tag/text() - Get text content of elements
# Find all quotes (they're in <span> tags with class="text")
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Found {len(quotes)} quotes:")
for i, quote in enumerate(quotes, 1):
print(f"{i}. {quote}")
# Find all authors (they're in <small> tags with class="author")
authors = tree.xpath('//small[@class="author"]/text()')
print(f"\nFound {len(authors)} authors:")
for i, author in enumerate(authors, 1):
print(f"{i}. {author}")
|
The |
CSS selectors are another way to select elements. lxml supports CSS selectors through the cssselect method:
# Using CSS selectors (alternative to XPath)
# CSS selector syntax:
# tag - Select all 'tag' elements
# .classname - Select elements with class="classname"
# #id - Select element with id="id"
# tag.classname - Select 'tag' elements with class="classname"
quotes_css = tree.cssselect('span.text')
print(f"Found {len(quotes_css)} quotes using CSS selector:")
for i, quote_elem in enumerate(quotes_css, 1):
print(f"{i}. {quote_elem.text_content()}")
|
CSS selectors can be more readable than XPath for simple selections. The |
Now, create an HTML string with multiple quotes (like the example above) and extract:
1. All quote texts
2. All author names
3. Display them together (e.g., "Quote: [text] - Author: [name]").
2.1. Create an HTML string with at least 3 quotes, each in a div with class="quote",
2.2. Extract all quote texts using XPath,
2.3. Extract all author names using XPath,
2.4. Try extracting quotes using CSS selectors as well,
2.5. Display the results pairing each quote with its author.
Question 3 (2 points)
Sometimes elements are nested, and we need to extract related data together. Let’s learn about relative XPath (using .//):
from lxml import html
# HTML with nested structure
html_string = """
<html>
<body>
<div class="quote">
<span class="text">The only way to do great work is to love what you do.</span>
<small class="author">Steve Jobs</small>
<div class="tags">
<a class="tag">inspiration</a>
<a class="tag">work</a>
</div>
</div>
<div class="quote">
<span class="text">Innovation distinguishes between a leader and a follower.</span>
<small class="author">Steve Jobs</small>
<div class="tags">
<a class="tag">innovation</a>
<a class="tag">leadership</a>
</div>
</div>
</body>
</html>
"""
tree = html.fromstring(html_string)
# Find all quote containers (divs with class="quote")
quote_containers = tree.xpath('//div[@class="quote"]')
print(f"Found {len(quote_containers)} quote containers\n")
# Extract data from each container
for i, container in enumerate(quote_containers, 1):
# Extract quote text from within this container
# Notice the .// - the dot means "starting from the current element"
quote_text = container.xpath('.//span[@class="text"]/text()')[0]
# Extract author from within this container
author = container.xpath('.//small[@class="author"]/text()')[0]
# Extract tags from within this container
tags = container.xpath('.//a[@class="tag"]/text()')
print(f"Quote {i}:")
print(f" Text: {quote_text}")
print(f" Author: {author}")
print(f" Tags: {', '.join(tags)}")
print()
|
Notice the |
Now, create an HTML string with nested quote containers and extract:
1. Each quote’s text, author, and tags together (as shown above),
2. Make sure you use relative XPath (.//) to search within each container.
3.1. Create an HTML string with at least 2 quote containers, each containing text, author, and tags,
3.2. Find all quote containers using XPath,
3.3. For each container, extract the quote text, author, and tags using relative XPath (.//),
3.4. Display the results showing all information for each quote,
3.5. Explain the difference between // and .// in XPath.
Question 4 (2 points)
Sometimes we need to extract attributes from elements, not just text. Common attributes include:
- href - Links (where the link goes)
- src - Images (image source URL)
- class - CSS class names
- id - Unique identifier
- data-* - Custom data attributes
from lxml import html
# HTML with links
html_string = """
<html>
<body>
<a href="/author/steve-jobs">Steve Jobs</a>
<a href="/author/albert-einstein">Albert Einstein</a>
<a href="/author/maya-angelou">Maya Angelou</a>
<a href="/tag/inspiration">Inspiration</a>
<a href="/tag/work">Work</a>
</body>
</html>
"""
tree = html.fromstring(html_string)
# Extract href attributes from links
# Method 1: Using XPath with @attribute
author_links = tree.xpath('//a[contains(@href, "/author/")]/@href')
print("Author links (Method 1 - direct attribute extraction):")
for link in author_links:
print(f" {link}")
# Method 2: Get the element first, then access attributes
author_link_elements = tree.xpath('//a[contains(@href, "/author/")]')
print("\nAuthor links (Method 2 - get element then attribute):")
for elem in author_link_elements:
print(f" Text: {elem.text}")
print(f" Href: {elem.get('href')}")
print()
|
The |
Let’s also learn about extracting data from tables, which is a common scraping task:
# Example: Scraping a simple HTML table
html_table = """
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
<tr>
<td>Alice</td>
<td>25</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>30</td>
<td>London</td>
</tr>
<tr>
<td>Charlie</td>
<td>35</td>
<td>Tokyo</td>
</tr>
</table>
"""
tree = html.fromstring(html_table)
# Extract table headers
headers = tree.xpath('//th/text()')
print(f"Headers: {headers}")
# Extract all rows
rows = tree.xpath('//tr')
data = []
for row in rows[1:]: # Skip header row (index 0)
cells = row.xpath('.//td/text()')
if cells: # Only add non-empty rows
data.append(cells)
print("\nTable data:")
for row in data:
print(row)
Now, create HTML strings and practice extracting:
1. Links with their href attributes and text content,
2. A table with headers and multiple rows of data,
3. Create a dictionary mapping link text to their href values.
4.1. Create an HTML string with multiple links and extract both text and href attributes,
4.2. Create an HTML string with a table and extract headers and all row data,
4.3. Create a dictionary mapping link text to href values,
4.4. Display the table data in a readable format,
4.5. Show both methods of extracting attributes (direct @attribute and .get()).
Question 5 (2 points)
Let’s put it all together with a more complex example. We’ll work with HTML that has multiple types of elements:
from lxml import html
# More complex HTML structure
html_string = """
<html>
<body>
<div class="book">
<h3><a title="The Great Gatsby">The Great Gatsby</a></h3>
<p class="price">$12.99</p>
<p class="availability">In stock</p>
<p class="rating star-rating Four">Rating: 4 stars</p>
</div>
<div class="book">
<h3><a title="1984">1984</a></h3>
<p class="price">$10.99</p>
<p class="availability">In stock</p>
<p class="rating star-rating Five">Rating: 5 stars</p>
</div>
<div class="book">
<h3><a title="To Kill a Mockingbird">To Kill a Mockingbird</a></h3>
<p class="price">$11.99</p>
<p class="availability">Out of stock</p>
<p class="rating star-rating Three">Rating: 3 stars</p>
</div>
</body>
</html>
"""
tree = html.fromstring(html_string)
# Find all book containers
books = tree.xpath('//div[@class="book"]')
print(f"Found {len(books)} books\n")
# Extract information from each book
for i, book in enumerate(books, 1):
# Book title (stored in title attribute of <a> tag)
title_elem = book.xpath('.//h3/a')[0]
title = title_elem.get('title')
# Book price
price = book.xpath('.//p[@class="price"]/text()')[0]
# Availability
availability = book.xpath('.//p[@class="availability"]/text()')[0]
# Star rating (stored in class name like "star-rating Four")
rating_elem = book.xpath('.//p[contains(@class, "star-rating")]')[0]
rating_class = rating_elem.get('class')
# Extract the rating word (Four, Five, Three, etc.)
rating = rating_class.split()[-1] if rating_class else "Unknown"
print(f"Book {i}:")
print(f" Title: {title}")
print(f" Price: {price}")
print(f" Availability: {availability}")
print(f" Rating: {rating}")
print()
|
This example shows several important concepts: |
Now, create a complex HTML string with multiple items (books, products, quotes, etc.) and extract:
1. Multiple pieces of information from each item,
2. Store the data in a list of dictionaries, where each dictionary represents one item,
3. Handle attributes, text content, and nested structures.
5.1. Create an HTML string with at least 3 items (books, products, quotes, etc.),
5.2. Extract multiple pieces of information from each item (at least 3 fields per item),
5.3. Store the data in a list of dictionaries,
5.4. Display the first few items in a readable format,
5.5. Print the total number of items found.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_TDM202_project1.ipynb
|
You must double check your You will not receive full credit if your |