TDM 20200: Project 3 - Web Scraping Introduction 3
Project Objectives
This project covers advanced web scraping techniques including rate limiting, handling dynamic content, working with headers and cookies, error handling, and ethical considerations. These skills are essential for scraping real-world websites responsibly and effectively.
|
If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Questions
Question 1 (2 points)
Let’s start by learning about rate limiting with simple delays:
import requests
from lxml import html
import time
def scrape_with_delay(url, delay=1):
"""Scrape a URL with a delay between requests."""
response = requests.get(url)
time.sleep(delay) # Wait before next request
return response
# Example: Scraping multiple pages with delays
base_url = "https://quotes.toscrape.com"
pages_to_scrape = [
f"{base_url}/page/1/",
f"{base_url}/page/2/",
f"{base_url}/page/3/"
]
all_quotes = []
for page_url in pages_to_scrape:
print(f"Scraping {page_url}...")
response = scrape_with_delay(page_url, delay=2) # 2 second delay
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
all_quotes.extend(quotes)
print(f" Found {len(quotes)} quotes")
print(f"\nTotal quotes collected: {len(all_quotes)}")
|
The |
Now, implement a scraper that:
1. Scrapes multiple pages from quotes.toscrape.com/
2. Adds a 2-second delay between each request,
3. Tracks how long the scraping takes,
4. Calculates the average time per request.
1.1. Write a scraper with rate limiting (2-second delays).
1.2. Scrape at least 5 pages.
1.3. Track total scraping time.
1.4. Calculate and display average time per request.
1.5. Explain why rate limiting is important.
Question 2 (2 points)
When a server is overloaded or you’re making requests too quickly, you might receive HTTP error codes like 429 (Too Many Requests) or 503 (Service Unavailable). Exponential backoff is a strategy where you wait progressively longer between retries when you encounter errors.
import requests
from lxml import html
import time
import random
def scrape_with_exponential_backoff(url, max_retries=5, initial_delay=1):
"""
Scrape a URL with exponential backoff retry logic.
If we get an error, wait longer before retrying:
- First retry: wait 1 second
- Second retry: wait 2 seconds
- Third retry: wait 4 seconds
- etc. (exponential: 2^retry_number)
"""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
# Check for rate limiting errors
if response.status_code == 429:
wait_time = initial_delay * (2 ** attempt)
print(f"Rate limited! Waiting {wait_time} seconds before retry {attempt + 1}...")
time.sleep(wait_time)
continue
# Check for server errors
if response.status_code >= 500:
wait_time = initial_delay * (2 ** attempt)
print(f"Server error {response.status_code}! Waiting {wait_time} seconds before retry {attempt + 1}...")
time.sleep(wait_time)
continue
# Success!
response.raise_for_status() # Raises exception for 4xx/5xx errors
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
# Last attempt failed
raise Exception(f"Failed after {max_retries} attempts: {e}")
wait_time = initial_delay * (2 ** attempt)
# Add some randomness to avoid thundering herd problem
jitter = random.uniform(0, 0.1 * wait_time)
total_wait = wait_time + jitter
print(f"Error: {e}")
print(f"Waiting {total_wait:.2f} seconds before retry {attempt + 1}...")
time.sleep(total_wait)
raise Exception("Max retries exceeded")
# Example usage
url = "https://quotes.toscrape.com/page/1/"
response = scrape_with_exponential_backoff(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Successfully scraped {len(quotes)} quotes")
|
Exponential backoff means the wait time doubles with each retry: 1s, 2s, 4s, 8s, etc. This gives the server time to recover. Adding "jitter" (randomness) prevents multiple clients from retrying at exactly the same time (the "thundering herd" problem). |
Now, implement a scraper with exponential backoff that:
1. Handles 429 (Too Many Requests) errors,
2. Handles 503 (Service Unavailable) errors,
3. Implements exponential backoff with jitter,
4. Logs retry attempts and wait times,
5. Scrapes multiple pages, handling errors gracefully.
2.1. Implement exponential backoff function.
2.2. Handle HTTP error codes 429 and 503.
2.3. Add jitter to avoid synchronized retries.
2.4. Test your function (you can simulate errors by making requests too quickly).
2.5. Scrape multiple pages using your robust scraper.
Question 3 (2 points)
Some websites check the "User-Agent" header to identify what type of browser or program is making the request. Some sites block requests that don’t have a proper User-Agent or that look like bots. We can customize our request headers to appear more like a regular browser:
import requests
from lxml import html
# Common User-Agent strings (these identify your "browser")
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
def scrape_with_headers(url, user_agent=None):
"""Scrape with custom headers."""
headers = {
'User-Agent': user_agent or user_agents[0],
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
}
response = requests.get(url, headers=headers)
return response
# Example: Scraping with custom headers
url = "https://quotes.toscrape.com/"
response = scrape_with_headers(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Scraped {len(quotes)} quotes with custom headers")
|
The User-Agent header tells the server what browser/client is making the request. Some websites block requests without a User-Agent or with suspicious User-Agents. Using a realistic User-Agent string can help avoid blocks. However, always respect robots.txt and terms of service regardless of headers. |
Some websites also use cookies for session management. We can handle cookies using a session object:
import requests
from lxml import html
# Create a session to maintain cookies across requests
session = requests.Session()
# Set headers for the session
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
# First request - cookies are automatically saved
url1 = "https://quotes.toscrape.com/"
response1 = session.get(url1)
print(f"First request cookies: {session.cookies}")
# Subsequent requests automatically include cookies
url2 = "https://quotes.toscrape.com/page/2/"
response2 = session.get(url2)
print(f"Second request cookies: {session.cookies}")
Now, create a scraper that:
1. Uses custom User-Agent headers,
2. Maintains a session across multiple requests,
3. Rotates between different User-Agent strings for different requests,
4. Scrapes multiple pages while maintaining the session.
3.1. Implement a scraper with custom User-Agent headers.
3.2. Use a session object to maintain cookies.
3.3. Rotate User-Agent strings across requests.
3.4. Scrape multiple pages using the session.
3.5. Explain why custom headers might be necessary.
Question 4 (2 points)
The robots.txt file is a standard that websites use to tell scrapers which parts of the site they can and cannot access. It’s important to check and respect robots.txt:
import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin, urlparse
def check_robots_txt(base_url, user_agent='*'):
"""Check robots.txt for a given URL."""
robots_url = urljoin(base_url, '/robots.txt')
try:
response = requests.get(robots_url, timeout=5)
if response.status_code == 200:
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
# Check if we can access a specific path
parsed_url = urlparse(base_url)
path = parsed_url.path or '/'
can_fetch = rp.can_fetch(user_agent, base_url)
print(f"robots.txt found at: {robots_url}")
print(f"Can fetch {base_url}? {can_fetch}")
# Show crawl delay if specified
crawl_delay = rp.crawl_delay(user_agent)
if crawl_delay:
print(f"Crawl delay specified: {crawl_delay} seconds")
return rp, can_fetch
else:
print(f"robots.txt not found or inaccessible (status: {response.status_code})")
return None, True # If no robots.txt, assume allowed
except Exception as e:
print(f"Error checking robots.txt: {e}")
return None, True # If error, assume allowed (but be cautious)
# Example: Check robots.txt
base_url = "https://quotes.toscrape.com"
rp, allowed = check_robots_txt(base_url)
if allowed:
print("\nProceeding with scraping...")
# Your scraping code here
else:
print("\nScraping not allowed by robots.txt!")
|
Always check |
Now, create a function that:
1. Checks robots.txt before scraping,
2. Respects crawl delays specified in robots.txt,
3. Only scrapes paths that are allowed,
4. Implements a scraper that checks multiple websites' robots.txt files.
Test with:
- quotes.toscrape.com/
- books.toscrape.com/
- www.google.com/ (to see a more complex robots.txt)
4.1. Write a function to check and parse robots.txt.
4.2. Respect crawl delays from robots.txt.
4.3. Check if specific paths are allowed before scraping.
4.4. Test your function on multiple websites.
4.5. Explain why respecting robots.txt is important.
Question 5 (2 points)
Let’s combine all these techniques into a robust, production-ready scraper. We’ll also add logging to track what’s happening:
import requests
from lxml import html
import time
import random
import logging
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin
# Set up logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class RobustScraper:
"""A robust web scraper with rate limiting, error handling, and robots.txt checking."""
def __init__(self, base_url, user_agent=None, default_delay=1):
self.base_url = base_url
self.session = requests.Session()
self.default_delay = default_delay
self.robots_parser = None
# Set headers
headers = {
'User-Agent': user_agent or 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
self.session.headers.update(headers)
# Check robots.txt
self._check_robots_txt()
def _check_robots_txt(self):
"""Check and parse robots.txt."""
robots_url = urljoin(self.base_url, '/robots.txt')
try:
response = self.session.get(robots_url, timeout=5)
if response.status_code == 200:
self.robots_parser = RobotFileParser()
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
logger.info(f"Loaded robots.txt from {robots_url}")
except Exception as e:
logger.warning(f"Could not load robots.txt: {e}")
def can_fetch(self, url):
"""Check if URL can be fetched according to robots.txt."""
if self.robots_parser:
return self.robots_parser.can_fetch(self.session.headers['User-Agent'], url)
return True
def get_crawl_delay(self):
"""Get crawl delay from robots.txt."""
if self.robots_parser:
delay = self.robots_parser.crawl_delay(self.session.headers['User-Agent'])
if delay:
return delay
return self.default_delay
def fetch(self, url, max_retries=3, backoff_base=1):
"""Fetch a URL with exponential backoff and error handling."""
if not self.can_fetch(url):
logger.warning(f"robots.txt disallows: {url}")
return None
for attempt in range(max_retries):
try:
# Respect crawl delay
delay = self.get_crawl_delay()
time.sleep(delay)
response = self.session.get(url, timeout=10)
# Handle rate limiting
if response.status_code == 429:
wait_time = backoff_base * (2 ** attempt)
logger.warning(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
# Handle server errors
if response.status_code >= 500:
wait_time = backoff_base * (2 ** attempt)
logger.warning(f"Server error {response.status_code}. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
response.raise_for_status()
logger.info(f"Successfully fetched {url}")
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
logger.error(f"Failed to fetch {url} after {max_retries} attempts: {e}")
return None
wait_time = backoff_base * (2 ** attempt) + random.uniform(0, 0.1)
logger.warning(f"Error fetching {url}: {e}. Retrying in {wait_time:.2f}s...")
time.sleep(wait_time)
return None
def scrape_quotes(self, url):
"""Scrape quotes from a page."""
response = self.fetch(url)
if not response:
return []
tree = html.fromstring(response.text)
quotes = []
quote_containers = tree.xpath('//div[@class="quote"]')
for container in quote_containers:
text = container.xpath('.//span[@class="text"]/text()')
author = container.xpath('.//small[@class="author"]/text()')
if text and author:
quotes.append({
'text': text[0],
'author': author[0]
})
logger.info(f"Scraped {len(quotes)} quotes from {url}")
return quotes
# Example usage
scraper = RobustScraper("https://quotes.toscrape.com/")
quotes = scraper.scrape_quotes("https://quotes.toscrape.com/page/1/")
print(f"\nScraped {len(quotes)} quotes:")
for quote in quotes[:3]:
print(f" - {quote['text'][:50]}... - {quote['author']}")
Now, create your own robust scraper class that:
1. Checks robots.txt before scraping,
2. Implements exponential backoff,
3. Handles errors gracefully,
4. Includes logging,
5. Respects crawl delays,
6. Uses custom headers.
Use it to scrape multiple pages from a website of your choice.
5.1. Create a RobustScraper class with all advanced features.
5.2. Implement robots.txt checking.
5.3. Add logging to track scraping activity.
5.4. Test your scraper on multiple pages.
5.5. Demonstrate error handling and retry logic.
5.6. Explain the ethical considerations of web scraping.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project3.ipynb
|
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generative AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |