TDM 20200: Project 3 - Web Scraping Introduction 3

Project Objectives

This project covers advanced web scraping techniques including rate limiting, handling dynamic content, working with headers and cookies, error handling, and ethical considerations. These skills are essential for scraping real-world websites responsibly and effectively.

Learning Objectives
  • Understand rate limiting and why it’s important

  • Implement exponential backoff and other rate limiting strategies

  • Handle HTTP errors and retry logic

  • Work with request headers and user agents

  • Understand robots.txt and ethical scraping practices

  • Handle dynamic content and JavaScript-rendered pages

  • Implement robust error handling and logging

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Questions

Question 1 (2 points)

Let’s start by learning about rate limiting with simple delays:

import requests
from lxml import html
import time

def scrape_with_delay(url, delay=1):
    """Scrape a URL with a delay between requests."""
    response = requests.get(url)
    time.sleep(delay)  # Wait before next request
    return response

# Example: Scraping multiple pages with delays
base_url = "https://quotes.toscrape.com"
pages_to_scrape = [
    f"{base_url}/page/1/",
    f"{base_url}/page/2/",
    f"{base_url}/page/3/"
]

all_quotes = []
for page_url in pages_to_scrape:
    print(f"Scraping {page_url}...")
    response = scrape_with_delay(page_url, delay=2)  # 2 second delay
    tree = html.fromstring(response.text)

    quotes = tree.xpath('//span[@class="text"]/text()')
    all_quotes.extend(quotes)

    print(f"  Found {len(quotes)} quotes")

print(f"\nTotal quotes collected: {len(all_quotes)}")

The time.sleep() function pauses execution for the specified number of seconds. A delay of 1-2 seconds between requests is usually reasonable for most websites. Adjust based on the website’s load and your needs.

Now, implement a scraper that:
1. Scrapes multiple pages from quotes.toscrape.com/
2. Adds a 2-second delay between each request,
3. Tracks how long the scraping takes,
4. Calculates the average time per request.

Deliverables

1.1. Write a scraper with rate limiting (2-second delays).
1.2. Scrape at least 5 pages.
1.3. Track total scraping time.
1.4. Calculate and display average time per request.
1.5. Explain why rate limiting is important.

Question 2 (2 points)

When a server is overloaded or you’re making requests too quickly, you might receive HTTP error codes like 429 (Too Many Requests) or 503 (Service Unavailable). Exponential backoff is a strategy where you wait progressively longer between retries when you encounter errors.

import requests
from lxml import html
import time
import random

def scrape_with_exponential_backoff(url, max_retries=5, initial_delay=1):
    """
    Scrape a URL with exponential backoff retry logic.

    If we get an error, wait longer before retrying:
    - First retry: wait 1 second
    - Second retry: wait 2 seconds
    - Third retry: wait 4 seconds
    - etc. (exponential: 2^retry_number)
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            # Check for rate limiting errors
            if response.status_code == 429:
                wait_time = initial_delay * (2 ** attempt)
                print(f"Rate limited! Waiting {wait_time} seconds before retry {attempt + 1}...")
                time.sleep(wait_time)
                continue

            # Check for server errors
            if response.status_code >= 500:
                wait_time = initial_delay * (2 ** attempt)
                print(f"Server error {response.status_code}! Waiting {wait_time} seconds before retry {attempt + 1}...")
                time.sleep(wait_time)
                continue

            # Success!
            response.raise_for_status()  # Raises exception for 4xx/5xx errors
            return response

        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                # Last attempt failed
                raise Exception(f"Failed after {max_retries} attempts: {e}")

            wait_time = initial_delay * (2 ** attempt)
            # Add some randomness to avoid thundering herd problem
            jitter = random.uniform(0, 0.1 * wait_time)
            total_wait = wait_time + jitter

            print(f"Error: {e}")
            print(f"Waiting {total_wait:.2f} seconds before retry {attempt + 1}...")
            time.sleep(total_wait)

    raise Exception("Max retries exceeded")

# Example usage
url = "https://quotes.toscrape.com/page/1/"
response = scrape_with_exponential_backoff(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Successfully scraped {len(quotes)} quotes")

Exponential backoff means the wait time doubles with each retry: 1s, 2s, 4s, 8s, etc. This gives the server time to recover. Adding "jitter" (randomness) prevents multiple clients from retrying at exactly the same time (the "thundering herd" problem).

Now, implement a scraper with exponential backoff that:
1. Handles 429 (Too Many Requests) errors,
2. Handles 503 (Service Unavailable) errors,
3. Implements exponential backoff with jitter,
4. Logs retry attempts and wait times,
5. Scrapes multiple pages, handling errors gracefully.

Deliverables

2.1. Implement exponential backoff function.
2.2. Handle HTTP error codes 429 and 503.
2.3. Add jitter to avoid synchronized retries.
2.4. Test your function (you can simulate errors by making requests too quickly).
2.5. Scrape multiple pages using your robust scraper.

Question 3 (2 points)

Some websites check the "User-Agent" header to identify what type of browser or program is making the request. Some sites block requests that don’t have a proper User-Agent or that look like bots. We can customize our request headers to appear more like a regular browser:

import requests
from lxml import html

# Common User-Agent strings (these identify your "browser")
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

def scrape_with_headers(url, user_agent=None):
    """Scrape with custom headers."""
    headers = {
        'User-Agent': user_agent or user_agents[0],
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    }

    response = requests.get(url, headers=headers)
    return response

# Example: Scraping with custom headers
url = "https://quotes.toscrape.com/"
response = scrape_with_headers(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Scraped {len(quotes)} quotes with custom headers")

The User-Agent header tells the server what browser/client is making the request. Some websites block requests without a User-Agent or with suspicious User-Agents. Using a realistic User-Agent string can help avoid blocks. However, always respect robots.txt and terms of service regardless of headers.

Some websites also use cookies for session management. We can handle cookies using a session object:

import requests
from lxml import html

# Create a session to maintain cookies across requests
session = requests.Session()

# Set headers for the session
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})

# First request - cookies are automatically saved
url1 = "https://quotes.toscrape.com/"
response1 = session.get(url1)
print(f"First request cookies: {session.cookies}")

# Subsequent requests automatically include cookies
url2 = "https://quotes.toscrape.com/page/2/"
response2 = session.get(url2)
print(f"Second request cookies: {session.cookies}")

Now, create a scraper that:
1. Uses custom User-Agent headers,
2. Maintains a session across multiple requests,
3. Rotates between different User-Agent strings for different requests,
4. Scrapes multiple pages while maintaining the session.

Deliverables

3.1. Implement a scraper with custom User-Agent headers.
3.2. Use a session object to maintain cookies.
3.3. Rotate User-Agent strings across requests.
3.4. Scrape multiple pages using the session.
3.5. Explain why custom headers might be necessary.

Question 4 (2 points)

The robots.txt file is a standard that websites use to tell scrapers which parts of the site they can and cannot access. It’s important to check and respect robots.txt:

import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin, urlparse

def check_robots_txt(base_url, user_agent='*'):
    """Check robots.txt for a given URL."""
    robots_url = urljoin(base_url, '/robots.txt')

    try:
        response = requests.get(robots_url, timeout=5)
        if response.status_code == 200:
            rp = RobotFileParser()
            rp.set_url(robots_url)
            rp.read()

            # Check if we can access a specific path
            parsed_url = urlparse(base_url)
            path = parsed_url.path or '/'

            can_fetch = rp.can_fetch(user_agent, base_url)
            print(f"robots.txt found at: {robots_url}")
            print(f"Can fetch {base_url}? {can_fetch}")

            # Show crawl delay if specified
            crawl_delay = rp.crawl_delay(user_agent)
            if crawl_delay:
                print(f"Crawl delay specified: {crawl_delay} seconds")

            return rp, can_fetch
        else:
            print(f"robots.txt not found or inaccessible (status: {response.status_code})")
            return None, True  # If no robots.txt, assume allowed
    except Exception as e:
        print(f"Error checking robots.txt: {e}")
        return None, True  # If error, assume allowed (but be cautious)

# Example: Check robots.txt
base_url = "https://quotes.toscrape.com"
rp, allowed = check_robots_txt(base_url)

if allowed:
    print("\nProceeding with scraping...")
    # Your scraping code here
else:
    print("\nScraping not allowed by robots.txt!")

Always check robots.txt before scraping. It’s located at example.com/robots.txt. The file specifies which user agents can access which paths, and may include crawl delays. Respecting robots.txt is both ethical and helps avoid getting blocked.

Now, create a function that:
1. Checks robots.txt before scraping,
2. Respects crawl delays specified in robots.txt,
3. Only scrapes paths that are allowed,
4. Implements a scraper that checks multiple websites' robots.txt files.

Test with:
- quotes.toscrape.com/
- books.toscrape.com/
- www.google.com/ (to see a more complex robots.txt)

Deliverables

4.1. Write a function to check and parse robots.txt.
4.2. Respect crawl delays from robots.txt.
4.3. Check if specific paths are allowed before scraping.
4.4. Test your function on multiple websites.
4.5. Explain why respecting robots.txt is important.

Question 5 (2 points)

Let’s combine all these techniques into a robust, production-ready scraper. We’ll also add logging to track what’s happening:

import requests
from lxml import html
import time
import random
import logging
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class RobustScraper:
    """A robust web scraper with rate limiting, error handling, and robots.txt checking."""

    def __init__(self, base_url, user_agent=None, default_delay=1):
        self.base_url = base_url
        self.session = requests.Session()
        self.default_delay = default_delay
        self.robots_parser = None

        # Set headers
        headers = {
            'User-Agent': user_agent or 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        }
        self.session.headers.update(headers)

        # Check robots.txt
        self._check_robots_txt()

    def _check_robots_txt(self):
        """Check and parse robots.txt."""
        robots_url = urljoin(self.base_url, '/robots.txt')
        try:
            response = self.session.get(robots_url, timeout=5)
            if response.status_code == 200:
                self.robots_parser = RobotFileParser()
                self.robots_parser.set_url(robots_url)
                self.robots_parser.read()
                logger.info(f"Loaded robots.txt from {robots_url}")
        except Exception as e:
            logger.warning(f"Could not load robots.txt: {e}")

    def can_fetch(self, url):
        """Check if URL can be fetched according to robots.txt."""
        if self.robots_parser:
            return self.robots_parser.can_fetch(self.session.headers['User-Agent'], url)
        return True

    def get_crawl_delay(self):
        """Get crawl delay from robots.txt."""
        if self.robots_parser:
            delay = self.robots_parser.crawl_delay(self.session.headers['User-Agent'])
            if delay:
                return delay
        return self.default_delay

    def fetch(self, url, max_retries=3, backoff_base=1):
        """Fetch a URL with exponential backoff and error handling."""
        if not self.can_fetch(url):
            logger.warning(f"robots.txt disallows: {url}")
            return None

        for attempt in range(max_retries):
            try:
                # Respect crawl delay
                delay = self.get_crawl_delay()
                time.sleep(delay)

                response = self.session.get(url, timeout=10)

                # Handle rate limiting
                if response.status_code == 429:
                    wait_time = backoff_base * (2 ** attempt)
                    logger.warning(f"Rate limited. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue

                # Handle server errors
                if response.status_code >= 500:
                    wait_time = backoff_base * (2 ** attempt)
                    logger.warning(f"Server error {response.status_code}. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue

                response.raise_for_status()
                logger.info(f"Successfully fetched {url}")
                return response

            except requests.exceptions.RequestException as e:
                if attempt == max_retries - 1:
                    logger.error(f"Failed to fetch {url} after {max_retries} attempts: {e}")
                    return None

                wait_time = backoff_base * (2 ** attempt) + random.uniform(0, 0.1)
                logger.warning(f"Error fetching {url}: {e}. Retrying in {wait_time:.2f}s...")
                time.sleep(wait_time)

        return None

    def scrape_quotes(self, url):
        """Scrape quotes from a page."""
        response = self.fetch(url)
        if not response:
            return []

        tree = html.fromstring(response.text)
        quotes = []

        quote_containers = tree.xpath('//div[@class="quote"]')
        for container in quote_containers:
            text = container.xpath('.//span[@class="text"]/text()')
            author = container.xpath('.//small[@class="author"]/text()')

            if text and author:
                quotes.append({
                    'text': text[0],
                    'author': author[0]
                })

        logger.info(f"Scraped {len(quotes)} quotes from {url}")
        return quotes

# Example usage
scraper = RobustScraper("https://quotes.toscrape.com/")
quotes = scraper.scrape_quotes("https://quotes.toscrape.com/page/1/")
print(f"\nScraped {len(quotes)} quotes:")
for quote in quotes[:3]:
    print(f"  - {quote['text'][:50]}... - {quote['author']}")

Now, create your own robust scraper class that:
1. Checks robots.txt before scraping,
2. Implements exponential backoff,
3. Handles errors gracefully,
4. Includes logging,
5. Respects crawl delays,
6. Uses custom headers.

Use it to scrape multiple pages from a website of your choice.

Deliverables

5.1. Create a RobustScraper class with all advanced features.
5.2. Implement robots.txt checking.
5.3. Add logging to track scraping activity.
5.4. Test your scraper on multiple pages.
5.5. Demonstrate error handling and retry logic.
5.6. Explain the ethical considerations of web scraping.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project3.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generative AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.