TDM 30100: Project 10 — 2023

Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: sqlite3, containerization, and analysis work as well.

Context: This is the first in a series of web scraping projects with a focus on web scraping that incorporates of variety of skills we’ve touched on in previous Data Mine courses. For this first project, we will start slow with a selenium review with a small scraping challenge.

Scope: selenium, Python, web scraping

Learning Objectives
  • Use selenium to interact with a web page prior to scraping.

  • Use selenium and xpath expressions to efficiently scrape targeted data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1 (2 pts)

The following code provides you with both a template for configuring a Firefox browser selenium driver that will work on Anvil, as well as a straightforward example that demonstrates how to search web pages and elements using xpath expressions, and simulate mouse clicks. Take a moment, run the code, and refresh your understanding.

import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
firefox_options = Options()
firefox_options.add_argument("--window-size=810,1080")
# Headless mode means no GUI
firefox_options.add_argument("--headless")
firefox_options.add_argument("--disable-extensions")
firefox_options.add_argument("--no-sandbox")
firefox_options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Firefox(options=firefox_options)
# navigate to the webpage
driver.get("https://books.toscrape.com")

# full page source
print(driver.page_source)

# get html element
e = driver.find_element("xpath", "//html")

# print html element
print(e.get_attribute("outerHTML"))

# find the 'Music'link in the homepage
link = e.find_element("xpath", "//a[contains(text(),'Music')]")
# click the link
link.click()
# We can delay the program to allow the page to load
time.sleep(5)
# get new root HTML element
e = driver.find_element("xpath",".//html")
 # print html element
print(e.get_attribute("outerHTML"))
  1. Please use selenium to get and display the first book’s title and price in the Music books page

  2. At same page, try to find book titled "How music works" then click this book link and then scrape and print book information: product description, upc and availability

Take a look at the page source — do you think clicking the book link was needed in order to scrape that data? Why or why not?

You may get more information about xpath here: www.w3schools.com/xml/xpath_intro.asp [xpath]

Question 2 (6 pts)

Okay, Now, let us look into a popular website of housing market. zillow.com has extremely rich data on homes for sale, for rent, and lots of land.

Click around and explore the website a little bit. Note the following.

  1. Homes are typically list on the right hand side of the web page in a 21x2 set of "cards", for a total of 40 homes.

    At least in my experimentation — the last row only held 1 card and there was 1 advertisement card, which I consider spam.

  2. If you want to search for homes for sale, you can use the following link: www.zillow.com/homes/for_sale/{search_term}_rb/, where search_term could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: www.zillow.com/homes/for_sale/lafayette-in_rb

  3. If you want to search for homes for rent, you can use the following link: www.zillow.com/homes/for_rent/{search_term}_rb/, where search_term could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: www.zillow.com/for_rent/lafayette-in_rb

  4. If you load, for example, www.zillow.com/homes/for_rent/lafayette-in_rb and rapidly scroll down the right side of the screen where the "cards" are shown, it will take a fraction of a second for some of the cards to load. In fact, unless you scroll, those cards will not load, and if you were to parse the page contents, you would not find all 40 cards are loaded. This general strategy of loading content as the user scrolls is called lazy loading.

    1. Write a function called get_properties_info that, given a search_term (zipcode), will return a list of property information include zpid, price, number of bedroom, number of bathroom and square footage (sqft) . The function should both get all of the cards on a page, but cycle through all of the pages of homes for the query.

The following was a good query that had only 2 pages of results.

properties_info = get_properties_info("47933")

You may want to include an internal helper function called _load_cards that accepts the driver and scrolls through the page slowly in order to load all of the cards.

This link will help! Conceptually, here is what we did.

  1. Get initial set of cards using xpath expressions.

  2. Use driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1]) to scroll to the last card that was found in the DOM.

  3. Find cards again (now that more may have loaded after scrolling).

  4. If no more cards were loaded, exit.

  5. Update the number of cards we’ve loaded and repeat.

Sleep 5 seconds using time.sleep(5) between every scroll or link click.

After getting the information for each page, use driver.delete_all_cookies() to clear off cookies and help avoid captcha.

If you using the link from the "next page" button to get the next page, instead, use next_page.click() to click on the link. Otherwise, you may get a captcha.

Use something like:

with driver as d:
    d.get(blah)

This way, after exiting the with scope, the driver will be properly closed and quit which will decrease the likelihood of you getting captchas.

For our solution, we had a while True: loop in the _load_cards function and in the get_properties_info function and used the break command in an if statement to exit.

Project 10 Assignment Checklist

  • Jupyter Lab notebook with your codes, comments and outputs for the assignment

    • firstname-lastname-project10.ipynb.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.