STAT 29000: Project 3 — Spring 2022

Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from zillow.com.

Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.

Scope: python, web scraping, selenium

Learning Objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

  • Use selenium to interact with a browser in order to get a web page to a desired state for scraping.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Pop open a browser and visit zillow.com. Many websites have a similar interface — a bold and centered search bar for a user to interact with.

First, in your browser, type in 34474 into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you’ve been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see.

This second option may or may not consistently pop up. For this reason, we’ve included the relevant HTML below, for your convenience.

<div>
    <img alt="" src="https://www.zillowstatic.com/s3/homepage/static/interstitial_graphic.png" class="sc-14dvu6m-0 iYqEdo " width="262px" height="100px">
    <h3 id="interstitial-title" class="sc-14dvu6m-1 kvYidp ">What type of listings would you like to see?</h3>
    <ul class="sc-14dvu6m-2 gfkDFS listing-interstitial-buttons">
        <li><button class="StyledButton-c11n-8-48-0__sc-wpcbcc-0 bCYrmZ">For sale</button></li>
        <li><button class="StyledButton-c11n-8-48-0__sc-wpcbcc-0 bCYrmZ">For rent</button></li>
        <li><button class="StyledTextButton-c11n-8-48-0__sc-n1gfmh-0 jBjBRQ">Skip this question</button></li>
    </ul>
</div>

Remember that the value of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value.

<div data-fun='definitely'>happy</div>

You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all div elements with the value "happy".

//div[text()='happy']

Use selenium, and write Python code that first finds the search bar input element. Then, use selenium to emulate typing the zip code 34474 into the search bar followed by a press of the enter/return button.

Confirm your code works by printing the current URL of the page after the search has been performed. What happens? Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the same URL. To confirm this, instead of printing the URL, instead print the HTML after the search.

To print the HTML of an element using selenium, you can use the following code.

element = driver.find_element_by_xpath("//some_xpath")
print(element.get_attribute("outerHTML"))

If you don’t know what HTML to expect, the html element is a safe bet.

element = driver.find_element_by_xpath("//html")
print(element.get_attribute("outerHTML"))

Of course, please only print a sample of the HTML — we don’t want to print it all — that would be a lot!

Remember, in the background, selenium is actually launching a browser — just like a human would. Sometimes, you need to wait for a page to load before you can properly interact with it. It is highly recommended you use the time.sleep function to wait 5 seconds after a call to driver.get or element.send_keys.

One downside to selenium is it has some more boilerplate code than, requests, for example. Please use the following code to instantiate your selenium driver on Brown.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import uuid

firefox_options = Options()
firefox_options.add_argument("window-size=1920,1080")
# Headless mode means no GUI
firefox_options.add_argument("--headless")
firefox_options.add_argument("start-maximized")
firefox_options.add_argument("disable-infobars")
firefox_options.add_argument("--disable-extensions")
firefox_options.add_argument("--no-sandbox")
firefox_options.add_argument("--disable-dev-shm-usage")
firefox_options.add_argument('--disable-blink-features=AutomationControlled')

# Set the location of the executable Firefox program on Brown
firefox_options.binary_location = '/depot/datamine/bin/firefox/firefox'

profile = webdriver.FirefoxProfile()

profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
profile.update_preferences()

desired = DesiredCapabilities.FIREFOX

# Set the location of the executable geckodriver program on Scholar
uu = uuid.uuid4()
driver = webdriver.Firefox(log_path=f"/tmp/{uu}", options=firefox_options, executable_path='/depot/datamine/bin/geckodriver', firefox_profile=profile, desired_capabilities=desired)

Please feel free to "reset" your driver (for example, if you’ve lost track of "where" it is or you aren’t getting results you expected) by running the following code, followed by the code shown above.

driver.quit()

# instantiate driver again
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Okay, let’s go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results!

Write Python code that uses selenium to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page after the button has been clicked.

Don’t forget, it may be best to put a time.sleep(5) after the click() method call — before printing the current URL.

Uh oh! If you did this correctly, it is likely that the URL is not quite right — something like: www.zillow.com/homes/rb/. By default, this URL will place the nearest city in the search bar — this is _not what we wanted. On the bright side, we did notice (when doing this search manually) that the URL should look like: www.zillow.com/homes/34474_rb/ — we can just insert our zip code directly in the URL and that should work without any fuss, plus we save some page loads and clicks. Great!

If you are paying close attention — you will find that this is an inconsistency between using a browser manually and using selenium. selenium isn’t saving the same data (cookies and local storage) as your browser is, and therefore doesn’t "remember" the zip code you are search for after that intermediate "For sale", "For rent", and "Skip this question" step. Luckily, modifying the URL works better anyways.

Test out (using selenium) that simply inserting the zip code in the URL works as intended. Finding the title element and printing the contents should verify quickly that it works as intended.

element = driver.find_element_by_xpath("//title")
print(element.get_attribute("outerHTML"))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Okay great! Take your time to open a browser to www.zillow.com/homes/34474_rb/ and use the Inspector to figure out how the web page is structured. For now, let’s not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards.

What non li element contains the cards in their entirety? Use selenium and XPath expressions to extract those elements from the web page. Print the value of the id attributes for all of the cards. How many cards was there? (this could vary depending on when the data was scraped — that is ok)

You can use the id attribute in combination with the starts-with XPath function to find these elements, because each id starts with the same 4-5 letter prefix.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Write code to print the mean price of each of the cards on the page, as well as the mean square footage. Print the values.

Uh oh! Once again, something is not working right. If you were to dig in, you’d find that only about 10 or so cards contain their data. This is because the cards are lazy-loaded. What this means is that you must scroll in order for the rest of the info to show up. You can verify this if you scroll super fast. You’ll notice even if the page was loaded for 10 seconds, that content at the bottom will take a second to load after scrolling fast.

To fix this problem — we need to scroll! Try the following code. Of course, fill in the find_element_by_xpath method call with the correct XPath expression (for both calls). You’ll notice that before we scroll the 10th element will not contain the data we are looking for, but after our scrolling it will! Super cool!

from selenium.common.exceptions import StaleElementReferenceException

cards = driver.find_elements_by_xpath("...")
print(cards[30].get_attribute("outerHTML"))

# Let's load every 2 cards or so at a time
for idx, card in enumerate(cards):
    if idx % 2 == 0:
        try:
            driver.execute_script('arguments[0].scrollIntoView();', card)
            time.sleep(2)

        except StaleElementReferenceException:
            # every once in a while we will get a StaleElementReferenceException
            # because we are trying to access or scroll to an element that has changed.
            # this probably means we can skip it because the data has already loaded.
            continue

cards = driver.find_elements_by_xpath("...")
print(cards[30].get_attribute("outerHTML"))

Your project writer is mean. Of course not every card contains a house — some of it is land. Unfortunately, land doesn’t have a square footage on the website! Do something similar to the following to skip over those annoying plots of land. (and don’t forget to fill in the xpaths)

from selenium.common.exceptions import NoSuchElementException
import sys
import re

prices = []
sq_ftgs = []
for ct, card in enumerate(cards):
    try:
        sqft = card.find_element_by_xpath("...").text
        sqft = re.sub('[^0-9.]', '', sqft)

        # if there isn't any sq footage skip the card entirely
        if sqft == '':
            continue

        price = card.find_element_by_xpath("...").text
        price = re.sub('[^0-9.]', '', price)

        # if there isn't any price skip the card entirely
        if price == '':
            continue

        sq_ftgs.append(float(sqft))
        prices.append(float(price))

    except NoSuchElementException:
        # verify that it is a plot of land, if not, panic
        is_lot = 'land' in card.find_element_by_xpath(".//ul[@class='list-card-details']/li[2]").text.lower()
        if not is_lot:
            print("NOT LAND")
            print(card.find_element_by_xpath(".//ul[@class='list-card-details']/li[2]").text)
            sys.exit(0)
        else:
            continue

print(sum(prices)/len(prices))
print(sum(sq_ftgs)/len(sq_ftgs))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Update your code from question (4) to first filter the homes by the number of bedrooms and bathrooms. Let’s look at some bigger homes. Filter to get houses with 4+ bedrooms and 3+ bathrooms. Recalculate the mean price and square footage for said houses. Print the values.

To apply said filters, you will need to emulate 3 clicks. One to activate the menu of filters, another to select the number of bedrooms, and another to select the number of bathrooms. You should be able to use a combination of element type (div/button/span/etc.) and attributes to accomplish this.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6 (optional, 0 pts)

Package your code up into a function that let’s you choose the zip code, number of bedrooms, and number of bathrooms. Experiment with the function for different combinations and print your results. If you really want to have some fun create an interesting graphic to show your results.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.