TDM 20200: Project 3 — 2023

Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from zillow.com.

Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.

Scope: python, web scraping, selenium

Learning Objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

  • Use selenium to interact with a browser in order to get a web page to a desired state for scraping.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Pop open a browser and visit zillow.com. Many websites have a similar interface — a bold and centered search bar for a user to interact with.

First, in your browser, type in 32607 into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you’ve been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see.

This second option may or may not consistently pop up. For this reason, we’ve included the relevant HTML below, for your convenience.

<div class="yui3-lightbox interstitial">
    <div class="yui3-lightbox-content lightbox_responsive">
        <a class="zsg-icon-x-thin lightbox-close" title="Close" tabindex="-1"></a>
        <div class="lightbox-body zsg-tooltip-viewport">
            <div>
                <img alt="" src="https://www.zillowstatic.com/s3/homepage/static/interstitial_graphic.png" class="sc-14dvu6m-0 iYqEdo " width="262px" height="100px">
                <h3 id="interstitial-title" class="sc-14dvu6m-1 kvYidp ">What type of listings would you like to see?</h3>
                <ul class="sc-14dvu6m-2 gfkDFS listing-interstitial-buttons">
                    <li>
                        <button class="StyledButton-c11n-8-69-2__sc-wpcbcc-0 lebuee">For sale</button>
                    </li>
                    <li>
                        <button class="StyledButton-c11n-8-69-2__sc-wpcbcc-0 lebuee">For rent</button>
                    </li>
                    <li>
                        <button class="StyledTextButton-c11n-8-69-2__sc-n1gfmh-0 iHnAnn">Skip this question</button>
                    </li>
                </ul>
            </div>
        </div>
    </div>
</div>

Remember that the value of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value.

<div data-fun='definitely'>happy</div>

You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all div elements with the value "happy".

//div[text()='happy']

Use selenium, and write Python code that first finds the search bar input element of the zillow.com home page. Then, use selenium to emulate typing the zip code 32607 into the search bar followed by a press of the enter/return button.

Confirm your code works by printing the current URL of the page after the search has been performed. What happens?

To print the URL of the current page, use the following code.

print(driver.current_url)

Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the same URL. To confirm this, instead of printing the URL, instead print the entirety of the HTML provided above. To do so, first use xpath expressions to isolate the outermost div element, then print the the entire element, including all of the nested elements.

To print the HTML of an element using selenium, you can use the following code.

element = driver.find_element("xpath", "//some_xpath")
print(element.get_attribute("outerHTML"))

If you don’t know what HTML to expect, the html element is a safe bet.

element = driver.find_element("xpath", "//html")
print(element.get_attribute("outerHTML"))

Of course, please only print the isolated element — we don’t want to print it all — that would be a lot!

You can use the class 'yui3-lightbox interstitial'.

Remember, in the background, selenium is actually launching a browser — just like a human would. Sometimes, you need to wait for a page to load before you can properly interact with it. It is highly recommended you use the time.sleep function to wait 5 seconds after a call to driver.get or element.send_keys.

One downside to selenium is it has some more boilerplate code than, requests, for example. Please use the following code to instantiate your selenium driver on Anvil.

import time
import uuid
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys

firefox_options = Options()
firefox_options.add_argument("--window-size=810,1080")
# Headless mode means no GUI
firefox_options.add_argument("--headless")
firefox_options.add_argument("--disable-extensions")
firefox_options.add_argument("--no-sandbox")
firefox_options.add_argument("--disable-dev-shm-usage")

# create driver
driver = webdriver.Firefox(log_path=f"/tmp/{uuid.uuid4()}", options=firefox_options)

# use driver here
# ...

# close the driver
driver.quit()

Please feel free to "reset" your driver (for example, if you’ve lost track of "where" it is (what webpage, for example) or you aren’t getting results you expected) by running the following code, followed by the code shown above.

driver.quit()

# instantiate driver again
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Okay, let’s go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results!

Write Python code that uses selenium to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page after the button has been clicked. Is the URL what you expected?

Don’t forget, it may be best to put a time.sleep(5) after the click() method call — before printing the current URL.

Uh oh! If you did this correctly, it is likely that the URL is not quite right — something like: www.zillow.com/homes/rb/, or maybe a captcha page. By default, websites employ a variety of anti-scraping techniques. On the bright side, we _did notice (when doing this search manually) that the URL should look like: www.zillow.com/homes/32607_rb/ — we can just insert our zip code directly in the URL and that should work without any fuss, plus we save some page loads and clicks. Great! Alternatively, you could also narrow down the search to homes "For Sale" by using www.zillow.com/homes/for_sale/32607_rb/.

If you are paying close attention — you will find that this is an inconsistency between using a browser manually and using selenium. selenium isn’t saving the same data (cookies and local storage) as your browser is, and therefore doesn’t "remember" the zip code you are search for after that intermediate "For sale", "For rent", and "Skip this question" step. Luckily, modifying the URL works better anyways.

Test out (using selenium) that simply inserting the zip code in the URL works as intended. Finding the title element and printing the contents should verify quickly that it works as intended. Bake this functionality into a function called print_title that takes a search term, search_term, and returns the contents of the title element.

element = driver.find_element("xpath", "//title")
print(element.get_attribute("outerHTML"))
# make sure this works
print_title("32607")
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Okay great! Take your time to open a browser to www.zillow.com/homes/for_sale/32607_rb/ and use the Inspector to figure out how the web page is structured. For now, let’s not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards.

What non li element contains the cards in their entirety? Use selenium and XPath expressions to extract those elements from the web page. Print the value of the id attributes for all of the cards. How many cards was there? (this could vary depending on when the data was scraped — that is ok).

You can use the id attribute in combination with the starts-with XPath function to find these elements, because each id starts with the same 4-5 letter prefix.

Some examples of how to use starts-with:

//div[starts-with(@id, 'card_')] # all divs with an id attribute that starts with 'card_'
//div[starts-with(text(), 'okay_')] # all divs with content that starts with 'okay_'
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

How many cards were there? For me, there were just about 8. Let’s verify things. Open a browser to www.zillow.com/homes/for_sale/32607_rb/ and scroll down to the bottom of the page. How many cards are there?

For me, there were significantly more than 8. There were 40. Something is going on here. What is going on is lazy loading. What this means is the web page is only loading the first 8 cards, and then loading the rest of the cards as you scroll down the page. This is a common technique to reduce the initial load time of a web page. This is also the perfect scenario for us to really utilize the power of selenium. If we just use a package like requests, we are unable to scroll down the page and load the rest of the cards.

Check out the function below called load_all_cards that accepts the driver as an argument, and scrolls down the page until all of the cards have been loaded. Examine the function and explain (in a markdown cell) what it is doing. In addition, use the function in combination with your code from the previous question to print the id attribute for all of the cards. How many cards were there this time?

def load_all_cards(driver):
    cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
    while True:
        try:
            num_cards = len(cards)
            driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
            time.sleep(2)
            cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
            if num_cards == len(cards):
                break
            num_cards = len(cards)
        except StaleElementReferenceException:
            # every once in a while we will get a StaleElementReferenceException
            # because we are trying to access or scroll to an element that has changed.
            # this probably means we can skip it because the data has already loaded.
            continue
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.