STAT 29000: Project 3 — Spring 2021

Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from trulia.com.

Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.

Scope: python, web scraping, selenium

Learning objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

  • Use selenium to interact with a browser in order to get a web page to a desired state for scraping.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Visit trulia.com. Many websites have a similar interface, i.e. a bold and centered search bar for a user to interact with. Using selenium write Python code that that first finds the input element, and then types "West Lafayette, IN" followed by an emulated "Enter/Return". Confirm you code works by printing the url after that process completes.

You will want to use time.sleep to pause a bit after the search so the updated url is returned.

That video is already relevant for Question 2 too.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 2

Use your code from question (1) to test out the following queries:

  • West Lafayette, IN (City, State)

  • 47906 (Zip)

  • 4505 Kahala Ave, Honolulu, HI 96816 (Full address)

If you look closely you will see that there are patterns in the url. For example, the following link would probably bring up homes in Crawfordsville, IN: trulia.com/IN/Crawfordsville. With that being said, if you only had a zip code, like 47933, it wouldn’t be easy to guess www.trulia.com/IN/Crawfordsville/47933/, hence, one reason why the search bar is useful.

If you used xpath expressions to complete question (1), instead use a different method to find the input element. If you used a different method, use xpath expressions to complete question (1).

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Let’s call the page after a city/state or zipcode search a "sales page". For example:

![](./images/trulia.png)

Use requests to scrape the entire page: www.trulia.com/IN/West_Lafayette/47906/. Use lxml.html to parse the page and get all of the img elements that make up the house pictures on the left side of the website.

Make sure you are actually scraping what you think you are scraping! Try printing your html to confirm it has the content you think it should have:

import requests
response = requests.get(...)
print(response.text)

Are you human? Depends. Sometimes if you add a header to your request, it won’t ask you if you are human. Let’s pretend we are Firefox:

import requests
my_headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(..., headers=my_headers)

Okay, after all of that work you may have discovered that only a few images have actually been scraped. If you cycle through all of the img elements and try to print the value of the src attribute, this will be clear:

import lxml.html
tree = lxml.html.fromstring(response.text)
elements = tree.xpath("//img")
for element in elements:
    print(element.attrib.get("src"))

This is because the webpage is not immediately, completely loaded. This is a common website behavior to make things appear faster. If you pay close to when you load www.trulia.com/IN/Crawfordsville/47933/, and you quickly scroll down, you will see images still needing to finish rendering all of the way, slowly. What we need to do to fix this, is use selenium (instead of lxml.html) to behave like a human and scroll prior to scraping the page! Try using the following code to slowly scroll down the page before finding the elements:

# driver setup and get the url
# Needed to get the window size set right and scroll in headless mode
myheight = driver.execute_script('return document.body.scrollHeight')
driver.set_window_size(1080,myheight+100)
def scroll(driver, scroll_point):
    driver.execute_script(f'window.scrollTo(0, {scroll_point});')
    time.sleep(5)

scroll(driver, myheight*1/4)
scroll(driver, myheight*2/4)
scroll(driver, myheight*3/4)
scroll(driver, myheight*4/4)
# find_elements_by_*

At the time of writing there should be about 86 links to images of homes.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

Write a function called avg_house_cost that accepts a zip code as an argument, and returns the average cost of the first page of homes. Now, to make this a more meaningful statistic, filter for "3+" beds and then find the average. Test avg_house_cost out on the zip code 47906 and print the average costs.

Use selenium to "click" on the "3+ beds" filter.

If you get an error that tells you button is not clickable because it is covered by an li element, try clicking on the li element instead.

You will want to wait a solid 10-15 seconds for the sales page to load before trying to select or click on anything.

Your results may end up including prices for "Homes Near \<ZIPCODE\>". This is okay. Even better if you manage to remove those results. If you do choose to remove those results, take a look at the data-testid attribute with value search-result-list-container. Perhaps only selecting the children of the first element will get the desired outcome.

You can use the following code to remove the non-numeric text from a string, and then convert to an integer:

import re
int(re.sub("[^0-9]", "", "removenon45454_numbers$"))
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

Get creative. Either add an interesting feature to your function from (4), or use matplotlib to generate some sort of accompanying graphic with your output. Make sure to explain what your additi

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.