TDM 20200: Project 5 — 2023

Motivation: Learning to scrape data can take time. We want to make sure you get comfortable with it! For this reason, we will continue to scrape data from Zillow to answer various questions. This will allow you to continue to get familiar with the tools, without having to re-learn everything about the website of interest.

Context: This is the fourth project on web scraping, where we will continue to focus on honing our skills using selenium.

Scope: Python, web scraping, selenium, matplotlib/plotly

Learning Objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

  • Use the beautifulsoup4 package to filter and parse data from a scraped web page.

  • Use selenium to interact with a browser in order to get a web page to a desired state for scraping.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Interested in being a TA? Please apply: purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE

Question 1

In the previous couple of projects, you’ve been utilizing selenium to scrape zpid information from Zillow. This begs the question — how can we use those values? For example, the following is a sample of data you may have received from one of your scraping functions.

example output
['zpid_58879763',
 'zpid_42717098',
 'zpid_54478236']

How can you use these? Well, what about just pasting the value into the URL, maybe: zillow.com/homes/for_sale/zpid_58879763/. What happens? Well, it doesn’t really work, unfortunately. However, if you browse Zillow a bit, you may realize that when you click on individual properties, the number and "zpid" part are reversed. Let’s test this out: zillow.com/homes/for_sale/58879763_zpid/. It works! This pulls up a webpage containing all of the details of the property! That is pretty cool.

Take your favorite version of the get_zpids function from the previous project, and tweak it slightly, so instead of returning lists of zpid values like:

example output
['zpid_58879763',
 'zpid_42717098',
 'zpid_54478236']

Instead, return a list of URLs like:

example output
['https://zillow.com/homes/for_sale/58879763_zpid/',
 'https://zillow.com/homes/for_sale/42717098_zpid/',
 'https://zillow.com/homes/for_sale/54478236_zpid/']

Great! Now, you can search Zillow, and get a (hopefully) robust list of property links that match your search term!

# example usages
get_zpids("West Lafayette, IN")
get_zpids("47933")
get_zpids("apple")

Test it out on a search term or two (in your Jupyter Notebook) and make sure it behaves like you expect. Test out a resulting link to make sure you get to see the details page of the property.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

One of the fun parts of scraping, we have largely taken away from you so far this semester. Scraping is fun because different people have different interests, and scraping allows you to assemble datasets based around those interests, which you can then analyze. We will give you more freedom to do this next project, and a little bit more starting now.

Scrape any data you want from at least 50 Zillow detail pages that you find interesting, and put together a fun analysis. This does not need to be anything extensive (unless you want it to be), but worthy of the 1 credit hour of work for seminar.

Start with this question. Pose a hypothesis about the data on the zillow details pages that you are interested in. For example, you could hypothesize that home values have risen by a larger rate in one zip code to another. Alternatively, you could also pose a challenge. For example, you could say you want to build a model that predicts the price of a home in X years based on the current price, and other features.

Basically, just come up with something that you want to try and answer or do. Then, in a markdown cell, write it out in enough detail for someone to easily understand what you are trying to do.

The point of this project is not to have good results or accurate conclusions. The point is to think, explore your hard-earned data you scraped, and maybe take the opportunity to learn something new. We have all sorts of backgrounds in The Data Mine — you may feel comfortable building some sort of model, or maybe not at all! That’s okay! You can still do something cool with the data you scraped, or put together a completely anecdotal analysis with some neat visualizations.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Scrape the required data from the Zillow detail pages. This will be the data you will use to try and answer your question or use to build your model. From within your Jupyter Notebook, scrape the data and display a sample of it.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

For this question, if you chose to do something like build a model, build the model. It doesn’t need to be good or fit the data well, just do your best to create something you can feed your scraped data to and get a potentially cool result.

If you went another route, like trying to demonstrate a hypothesis, then do that. Use any tool you want for this, but make sure to include the code you used to do it, and demonstrate how you are using the data you scraped to try and answer your question.

For both, please include a markdown cell that briefly summarizes what you did or tried to do.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Finally, create at least 1 graphic to support your work. If you did something like built a model, maybe you could create a graphic that shows the historical price of a property and then uses your model to project the price X years into the future.

If you did something like tried to demonstrate a hypothesis, maybe you could create a graphic that shows the price of a property in one zip code compared to another, and then show the difference in price over time.

For both, please include a markdown cell that briefly explains what, if any, conclusions you draw, and how, if you had the time, you would try to improve your work.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.