TDM 20200: Systematically extracting data from the web: Project 6 — Spring 2025

Motivation: Now that we are familiar with web scraping, we also want to ensure that we can systematically extract data from the internet and analyze it.

Context: We will get familiar with techniques for reproducibly working with data from the web, using workflows that can be well-documented, rather than doing everything by hand.

Scope: Data from the web in workflows

Learning Objectives:
  • We learn how to systematically and reproducibly use data from the internet.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

In this project we will scrape data from the following websites:

Questions

Question 1 (2 pts)

Consider the Green Taxi Trip Records, stored in parquet format, on the New York City taxi website:

Use list comprehension to build a list (of length 11) with the data from January 2024 through November 2024. Each element of the list will be a Pandas dataframe.

The read_parquet function from Pandas should be helpful.

It might also be helpful to know how to make a string with leading zeros; several examples are given here: stackoverflow.com/questions/134934/display-number-with-leading-zeros

Use pd.concat as described here: pandas.pydata.org/docs/reference/api/pandas.concat.html (with the option ignore_index=True) to build a Pandas data frame from the list given above. Your data frame should have 606224 rows and 20 columns.

Deliverables
  • Build a data frame containing the green taxi cab data from January 2024 through November 2024

  • Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

Read about how to extract a month from a date in Pandas, for instance, here: pandas.pydata.org/docs/reference/api/pandas.to_datetime.html or here: www.geeksforgeeks.org/get-month-from-date-in-pandas-python/ or wherever you prefer.

Then extract the average total_amount (i.e., the average cost) of a taxi cab ride for each of the months (based on the lpep_pickup_datetime). Display these average values of total_amount (by month) in a plot.

Even though you only are extracting data from 11 months, there seem to be a very small number of taxicab rides from December. There are 50000 to 60000 rides in each of the 11 months, and just 10 rides in December, and that is OK.

Deliverables
  • Plot the average values of total_amount (by month) in a plot.

  • Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

Wrap your work from Project 2, Question 4, into a function that takes a month name, month two-digit string for the month (with leading zero, if needed), two-digit string for the day (again, with a leading zero, if needed), and four-digit string for the year as input, and it returns the location of the Peanuts cartoon for that day. For instance, if you test the function as follows:

myfunction('June', '06', '22', '1970')

it will return the string:

'https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47'

Notice that the two-digit string version of the month needs to have a leading zero, like we learned earlier from: stackoverflow.com/questions/134934/display-number-with-leading-zeros

You might display each individual image like this:

from IPython.display import Image, display
display(Image(url='https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47'))

You might find it easier to build the URL like this:

"https://www.gocomics.com/peanuts/" + myyearstring + "/" + mymonthnumstring + "/" + mydaynumstring

where we have

mymonth = "June"
mymonthnumstring = "06"
mydaynumstring = "22"
myyearstring = "1970"

You might also find it easier to build the string for the xpath like this:

'//img[@alt = "Peanuts Comic Strip for ' + mymonth + ' ' + mydaynumstring + ', ' + myyearstring + ' "]'

where, again, we have

mymonth = "June"
mymonthnumstring = "06"
mydaynumstring = "22"
myyearstring = "1970"
Deliverables
  • Create a function that extracts a Peanuts comic for a specific date.

  • Test the function on a specific date and demonstrate that it works.

  • Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

Now we need to sleep in between requests. First load:

import time

and then add the line:

time.sleep(2)

to the end of your function (and be sure to re-define your function, i.e., to re-load your function).

Now test that your function works on a sequence of three days in a row, for instance June 6, 7, 8, or for instance, June 22, 23, 24. Be careful to make sure that your function works with leading zeros where appropriate.

Deliverables
  • Test the function on three specific dates and demonstrate that it works.

  • Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Now run your function on all of the comics in 1 full month of your choice. Print the comics in your Jupyter Lab notebook.

Deliverables
  • Now run your function on all of the comics in 1 full month of your choice. Print each of the comics from that month in your Jupyter Lab notebook.

  • Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit
  • firstname_lastname_project6.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.