TDM 20200: Mapping: Project 11 — Spring 2025
Dataset(s)
In this project we will use the following mapping data:
-
/anvil/projects/tdm/data/taxi/yellow/
Questions
Load Pandas as Geopandas as follows:
import pandas as pd
import geopandas as gpd
Question 1 (2 pts)
Consider the flight data contained in the files from
/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2009-01.csv
through
/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2009-12.csv
Write a function called extractdate
that accepts a 4-digit year (as a string), and a 2-digit month (as a string, with a leading 0 if needed), and a 2-digit day (as a string, with a leading 0 if needed). For the given year and month, the function should read in 3 columns from the correct year-and-month yellow taxi cab data file, namely, extracting only column 1 (which is the 2nd column, for "Trip_Pickup_DateTime"), and column 5 (which is the 6th column, for "Start_Lon"), and column 6 (which is the 7th column, for "Start_Lat").
For instance, extractdate("2009", "05", "29")
should read in the 3 columns above from the file /anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2009-05.csv
.
-
Create the function
extractdate
described above. -
Be sure to document your work from Question 1, using some comments and insights about your work.
Question 2 (2 pts)
Now revise the function extractdate
, so that the function creates a small data frame, which has only the taxi cab rides that started on the given year, month, day triple that was inputted to the function.
For instance, extractdate("2009", "05", "29")
should return a data frame with 523947 rows and 3 columns,
and extractdate("2009", "02", "14")
should return a data frame with 539684 rows and 3 columns.
Hint: You might need to check whether the year is correct, e.g.,
pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.year == int(myyear)
and the month is correct, e.g.,
pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.month == int(mymonth)
and the day is correct, e.g.,
pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.day == int(mydate)
-
Revise the function
extractdate
as described above. -
Be sure to document your work from Question 2, using some comments and insights about your work.
Question 3 (2 pts)
Now add a line in your function to convert the "Start_Lon" and "Start_Lat" values into geopandas geometries, like this:
gdf = gpd.GeoDataFrame(goodDF, geometry=gpd.points_from_xy(goodDF.Start_Lon, goodDF.Start_Lat), crs="NAD83")
If you try to plot gdf
like this: gdf.plot()
it will look bad, because there is some erroneous latitude and longitude values, which we will remove in Question 4 (below).
-
Continue to revise the function
extractdate
as described above. -
Be sure to document your work from Question 3, using some comments and insights about your work.
Question 4 (2 pts)
Now further refine your function, so that you are only including latitude and longitude values as described here:
In other words, we only want to preserve the values for which the longitude is between -74.27 and -73.68 (inclusive, i.e., including these extreme values), and for which the latitude is between 40.49 and 40.92 (inclusive, i.e., including these extreme values).
-
Further revise the function
extractdate
as described above. -
Be sure to document your work from Question 4, using some comments and insights about your work.
Question 5 (2 pts)
Finally, instead of just using something like goodgdf.plot()
to display the locations of the starting points of the cab rides, make sure that the points are drawn with small dots, for instance, using something like goodgdf.plot(markersize = 0.1)
.
Test your maps for May 29, 2009 using extractdate("2009", "05", "29")
and also for February 14, 2009 using extractdate("2009", "02", "14")
. The resulting maps will look very similar (showing the shape of New York City from these cab rides) but will have small differences.
-
Plot the starting points of the cab rides on May 29, 2009, and on February 14, 2009.
-
Be sure to document your work from Question 5, using some comments and insights about your work.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!
-
firstname_lastname_project11.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |