TDM 20200: Mapping: Project 13 — Spring 2025
Motivation: We will continue to compare the usage of Pandas and Polars.
Context: Polars provides faster data frames, but Polars is also newer than Pandas, so it is not (yet) able to be integrated with everything else that we might want to accomplish in Python.
Scope: Comparing performances of technologies.
Dataset(s)
In this project we will use the following mapping data:
-
/anvil/projects/tdm/data/flights/subset/2002.csv
Questions
For this project, it should be enough to use 3 or 4 cores.
Question 1 (2 pts)
Recreate Project 11, Question 1, using Polars instead of Pandas.
For instance, if you want to read in the May 2009 data, you can use:
import polars as pl
myDF = pl.read_csv("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2009-05.csv", columns=['Trip_Pickup_DateTime', 'Start_Lon', 'Start_Lat'])
-
Create the function
extractdate
described in Project 11, Question 1, but this time use Polars instead of Pandas. -
Be sure to document your work from Question 1, using some comments and insights about your work.
Question 2 (2 pts)
Now recreate Project 11, Question 2, again using Polars instead of Pandas.
These 3 lines of Pandas:
newDF = myDF[pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.year == int(myyear) & pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.month == int(mymonth) & pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.day == int(mydate)]
should look (instead) like this:
newDF = myDF.filter(myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year().eq(myyear)
& myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month().eq(mymonth)
& myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day().eq(mydate))
For instance, if you try for May 29, 2009, like this:
newDF = myDF.filter(myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year().eq(2009)
& myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month().eq(5)
& myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day().eq(29))
and you compare myDF.shape
versus newDF.shape
then (hopefully) you will see that newDF
is much smaller than myDF
, corresponding to only May 29, 2009 (for newDF
) as compared to all of the May 2009 data (in myDF
).
-
Revise the function
extractdate
described in Project 11, Question 2, but this time use Polars instead of Pandas. -
Be sure to document your work from Question 2, using some comments and insights about your work.
Question 3 (2 pts)
Remember that Project 10, Question 4, had some challenges using Pandas, especially when reading the data from election year 2020:
/anvil/projects/tdm/data/election/itcont2020.txt
Try this method instead:
import polars as pl
myDF = pl.read_csv("/anvil/projects/tdm/data/election/itcont2020.txt", has_header=False, separator='|', columns=[9,14], ignore_errors=True)
myDF = myDF.rename({"column_10": "STATE", "column_15": "TRANSACTION_AMT"})
myDF.group_by('STATE').agg(pl.sum('TRANSACTION_AMT')).sort('TRANSACTION_AMT').tail(10)
Now that you have this working in a much easier manner (using Polars instead of Pandas), please recreate Project 10, Question 4, using Polars instead of Pandas:
Determine the total amount of money given (in dollars) during election campaigns from 1980 to 2024, altogether.
-
Determine the total amount of money given (in dollars) during election campaigns from 1980 to 2024, altogether.
-
Be sure to document your work from Question 3, using some comments and insights about your work.
Question 4 (2 pts)
Pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
-
Pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
-
Be sure to document your work from Question 4, using some comments and insights about your work.
Question 5 (2 pts)
Again pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
-
Again pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
-
Be sure to document your work from Question 5, using some comments and insights about your work.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!
-
firstname_lastname_project13.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |