TDM 20200: Mapping: Project 13 — Spring 2025

Motivation: We will continue to compare the usage of Pandas and Polars.

Context: Polars provides faster data frames, but Polars is also newer than Pandas, so it is not (yet) able to be integrated with everything else that we might want to accomplish in Python.

Scope: Comparing performances of technologies.

Learning Objectives:

We will continue to compare and contrast Pandas and Polars.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

In this project we will use the following mapping data:

/anvil/projects/tdm/data/taxi/yellow/

Questions

For this project, it should be enough to use 3 or 4 cores.

Question 1 (2 pts)

Recreate Project 11, Question 1, using Polars instead of Pandas.

For instance, if you want to read in the May 2009 data, you can use:

import polars as pl
myDF = pl.read_csv("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2009-05.csv", columns=['Trip_Pickup_DateTime', 'Start_Lon', 'Start_Lat'])

Deliverables

Create the function extractdate described in Project 11, Question 1, but this time use Polars instead of Pandas.
Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

Now recreate Project 11, Question 2, again using Polars instead of Pandas.

For example, this code from Pandas from Project 11:

newDF = myDF[pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.year == int(myyear) & pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.month == int(mymonth) & pd.to_datetime(myDF['Trip_Pickup_DateTime']).dt.day == int(mydate)]

should look (instead) something like this, when you use Polars:

newDF = myDF.filter(myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year().eq(myyear)
           & myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month().eq(mymonth)
           & myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day().eq(mydate))

For instance, if you try for May 29, 2009, like this:

newDF = myDF.filter(myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.year().eq(2009)
           & myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.month().eq(5)
           & myDF['Trip_Pickup_DateTime'].str.to_datetime("%Y-%m-%d %H:%M:%S").dt.day().eq(29))

and you compare myDF.shape versus newDF.shape

then (hopefully) you will see that newDF is much smaller than myDF, corresponding to only May 29, 2009 (for newDF) as compared to all of the May 2009 data (in myDF).

Now (please) go revise the function extractdate described in Project 11, Question 2, but this time use Polars instead of Pandas.

Deliverables

Revise the function extractdate described in Project 11, Question 2, but this time use Polars instead of Pandas.
Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

Remember that Project 10, Question 4, had some challenges using Pandas, especially when reading the data from election year 2020:

/anvil/projects/tdm/data/election/itcont2020.txt

Try this method instead:

import polars as pl
myDF = pl.read_csv("/anvil/projects/tdm/data/election/itcont2020.txt", has_header=False, separator='|', columns=[9,14], ignore_errors=True)
myDF = myDF.rename({"column_10": "STATE", "column_15": "TRANSACTION_AMT"})
myDF.group_by('STATE').agg(pl.sum('TRANSACTION_AMT')).sort('TRANSACTION_AMT').tail(10)

Now that you have this working in a much easier manner (using Polars instead of Pandas), please recreate Project 10, Question 4 (over all of the election years, 1980 to 2024), using Polars instead of Pandas:

Determine the total amount of money given (in dollars) during election campaigns from 1980 to 2024, altogether.

Deliverables

Determine the total amount of money given (in dollars) during election campaigns from 1980 to 2024, altogether.
Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

Pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.

Deliverables

Pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Again pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.

Deliverables

Again pick a question that you solved earlier in the semester, and solve this problem again, but this time using Polars instead of Pandas.
Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit

firstname_lastname_project13.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.