TDM 20200: Mapping: Project 12 — Spring 2025

Motivation: We will now compare the usage of Pandas and Polars.

Context: Polars provides faster data frames, but Polars is also newer than Pandas, so it is not (yet) able to be integrated with everything else that we might want to accomplish in Python.

Scope: Comparing performances of technologies.

Learning Objectives:

We will compare and contrast Pandas and Polars.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

In this project we will use the following mapping data:

/anvil/projects/tdm/data/flights/subset/2002.csv

Questions

For this project, it should be enough to use 3 or 4 cores.

Question 1 (2 pts)

The first time that we load any data set into memory, it will take a little longer.

So please run the lines in Polars below, two times, because the first time will be extra slow, as we load the data off the disk and into memory.

import polars as pl
import time
start_time = time.time()
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")
print("My program took", time.time() - start_time, "to run")

Now compare with:

import pandas as pd
import time
start_time = time.time()
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")
print("My program took", time.time() - start_time, "to run")

You can read about the timing mechanism here: stackoverflow.com/questions/12444004/how-long-does-my-python-application-take-to-run

Question: How much faster is Polars than Pandas, in loading this data set? (Polars should be approximately 5 times faster than Pandas.)

Deliverables

How much faster is Polars than Pandas, in loading this data set? (Polars should be approximately 5 times faster than Pandas.)
Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

Now we calculate the time needed to get the average departure delays for flights with Origin at Indianapolis.

import polars as pl
# read in the 2002 flight data to a Polars data frame again
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA")

import time
start_time = time.time()
myresults = myDF.filter(pl.col("Origin") == "IND")['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

Now compare with:

import pandas as pd
# read in the 2002 flight data to a Pandas data frame again
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv")

import time
start_time = time.time()
myresults = myDF[myDF['Origin'] == "IND"]['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

Question: How much faster is Polars than Pandas, in finding these average flight delays? (Polars should be approximately 5 to 10 times faster than Pandas.)

Deliverables

How much faster is Polars than Pandas, in loading this data set? (Polars should be approximately 5 times faster than Pandas.)
Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

Now we calculate the time needed to get the average departure delays for all flights with Origin at any California airports. We need to merge our flight data with the airport data.

import polars as pl
# read in the 2002 flight data to a Polars data frame again, but only 3 columns this time:
myDF = pl.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", null_values="NA").select(["Origin", "Dest", "DepDelay"])

# also read in the airports data to a Polars data frame as follows:
myairports = pl.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")

import time
start_time = time.time()
newDF = myDF.join(myairports, left_on="Origin", right_on="iata")
myresults = newDF.filter(pl.col("state") == "CA")['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for all flights with Origin at any California airports was " + str(myresults))

Now compare with:

import pandas as pd
# read in the 2002 flight data to a Pandas data frame again, but only 3 columns this time:
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/2002.csv", usecols=["Origin", "Dest", "DepDelay"])

# also read in the airports data to a Pandas data frame as follows:
myairports = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")

import time
start_time = time.time()
newDF = myDF.merge(myairports, left_on='Origin', right_on='iata')
myresults = newDF[newDF['state'] == "CA"]['DepDelay'].mean()
print("My program took", time.time() - start_time, "to run")
print("The average delay for Indianapolis flights was " + str(myresults))

Question: How much faster is Polars than Pandas, in finding these average flight delays? (Polars should be approximately 2 to 3 times faster than Pandas.)

Deliverables

How much faster is Polars than Pandas, in merging these two data sets and finding the overall delays for flights from California? (It should be approximately 2 to 3 times faster.)
Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

Compare two operations in Polars versus Pandas and test the timing. It can be on the same data set, or on different data sets.

Deliverables

Compare the timing or two operations in Polars versus Pandas and test the timing.
Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Again compare two operations in Polars versus Pandas and test the timing. It can be on the same data set, or on different data sets.

Deliverables

Compare the timing or two operations in Polars versus Pandas and test the timing.
Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

Congratulations! Assuming you’ve completed all the above questions, you are learning to apply your web scraping knowledge effectively!

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit

firstname_lastname_project12.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.