TDM 10200: Python Project 10 — Spring 2025

Motivation: We can create functions and apply the functions to many data sets in an efficient way. We can extract information in a straightforward way.

Context: We learn how to apply functions using map and using list comprehensions.

Scope: Applying functions to data.

Learning Objectives:
  • Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

  • /anvil/projects/tdm/data/flights/subset/* (flights data)

  • /anvil/projects/tdm/data/election/itcont/* (election data)

  • /anvil/projects/tdm/data/taxi/yellow/* (yellow taxi cab data)

In this project, we walk students through these powerful techniques.

It should be enough to use 2 cores in your Jupyter Lab session for this project.

Some of the questions in this project take a few minutes to run.

Please remember to make comments (in full English sentences) about your understanding of each of the questions and solutions. This is especially important in this project, where we walk you through the solutions, but we still want you to write some explanations of your own!

Question 1 (2 pts)

We can calculate the number of flights starting from Indianapolis airport in 1990 as follows:

import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/1990.csv", usecols=[16])
myvalue = myDF['Origin'].value_counts()["IND"]
print(myvalue)
del(myDF)

(We use the del at the end, so that we do not keep this data frame in memory, during the remainder of the project.)

Now we can replicate this work, using a function, as follows:

def myindyflights (myyear: int) -> int:
    """
    The myindyflights function takes a year as the input, and returns the number of flights departing from Indianapolis airport during that year.

    Args:
    myyear (str): This is a year as the input

    Returns:
    myvalue (int): This is the number of flights departing from Indianapolis airport during that year.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/" + str(myyear) + ".csv", usecols=[16])
    myvalue = myDF['Origin'].value_counts()["IND"]
    return myvalue

and we can test that we get the same results:

import pandas as pd
myindyflights(1990)

Finally, we can use the map function to run this function on each year from 1987 to 2008.

import pandas as pd
myresults = list(map(myindyflights, range(1987,2009)))

which yields the number of flights starting from Indianapolis airport in each year from 1987 to 2008.

The total number of flights starting from Indianapolis altogether, during 1987 to 2008, is:

sum(myresults)

and the number of flights per year is:

import matplotlib.pyplot as plt
plt.barh(range(1987,2009), myresults)

The data sets cover October 1987 through April 2008. So the data sets for 1987 and 2008 are smaller than you might expect, and that is OK.

Deliverables
  • Plot the total number of flights starting from the Indianapolis airport during 1987 to 2008.

Question 2 (2 pts)

We replicate the work from Question 1, but this time, we keep track of all of the flights originating at every airport in the data set.

We make a function, very much like Question 1, but this time, we keep track of the full table of the counts of Origin airports, for all airports (not just for Indianapolis):

def myflights (myyear: int) -> int:
    """
    The myflights function takes a year as the input, and returns the number of flights departing from each airport during that year.

    Args:
    myyear (str): This is a year as the input

    Returns:
    myvalue (int): This is the number of flights departing from each airport during that year.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/" + str(myyear) + ".csv", usecols=[16])
    myvalue = myDF['Origin'].value_counts()
    return myvalue

and we can test that this function works for the 1990 flights:

import pandas as pd
myflights(1990)

Finally, we can use the map function to run this function on each year from 1987 to 2008.

import pandas as pd
myresults = list(map(myflights, range(1987,2009)))

which yields the number of flights starting from each airport in each year, from 1987 to 2008.

Now we can add up the number of flights across all of the years, as follows:

pd.concat(myresults, axis=1).sum(axis=1)

and the number of flights starting at each of the top 10 airports during the years 1987 to 2008 is:

mycounts = pd.concat(myresults, axis=1).sum(axis=1).sort_values().tail(10)

import matplotlib.pyplot as plt
plt.barh(mycounts.index, mycounts)
Deliverables
  • Plot the total number of flights starting from each of the top 10 airports during 1987 to 2008.

Question 3 (2 pts)

Now we follow the methodology of Question 1, but this time we obtain the total amount of the donations from Indiana during federal election campaigns.

We can extract the total amount of the donations from Indiana during an election year as follows:

def myindydonations (myyear: int) -> int:
    """
    The myindydonations function takes a year as the input, and returns the amount of money donated from Indiana during that year.

    Args:
    myyear (str): This is a year as the input

    Returns:
    myvalue (int): This is the amount of money donated from Indiana during that year.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont" + str(myyear) + ".txt", header=None, sep='|', usecols=[9,14], encoding='Windows-1252')
    myDF.columns = ["STATE", "TRANSACTION_AMT"]
    myvalue = myDF.groupby('STATE')['TRANSACTION_AMT'].sum()["IN"]
    return myvalue

and we can test this function by discovering how much money was donated from Indiana during the 1990 election cycle:

import pandas as pd
myindydonations(1990)

Finally, we can use the map function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.

import pandas as pd
myresults = list(map(myindydonations, range(1980,2019,2)))

which yields the total amount of money donated from Indiana during each election cycle from 1980 to 2018.

The amount of money donated from Indiana per election cycle is:

import matplotlib.pyplot as plt
plt.barh(range(1980,2019,2), myresults)
Deliverables
  • Plot amount of money donated from Indiana per election cycle from 1980 to 2018.

Question 4 (2 pts)

Now we find the top 10 states according to the total amount of the donations from each state during the elections from 1980 to 2018.

We can extract the total amount of all the donations from all of the states during an election year as follows:

def mydonations (myyear: int) -> int:
    """
    The mydonations function takes a year as the input, and returns the amount of money donated from each state during that year.

    Args:
    myyear (str): This is a year as the input

    Returns:
    myvalue (int): This is the amount of money donated from each state during that year.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont" + str(myyear) + ".txt", header=None, sep='|', usecols=[9,14], encoding='Windows-1252')
    myDF.columns = ["STATE", "TRANSACTION_AMT"]
    myvalue = myDF.groupby('STATE')['TRANSACTION_AMT'].sum()
    return myvalue

and we can test this function by discovering how much money was donated from each state during the 1990 election cycle:

import pandas as pd
mydonations(1990)

Finally, we can use the map function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.

import pandas as pd
myresults = list(map(mydonations, range(1980,2019,2)))

which yields the total amount of money donated from each state during each election cycle from 1980 to 2018.

Now we can add up the amount of donations in each state, across all of the years, as follows:

pd.concat(myresults, axis=1).sum(axis=1)

and the total amount of donations from each of the top 10 states across all election years 1980 to 2018 is:

mycounts = pd.concat(myresults, axis=1).sum(axis=1).sort_values().tail(10)

import matplotlib.pyplot as plt
plt.barh(mycounts.index, mycounts)
Deliverables
  • Plot the amount of money donated from each of the top 10 states altogether during 1980 to 2018.

Question 5 (2 pts)

In this last question, we find the total amount of money spent on taxi cab rides in New York City on each day of 2018.

We first extract the total amount of the taxi cab rides per day of a given month as follows:

def myfares (mymonth: str) -> float:
    """
    The myfares function takes a 2-character month as the input (in quotation marks, with a leading 0 if needed), and returns the amount of money spent on each day during that month

    Args:
    mymonth (str): This is a 2-character month (as a string) as the input

    Returns:
    mytable (float): This is the amount of money (as a floating point number) spent on each day during that month
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2018-" + mymonth + ".csv", usecols=[1,16])
    myDF['mydate'] = pd.to_datetime(myDF['tpep_pickup_datetime']).dt.strftime("%Y-%m-%d")
    mytable = myDF.groupby('mydate')['total_amount'].sum()
    return mytable

and we can test this function by discovering how much money was spent on each day in January:

import pandas as pd
myfares("01")

Finally, we can use the map function to run this function on each month from "01" to "12".

import pandas as pd
mymonths = [str(i).zfill(2) for i in range(1,13)]
myresults = list(map(myfares, mymonths))

which yields the total amount of money spent on taxi cab rides each day.

Now we can add up the amounts spent per day (sometimes there is overlap from month to month), as follows:

mycounts = pd.concat(myresults, axis=1).sum(axis=1)
betterdates = mycounts[pd.to_datetime(mycounts.index).strftime("%Y") == "2018"]

and the total amount of money spent on taxi cab rides during each day in 2018 is can be plotted as:

import matplotlib.pyplot as plt
plt.plot(betterdates.index, betterdates)
Deliverables
  • Plot the total amount of money spent on taxi cab rides during each day in 2018.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project10.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.