TDM 10200: Project 7 — 2023

Motivation: Pandas allows us to work with data frames. The actions that we perform on data frames will sometimes remind us of similar actions that we have performed on data frames during the previous semester with R. For instance, we often want to extract information about one or more variables, sometimes grouping the data according to one variable and summarizing another variable within each of those groups.

Context: Unifying our understanding of Pandas and the ability to develop functions will allow us to systematically analyze data.

Scope: Pandas and functions

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/flights/subset/

Questions

ONE

Make sure to have 2 cores when you start your Jupyter Lab session.

For convenience, we will help you quickly get all of the data for all of the flights that depart or arrive at Indianapolis airport, as follows. Make a new cell in JupyterLab that has exactly this content (please copy and paste for accuracy):

%%bash
head -n1 /anvil/projects/tdm/data/flights/subset/1987.csv >~/INDflights.csv
grep -h ",IND," /anvil/projects/tdm/data/flights/subset/*.csv >>~/INDflights.csv

We want to use Pandas to read in the data frame, and it will have a lot of columns. So we set Pandas to display an unlimited number of columns.

import pandas as pd
pd.set_option('display.max_columns', None)

Afterwards, in a separate cell in JupyterLab, you can read in your data to a Pandas data frame like this:

myDF = pd.read_csv('~/INDflights.csv')

Do not worry that you get a DtypeWarning; this will not affect our work on this project~

Now your data frame called myDF will contain all of the data for the flights (from October 1987 to April 2008) that depart or arrive at Indianapolis airport, which has 3-letter code IND.

These files correspond to the years 1987 through 2008. Your data frame should contain all of the data for all of the flights with IND as the Origin or Dest airport.

  1. How many flights are there altogether in myDF? You can check this using myDF.shape.

  2. How many of the flights are departing from IND? (I.e., the Origin airport is IND.)

  3. How many of the flights are arriving to IND? (I.e., the Dest airport is IND.)

Items to submit
  • Code used to answer the question.

  • Result of code.

TWO

  1. For flights departing from 'IND' (i.e., with IND as the Origin), what are the 20 most popular destination airports (i.e., the 20 most popular Dest airports)?

  2. For flights departing from 'IND' (i.e., with IND as the Origin), what are the 5 most popular airlines (i.e., the 5 most popular `UniqueCarrier`s)?

Items to submit
  • Code used to answer the question.

  • Result of code.

THREE

  1. Wrap your work for question 2a into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 20 most popular destination airports in that data frame.

  2. Wrap your work for question 2b into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 5 most popular airlines in that data frame.

Items to submit
  • Code used to answer the question.

  • Result of code.

FOUR

Test your functions from question 3a and 3b on a couple of other airports. Hint: If we use huge airports, we likely will not have enough member in Pandas and our kernel might crash. So we will consider some midsize airports for testing the functions. Test your functions from questions 3a and 3b on Jacksonville (JAX) and Buffalo (BUF).

Items to submit
  • Code used to answer the question.

  • Result of code.

TA applications for The Data Mine are currently being accepted. Please visit us here to apply!

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.