TDM 10200: Project 10 — Spring 2024

Motivation: NumPy is the foundation that Pandas is built on. Mastering NumPy’s numerical operations will enrich your understanding for numerical operations and data analysis.

Context: Hopeful you have a solid foundation and understanding of data analysis in Python, and a good introduction to Pandas. In this project we will delve into NumPy, enhancing your skill set for high performance computing, in situations where you do not use Pandas.

Scope: Python, pandas, NumPy

Reading and Resources

Dataset

/anvil/projects/tdm/data/flights/2014.csv

You need to use 2 cores for your Jupyter Lab session for Project 10 this week.

You can use pd.set_option('display.max_columns', None) if you want to see all of the columns in a very wide data frame.

We added six new videos to help you with Project 10. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: /anvil/projects/tdm/data/flights/2014.csv

Questions

Question 1 (2 points)

  1. The dataset is 2.5 G. For this project, we will only need the following columns, so let us create a DataFrame with only those columns with corresponding data types.

cols = [
    'DepDelay', 'ArrDelay', 'Distance',
    'CarrierDelay', 'WeatherDelay',
    'DepTime', 'ArrTime', 'Diverted', 'AirTime'
]

col_types = {
    'DepDelay': 'float64',
    'ArrDelay': 'float64',
    'Distance': 'float64',
    'CarrierDelay': 'float64',
    'WeatherDelay': 'float64',
    'DepTime': 'float64',
    'ArrTime': 'float64',
    'Diverted': 'int64',
    'AirTime': 'float64'
}
  • You may refer to pandas.read_csv to know more about read only specific columns

Question 2 (2 points)

  1. Use to_numpy() to create a numpy array called mydelays, containing the information from the column DepDelay.

  2. Display the shape and data type in mydelays.

  3. Use nan_to_num() to replace all null values in mydelays to 0.

  4. It can be helpful to know how to manipulate the values in an array! Find the average time in mydelays by calculating the numpy mean() of this array. Afterwards, add 15 minutes to all of the departure delay times stored in mydelays. Finally, use the numpy mean() method again, to calculate and display the average of the updated values in mydelays. How do these two averages compare?

  • The output should look something like this:

output
The average Departure Delay before adding 15 minutes is: .......

The average Departure Delay after adding 15 minutes is: .......

Question 3 (2 points)

  1. Calculate and display the maximum arrival delay and the minimum arrival delay.

  • The output should look something like this:

output
Max Arrival Delay: ...... minutes
Min Arrival Delay: ...... minutes

Question 4 (2 points)

The motivation for questions 4 and 5 is to compare the times needed for calculations in pandas vs. numpy.

In this question, first solve the following 3 questions using pandas (only).

  1. Create a data frame named delayed_flights that contains the information about the flights that satisfy the condition departure delay  60 minutes or arrival delay  60 minutes.

  2. Calculate the average distance for the flights that you found in question 4a, by taking a mean of the Distance column from the pandas data frame.

  3. Display the time needed to calculate the time used for the calculation.

  • You may import the time() library to calculate the time used, as follows:

import time
start_time = time.time()
....
#your program here
....
end_time = time.time()
print(f"Used time is {end_time - start_time}")

Question 5 (2 points)

Please using numpy methods to re-create your work from Question 4, as follows:

  1. Create 3 numpy arrays for the DepDelay, ArrDelay, and Distance data.

  2. Filter the numpy array with the Distance stored in it, so that you have only the Distances that satisfy the condition that 'departure delay > 60 minutes or arrival delay > 60 minutes'

  3. Use numpy mean() to calculate the average distances from question 5b. (Your solution should be the same as the average you obtained in question 4b.)

  4. How long does the program take to get the average?

  5. Please state your understanding of pandas vs. numpy from Question 4 and 5 in one or two sentences.

Project 10 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project10.ipynb.

  • Python file with code and comments for the assignment

    • firstname-lastname-project10.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.