Aggregate Functions with the Flight Dataset

This example is from TDM 102 Project 10 Spring 2024.

These example(s) depend on the database:

  • /anvil/projects/tdm/data/flights/2014.csv

Learn more about the dataset here.

1a. The dataset is 2.5 G. For this project, we will only need the following columns, so let us create a DataFrame with only those columns with corresponding data types.

import pandas as pd


pth="/anvil/projects/tdm/data/flights/2014.csv"

columns_to_read = [
    'DepDelay', 'ArrDelay', 'Distance',
    'CarrierDelay', 'WeatherDelay',
    'DepTime', 'ArrTime', 'Diverted', 'AirTime'
]


column_types = {
    'DepDelay': 'float64',
    'ArrDelay': 'float64',
    'Distance': 'float64',
    'CarrierDelay': 'float64',
    'WeatherDelay': 'float64',
    'DepTime': 'float64',
    'ArrTime': 'float64',
    'Diverted': 'int64',
    'AirTime': 'float64'
}

# Reading the CSV file with only the specified columns
df = pd.read_csv(pth, usecols=columns_to_read, dtype=column_types)

print(df.head())
DepTime DepDelay ArrTime ArrDelay Diverted AirTime Distance CarrierDelay WeatherDelay

935.0

-5.0

1051.0

-4.0

0

56.0

328.0

951.0

11.0

1115.0

20.0

0

54.0

328.0

11.0

0.0

1144.0

9.0

1302.0

2.0

0

57.0

328.0

1134.0

-1.0

1253.0

-7.0

0

53.0

328.0

1129.0

-6.0

1244.0

-16.0

0

52.0

328.0

2a. Use to_numpy() to create a numpy array called mydelays, containing the information from the column DepDelay.

import numpy as np
import pandas as pd



mydelays = df['DepDelay'].to_numpy()

2b. Display the shape and data type in mydelays.

print("Shape:", mydelays.shape, "Type:", mydelays.dtype)
Shape: (5819811,) Type: float64

2c. Use nan_to_num() to replace all null values in mydelays to 0.

mydelays = np.nan_to_num(mydelays, nan=0)

2d. It can be helpful to know how to manipulate the values in an array! Find the average time in mydelays by calculating the numpy mean() of this array. Afterwards, add 15 minutes to all of the departure delay times stored in mydelays. Finally, use the numpy mean() method again, to calculate and display the average of the updated values in mydelays.

import numpy as np
import pandas as pd

mydelays_u = mydelays + 15
mean_mydelays = np.mean(mydelays_u)
print("Mean Departure Delay:", mean_mydelays)
Mean Departure Delay: 25.417677824932802

3a. Calculate and display the maximum arrival delay and the minimum arrival delay.

import numpy as np
import pandas as pd

arrDelay = df['ArrDelay'].to_numpy()
arrDelay =np.nan_to_num(arrDelay,nan=0)
max_arr_delay = np.max(arrDelay)
min_arr_delay = np.min(arrDelay)
print("Max Arrival Delay:", max_arr_delay, "Min Arrival Delay:", min_arr_delay)
Max Arrival Delay: 2444.0 Min Arrival Delay: -112.0