# TDM 10200: Project 10 — Spring 2024

Motivation: NumPy is the foundation that Pandas is built on. Mastering NumPy’s numerical operations will enrich your understanding for numerical operations and data analysis.

Context: Hopeful you have a solid foundation and understanding of data analysis in Python, and a good introduction to Pandas. In this project we will delve into NumPy, enhancing your skill set for high performance computing, in situations where you do not use Pandas.

Scope: Python, pandas, NumPy

## Dataset

`/anvil/projects/tdm/data/flights/2014.csv`

 You need to use 2 cores for your Jupyter Lab session for Project 10 this week.
 You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame.
 We added six new videos to help you with Project 10. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: `/anvil/projects/tdm/data/flights/2014.csv`

## Questions

### Question 1 (2 points)

1. The dataset is 2.5 G. For this project, we will only need the following columns, so let us create a DataFrame with only those columns with corresponding data types.

``````cols = [
'DepDelay', 'ArrDelay', 'Distance',
'CarrierDelay', 'WeatherDelay',
'DepTime', 'ArrTime', 'Diverted', 'AirTime'
]

col_types = {
'DepDelay': 'float64',
'ArrDelay': 'float64',
'Distance': 'float64',
'CarrierDelay': 'float64',
'WeatherDelay': 'float64',
'DepTime': 'float64',
'ArrTime': 'float64',
'Diverted': 'int64',
'AirTime': 'float64'
}``````

### Question 2 (2 points)

1. Use `to_numpy()` to create a numpy array called `mydelays`, containing the information from the column `DepDelay`.

2. Display the shape and data type in `mydelays`.

3. Use `nan_to_num()` to replace all null values in `mydelays` to 0.

4. It can be helpful to know how to manipulate the values in an array! Find the average time in `mydelays` by calculating the numpy `mean()` of this array. Afterwards, add 15 minutes to all of the departure delay times stored in `mydelays`. Finally, use the numpy `mean()` method again, to calculate and display the average of the updated values in `mydelays`. How do these two averages compare?

 The output should look something like this: output ```The average Departure Delay before adding 15 minutes is: ....... The average Departure Delay after adding 15 minutes is: .......```

### Question 3 (2 points)

1. Calculate and display the maximum arrival delay and the minimum arrival delay.

 The output should look something like this: output ```Max Arrival Delay: ...... minutes Min Arrival Delay: ...... minutes```

### Question 4 (2 points)

The motivation for questions 4 and 5 is to compare the times needed for calculations in pandas vs. numpy.

In this question, first solve the following 3 questions using pandas (only).

1. Create a data frame named `delayed_flights` that contains the information about the flights that satisfy the condition   60 minutes or arrival delay  60 minutes.

2. Calculate the average distance for the flights that you found in question 4a, by taking a mean of the `Distance` column from the pandas data frame.

3. Display the time needed to calculate the time used for the calculation.

 You may import the `time()` library to calculate the time used, as follows: ``````import time start_time = time.time() .... #your program here .... end_time = time.time() print(f"Used time is {end_time - start_time}")``````

### Question 5 (2 points)

Please using `numpy` methods to re-create your work from Question 4, as follows:

1. Create 3 numpy arrays for the `DepDelay`, `ArrDelay`, and `Distance` data.

2. Filter the numpy array with the `Distance` stored in it, so that you have only the Distances that satisfy the condition that 'departure delay > 60 minutes or arrival delay > 60 minutes'

3. Use numpy `mean()` to calculate the average distances from question 5b. (Your solution should be the same as the average you obtained in question 4b.)

4. How long does the program take to get the average?

5. Please state your understanding of pandas vs. numpy from Question 4 and 5 in one or two sentences.

Project 10 Assignment Checklist

• Jupyter Lab notebook with your code, comments and output for the assignment

• `firstname-lastname-project10.ipynb`.

• Python file with code and comments for the assignment

• `firstname-lastname-project10.py`