TDM 20200: Project 11 — 2024

Motivation: Machine learning and AI are huge buzzwords in industry, in this project, we will continue to learn more TensorFlow features.

Context: The purpose of these projects is to give you exposure to machine learning tools, some basic functionality, and a conceptual workflow to create and use a model without needing any special math or statistics background.

Scope: Python, tensorflow, scikit-learn, numpy

Dataset

/anvil/projects/tdm/data/flights/2019.csv

Readings and Resources

Make sure to read about, and use the template found here, and the important information about projects submissions here.
Pandas read_csv
scikit-learn documentation
scikit-learn tutorial
tensorflow tutorial
youtube for tensorflow
joblib dump() load()
metrics

You need to use 2 cores for your Jupyter Lab session for Project 11 this week.

You can use pd.set_option('display.max_columns', None) if you want to see all of the columns in a very wide data frame.

We added a video (below) to help you with Project 11. BUT the example video is about a data set with beer reviews. You need to (instead) work on the flight data given here: /anvil/projects/tdm/data/flights/2014.csv and also here /anvil/projects/tdm/data/flights/2019.csv

Questions

Question 1 (2 points)

In the previous project you created a tensorflow model with limited data. Since it would need large data in order to create a meaningful tensorflow model, your model may not work well! Nonetheless, we can still learn how we can create (and use!) the model, and how to check the performance power of the model.

First update your program from Project 10, building the model with more data. Please use nrows = 100000 from the data set 2014.csv. The test/training split should again be defined using:

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

and using epochs=10 when training the model.

In question 1, we are using 2014 data to train the model. In question 2, we will use 2019 data to test the model. Please name your variables in a way that will enable you and the graders to both understand which variables are which.

Question 2 (2 points)

Read in 100000 lines of data from the 2019.csv file.
Save the predicted arrival delays as predicted_arrival_delays_100k_2019 (or something similar)
Save the actual arrival delays as actual_arrival_delays_100k_2019 (or something similar)

Question 3 (2 points)

Solve questions 1 and 2 again, this time using 500000 rows from the 2014 data and 500000 rows from the 2019 data. Be sure to change all of your variable names accordingly.

Question 4 (2 points)

Use the data from question 2 (with 100000 rows of data), to study the predicted arrival delays for 2019 versus the actual arrival delays for 2019. Please comment on what you find.

Be sure to (please) provide some explanation about what you learn, and (likely) some visualizations to justify your work.

Question 5 (2 points)

Now use the data from question 3 (with 500000 rows of data), to study the predicted arrival delays for 2019 versus the actual arrival delays for 2019. Please comment on what you find. Be sure to compare the effectiveness of using 100000 rows of data versus using 500000 rows of data.

Project 11 Assignment Checklist

Jupyter Lab notebook with your code, comments and outputs for the assignment
- firstname-lastname-project11.ipynb
Python file with code and comments for the assignment
- firstname-lastname-project11.py
Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.