TDM 10100: Project 13 - Topics Review

Project Objectives

Motivation: If you have made it this far, you hopefully now have a good introduction to working in R. This project mainly focuses on reviewing some of the many topics we have already covered throughout this semester.

Context: From reading in datasets to making interactive maps, we hope this project helps you realize just how much you have learned throughout this semester.

Scope: R, tapply, data cleaning, merging, mapping

Learning Objectives

Write your own function
Practice working with big datasets
Visually express data value distributions

Dataset

/anvil/projects/tdm/data/flights/subset/2006.csv
/anvil/projects/tdm/data/flights/subset/airports.csv

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

2006 Flights

The flights dataset is huge, with files from 1987 to 2023, each with respective subset datasets just to make the data reasonable to work with. The flights datasets provides numerous opportunities for data exploration - there are 7,141,922 rows from 2006 alone! These subsets contain information about when each flight took place, as well as different factors like how long they took, specifics like flight numbers, and more. There are, of course, empty or messy values, but there is so much data that this does not make too much of an impact for what we will be doing.

There are 29 columns and millions of rows of data. Some of these columns include:

Month: numeric month values
DepDelay: flight departure delay time in minutes
Origin: abbreviation values for origin airport
Dest: abbreviation values for destination airport

Airports

This dataset accompanies the flights dataset. It is a great deal smaller than even just one of the flight year subsets, containing 3376 rows that give details for each of the airports across the United States.

There is far less going on in this dataset, with 3376 rows, and 7 columns:

iata: three-letter abbreviations for each airport name
airport: actual spelled out airport name
city: city in which each airport is
state: all 50 U.S. states, in two-letter abbreviation format
country: mostly containing USA values, with a few other countries listed
lat: latitude of each airport location
long: longitude of each airport location

Questions

BOS Boston Skyline

Please use 4 cores for this project.

Question 1 (2 points)

Use the fread() function from the data.table library. This function is more focused at dealing with really large datasets than read.csv.

For this question, read in the 2006 Flights dataset with the sole purpose of checking out what is generally contained within Flights data:

flights <- fread('/anvil/projects/tdm/data/flights/subset/2006.csv')

Print out the head(), and make note of a few columns that track useful numeric flights data.

Now, we will write a function avg_flights_values(), for which we will be able to input a year, a flight origin, and the name of the column which contains numeric data. This function will read in the dataset from the specified year, take the flights from the inputted origin, and find the average value of the chosen numeric column per month in that year.

Sometimes it is not so easy to build a function from nothing. Outside of where you are going to build the function, take the values from the 2006 Flights data for which the Origin is "IND", and save this subset as indy_flights.

Use tapply() to take the average DepDelay across each Month of indy_flights. Make sure to remove NA values, and save the result of this function as my_avg_values (simply using na.rm = TRUE option in tapply function).

We’ve used tapply() in some projects before! In Project 7, we used tapply() similarly to find the average video counts for YouTube channels made within a certain year.

my_averages <- tapply(youtubers$video.count2, youtubers$started, mean, na.rm=TRUE)

Build the avg_flights_values() function. Test your function with some different inputs. For instance, if you run avg_flights_values(2006, "IND", "DepDelay"), the output should be something like: 1: 3.6426084203862 2: 5.175 3: 6.70757670632436 4: 5.9419542083199 5: 5.58160332162248 6: 9.58821645853346 7: 6.61897590361446 8: 4.49106872540115 9: 4.49279835390946 10: 7.67158308751229 11: 4.33443489755453 12: 8.79530423280423

Deliverables

1.1 What columns in the Flights dataset can you find the per-month average for their values?
1.2 Build the function avg_flights_values() and test it using different input values than from the example.

Question 2 (2 points)

Read in the Airports dataset and display the first 6 rows:

airports <- read.csv('/anvil/projects/tdm/data/flights/subset/airports.csv')
head(airports)

This dataset contains information about each of the different airports in the United States. If you look through the iata column, you may recognize some of the three-letter abbreviations for the airport names that occur in the Flights datasets.

Filter the Airports dataset to create a dataframe that only contains information about the airports in the midwest_states:

midwest_states <- c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI")

Create a plot to show how the airports are distributed throughout the Midwest. Then, for the northeast_states:

northeast_states <- c('CT', 'ME', 'MA', 'NH', 'NJ', 'NY', 'PA', 'RI', 'VT')

Plot to show how many airports are in each northeast state.

Display the two plots beside each other.

In base R, this can look like

# set the plotting space for 1 row, 2 columns worth of plots
par(mfrow = c(1, 2))

plot(plot_number_1)

plot(plot_number_2)

Deliverables

2.1 Make a table to show how many airports are in each Midwest state. Which state has the most airports?
2.2 Plot to show how the airports in the Midwest and northeast regions are distributed throughout the states

Question 3 (2 points)

In this question, we will be using the 2006 Flights and the Airports datasets.

There are a lot of columns in the Flights dataset. It can be helpful to adjust how many columns can be displayed with: options(repr.matrix.max.cols = some_big_number)

What is the dimension of the 2006 Flights dataset? Display the names of the columns.

The Airports dataset contains latitude and longitude information for each of the airports. However, there are 3376 unique entries in the iata column, and about 300 in each of the Origin and Dest columns of the Flights dataset.

Merge the two datasets together using the iata and Origin columns. Make sure you are keeping everything from the Flights data, and only taking the rows from the Airports data which matches to the Flight entries. Save these merged datasets as flights_expanded:

flights_expanded <- flights %>%
    left_join(airports, by = c("Origin" = "iata")) %>%
    rename(OriginLat = lat,
           OriginLong = long,
           OriginCity = city)

Please notice in the code above, we renamed the city, lat, and long columns that come from the Airports data. It is important to note that these values align with those from the Origin values, rather than the Dest column values.

Print the unique OriginCity names of flights_expanded.

Deliverables

3.1 Show the column names of the Flights dataset.
3.2 Merge the 2006 Flights dataset with the Airports dataset.
3.3 What are the origin cities of flights_expanded?

Question 4 (2 points)

For this question, we will work with maps in R again. This utilizes the following libraries:

library(sf)
library(leaflet)
library(htmltools)

Remember, resizing the map is strange (for Firefox browser) and requires the html formatting that we used in Project 12.

In the flights_expanded dataframe, select the Origin, OriginCity, OriginLat, and OriginLong columns, and subset them as origin_info.

In the Airports data, there was one entry per airport. But in flights_expanded, there is an entry for each flight. If we tried to plot the origin locations from each flight, the map would be extremely laggy, and the image would be far too overcrowded. Clean origin_info so it only contains distinct Origin locations for each flight origin:

origin_info <- flights_expanded %>%
    select(Origin, OriginCity, OriginLat, OriginLong)  %>%
    distinct()

We have some experience mapping from Project 12. Here, set the coordinates of points to be the OriginLong and OriginLat columns of origin_info. Plot these points on the map.

Deliverables

4.1 Subset and clean flights_expanded to make origin_info.
4.2 Map the locations for the origins of the flights that took place in 2006.

Question 5 (2 points)

The map currently shows the unique airport locations across the United States. BUT we can make it so we are only plotting the airports in the Midwest. Filter origin_info to only include values from the state column that are included in midwest_states which is the same vector used to find the Midwest states in Question 2.

Make a map to show the airports for the flights that departed from the Midwest states.

There are a lot of mapped points of both the map of the Midwest origins, and the origins across the entire country. Add popups to where you have addCircleMarkers(), so when a dot is selected, it will display what city it represents.

addCircleMarkers(data = points, radius = [some_radius_value], color = [some_fun_color], popup = ~OriginCity)

Deliverables

5.1 Map to show the locations of the airports for the flights departing from Midwest states.
5.2 Add popups to the circle markers to display the name of each origin city.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project13.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.