TDM 10100: Project 8 - Dates and Date-Times

Project Objectives

Motivation: Working with date or date-time data can be tricky and frustrating. The lubridate library makes this a lot easier and even makes things possible that are not in base R.

Context: Lubridate is a very useful package from the tidyverse that can be used across many different contexts when working with real-world data.

Scope: R, dates, date-times, tapply,

Learning Objectives

Learn about working with lubridate
Practice plotting useful functions
Convert and manipulate data values

Please use 4 Cores and 8GB RAM for this project.

Dataset

/anvil/projects/tdm/data/restaurant/orders.csv
/anvil/projects/tdm/data/flights/subset/1997.csv

If AI is used in any cases, such as for debugging, research, etc, we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Restaurant Orders

This dataset contains data about restaurant recommendations. The focus was on the challenge of making restaurant recommendations to customers. This is a record of all orders made by the customers from various venders that were included in the competition. Each customer is not limited to a single order and can even order from multiple locations.

With data from about 35,000 customers, there is ample opportunity for data exploration. The orders are stored across 26 columns that focus on the details of each order, vender, and customer combination. The columns we will be focusing on are item_count and created_at. item_count tracks the number of items within each order. created_at marks the year, month, day, hour, minute, and second the order was created in the system.

1997 Flights

The flights dataset is huge, with files from 1987 to 2023, each with respective subset datasets just to make the data reasonable to work with. The flights dataset provides numerous opportunities for data exploration with 5,411,843 rows from 1997 alone. These subsets contain information about when each flight took place, as well as different factors like how long they took, specifics like flight numbers, and more. There are, of course, empty or messy values, but there is so much data that this does not make too much of an impact for what we will be doing.

There are 29 columns and millions of rows of data. Some of these columns include:

Year: year values (only containing 1997) for this subset
Month: numeric month values
DayofMonth: day of month (1 - (29 up to 31))
DepTime: flight departure time in military time shorthand
Origin: abbreviation values for origin airport
Distance: distance per flight in miles

Questions

Question 1 (2 points)

Read in the Restaurant Orders dataset as orders from /anvil/projects/tdm/data/restaurant/orders.csv. Many of the entries are blank when we first look at the head of the dataset. This could be un-recorded data, errors, or just instances where the columns did not apply to the customer. This is a fairly large dataset, and the columns we will be using are more consistent, so it is OK to ignore where there are gaps.

If we look at the created_at column, it shows date and time character data (if you read the data with read.csv()). When data is of class Date, it represents actual calendar dates (without time information) rather than just written entries. You can convert character strings into Date objects using as.Date().

While the default input format is "YYYY-MM-DD" or "YYYY/MM/DD", you can specify other formats using different arguments and format codes. The output of Date objects are typically "YYYY-MM-DD", regardless of input format.

Overwrite the created_at column or make a new column containing the created_at entries as actual dates:

orders$created_at <- as.Date(orders$created_at)

In R, there is a package lubridate that is designed to make working with date and time data simpler and more efficient. This package is a part of the tidyverse, and provides us with methods to complete date and time related manipulations easier than with base R. To learn a bit about how to begin using lubridate, look through the introduction and cheatsheet.

Within lubridate, there are numerous functions provided that make manipulating date/time data a lot simpler. This includes but is not limited to:

wday() - day of the week
mday() - day of the month
yday() - day of the year
month() - numerical month
year() - year value
ymd() - dates of YYYY-MM-DD format

With the Date created_at column, make two new columns. Your first column, num_day_of_the_week, will contain the day of the week data for the numerical date values:

orders$num_day_of_the_week <- wday(orders$created_at)

Your second column, day_of_the_week, will contain the day of the week data with the correct labels applied to each day of the week value:

orders$day_of_the_week <- wday(orders$created_at, label=TRUE)

Show the table of each of these new columns, then make a table displaying their values against each other.

This comparison table does not tell us much. But it does help confirm that 1 = Sunday, 2 = Monday, and so forth. And it is also true that everything you do does not have to have a strict purpose, and can just be fun for displaying some of your data manipulations visually. You could even put this into a heatmap or some other way of visualizing it, but this simple table is OK, especially given that it really does not serve much practical purpose.

Deliverables

1.1 In your own words, what does the lubridate library do?
1.2 Proof that the days of the week values 1-7 follow the Sunday-Saturday rotation.
1.3 Date data is stored as the class Date. How does date-time data get stored?

Please refer to the Dr. Ward’s video with orders data for more practise. In this video, the data is read into R with fread function and it autamatically sees created_at column as Date (no need to convert it):

Question 2 (2 points)

Now, take the month values from created_at, and save them as a new month column (we used label=TRUE for the names of the months):

orders$month <- month(orders$created_at, label=TRUE)

Do the same for years for a new year column. From these new columns, check out how the data is distributed in their respective tables (table for month and table for year).

There is a column item_count that tracks how many items were in each order. Use tapply() to see how the total counts of items in the orders were distributed across the months.

You may see that some of the months still have NA as their value even after using na.rm=TRUE. To see why this is, search through the orders dataset for where the month column is equal to "Mar" (or March if you specified abbr = FALSE when creating month).

Save your tapply() function to a variable, then create a barplot from it. This helps to visualize that there were actually no items ordered in March, April, or May.

Use tapply() again, this time to show the total item count for each (year, month) pairing.

Make this into a barplot to help visualize how the orders were distributed throughout the years. Remember to use beside=TRUE and legend=TRUE (as well as your other customizations) to help this plot’s readability.

Deliverables

2.1 What was the actual time span of orders (with items) in this dataset?
2.2 Why were March and April showing NA values? Why does May not?
2.3 Give table showing how the orders were distributed throughout the months

In the following video, Dr. Ward shows more examples with orders data and identifies the top 5 vendors (vendor_id) with the highest number of orders over the years:

Question 3 (2 points)

Did you choose 4 Cores and 8GB RAM when starting the server?
It is going to be crucial when reading flights data.

Read in the 1997 Flights dataset as flights from /anvil/projects/tdm/data/flights/subset/1997.csv.

Another useful function in lubridate is ymd() which is a method often used with paste() to easily combine three columns (year column, month column, day column) to create one Date column containing values of YYYY-MM-DD data. Lets make a new column full_dates that contains data in the format YYYY-MM-DD:

flights$full_dates <- ymd(paste(flights$Year, flights$Month, flights$DayofMonth, sep="-"))

This combining of columns that we’ve just done is actually the opposite of how we split up the created_at column in Question 1. Now that you know how to split and merge dates, you could continue to do so in an endless loop, splitting, merging, splitting again, …
But for the rest of this project, let’s just set this aside.

In the DepTime column, there are values from 1 to 2400. BUT 2400 is not valid in POSIXct. POSIXct is a class used to store date and time information. The valid range is 00 - 23 for hours, 00 - 59 for minutes/seconds. So 2400 is not valid, and needs to be converted to 0 to represent midnight on the following day.
Use flights$DepTime[flights$DepTime == 2400] <- 0 to replace each 2400 entry in DepTime with 0.

To make the DepTime column display times in a more readable format instead of military shorthand, we need to convert its values. Use the floor() function to divide the DepTime column by 100, and save this as a new column depHour:

flights$depHour <- floor(flights$DepTime / 100)

Take the fractional part from dividing DepTime by 100 and save this as depMinute.

For the final part of this question, make a column date_times. You should use make_datetime(), and should include the Year, Month, DayofMonth, depHour, and depMinute columns. This new column will include date and time values for each flight’s departure in a format like how the created_at column from the orders dataset was, only we do not include seconds here.

Deliverables

3.1 date_times column containing the year, month, day, hour, and minute of each flight
3.2 depHour and depMinute columns that correctly represent that DepTime values
3.3 What does ymd from the ymd() function actually stand for and what does this function do?

Question 4 (2 points)

Make a dataframe bostonDF that contains only the values from flights that had an Origin of BOS.

Flights departing from Boston could have many different arrival locations. But one thing that is fair to guess is that the average flight distance across each of the different months of the year would be fairly similar. Start by looking at how many Boston flights there were per month. These values are relatively similar, and February, of course, has the least occurrences, given that it has the least number of possible flight days.

Use tapply() to show the average flight Distance across the different months for flights within the bostonDF. Save this as boston_distance.

Plot boston_distance as a line plot with type='b'.

Create dataframes with Origin for the flights PHX, MDW, and SEA, respectively. Perform all of the steps you did with bostonDF for each of these, resulting in four plots total.

What if you wanted to compare the average flight distance across the months for the different origin locations? It is hard to do when the plots are separate, and are at different scales.

Plot boston_distance again, this time using specific y-limits of minimum 300, maximum 1200. To add on lines representing each of your other plots, use lines().
Your cell should contain:

plot(boston_distance, …)
lines(phoenix_distance, …)
lines(chicago_distance, …)
lines(seattle_distance, …)

The initial plotting with boston_distance sets up the space, and then each lines() adds the additional plot lines to the visualization space:

plot(boston_distance,
    type='b',
    col='blue',
     ylim=c(300,
            1200),
     main='Average Distance From Boston, Phoenix, Chicago, and Seattle',
     xlab='Months',
     ylab='Average Distance'
    )
lines(phoenix_distance,
      type='b',
    col='orange'
      )
lines(chicago_distance,
      type='b',
    col='red'
      )
lines(seattle_distance,
      type='b',
    col='green'
      )

It is useful to be consistent and use one color for each time you are mapping a specific location - i.e. Boston = blue, Phoenix = orange, and so on.

Deliverables

4.1 Compare the average flight distance in October for each of your four smaller dataframes
4.2 Five plots - one for each origin airport (BOS, PHX, MDW, and SEA), the last with them combined
4.3 Between the four chosen airports, how do YOU explain the difference in average flight distance?

Question 5 (2 points)

In Question 4, we were looking at the average distance per month for each of four flight origins. Use tapply() here to find the total (sum) distance per month for each of those same four flight origins.

These values will be a lot greater than the ones from Question 4, because those were the averages, and these will count the hundreds of thousands of millions of miles from the flights.

Combine these four tapply() functions in one plot. You can use the same plotting code from the previous question, but remember that in this case, the distances will come from the sum, not the average:

plot(boston_distance,
    type='b',
    col='blue',
     main='Total Distance From Boston, Phoenix, Chicago, Seattle',
     xlab='Months',
     ylab='Total Distance'
    )
lines(phoenix_distance,
      type='b',
    col='orange'
      )
lines(chicago_distance,
      type='b',
    col='red'
      )
lines(seattle_distance,
      type='b',
    col='green'
      )

This plot may not look quite right. The reason for this is that the plot space is created when the plot() was made; in our case, the limits of the area are set to the min and max values from the Boston flights.

Find the max() and min() values from each of your total flight distances across all four locations, and set the y limits accordingly, matching above the highest max value and below the lowest min value.

Deliverables

5.1 Use tapply to find the total distance per month for fourth locations
5.2 What was the maximum distance across all of the four locations? What was the minimum?
5.3 A line plot correctly showing the total distance by month and location

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project8.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.