TDM 10100: Project 8 - Dates and Date-Times
Project Objectives
Motivation: Working with date or date-time data can be tricky and frustrating. The lubridate library makes this a lot easier and even makes things possible that are not in base R.
Context: Lubridate is a very useful package from the tidyverse that can be used across many different contexts when working with real-world data.
Scope: R, dates, date-times, tapply,
Please use 4 Cores and 8GB RAM for this project. |
Dataset
-
/anvil/projects/tdm/data/restaurant/orders.csv
-
/anvil/projects/tdm/data/flights/subset/1997.csv
If AI is used in any cases, such as for debugging, research, etc, we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Restaurant Orders
This dataset contains data about restaurant recommendations. The focus was on the challenge of making restaurant recommendations to customers. This is a record of all orders made by the customers from various venders that were included in the competition. Each customer is not limited to a single order and can even order from multiple locations.

With data from about 35,000 customers, there is ample opportunity for data exploration. The orders are stored across 26 columns that focus on the details of each order, vender, and customer combination. The columns we will be focusing on are item_count
and created_at
. item_count
tracks the number of items within each order. created_at
marks the year, month, day, hour, minute, and second the order was created in the system.
1997 Flights
The flights dataset is huge, with files from 1987 to 2023, each with respective subset datasets just to make the data reasonable to work with. The flights dataset provides numerous opportunities for data exploration with 5,411,843 rows from 1997 alone. These subsets contain information about when each flight took place, as well as different factors like how long they took, specifics like flight numbers, and more. There are, of course, empty or messy values, but there is so much data that this does not make too much of an impact for what we will be doing.
There are 29 columns and millions of rows of data. Some of these columns include:
-
Year
: year values (only containing 1997) for this subset -
Month
: numeric month values -
DayofMonth
: day of month (1 - (29 up to 31)) -
DepTime
: flight departure time in military time shorthand -
Origin
: abbreviation values for origin airport -
Distance
: distance per flight in miles
Questions
Question 1 (2 points)
Read in the Restaurant Orders dataset as orders
from /anvil/projects/tdm/data/restaurant/orders.csv
. Many of the entries are blank when we first look at the head of the dataset. This could be un-recorded data, errors, or just instances where the columns did not apply to the customer. This is a fairly large dataset, and the columns we will be using are more consistent, so it is OK to ignore where there are gaps.
If we look at the created_at
column, it shows date and time character
data (if you read the data with read.csv()
). When data is of class Date
, it represents actual calendar dates (without time information) rather than just written entries. You can convert character strings into Date
objects using as.Date()
.
While the default input format is "YYYY-MM-DD" or "YYYY/MM/DD", you can specify other formats using different arguments and format codes. The output of Date objects are typically "YYYY-MM-DD", regardless of input format. |
Overwrite the created_at
column or make a new column containing the created_at
entries as actual dates:
orders$created_at <- as.Date(orders$created_at)
In R, there is a package lubridate
that is designed to make working with date and time data simpler and more efficient. This package is a part of the tidyverse, and provides us with methods to complete date and time related manipulations easier than with base R. To learn a bit about how to begin using lubridate
, look through the introduction and cheatsheet.
Within lubridate
, there are numerous functions provided that make manipulating date/time data a lot simpler. This includes but is not limited to:
-
wday()
- day of the week -
mday()
- day of the month -
yday()
- day of the year -
month()
- numerical month -
year()
- year value -
ymd()
- dates of YYYY-MM-DD format
With the Date created_at
column, make two new columns. Your first column, num_day_of_the_week
, will contain the day of the week data for the numerical date values:
orders$num_day_of_the_week <- wday(orders$created_at)
Your second column, day_of_the_week
, will contain the day of the week data with the correct labels applied to each day of the week value:
orders$day_of_the_week <- wday(orders$created_at, label=TRUE)
Show the table of each of these new columns, then make a table displaying their values against each other.
This comparison table does not tell us much. But it does help confirm that |
1.1 In your own words, what does the lubridate library do?
1.2 Proof that the days of the week values 1-7 follow the Sunday-Saturday rotation.
1.3 Date data is stored as the class Date. How does date-time data get stored?
Please refer to the Dr. Ward’s video with |
Question 2 (2 points)
Now, take the month values from created_at
, and save them as a new month
column (we used label=TRUE
for the names of the months):
orders$month <- month(orders$created_at, label=TRUE)
Do the same for years for a new year
column. From these new columns, check out how the data is distributed in their respective tables (table for month and table for year).
There is a column item_count
that tracks how many items were in each order. Use tapply()
to see how the total counts of items in the orders were distributed across the months.
You may see that some of the months still have NA
as their value even after using na.rm=TRUE
. To see why this is, search through the orders
dataset for where the month
column is equal to "Mar" (or March if you specified abbr = FALSE
when creating month
).
Save your tapply()
function to a variable, then create a barplot from it. This helps to visualize that there were actually no items ordered in March, April, or May.
Use tapply()
again, this time to show the total item count for each (year, month)
pairing.
Make this into a barplot to help visualize how the orders were distributed throughout the years. Remember to use beside=TRUE
and legend=TRUE
(as well as your other customizations) to help this plot’s readability.
2.1 What was the actual time span of orders (with items) in this dataset?
2.2 Why were March and April showing NA values? Why does May not?
2.3 Give table showing how the orders were distributed throughout the months
In the following video, Dr. Ward shows more examples with |
Question 3 (2 points)
Did you choose 4 Cores and 8GB RAM when starting the server?
It is going to be crucial when reading flights
data.
Read in the 1997 Flights dataset as flights
from /anvil/projects/tdm/data/flights/subset/1997.csv
.
Another useful function in lubridate
is ymd()
which is a method often used with paste()
to easily combine three columns (year column, month column, day column) to create one Date column containing values of YYYY-MM-DD data. Lets make a new column full_dates
that contains data in the format YYYY-MM-DD
:
flights$full_dates <- ymd(paste(flights$Year, flights$Month, flights$DayofMonth, sep="-"))
This combining of columns that we’ve just done is actually the opposite of how we split up the created_at
column in Question 1. Now that you know how to split and merge dates, you could continue to do so in an endless loop, splitting, merging, splitting again, …
But for the rest of this project, let’s just set this aside.
In the DepTime
column, there are values from 1
to 2400
. BUT 2400
is not valid in POSIXct. POSIXct is a class used to store date and time information. The valid range is 00 - 23
for hours, 00 - 59
for minutes/seconds. So 2400 is not valid, and needs to be converted to 0 to represent midnight on the following day.
Use flights$DepTime[flights$DepTime == 2400] <- 0
to replace each 2400 entry in DepTime
with 0.
To make the DepTime
column display times in a more readable format instead of military shorthand, we need to convert its values. Use the floor()
function to divide the DepTime
column by 100, and save this as a new column depHour
:
flights$depHour <- floor(flights$DepTime / 100)
Take the fractional part from dividing DepTime
by 100 and save this as depMinute
.
For the final part of this question, make a column date_times
. You should use make_datetime()
, and should include the Year
, Month
, DayofMonth
, depHour
, and depMinute
columns. This new column will include date and time values for each flight’s departure in a format like how the created_at
column from the orders dataset was, only we do not include seconds here.
3.1 date_times
column containing the year, month, day, hour, and minute of each flight
3.2 depHour
and depMinute
columns that correctly represent that DepTime
values
3.3 What does ymd from the ymd()
function actually stand for and what does this function do?
Question 4 (2 points)
Make a dataframe bostonDF
that contains only the values from flights that had an Origin
of BOS
.
Flights departing from Boston could have many different arrival locations. But one thing that is fair to guess is that the average flight distance across each of the different months of the year would be fairly similar. Start by looking at how many Boston flights there were per month. These values are relatively similar, and February, of course, has the least occurrences, given that it has the least number of possible flight days.
Use tapply()
to show the average flight Distance
across the different months for flights within the bostonDF
. Save this as boston_distance
.
Plot boston_distance
as a line plot with type='b'
.
Create dataframes with Origin
for the flights PHX
, MDW
, and SEA
, respectively. Perform all of the steps you did with bostonDF
for each of these, resulting in four plots total.
What if you wanted to compare the average flight distance across the months for the different origin locations? It is hard to do when the plots are separate, and are at different scales.
Plot boston_distance
again, this time using specific y-limits of minimum 300, maximum 1200. To add on lines representing each of your other plots, use lines()
.
Your cell should contain:
-
plot(boston_distance, …)
-
lines(phoenix_distance, …)
-
lines(chicago_distance, …)
-
lines(seattle_distance, …)
The initial plotting with boston_distance
sets up the space, and then each lines()
adds the additional plot lines to the visualization space:
plot(boston_distance,
type='b',
col='blue',
ylim=c(300,
1200),
main='Average Distance From Boston, Phoenix, Chicago, and Seattle',
xlab='Months',
ylab='Average Distance'
)
lines(phoenix_distance,
type='b',
col='orange'
)
lines(chicago_distance,
type='b',
col='red'
)
lines(seattle_distance,
type='b',
col='green'
)
It is useful to be consistent and use one color for each time you are mapping a specific location - i.e. |
4.1 Compare the average flight distance in October for each of your four smaller dataframes
4.2 Five plots - one for each origin airport (BOS
, PHX
, MDW
, and SEA
), the last with them combined
4.3 Between the four chosen airports, how do YOU explain the difference in average flight distance?
Question 5 (2 points)
In Question 4, we were looking at the average distance per month for each of four flight origins. Use tapply()
here to find the total (sum) distance per month for each of those same four flight origins.
These values will be a lot greater than the ones from Question 4, because those were the averages, and these will count the hundreds of thousands of millions of miles from the flights.
Combine these four tapply()
functions in one plot. You can use the same plotting code from the previous question, but remember that in this case, the distances will come from the sum, not the average:
plot(boston_distance,
type='b',
col='blue',
main='Total Distance From Boston, Phoenix, Chicago, Seattle',
xlab='Months',
ylab='Total Distance'
)
lines(phoenix_distance,
type='b',
col='orange'
)
lines(chicago_distance,
type='b',
col='red'
)
lines(seattle_distance,
type='b',
col='green'
)
This plot may not look quite right. The reason for this is that the plot space is created when the plot()
was made; in our case, the limits of the area are set to the min
and max
values from the Boston flights.
Find the max()
and min()
values from each of your total flight distances across all four locations, and set the y limits accordingly, matching above the highest max value and below the lowest min value.
5.1 Use tapply to find the total distance per month for fourth locations
5.2 What was the maximum distance across all of the four locations? What was the minimum?
5.3 A line plot correctly showing the total distance by month and location
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project8.ipynb
You must double check your You will not receive full credit if your |