STAT 19000: Project 12 — Fall 2021

Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. lubridate is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.

Context: We’ve been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.

Scope: r

Learning Objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to create basic graphs with default settings.

  • Demonstrate the ability to modify axes labels and titles.

  • Incorporate legends using legend().

  • Demonstrate the ability to customize a plot (color, shape/linetype).

  • Convert strings to dates, and format dates using the lubridate package.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

For this project, when launching your Jupyter Lab instance, please select 5000 as the amount of memory to allocate.

Read the dataset into a dataframe called liquor.

We are interested in exploring time-related trends in Iowa liquor sales. What is the data type for the column Date?

Try to run the following code, to get the time between the first and second sale.

liquor$Date[1] - liquor$Date[2]

As you may have expected, we cannot use the standard operators (like + and -) on this type.

Create a new column named date to be the Date column but in date format using the function as.Date().

From this point in time on, you will have 2 "date" columns — 1 called Date and 1 called date. Date will be the incorrect type for a date, and date will be the correct type.

This allows us to see different ways to work with the data.

You may need to define the date format in the as.Date() function using the argument format.

Try running the following code now.

liquor$date[1] - liquor$date[2]

Much better! This is just 1 reason why it is important to have the data in your dataframe be of the correct type.

Double check that the date got converted properly. The year for liquor$date[1] should be in 2015.

Relevant topics: read.csv, fread, as.Date, str

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Create two new columns in the dataset called year and month based on the Date column.

Which years are covered in this dataset regarding Iowa liquor sales? Do all years have all months represented?

Use the as.Date function again, and set the format to contain only the information wanted. See an example below.

Update: It came to our attention that the substr method previously mentioned is much less memory efficient and will cause the kernel to crash (if your project writer took the time to test both ideas he had, you wouldn’t have had this issue (sorry)). Please use the as.Date method shown below.

myDate <- as.Date('2021-11-01')
day <- as.numeric(format(myDate,'%d'))

Relevant topics: substr, as.numeric, format, unique, table

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

A useful package for dealing with dates is called lubridate. The package is part of the famous tidyverse suite of packages. Run the code below to load it.

library(lubridate)

Re-do questions 1 and 2 using the lubridate package. Make sure to name the columns differently, for example date_lb, year_lb and month_lb.

Do you have a preference for solving the questions? Why or why not?

Relevant topics: Lubridate Cheat Sheet

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Sentence explaining which method you prefer and why.

Question 4

Now that we have the columns year and month, let’s explore the data for time trends.

What is the average volume (gallons) of liquor sold per month? Which month has the lowest average volume? Does that surprise you?

You can change the labels in the x-axis to be months by having the argument xaxt in the plot function set as "n" (xaxt="n") and then having the following code at the end of your plot: axis(side=1, at=1:12, labels=month.abb).

Relevant topics: tapply, plot

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences describing your findings.

Question 5

Make a line plot for the average volume sold per month for the years of 2012 to 2015. Your plot should contain 4 lines, one for each year.

Make sure you specify a title, and label your axes.

Write 1-2 sentences analyzing your plot.

There are many ways to get an average per month. You can use for loops, apply suite with your own function, subset, and tapply with a grouping that involves both year and month.

Relevant topics: plot, line, subset, mean, sapply, tapply

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences analyzing your plot.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.