# STAT 19000: Project 12 — Fall 2021

Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. `lubridate` is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.

Context: We’ve been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.

Scope: r

Learning Objectives
• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Utilize apply functions in order to solve a data-driven problem.

• Gain proficiency using split, merge, and subset.

• Demonstrate the ability to create basic graphs with default settings.

• Demonstrate the ability to modify axes labels and titles.

• Incorporate legends using legend().

• Demonstrate the ability to customize a plot (color, shape/linetype).

• Convert strings to dates, and format dates using the `lubridate` package.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`

## Questions

### Question 1

 For this project, when launching your Jupyter Lab instance, please select 5000 as the amount of memory to allocate.

Read the dataset into a dataframe called `liquor`.

We are interested in exploring time-related trends in Iowa liquor sales. What is the data type for the column `Date`?

Try to run the following code, to get the time between the first and second sale.

``liquor\$Date[1] - liquor\$Date[2]``

As you may have expected, we cannot use the standard operators (like + and -) on this type.

Create a new column named `date` to be the `Date` column but in date format using the function `as.Date()`.

 From this point in time on, you will have 2 "date" columns — 1 called `Date` and 1 called `date`. `Date` will be the incorrect type for a date, and `date` will be the correct type. This allows us to see different ways to work with the data.

You may need to define the date format in the `as.Date()` function using the argument `format`.

Try running the following code now.

``liquor\$date[1] - liquor\$date[2]``

Much better! This is just 1 reason why it is important to have the data in your dataframe be of the correct type.

 Double check that the date got converted properly. The year for `liquor\$date[1]` should be in 2015.

Relevant topics: `read.csv`, `fread`, `as.Date`, `str`

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 2

Create two new columns in the dataset called `year` and `month` based on the `Date` column.

Which years are covered in this dataset regarding Iowa liquor sales? Do all years have all months represented?

Use the `as.Date` function again, and set the format to contain only the information wanted. See an example below.

 Update: It came to our attention that the `substr` method previously mentioned is much less memory efficient and will cause the kernel to crash (if your project writer took the time to test both ideas he had, you wouldn’t have had this issue (sorry)). Please use the `as.Date` method shown below.
``````myDate <- as.Date('2021-11-01')
day <- as.numeric(format(myDate,'%d'))``````

Relevant topics: `substr`, `as.numeric`, `format`, `unique`, `table`

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 3

A useful package for dealing with dates is called `lubridate`. The package is part of the famous `tidyverse` suite of packages. Run the code below to load it.

``library(lubridate)``

Re-do questions 1 and 2 using the `lubridate` package. Make sure to name the columns differently, for example `date_lb`, `year_lb` and `month_lb`.

Do you have a preference for solving the questions? Why or why not?

Relevant topics: Lubridate Cheat Sheet

Items to submit
• Code used to solve this problem.

• Output from running the code.

• Sentence explaining which method you prefer and why.

### Question 4

Now that we have the columns `year` and `month`, let’s explore the data for time trends.

What is the average volume (gallons) of liquor sold per month? Which month has the lowest average volume? Does that surprise you?

 You can change the labels in the x-axis to be months by having the argument `xaxt` in the plot function set as "n" (`xaxt="n"`) and then having the following code at the end of your plot: `axis(side=1, at=1:12, labels=month.abb)`.

Relevant topics: `tapply`, `plot`

Items to submit
• Code used to solve this problem.

• Output from running the code.

• 1-2 sentences describing your findings.

### Question 5

Make a line plot for the average volume sold per month for the years of 2012 to 2015. Your plot should contain 4 lines, one for each year.

Make sure you specify a title, and label your axes.

Write 1-2 sentences analyzing your plot.

 There are many ways to get an average per month. You can use `for` loops, `apply` suite with your own function, `subset`, and `tapply` with a grouping that involves both year and month.

Relevant topics: `plot`, `line`, `subset`, `mean`, `sapply`, `tapply`

Items to submit
• Code used to solve this problem.

• Output from running the code.

• 1-2 sentences analyzing your plot.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.