STAT 19000: Project 6 — Fall 2021

Motivation: tapply is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. If you have any familiarity with SQL, it tapply is very similar to working with the GROUP BY clause — you first group your data using some rule, and then perform some operation for each newly created group.

Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply from the "apply suite" of functions in R.

Scope: r, tapply

Learning Objectives
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

  • Demonstrate using tapply to perform calculations on subsets of data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/amazon/tracks.csv

  • /depot/datamine/data/amazon/tracks.db

Questions

Question 1

Load the tracks.csv file into an R data.frame called tracks. Immediately after loading the file, run the following.

str(tracks)
r

What happens?

The C in CSV is not true for this dataset! You’ll need to take advantage of the sep argument of read.csv to read in this dataset.

Once you’ve successfully read in the data, re-run the following.

str(tracks)
r
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Great! tapply is a very cool, very powerful function in R.

First, let’s say that we wanted to see what the average duration (a column in the tracks data.frame) of songs were by each year (a column in the tracks data.frame). If you think about how you would approach solving this problem, there are a lot of components to keep track of!

  • You don’t know ahead of time how many different years are in the dataset.

  • You have to associate each sum of duration with a specific year.

  • Etc.

Its a lot of work!

In R, there is a really great library that allows us to run queries on an sqlite database and put the result directly into a dataframe. This would be the SQL and R solution to this problem.

library(RSQLite)

con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
head(myDF)
r

Use tapply to solve the same problem! Are your results the same? Print the first 5 results to make sure they are the same.

tapply can take a minute to get the hang of. I like to think about the first argument to tapply as the column of data we want to perform an operation on, the second argument to tapply as the column of data we want to group by, and the third argument as the operation (as a function, like sum, or median, or mean or sd, or var, etc.) we want to perform on the data.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Plot the results of question (2) with any appropriate plot that will highlight the duration of music by year, sequentially. What patterns do you see, if any?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Ha! Thats not so bad! What are the artist_name of the artists with the highest median duration of songs? Sort the results of the tapply function in descending order and print the first 5 results.

This may take a few minutes to run — this function is doing a lot and there are a lot of artists in this dataset!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Explore the dataset and come up with a question you want to answer. Make sure tapply would be useful with your investigation, and use tapply to calculate something interesting for the dataset. Create one or more graphics as you are working on your question. Write 1-2 sentences reviewing your findings. It could be anything, and your findings do not need to be "good" or "bad", they can be boring (much like a lot of research findings)!

Items to submit
  • Question you want to answer.

  • Code used to solve this problem.

  • Output (including graphic(s)) from running the code.

  • 1-2 sentences reviewing your findings.

Question 6 (optional, 0 pts)

Use the following SQL and R code and take a crack at solving a problem (any problem) you want to do with R and SQL. You can use the following code to help. Create a cool graphic with the results!

library(RSQLite)

con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
myDF
r

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.