Think Summer 2020

Submission

Students need to submit an [RMarkdown] file with all of the required code and output by Wednesday, July 8th at 12:00 PM EST through Gradescope inside Brightspace.

You can find an Rmarkdown template which you can modify and use a starting point for your project here, and the resulting, compiled PDF here.

Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.

Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.

Scope: SQL, SQL in R

Learning Objectives
  • Demonstrate the ability to interact with popular database management systems within R.

  • Solve data-driven problems using a combination of SQL and R.

  • Use basic SQL commands: select, order by, limit, desc, asc, count, where, from.

  • Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

It is highly recommended that you use rstudio.scholar.rcac.purdue.edu. Simply click on the link and login using your Purdue account credentials. Use another system at your own risk.

Don’t forget the very useful documentation shortcut ?. To use, simply type ? in the console, followed by the name of the function you are interested in.

You can also look for package documentation by using help(package=PACKAGENAME), so for example, to see the documentation for the package ggplot2, we could run:

help(package=ggplot2)

Sometimes it can be helpful to see the source code of a defined function. A [function](www.tutorialspoint.com/r/r_functions.htm) is any chunk of organized code that is used to perform an operation. Source code is the underlying R or c or c++ code that is used to create the function. To see the source code of a defined function, type the function’s name without the (). For example, if we were curious about what the function Reduce does, we could run:

Reduce

Occasionally this will be less useful as the resulting code will be code that calls c code we can’t see. Other times it will allow you to understand the function better.

Dataset

The following questions will use the imdb database found in Scholar. The credentials to the database are:

Username: imdb_user

Password: movie$Rkool

This database has 6 tables, namely:

akas, crew, episodes, people, ratings, and titles.

To connect to the database from a terminal in Scholar, execute the following in a terminal:

mysql -u imdb_user -h scholar-db.rcac.purdue.edu -p

You will be asked for the password. Type the provided password and press enter. Note that it will look like nothing is being typed as you type, this is OK, you are indeed typing the password.

To connect to the database from Rstudio, open a browser and navigate to rstudio.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials.

To establish a connection with the MySQL database within Rstudio, run the following:

install.packages("RMariaDB")
library(RMariaDB)

host <- "scholar-db.rcac.purdue.edu"
user <- "imdb_user"
password <- "movie$Rkool"
database <- "imdb"

db <- dbConnect(RMariaDB::MariaDB(), host=host, db=database, user=user, password=password)

After running the code above, you should be successfully connected to the database. From here, you can either use the package RMariaDB to query our database:

result <- dbGetQuery(db, "SELECT * FROM titles LIMIT 5;")

Or you can execute SQL directly in an Rmarkdown file. For example, copy and paste the following code chunks in an RMarkdown file:

This code chunk initiates a connection to the database.

```{r}
install.packages("RMariaDB")
library(RMariaDB)

host <- "scholar-db.rcac.purdue.edu"
user <- "imdb_user"
password <- "movie$Rkool"
database <- "imdb"

db <- dbConnect(RMariaDB::MariaDB(), host=host, db=database, user=user, password=password)
```

This code chunk demonstrates how to run SQL queries from within R.

```{r}
result <- dbGetQuery(db, "SELECT * FROM titles LIMIT 5;")
```

This code chunk demonstrates how to use the SQL connection to run SQL queries directly within a code chunk.

```{sql, connection=db}
SELECT * FROM titles LIMIT 5;
```

Questions

Question 1

Explore the 6 tables. State an interesting fact (of your choice) that you find about at least one of the tables.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • A sentence describing at least 1 interesting fact about at least one of the tables.

Question 2

Find the title_id, rating, and number of votes for all movies that received at least 2 million votes.

Use the ratings table.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

Question 3

Now use the information you found, about the movies that received at least 2 million votes, to identify the titles of these movies, using the titles table.

You will probably recognize the names of these movies.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

Question 4

Find the names, birth years, and death years, for all actors and actresses who lived more than 115 years.

You can use this clause in your SQL query:

WHERE died - born > 115

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

Question 5

In the titles table, the genres column specifies the genre of each movie. Use the COUNT function to find how many movies of each genre occur in the database.

You can use the same strategy from the SUM of transactions examples in the election database. Just use COUNT instead of SUM.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

Question 6

In the titles table, the premiered column specifies the year that a movie was premiered. Use the COUNT function to find how many movies premiered in each year in the database.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

Question 7

One movie has a strange premiere year. Which movie is this?

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

Question 8

Make a dotchart that shows how many movies premiered in each year since the year 2000.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to gather the data used in the dotchart.

  • A dotchart that shows how many movies premiered in each year since the year 2000, in png or jpg/jpeg format.

Question 9

The title 'The Awakening' has been used very often! How many times has this been used as a title?

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

Question 10

Investigate all of the occurrences of these titles called 'The Awakening'. Find an interesting fact about the entries with these titles.

Relevant topics: [sql](#sql-examples), [sql in R](#sql-in-r-examples)

Items to submit
  • SQL query used to solve this problem.

  • Output from running the SQL query.

  • 1-2 sentences describing the interesting fact you found about the entries with these titles.