STAT 39000: Project 12 — Fall 2020

Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.

Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.

Scope: SQL, sqlite, joins

Learning objectives
  • Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.

  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.

  • Showcase the ability to filter, alias, and write subqueries.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

A public sample of the data can be found here.

For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:

````markdown

library(RSQLite)
bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")

``

Each solution should then be placed in a code chunk like this:

````markdown

SELECT * FROM station LIMIT 5;

``

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

````markdown

SELECT * FROM station LIMIT 5;

``

This will allow the code to be displayed without throwing an error.

There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN).

Questions

Question 1

Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite concatenate operator || see here. Write a query that returns the first 5 records of information from the station table formatted in the following way:

(id) name @ (lat, long)

For example:

(84) Ryland Park @ (37.342725, -121.895617)

Here is a video about how to concatenate strings in SQLite.

Items to submit
  • SQL query used to solve this problem.

  • The first 5 records of information from the station table.

Question 2

There is a variety of interesting weather information in the weather table. Write a query that finds the average mean_temperature_f by zip_code. Which is on average the warmest zip_code?

Use aliases to format the result in the following way:

Zip Code|Avg Temperature
94041|61.3808219178082

Note that this is the output if you use sqlite in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.

Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT

Items to submit
  • SQL query used to solve this problem.

  • The results of the query copy and pasted.

Question 3

From (2) we can see that there are only 5 zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip table? Write a query that finds the number of unique zip_code`s in the `trip table. Write another query that lists the zip_code and count of the number of times the zip_code appears. If we had originally assumed that the zip_code was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected zip_code values in the trip table?

There could be missing values in zip_code. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL here.

Items to submit
  • SQL queries used to solve this problem.

  • 1-2 sentences explainging what a possible explanation for the `zip_code`s could be.

Question 4

In (2) we wrote a query that finds the average mean_temperature_f by zip_code. What if we want to tack on our results in (2) to information from each row in the station table based on the zip_code? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.

Use the query from (2) as a sub query within your solution.

Here is a video about JOIN and LEFT JOIN.

Items to submit
  • SQL query used to solve this problem.

Question 5

In (3) we alluded to the fact that many zip_code in the trip table aren’t very consistent. Users can enter a zip code when using the app. This means that zip_code can be from anywhere in the world! With that being said, if the zip_code is one of the 5 zip_code for which we have weather data (from question 2), we can add that weather information to matching rows of the trip table. In (4) we used an INNER JOIN to append some weather information to each row in the station table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather table to the trip data from the trip table. Limit your output to 5 lines.

Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the trip table, you can use start_date for for the date information.

You will want to wrap your dates and datetimes in sqlite’s date function prior to comparison.

Items to submit
  • SQL query used to solve this problem.

  • First 5 lines of output.

Question 6

How many rows are in the result from (5) (when not limiting to 5 lines)? How many rows are in the trip table? As you can see, a large proportion of the data from the trip table did not match the data from the weather table, and therefore was removed from the result. What if we want to keep all of the data from the trip table and add on data from the weather table if we have a match? Write a query to accomplish this. How many rows are in the result?

Items to submit
  • SQL query used to find how many rows from the result in (5).

  • The number of rows in the result of (5).

  • SQL query to find how many rows are in the trip table.

  • The number of rows in the trip table.

  • SQL query to keep all of the data from the trip table and add on matching data from the weather table when available.

  • The number of rows in the result.