TDM 20100: Project 13 — SQL

Motivation: We have used three SQL databases directly, from the SQL prompt. Now we demonstrate how to make SQL calls from R, so that we can (for example) make plots related to our SQL queries.

Context: When we make a SQL call in R, the data is returned as an R database.

Scope: This project will synthesize what you have learned about how to make database calls from SQL and how to use R to visualize data from data frames.

Learning Objectives:

We will make SQL calls from R.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following three databases:

/anvil/projects/tdm/data/lahman/lahman.db (Lahman baseball database)
/anvil/projects/tdm/data/movies_and_tv/imdb2024.db (Internet Movie DataBase (IMDB))
/anvil/scratch/x-mdw/newflightdatabase.db (the flight database that you built in Project 12)

Please change mdw to your username in the location of the flight database.

Questions

You can load data from a SQL query into R (using the seminar-r kernel in Jupyter Lab) as follows, for example:

library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/lahman/lahman.db")
myDF <- dbGetQuery(conn, "SELECT * FROM batting LIMIT 5;")
head(myDF)

Once you have the connection to the database loaded, you can make more queries, without needing to re-run the dbConnect line. You can go directly to another dbGetQuery line, like this:

myDF <- dbGetQuery(conn, "SELECT * FROM pitching LIMIT 5;")
head(myDF)

If your kernel dies at any point, or if you start a new session, you will obviously need to go back and re-load your RSQLite library and also re-connect to the database, using the dbConnect command.

You can even make complex queries, for instance:

myDF <- dbGetQuery(conn, "SELECT * FROM batting as b JOIN people as p
                   ON b.playerID = p.playerID
                   WHERE p.nameFirst = 'Rickey'
                   AND p.nameLast = 'Henderson';")
head(myDF)

and we can plot the data, for instance, like this:

myDF <- dbGetQuery(conn, "SELECT b.R as myruns, b.yearID as myyears
                   FROM batting as b JOIN people as p
                   ON b.playerID = p.playerID
                   WHERE p.nameFirst = 'Rickey'
                   AND p.nameLast = 'Henderson'
		   GROUP BY b.yearID;")
plot(myDF$myyears, myDF$myruns)

Question 1 (2 pts)

Using the seminar-r kernel in Jupyter Lab, open a connection to the Lahman database using the dbConnect process that is outlined above.

Revisit your work from Project 8, Question 4, using the Lahman baseball database, but this time, make a dotchart, as follows:

Use the Batting table to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). Instead of printing the output, this time make a dotchart with 5 rows. Each row should show the playerID of each player and the total number of hits in each of their careers.

Deliverables

Make a dotchart with 5 rows for the top 5 players of all time, in terms of their total number of hits, SUM(H). Each row should show the playerID of each player and the total number of hits in each of their careers.

Question 2 (2 pts)

Revisit your work from Project 8, Question 5, using the Lahman baseball database, but this time, make a dotchart, as follows:

Consider the Schools table, group together the schools in each state. Find the number of schools in each group, using SELECT COUNT(*) as mycounts, state so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of mycounts in descending order (which is denoted by DESC), in other words, the states with the most schools are printed first in your list.

In this way, by using LIMIT 5, you can make a dotchart that displays the 5 states with the most schools, and the number of schools in each state.

Deliverables

Make a dotchart that displays the 5 states with the most schools, and the number of schools in each state.

Question 3 (2 pts)

Revisit your work from Project 11, Question 2, using the IMDB Movies database, but this time, make a dotchart, as follows:

Join the ratings and the basics table, to find the 13 titles that each have more than 2 million ratings. Make a dotchart for these 13 titles, showing the primaryTitle and the number of ratings for each of these 13 titles.

Deliverables

Make a dotchart for these 13 titles, showing the primaryTitle and the number of ratings for each of these 13 titles.

Question 4 (2 pts)

Revisit your work from Project 11, Question 3, using the IMDB Movies database, but this time, make a plot, as follows:

Using the startYear values from the basics table, find the total number of entries in each startYear. Make a plot that shows the startYear on the x-axis and the number of entries from each startYear on the y-axis.
Now fix your plot from part (a), so that you only show the results in which myDF$startYear > 0.

Deliverables

Make a plot that shows the startYear on the x-axis and the number of entries from each startYear on the y-axis.
Now fix your plot from part (a), so that you only show the results in which myDF$startYear > 0.

Question 5 (2 pts)

Revisit your work from Project 12, Question 2, using the flights database that you built, but this time, make a dotchart, as follows:

Join the flights and the airports table, matching the Origin column to the iata column. Find the total number of flights in the database for each Origin airport that is located in Texas. Make a dotchart that shows, for each Origin airport in Texas, the total number of flights and the 3-letter Origin airport code.

There are 29 airports in Texas that should appear in your dotchart. It is OK to put the dotchart in any order that you like, i.e., in numerical order, alphabetical order, or any other order is OK!

Deliverables

Make a dotchart that shows, for each Origin airport in Texas, the total number of flights and the 3-letter Origin airport code.

Submitting your Work

Now we know how to leverage our knowledge of SQL when working in R!

Items to submit

firstname-lastname-project13.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.