TDM 20100: Project 13 — SQL
Motivation: We have used three SQL databases directly, from the SQL prompt. Now we demonstrate how to make SQL calls from R, so that we can (for example) make plots related to our SQL queries.
Context: When we make a SQL call in R, the data is returned as an R database.
Scope: This project will synthesize what you have learned about how to make database calls from SQL and how to use R to visualize data from data frames.
Dataset(s)
This project will use the following three databases:
-
/anvil/projects/tdm/data/lahman/lahman.db
(Lahman baseball database) -
/anvil/projects/tdm/data/movies_and_tv/imdb2024.db
(Internet Movie DataBase (IMDB)) -
/anvil/scratch/x-mdw/newflightdatabase.db
(the flight database that you built in Project 12)
Please change |
Questions
You can load data from a SQL query into R (using the seminar-r
kernel in Jupyter Lab) as follows, for example:
library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/lahman/lahman.db")
myDF <- dbGetQuery(conn, "SELECT * FROM batting LIMIT 5;")
head(myDF)
Once you have the connection to the database loaded, you can make more queries, without needing to re-run the dbConnect
line. You can go directly to another dbGetQuery
line, like this:
myDF <- dbGetQuery(conn, "SELECT * FROM pitching LIMIT 5;")
head(myDF)
If your kernel dies at any point, or if you start a new session, you will obviously need to go back and re-load your RSQLite
library and also re-connect to the database, using the dbConnect
command.
You can even make complex queries, for instance:
myDF <- dbGetQuery(conn, "SELECT * FROM batting as b JOIN people as p
ON b.playerID = p.playerID
WHERE p.nameFirst = 'Rickey'
AND p.nameLast = 'Henderson';")
head(myDF)
and we can plot the data, for instance, like this:
myDF <- dbGetQuery(conn, "SELECT b.R as myruns, b.yearID as myyears
FROM batting as b JOIN people as p
ON b.playerID = p.playerID
WHERE p.nameFirst = 'Rickey'
AND p.nameLast = 'Henderson'
GROUP BY b.yearID;")
plot(myDF$myyears, myDF$myruns)
Question 1 (2 pts)
Using the seminar-r
kernel in Jupyter Lab, open a connection to the Lahman database using the dbConnect
process that is outlined above.
Revisit your work from Project 8, Question 4, using the Lahman baseball database, but this time, make a dotchart
, as follows:
Use the Batting table to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). Instead of printing the output, this time make a dotchart with 5 rows. Each row should show the playerID
of each player and the total number of hits in each of their careers.
Make a dotchart with 5 rows for the top 5 players of all time, in terms of their total number of hits, SUM(H)
. Each row should show the playerID
of each player and the total number of hits in each of their careers.
Question 2 (2 pts)
Revisit your work from Project 8, Question 5, using the Lahman baseball database, but this time, make a dotchart
, as follows:
Consider the Schools table, group together the schools in each state. Find the number of schools in each group, using SELECT COUNT(*) as mycounts, state
so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of mycounts in descending order (which is denoted by DESC), in other words, the states with the most schools are printed first in your list.
In this way, by using LIMIT 5, you can make a dotchart that displays the 5 states with the most schools, and the number of schools in each state.
Make a dotchart that displays the 5 states with the most schools, and the number of schools in each state.
Question 3 (2 pts)
Revisit your work from Project 11, Question 2, using the IMDB Movies database, but this time, make a dotchart
, as follows:
Join the ratings and the basics table, to find the 13 titles that each have more than 2 million ratings. Make a dotchart for these 13 titles, showing the primaryTitle
and the number of ratings for each of these 13 titles.
Make a dotchart for these 13 titles, showing the primaryTitle
and the number of ratings for each of these 13 titles.
Question 4 (2 pts)
Revisit your work from Project 11, Question 3, using the IMDB Movies database, but this time, make a plot
, as follows:
-
Using the startYear values from the basics table, find the total number of entries in each startYear. Make a plot that shows the
startYear
on the x-axis and the number of entries from eachstartYear
on the y-axis. -
Now fix your plot from part (a), so that you only show the results in which
myDF$startYear > 0
.
-
Make a plot that shows the
startYear
on the x-axis and the number of entries from eachstartYear
on the y-axis. -
Now fix your plot from part (a), so that you only show the results in which
myDF$startYear > 0
.
Question 5 (2 pts)
Revisit your work from Project 12, Question 2, using the flights database that you built, but this time, make a dotchart
, as follows:
Join the flights
and the airports
table, matching the Origin
column to the iata
column. Find the total number of flights in the database for each Origin
airport that is located in Texas. Make a dotchart that shows, for each Origin
airport in Texas, the total number of flights and the 3-letter Origin
airport code.
There are 29 airports in Texas that should appear in your |
Make a dotchart that shows, for each Origin
airport in Texas, the total number of flights and the 3-letter Origin
airport code.
Submitting your Work
Now we know how to leverage our knowledge of SQL when working in R!
-
firstname-lastname-project13.ipynb
You must double check your You will not receive full credit if your |