Think Summer: Notes on how to import SQL results into R — 2022

Loading the R library for SQL, and loading the database

We need to run this at the start. If something goes wrong with our database queries, we can always come back and run these two lines again. Ideally, we should only need to run these once per session.

%%R
library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db")

Importing data from SQL to R

For example, we can import the entire titles database from SQL into R:

%%R
myDF <- dbGetQuery(conn, "SELECT * FROM titles;")

These are the first few lines of the result:

%%R
head(myDF)

We can see that we have more than 8 million rows, and 9 columns of data.

%%R
dim(myDF)

A dataframe in R, by the way, is a lot like an Excel spreadsheet or a SQL table, namely, it has columns of data, with one type of data in each column. The columns are usually long, in other words, there are usually many rows in the dataframe.

Getting familiar with the premiered years of the titles

Here are the first few entries in the premiered column:

%%R
head(myDF$premiered)

and here are the last few entries in the premiered column:

%%R
tail(myDF$premiered)

We can tabulate how many titles premiered in each year, using the table function in R. In the result, we have a row of years followed by a row of data (which is the number of titles per year). Then we have another row of years followed by a row of data, etc., etc. R always displays data from a table in this way, namely, by alternating a row of lables and a row of data. You can think about how things would look different (and easier) if your screen was really, really wide, and there were only two rows displayed, namely, the lables (which are the years themselves) and the data (which are the counts of the number of titles per year.

%%R
table(myDF$premiered)

It is easy to wrap that table into a plot, to display how the number of titles has greatly increased from year to year.

%%R
plot(table(myDF$premiered))

How many genres are in the titles table?

Here are the first few genres from the genres column:

%%R
head(myDF$genres)

We use the head so that we can keep the output relatively small and manageable. Now we can remove duplicates:

%%R
unique(head(myDF$genres))

and see how many unique values there are, in the head:

%%R
length(unique(head(myDF$genres)))

Now that this works well, we can remove the head restriction, and see that there are 2283 unique genres in the table altogether. Remember that each genre is actually a tuple of genres, for instance,

%%R
length(unique(myDF$genres))

Now we can focus on separating the genres into their individual genres. Remember that they are combined, using commas, in the format that we originally have. Here are the first few genres:

%%R
head(myDF$genres)

Now we split them according to the commas in each:

%%R
strsplit(head(myDF$genres), ",")

This will be new for many/most of you, but we can unlist them in R, so that they are not listed separately anymore, but instead, they are in one big vector.

%%R
unlist(strsplit(head(myDF$genres), ","))

and now we can use unique to see a list of the genres, removing any duplications:

%%R
unique(unlist(strsplit(head(myDF$genres), ",")))

Since this works on the head, we can remove the head now, and see the 29 such genres. Notice that the 21st such genre is missing, i.e., it is empty, so we do not know the genres for some of the movies.

%%R
unique(unlist(strsplit(myDF$genres, ",")))