Movies and TV
Source
The Internet Movie Database (IMDb) provides these data sets:
These data sets are refreshed on a daily basis.
Description of the Data
There are 7 data files provided by IMDb:
name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz
The "Rotten Tomatoes movies and critic reviews dataset" is likely from here:
The Office dialogue likely is from here:
Transformations to the original data source
Kevin performed several transformation of the data.
He created comma-separated versions of the data from the IMDb tsv files:
akas.csv, crew.csv, episodes.csv, people.csv, ratings.csv, titles.csv
He also created a database file:
imdb.db
These two files are from the Rotten Tomatoes dataset:
rotten_tomatoes_movies.csv, rotten_tomatoes_reviews.csv
The dialogues from The Office are stored in:
the_office_dialogue.csv
Note that the source website mentions "55130 observations of 12 variables" but there seems to be 1 line in the csv file that stretches onto 3 lines (i.e., it has 2 extra lines). There is also the header line. Thus, the file the_office_dialogue.csv
has a total of 55133 lines.
We can download the IMDb data using:
wget datasets.imdbws.com/name.basics.tsv.gz
wget datasets.imdbws.com/title.akas.tsv.gz
wget datasets.imdbws.com/title.basics.tsv.gz
wget datasets.imdbws.com/title.crew.tsv.gz
wget datasets.imdbws.com/title.episode.tsv.gz
wget datasets.imdbws.com/title.principals.tsv.gz
wget datasets.imdbws.com/title.ratings.tsv.gz
and then we uncompress the data:
gunzip *.gz
and then remove the \N
values so that we have NULL values instead:
cat name.basics.tsv | sed 's.\\N..g' >name.tsv &
cat title.akas.tsv | sed 's.\\N..g' >akas.tsv &
cat title.basics.tsv | sed 's.\\N..g' >basics.tsv &
cat title.crew.tsv | sed 's.\\N..g' >crew.tsv &
cat title.episode.tsv | sed 's.\\N..g' >episode.tsv &
cat title.principals.tsv | sed 's.\\N..g' >principals.tsv &
cat title.ratings.tsv | sed 's.\\N..g' >ratings.tsv &
Then we create a new file in sqlite:
module load sqlite
sqlite3 imdb6Oct2024.db
We prepare to input the files:
.headers ON
.mode ascii
.separator "\t" "\n"
CREATE TABLE name(
"nconst" TEXT,
"primaryName" TEXT,
"birthYear" INTEGER,
"deathYear" INTEGER,
"primaryProfession" TEXT,
"knownForTitles" TEXT
);
CREATE TABLE akas(
"titleId" TEXT,
"ordering" INTEGER,
"title" TEXT,
"region" TEXT,
"language" TEXT,
"types" TEXT,
"attributes" TEXT,
"isOriginalTitle" INTEGER
);
CREATE TABLE basics(
"tconst" TEXT,
"titleType" TEXT,
"primaryTitle" TEXT,
"originalTitle" TEXT,
"isAdult" INTEGER,
"startYear" INTEGER,
"endYear" INTEGER,
"runtimeMinutes" INTEGER,
"genres" TEXT
);
CREATE TABLE crew(
"tconst" TEXT,
"directors" TEXT,
"writers" TEXT
);
CREATE TABLE episode(
"tconst" TEXT,
"parentTconst" TEXT,
"seasonNumber" INTEGER,
"episodeNumber" INTEGER
);
CREATE TABLE principals(
"tconst" TEXT,
"ordering" INTEGER,
"nconst" TEXT,
"category" TEXT,
"job" TEXT,
"characters" TEXT
);
CREATE TABLE ratings(
"tconst" TEXT,
"averageRating" NUMERIC,
"numVotes" INTEGER
);
Then we import the files:
.import --skip 1 name.tsv name
.import --skip 1 akas.tsv akas
.import --skip 1 basics.tsv basics
.import --skip 1 crew.tsv crew
.import --skip 1 episode.tsv episode
.import --skip 1 principals.tsv principals
.import --skip 1 ratings.tsv ratings
and, finally, we make an index for each of the tables:
CREATE INDEX ix_name_covering ON name(nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles);
CREATE INDEX ix_akas_covering ON akas(titleId,ordering,title,region,language,types,attributes,isOriginalTitle);
CREATE INDEX ix_basics_covering ON basics(tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres);
CREATE INDEX ix_crew_covering ON crew(tconst,directors,writers);
CREATE INDEX ix_episode_covering ON episode(tconst,parentTconst,seasonNumber,episodeNumber);
CREATE INDEX ix_principals_covering ON principals(tconst,ordering,nconst,category,job,characters);
CREATE INDEX ix_ratings_covering ON ratings(tconst,averageRating,numVotes);