Movies and TV

Source

The Internet Movie Database (IMDb) provides these data sets:

developer.imdb.com/non-commercial-datasets/

datasets.imdbws.com/

These data sets are refreshed on a daily basis.

Description of the Data

There are 7 data files provided by IMDb:

name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz

The "Rotten Tomatoes movies and critic reviews dataset" is likely from here:

www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

The Office dialogue likely is from here:

bradlindblad.github.io/schrute/reference/theoffice.html

github.com/bradlindblad/schrute

Transformations to the original data source

Kevin performed several transformation of the data.

He created comma-separated versions of the data from the IMDb tsv files:

akas.csv, crew.csv, episodes.csv, people.csv, ratings.csv, titles.csv

He also created a database file:

imdb.db

These two files are from the Rotten Tomatoes dataset:

rotten_tomatoes_movies.csv, rotten_tomatoes_reviews.csv

The dialogues from The Office are stored in:

the_office_dialogue.csv

Note that the source website mentions "55130 observations of 12 variables" but there seems to be 1 line in the csv file that stretches onto 3 lines (i.e., it has 2 extra lines). There is also the header line. Thus, the file the_office_dialogue.csv has a total of 55133 lines.

We can download the IMDb data using:

wget datasets.imdbws.com/name.basics.tsv.gz

wget datasets.imdbws.com/title.akas.tsv.gz

wget datasets.imdbws.com/title.basics.tsv.gz

wget datasets.imdbws.com/title.crew.tsv.gz

wget datasets.imdbws.com/title.episode.tsv.gz

wget datasets.imdbws.com/title.principals.tsv.gz

wget datasets.imdbws.com/title.ratings.tsv.gz

and then we uncompress the data:

gunzip *.gz

and then remove the \N values so that we have NULL values instead:

cat name.basics.tsv | sed 's.\\N..g' >name.tsv &

cat title.akas.tsv | sed 's.\\N..g' >akas.tsv &

cat title.basics.tsv | sed 's.\\N..g' >basics.tsv &

cat title.crew.tsv | sed 's.\\N..g' >crew.tsv &

cat title.episode.tsv | sed 's.\\N..g' >episode.tsv &

cat title.principals.tsv | sed 's.\\N..g' >principals.tsv &

cat title.ratings.tsv | sed 's.\\N..g' >ratings.tsv &

Then we create a new file in sqlite:

module load sqlite

sqlite3 imdb6Oct2024.db

We prepare to input the files:

.headers ON

.mode ascii

.separator "\t" "\n"

CREATE TABLE name(
  "nconst" TEXT,
  "primaryName" TEXT,
  "birthYear" INTEGER,
  "deathYear" INTEGER,
  "primaryProfession" TEXT,
  "knownForTitles" TEXT
);

CREATE TABLE akas(
  "titleId" TEXT,
  "ordering" INTEGER,
  "title" TEXT,
  "region" TEXT,
  "language" TEXT,
  "types" TEXT,
  "attributes" TEXT,
  "isOriginalTitle" INTEGER
);

CREATE TABLE basics(
  "tconst" TEXT,
  "titleType" TEXT,
  "primaryTitle" TEXT,
  "originalTitle" TEXT,
  "isAdult" INTEGER,
  "startYear" INTEGER,
  "endYear" INTEGER,
  "runtimeMinutes" INTEGER,
  "genres" TEXT
);

CREATE TABLE crew(
  "tconst" TEXT,
  "directors" TEXT,
  "writers" TEXT
);

CREATE TABLE episode(
  "tconst" TEXT,
  "parentTconst" TEXT,
  "seasonNumber" INTEGER,
  "episodeNumber" INTEGER
);

CREATE TABLE principals(
  "tconst" TEXT,
  "ordering" INTEGER,
  "nconst" TEXT,
  "category" TEXT,
  "job" TEXT,
  "characters" TEXT
);

CREATE TABLE ratings(
  "tconst" TEXT,
  "averageRating" NUMERIC,
  "numVotes" INTEGER
);

Then we import the files:

.import --skip 1 name.tsv name

.import --skip 1 akas.tsv akas

.import --skip 1 basics.tsv basics

.import --skip 1 crew.tsv crew

.import --skip 1 episode.tsv episode

.import --skip 1 principals.tsv principals

.import --skip 1 ratings.tsv ratings

and, finally, we make an index for each of the tables:

CREATE INDEX ix_name_covering ON name(nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles);

CREATE INDEX ix_akas_covering ON akas(titleId,ordering,title,region,language,types,attributes,isOriginalTitle);

CREATE INDEX ix_basics_covering ON basics(tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres);

CREATE INDEX ix_crew_covering ON crew(tconst,directors,writers);

CREATE INDEX ix_episode_covering ON episode(tconst,parentTconst,seasonNumber,episodeNumber);

CREATE INDEX ix_principals_covering ON principals(tconst,ordering,nconst,category,job,characters);

CREATE INDEX ix_ratings_covering ON ratings(tconst,averageRating,numVotes);