TDM 40100: Project 3 — 2022

Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3 database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3, with the opportunity to use other skills you’ve built throughout the previous years.

Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3 — a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.

Scope: sqlite3, lmod, SQL

Learning Objectives
  • Create your own sqlite3 database file.

  • Analyze a large dataset and formulate CREATE TABLE statements designed to store the data.

  • Run one or more queries to test out the end result.

  • Demonstrate the ability to normalize a series of database tables.

  • Wrangle and insert data into database.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json

Questions

In case you skipped the previous project, let’s all get on the same page. You run the following code in a Jupyter notebook to create a sqlite3 database called my.db in your $HOME directory.

%%bash

rm $HOME/my.db
sqlite3 $HOME/my.db "CREATE TABLE reviews (
	user_id TEXT,
	book_id INTEGER,
	review_id TEXT,
	rating INTEGER,
	review_text TEXT,
    date_added TEXT,
    date_updated TEXT,
    read_at TEXT,
    started_at TEXT,
    n_votes INTEGER,
    n_comments INTEGER
);"

sqlite3 $HOME/my.db "CREATE TABLE books (
	isbn TEXT,
    text_reviews_count INTEGER,
    country_code TEXT,
    language_code TEXT,
    asin TEXT,
    is_ebook INTEGER,
    average_rating REAL,
    kindle_asin TEXT,
    description TEXT,
    format TEXT,
    link TEXT,
    publisher TEXT,
    num_pages INTEGER,
    publication_day INTEGER,
    isbn13 TEXT,
    publication_month INTEGER,
    edition_information TEXT,
    publication_year INTEGER,
    url TEXT,
    image_url TEXT,
    book_id TEXT,
    ratings_count INTEGER,
    work_id TEXT,
    title TEXT,
    title_without_series TEXT
);"

sqlite3 $HOME/my.db "CREATE TABLE authors (
	average_rating REAL,
    author_id INTEGER,
    text_reviews_count INTEGER,
    name TEXT,
    ratings_count INTEGER
);"

sqlite3 $HOME/my.db "CREATE TABLE series (
	numbered INTEGER,
    note TEXT,
    description TEXT,
    title TEXT,
    series_works_count INTEGER,
    series_id INTEGER,
    primary_work_count INTEGER
);"
%sql sqlite:////home/x-myalias/my.db
%%sql

SELECT * FROM reviews limit 5;
%%sql

SELECT * FROM books limit 5;
%%sql

SELECT * FROM authors limit 5;
%%sql

SELECT * FROM series limit 5;
%%bash

rm -rf $HOME/goodreads_samples
mkdir $HOME/goodreads_samples
cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json

Question 1

Update your original CREATE TABLE statement for the books table to include a field that will be used to store the actual book cover images from the image_url field in the books table. Call this new field book_cover. Which one of the sqlite types did you use?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Check out a line of the goodreads_books.json data:

%%bash

head -n 1 $HOME/goodreads_samples/goodreads_books.json

Don’t have a goddreads_samples directory? Make sure you run the following.

%%bash

rm -rf $HOME/goodreads_samples
mkdir $HOME/goodreads_samples
cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json

Recall that in the previous project, we just ignored the following fields from the books table: series, similar_books, popular_shelves, and authors. We did this because those fields are more complicated to deal with.

Read this article on database normalization from Microsoft. We are going to do our best to normalize our tables with these previously ignored fields taken into consideration.

Let’s start by setting some practical naming conventions. Note that these are not critical by any stretch, but can help remove some guess work when navigating a database with many tables and ids.

  1. Every table’s primary key should be named id, unless it is a composite key. For example, instead of book_id in the books table, it would make sense to call that column id — "book" is implied from the table name.

  2. Every table’s foreign key should reference the id column of the foreign table and be named "foreign_table_name_id". For example, if we had a foreign key in the books table that referenced an author in the authors table, we should name that column author_id.

  3. Keep table names plural, when possible — for example, not the book table, but the books table.

  4. Link tables or junction tables should be named by the two tables which you are trying to represent the many-to-many relationship for. (We will go over this one specifically when needed, no worries)

Make the appropriate changes to the following CREATE TABLE statements that reflect these conventions as much as possible (for now).

%%bash

rm $HOME/my.db
sqlite3 $HOME/my.db "CREATE TABLE reviews (
	user_id TEXT,
	book_id INTEGER,
	review_id TEXT,
	rating INTEGER,
	review_text TEXT,
    date_added TEXT,
    date_updated TEXT,
    read_at TEXT,
    started_at TEXT,
    n_votes INTEGER,
    n_comments INTEGER
);"

sqlite3 $HOME/my.db "CREATE TABLE books (
	isbn TEXT,
    text_reviews_count INTEGER,
    country_code TEXT,
    language_code TEXT,
    asin TEXT,
    is_ebook INTEGER,
    average_rating REAL,
    kindle_asin TEXT,
    description TEXT,
    format TEXT,
    link TEXT,
    publisher TEXT,
    num_pages INTEGER,
    publication_day INTEGER,
    isbn13 TEXT,
    publication_month INTEGER,
    edition_information TEXT,
    publication_year INTEGER,
    url TEXT,
    image_url TEXT,
    book_id TEXT,
    ratings_count INTEGER,
    work_id TEXT,
    title TEXT,
    title_without_series TEXT
);"

sqlite3 $HOME/my.db "CREATE TABLE authors (
	average_rating REAL,
    author_id INTEGER,
    text_reviews_count INTEGER,
    name TEXT,
    ratings_count INTEGER
);"

sqlite3 $HOME/my.db "CREATE TABLE series (
	numbered INTEGER,
    note TEXT,
    description TEXT,
    title TEXT,
    series_works_count INTEGER,
    series_id INTEGER,
    primary_work_count INTEGER
);"
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

A book can have many authors, and an author can have many books. This is an example of a many-to-many relationship.

We already have a books table and an authors table. Create a junction or link table that effectively normalizes the authors field in the books table. Call this new table books_authors (see point 4 above — this is the naming convention we want).

Make sure to include your CREATE TABLE statement in your notebook.

There should be 4 columns in the authors_books table. A primary key field, two foreign key fields, and a regular data field that is a part of the original authors field data in the books table.

Make sure to properly apply the primary key and foreign key keywords.

Write a SQL query to find every book by author with id 12345. It doesn’t have to be perfect syntax, as long as the logic is correct. In addition, it won’t be runnable, that is okay.

You will need to use joins and our junction table to perform this query.

Copy, paste, and update your bash cell with the CREATE TABLE statements to implement these changes. In a markdown cell, write out your SQL query.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Assume that a series can have many books and a book can be a part of many series. Perform the same operations as the previous problem (except for the query).

What columns does the books_series table have?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

The remaining two fields that need to be dealt with are similar_books and popular_shelves. Choose at least one of the two and do your best to come up with a good solution for the way we store the data. We will give hints for both below.

For this question, please copy, paste, and update the bash cell with the CREATE TABLE statements. In addition, please include a markdown cell with a detailed explanation of why you chose your solution, and provide at least 1 example of a query that should work for your solution (like before, we are looking for logic, not syntax).

similar_books:

It is okay to have a link table that links rows from the same table!

There are always many ways to do the same thing. In our examples, we used link tables with their own id (primary key) in addition to multiple foreign keys. This provides the flexibility of later being able to add more fields to the link table, where it may even become useful all by itself.

There is, however, a technically better solution for a table that is simply a link table and nothing more. This would be where you have 2 columns, both foreign keys, and you create a composite primary key, or a primary key that is represented by the unique combination of both foreign keys. This ensures that links are only ever represented once. Feel free to experiment with this if you want!

popular_shelves:

You can create as many tables as you need.

After a bit of thinking, this one may not be too different than what you’ve already accomplished.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.