TDM 40100: Project 4 — 2022

Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3 database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3, with the opportunity to use other skills you’ve built throughout the previous years.

Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3 — a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.

Scope: sqlite3, lmod, SQL

Learning Objectives
  • Create your own sqlite3 database file.

  • Analyze a large dataset and formulate CREATE TABLE statements designed to store the data.

  • Run one or more queries to test out the end result.

  • Demonstrate the ability to normalize a series of database tables.

  • Wrangle and insert data into database.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json

Questions

This project is going to be a bit more open. The goal of this project is to take our dataset sample, and write Python code to insert it into our sqlite3 database. There are a variety of ways this could be accomplished, and we will accept anything that works, with a few constraints.

In the next project, we will run some experiments that will time insertion, project the time it would take to insert all of the data, adjust database settings, and, ultimately, create a final product that we can feel good about.

Question 1

As mentioned earlier — the goal of this project is to insert the sample data into our database. Start by generating the sample data.

%%bash

rm -rf $HOME/goodreads_samples
mkdir $HOME/goodreads_samples
cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json

In addition, go ahead and copy our empty database that is ready for you to insert data into.

%%bash

rm $HOME/my.db
cp /anvil/projects/tdm/data/goodreads/my.db $HOME

You can run this as many times as you need in order to get a fresh start.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Write Python code that inserts the data into your database. Here are the constraints.

  1. You should be able to fully recover the book_cover image from the database. This means you’ll need to handle scraping the image_url and converting the image to bytes before inserting into the database.

    Want some help to write the scraping code? Check out this 30100 question for more guidance.

  2. Your functions and code should ultimately operate on a single row of the datasets. For instance:

    import json
    
    with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
        for line in f:
            print(line)
            parsed = json.loads(line)
            print(f"{parsed['isbn']=}")
            print(f"{parsed['num_pages']=}")
            break

    Here, you can see that we can take a single row and do something to it. Why do we want it to work this way? This makes it easy to break our dataset into chunks and perform operations in parallel, if we so choose (and we will, but not in this project).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Demonstrate your database works by doing the following.

  1. Fully recover a book_cover and display it in your notebook.

    %%bash
    
    rm $HOME/test.db || true
    sqlite3 $HOME/test.db "CREATE TABLE test (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
    	my_blob BLOB
    );"
    import shutil
    import requests
    import os
    import uuid
    import sqlite3
    
    url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
    my_bytes = scrape_image_from_url(url)
    
    # insert
    conn = sqlite3.connect('/home/x-kamstut/test.db')
    cursor = conn.cursor()
    query = f"INSERT INTO test (my_blob) VALUES (?);"
    dat = (my_bytes,)
    cursor.execute(query, dat)
    conn.commit()
    cursor.close()
    
    # retrieve
    conn = sqlite3.connect('/home/x-kamstut/test.db')
    cursor = conn.cursor()
    
    query = f"SELECT * from test where id = ?;"
    cursor.execute(query, (1,))
    record = cursor.fetchall()
    img = record[0][1]
    tmp_filename = str(uuid.uuid4())
    with open(f"{tmp_filename}.jpg", 'wb') as file:
        file.write(img)
    
    from IPython import display
    display.Image(f"{tmp_filename}.jpg")
  2. Run a simple query to SELECT the first 5 rows of each table.

    %sql sqlite:////home/my-username/my.db
    %%sql
    
    SELECT * FROM tablename LIMIT 5;

    Make sure to replace "my-username" with your Anvil username, for example, x-kamstut is mine.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.