TDM 20100: Project 12 — 2023

Motivation: In the previous projects, you’ve gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we haven’t yet done is creating your own database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and many of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.

Context: We will (mostly) be using the sqlite3 command line tool to interact with the database.

Scope: sql, sqlite, unix

Learning Objectives
  • Create a sqlite database schema.

  • Populate the database with data using INSERT statements.

  • Populate the database with data using the command line interface (CLI) for sqlite3.

  • Run queries on a database.

  • Create an index to speed up queries.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The project will use the following datasets:

  • /anvil/projects/tdm/data/restaurant/orders.csv

  • /anvil/projects/tdm/data/lahman/lahman.db

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. For example

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.

To prepare for this project, create a new Jupyter Notebook called firstname-lastname-project12.ipynb. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the sqlite3 command line tool from the terminal window.

Okay, once completed, the first step is schema creation. First, it is important to note. The goal of this project is to put the data in /anvil/projects/tdm/data/restaurant/orders.csv into a sqlite database

With that in mind, run the following (in your terminal) to get a sample of the data.

head /anvil/projects/tdm/data/restaurant/orders.csv

Review the output data. An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types — just more "set in stone" than the very easily changed data.frame.

Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the sqlite_schema table will contain the type, name, tbl_name, root page, and sql for the database object.

For sqlite, the "database object" could refer to a table, index, view, or trigger.

This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is here.

For our purposes, when I refer to "schema", what I really mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the exact commands to build the database from scratch without the data itself. For example, take a look at our lahman.db database by running the following in your terminal.

sqlite3 /anvil/projects/tdm/data/lahman/lahman.db

This will open the command line interface (CLI) for sqlite3. It will look similar to:

sqlite>

Type .schema to see the "schema" for the database.

Any command you run in the sqlite CLI that starts with a dot (.) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing .help.

After running .schema, you should see a variety of legitimate SQL commands that will create the structure of your database without the data itself. This is an extremely useful self-documenting tool that is particularly useful.

So, now let’s study the sample of our orders.csv dataset to create a markdown list of key:value pairs for each column in the dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.

Questions

Question 1 (2 pts)

  1. Create a markdown list of key:value pairs for each column in the orders.csv dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.

For example, your solution might be given like this:

  • akeed_order_id: INTEGER

  • customer_id: TEXT

  • etc., etc.

where the value is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" here.

We just showed akeed_order_id and customer_id to give examples about how the first two variables in the data set should be classified.

As a side note: Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are approximate numeric data types and exact numeric data types. What you are most familiar with is the approximate numeric data types. In R or Python for example, try running the following:

(3 - 2.9) <= 0.1
Output
FALSE
(3 - 2.9) <= 0.1
Output
False

Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is critical that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for our example, store text as TEXT, numbers without decimal places as INTEGER, and numbers with decimal places as REAL — our example dataset doesn’t have a need for NUMERIC.

Question 2 (2 pts)

  1. Create a database named "orders.db" and a table named "orders" by following the instructions below

Let’s put together our CREATE TABLE statement that will create our table in the database.

See here for some good examples. Realize that the CREATE TABLE statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table orders.

Once you’ve written your CREATE TABLE statement, create a new, empty database by running the following in a terminal: sqlite3 $HOME/orders.db. Copy and paste the CREATE TABLE statement into the sqlite CLI. Upon success, you should see the statement printed when running the dot command .schema. Fantastic! You can also verify that the table exists by running the dot command .tables.

Congratulations! To finish things off, please paste the CREATE TABLE statement into a markdown cell in your notebook. In addition, include a screenshot of your .schema output after your CREATE TABLE statement was run.

Question 3 (2 pts)

The next step in the project is to add the data! After all, it is a data base. You may get how to insert data into table from here

  1. Please populate the data from orders.csv into your orders table

  2. Connect to "orders.db" and run a query to get the first 5 rows from "orders" table.

You could programmatically generate a .sql file with the INSERT INTO statement, hook the database up with Python or R and insert the data that way, or you could use the wonderful dot commands sqlite like following:

.mode csv
.import --skip 1 /anvil/projects/tdm/data/restaurant/orders.csv orders

To connect to the database:

%sql sqlite:///$HOME/orders.db

To select data from the table:

%sql SELECT * FROM orders LIMIT 5

Question 4 (2 pts)

Woohoo! You’ve successfully created a database and populated it with data from a dataset — pretty cool! Connect to your database from inside a terminal.

sqlite3 $HOME/orders.db

Now, run the following dot command in order to time our queries: .timer on. This will print out the time it takes to run each query. For example, try the following:

SELECT * FROM orders LIMIT 5;

Cool! Time the following query.

SELECT * FROM orders ORDER BY created_at LIMIT 10;
Output
Run Time: real 0.021 user 0.000261 sys 0.004553

Running time is often critical, particularly during large-scale database searches. Let’s explore some techniques to enhance performance through the use of indexing in tables. You may get more information about index here: www.sqlitetutorial.net/sqlite-index/

  1. Create an index for column "created_at".

Project 12 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project12.ipynb

  • Sql file 'orders.db' (this file should be approximately 22 MB)

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.