0. Kick-off
Welcome!
Welcome to the half day short course titled Applied SQL for Data Analysis: A Practical Introduction to a Key Data Skill.
Agenda
Wednesday, November 12, 2025, 8:00 AM to noon
| Time | Minutes | Topic |
|---|---|---|
8:00 - 8:20 AM |
20 |
0. Kick off |
8:20 - 9:00 AM |
40 |
1. Basics of Querying |
9:00 - 9:40 AM |
40 |
2. Filtering and Sorting |
9:40 - 9:45 AM |
5 |
Break |
9:45 - 10:25 AM |
40 |
3. Aggregation and Grouping |
10:25 - 11:05 AM |
40 |
4. Joins: Combining Tables |
11:05 - 11:10 AM |
5 |
Break |
11:10 - 11:50AM |
40 |
5. SQL in R |
11:50 AM - 12:00 PM |
10 |
Wrap up |
On your own |
6. Real-World Querying Challenge |
About Structured Query Language (SQL)
Structured Query Language (SQL) is a language used for querying and manipulating data in a database and it provides a relational database management system. SQL allows users to efficiently create, read, update, and delete data in databases. It is widely used across applications, from small projects to large-scale industry systems. For data analysts, data scientists, and software developers, understanding SQL is essential for working effectively with databases.
We use SQLite, which does not require a server to run. The database is built into the application, and the app reads from and writes to the database files directly on disk. SQLite source code is public and free to everyone.
Datasets
JupyterLab on Anvil, a supercomputer supported by the National Science Foundation, will be used for this workshop.
IMDb
We will use the IMDb SQLite data located at: /anvil/projects/tdm/data/movies_and_tv/imdb.db[1] for all demo content.
The following is an illustration of the database to help you understand the data.
[1] There are 7 data files provided by IMDb:
name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz
We will use the SQLite database format created from those tsv (Tab-Separated Values) files.
Lahman Baseball Database
We will use the Lahman Baseball Database. The SQLite database on Anvil is located at /anvil/projects/tdm/data/lahman/lahman.db.
We will use the Lahman data for all practice on your own content.
The following is an illustration of the database to help you understand the data.
Login in to Anvil
Let’s launch a Jupyter session on Anvil by navigating to https://notebook.anvilcloud.rcac.purdue.edu.
Use the ACCESS username that you were assigned (when you setup your account) and the ACCESS password that you chose. You will need to authenticate via DUO.
Now launch two notebooks. Since sqlite only references the most recent database, we recommend having two notebooks open:
-
Notebook 1 - use for the 'demo' sections which utilize the IMDb data. Title this file "demo.ipynb" or "IMDb.ipynb".
-
Notebook 2 - use for the 'practice on your own' sections which utilize the Lahman baseball data. Title this file "practice.ipynb" or "baseball.ipynb".