TDM 20100: Project 5 - SQL Introduction 1
Project Objectives
This project and the following projects will introduce you to SQL (specifically SQLite), which is a powerful language for managing and querying databases.
Dataset
-
/anvil/projects/tdm/data/lahman/lahman.db
Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/sql
Questions
Question 1 (2 points)
Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. It allows users to efficiently create, read, update, and delete data in databases. SQL is used in various applications, from small-scale projects to many large-scale industry applications. It is essential for data analysts, data scientists, and software developers to understand SQL to work with databases effectively. In the next few projects, you will learn the basics of SQL, including how to write queries, filter data, and perform data aggregation.
To start, you will need to ensure Anvil is setup correctly. For these projects, please make sure you have the seminar
kernel selected. Then, make a new code cell and run the following code to load the Lahman dataset into a queryable SQL database:
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
This will load the Lahman dataset into a SQLite database. You can then use SQL queries to interact with the data. You will need to re-run this code cell at the start of each jupyter session or if your kernel crashes/restarts. |
Once this is loaded, we can start writing SQL queries. To write SQL code in this environment, you will need to prefix a line of code with %sql
. Additionally, you can make the first line of a code cell %%sql
to run the entire cell as a SQL query.
There are many different SQL commands, each with their own use cases and best practices. For these project, we will focus on the following commands:
Command | Description | Example | Example Description |
---|---|---|---|
|
Used to select data from a database. |
|
Selects all columns from the table called |
|
Used to filter records based on a condition. |
|
Selects all columns from the table called |
|
Used to sort the result set in ascending or descending order. |
|
Selects all columns from the table called |
|
Used to specify the number of records to return. |
|
Selects all columns from the table called |
|
Used to group rows with the same values in specified columns into summary row(s). |
|
Selects the |
|
Used to combine rows from two or more tables based on a related column between them. |
|
Selects all columns from |
|
A directional command that may be used to create a new table from the results of a query. |
|
Selects all columns from |
|
Used to insert new records into a table. |
|
Inserts a new record into the |
|
Used to modify existing records in a table. |
|
Updates the |
|
Used to delete records from a table. |
|
Deletes records from the |
|
Used to delete a table or database. |
|
Deletes the |
|
Used to create a new table or database. |
|
Creates a new table called |
In SQL, and some other languages such as bash/batch, the |
That’s a lot of commands! The best part is that those are just the tip of the iceberg. SQL is a very powerful language with many more commands and features. You can find more information about SQL commands in the [SQLite documentation](www.sqlite.org/lang.html).
One thing you may notice when reading through the examples above is that SQL commands are practically plain English. This is one of the reasons why SQL is so popular and widely used, as it is very easy to both understand and write.
The first SQL query we will look at is exploring what tables are available in the Lahman dataset. To do this, we will use the sqlite_master
table, which is a special table in SQLite that contains information about all the tables in the database. You can run the following command to see all the tables in the Lahman dataset:
%%sql
SELECT name FROM sqlite_master WHERE type='table';
This will return a list of all the tables in the Lahman dataset. In the example above, you can see we used 3 SQL commands: SELECT
, FROM
, and WHERE
. The SELECT
command is used to specify the name
column, the FROM
command is used to specify the sqlite_master
table, and the WHERE
command is used to filter the results to only include rows where the type
column is equal to 'table'
. This means we are only selecting tables, not other database objects like indexes or views.
Typically, SQL databases will have many tables, each containing different types or groups of data. In the Lahman dataset, you will find tables containing information about baseball players, teams, games, and more. As you should be able to tell, tables are typically clearly named to indicate what data they contain. For example, the People
table contains information about baseball players, while the Teams
table contains information about baseball teams. Go ahead and count how many tables are in the Lahman dataset, remember this count for later.
Let’s take a look at the People
table to see what data it contains. Please create a new code cell and create your own SQL query to select all columns from the People
table, but limit the results to 5 records. Then, run this cell to see the results.
From this data, you should be able to answer the following questions about our dataset:
-
How many columns are in the
People
table? -
What is the name of the column that contains the player’s first name?
-
For a players deathYear, what is put in the column if the player is still alive?
-
What is Don Aase’s playerID?
1.1. Load the SQL database into a queryable SQLite database.
1.2. Run the SQL query to see all tables in the Lahman dataset.
1.3. Write a SQL query to select all columns from the People
table, limiting the results to 5 records.
1.4. Answer the questions (1, 2, 3 and 4 above) about the People
table based on the results of your SQL query.
Question 2 (2 points)
Now, let’s try using some slightly more advanced queries. We will continue to use the People
table, but we will use the WHERE
command to filter the results based on certain conditions.
First, please create a new code cell and write a SQL query to select all columns from the People
table where the player’s last name is `Sanders` as follows:
%%sql
SELECT * FROM People WHERE nameLast='Sanders'
You should see that there are 15 players with the last name `Sanders`.
Now, let’s see how many players have the last name Sanders
and were born before 1900. We will create a new code cell and write a SQL query to select all columns from the People
table where the player’s last name is "Sanders" and the birth year is before 1900. Then, run this cell to see the results:
%%sql
SELECT * FROM People WHERE nameLast='Sanders' AND birthYear<1900
You can use the |
After running this query, you should see that there are 4 players with the last name Sanders
who were born before 1900.
Now, write a SQL query to see how many players with the last name Sanders
were born before 1900 or after 1975.
Operations like For example, condition1 AND condition2 OR condition3 will be evaluated as (condition1 AND condition2) OR condition3. If you want to evaluate it as condition1 AND (condition2 OR condition3), you can simply use parentheses to specify the order of evaluation. |
After running this, you should see 6 players in our results.
2.1. Write a SQL query to select all columns from the People
table where the player’s last name is "Sanders".
2.2. Write a SQL query to select all columns from the People
table where the player’s last name is "Sanders" and the birth year is before 1900.
2.3. Write a SQL query to select all columns from the People
table where the player’s last name is "Sanders" and the birth year is before 1900 or after 1975.
Question 3 (2 points)
Something that may be useful to us is to sort the results of our queries. For example, we could sort players by their birth year, their weight, or even their last name. To do this, we can use the ORDER BY
command in our SQL queries. This command allows us to specify a column we want to sort by using the column name, and also if we want it sorted in ascending or descending order. By default, it will sort in ascending order, but you can specify DESC
to sort in descending order.
For example, the following query selects all players who were born before 1900, orders them by last name in descending order, and displays the first 10 results.
%%sql
SELECT * FROM People WHERE birthYear<1900 ORDER BY nameLast DESC LIMIT 10
For a starter, write a query that returns all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order. You can use the People
table for this query.
You should typically always use the |
After you have that query working, copy this query and expand our limit to 20 players. What happens? If your query is correct, you should get an error message that looks like this:
(sqlite3.OperationalError) Could not decode to UTF-8 column 'nameLast' with text 'Zu�iga'
What’s happening here?
The Lahman dataset contains some players with non-ASCII characters in their names, such as "Zuñiga". SQLite and python are having a hard time decoding these characters, which is causing the error. To fix this, we can talk about the LIKE
operator in the WHERE
clause. This operator allows us to filter results based on a pattern, and it can be used to match non-ASCII characters as well by matching the � character. For example, you can match any Players with a non-ASCII character in their last name by using the following query:
%%sql
SELECT * FROM People WHERE nameLast LIKE '%�%'
Based on this knowledge, can you modify your previous query to exclude players with non-ASCII characters in their last name?
Hint: use the |
3.1. Write a SQL query to select all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order, limiting the results to 10 records.
3.2. Write a SQL query to select all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order, limiting the results to 20 records. (This will throw an error.)
3.3. Fix the error in the previous query by excluding players with non-ASCII characters in their last name.
Question 4 (2 points)
Another useful command in SQL is the GROUP BY
command. This command allows us to group rows together if they have the same values in specified columns. This is useful for aggregating data, such as counting the number of records that match a certain condition, or calculating metrics of a column such as the average or sum.
Similar to how the WHERE
command comes with operators such as AND
and OR
, the GROUP BY
command also comes with aggregation functions such as COUNT
, SUM
, AVG
, MIN
, and MAX
. These functions allow us to perform calculations on the grouped data.
For example, suppose we want to know the average height of players with the last name Sanders
.
We can use the WHERE
command to filter the results to only include players with the last name Sanders
, the GROUP BY
command to group the results by the last name, and the AVG
function to calculate the average height. The SQL query would look like this:
%%sql
SELECT *, AVG(height) AS avg_height FROM People WHERE nameLast = 'Sanders' GROUP BY nameLast;
Even though we use the wildcard to select all columns, we still need to specify which columns we want to aggregate. In this case, we are aggregating the |
Let’s try this out with a more complex query. Write a SQL query that returns the average weight of players grouped by their birth year, ordered by the average weight in descending order. You can use the People
table for this query. Be sure to limit the results to 10 records.
4.1. Write a SQL query to return the average weight of players grouped by their birth year, ordered by the average weight in descending order, limiting the results to 10 records.
Question 5 (2 points)
Another useful operation in SQL is the COUNT
function, which allows us to count the number of records that match a certain condition. Remember in Question 1 when you manually counted the number of tables in the Lahman dataset. We can instead use the COUNT
function to do this task for us. The COUNT
function can be used in conjunction with the SELECT
command to count the number of records that match a certain condition.
For example, if we want to count the number of players in the People
table,
we can use the following SQL query:
%%sql
SELECT COUNT(*) FROM People;
This will return a single number, which is the total number of records in the People
table. The *
wildcard is used to count all records, regardless of their values.
We could also use the WHERE
command to count the number of players with a certain condition. For example, if we want to count the number of players with the last name "Sanders", we can use the following SQL query:
%%sql
SELECT COUNT(*) FROM People WHERE nameLast = 'Sanders';
We can also use the COUNT
function in conjunction with the GROUP BY
command to count the number of records in each group. For example, if we want to count the number of players grouped by their birth year, we can use the following SQL query:
%%sql
SELECT birthYear, COUNT(*) FROM People GROUP BY birthYear LIMIT 10;
Now, time for you to try some yourself. Can you write an SQL query that counts the number of tables in the Lahman dataset? How about the number of players born before 1900 who are over 74 inches tall?
You should get 27 and 767, respectively, for the two queries above. |
5.1. Write a SQL query to count the number of tables in the Lahman dataset.
5.2. Write a SQL query to count the number of players born before 1900 who are over 74 inches tall.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project5.ipynb
You must double check your You will not receive full credit if your |