TDM 20100: Project 5 - SQL Introduction 1

Project Objectives

This project and the following projects will introduce you to SQL (specifically SQLite), which is a powerful language for managing and querying databases.

Learning Objectives
  • Learn the basics of SQLite syntax and structure.

  • Learn how to create and manipulate tables.

Dataset

  • /anvil/projects/tdm/data/lahman/lahman.db

Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/sql

Questions

Question 1 (2 points)

Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. It allows users to efficiently create, read, update, and delete data in databases. SQL is used in various applications, from small-scale projects to many large-scale industry applications. It is essential for data analysts, data scientists, and software developers to understand SQL to work with databases effectively. In the next few projects, you will learn the basics of SQL, including how to write queries, filter data, and perform data aggregation.

To start, you will need to ensure Anvil is setup correctly. For these projects, please make sure you have the seminar kernel selected. Then, make a new code cell and run the following code to load the Lahman dataset into a queryable SQL database:

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

This will load the Lahman dataset into a SQLite database. You can then use SQL queries to interact with the data. You will need to re-run this code cell at the start of each jupyter session or if your kernel crashes/restarts.

Once this is loaded, we can start writing SQL queries. To write SQL code in this environment, you will need to prefix a line of code with %sql. Additionally, you can make the first line of a code cell %%sql to run the entire cell as a SQL query.

There are many different SQL commands, each with their own use cases and best practices. For these project, we will focus on the following commands:

Command Description Example Example Description

SELECT

Used to select data from a database.

SELECT * FROM table

Selects all columns from the table called table

WHERE

Used to filter records based on a condition.

SELECT * FROM table WHERE column = value

Selects all columns from the table called table where the column equals value

ORDER BY

Used to sort the result set in ascending or descending order.

SELECT * FROM table ORDER BY column ASC

Selects all columns from the table called table and orders the results by column in ascending order

LIMIT

Used to specify the number of records to return.

SELECT * FROM table LIMIT 10

Selects all columns from the table called table and limits the results to 10 records

GROUP BY

Used to group rows with the same values in specified columns into summary row(s).

SELECT column, COUNT(*) FROM table GROUP BY column

Selects the column and counts the number of records in each group, grouping by the column

JOIN

Used to combine rows from two or more tables based on a related column between them.

SELECT * FROM table1 JOIN table2 ON table1.column = table2.column

Selects all columns from table1 and table2 where the column in table1 matches the column in table2

INTO

A directional command that may be used to create a new table from the results of a query.

SELECT * INTO new_table FROM table

Selects all columns from table and creates a new table called new_table with the results

INSERT

Used to insert new records into a table.

INSERT INTO table (column1, column2) VALUES (value1, value2)

Inserts a new record into the table with column1 set to value1 and column2 set to value2

UPDATE

Used to modify existing records in a table.

UPDATE table SET column = value WHERE condition

Updates the column in the table to value where the condition is met

DELETE

Used to delete records from a table.

DELETE FROM table WHERE condition

Deletes records from the table where the condition is met

DROP

Used to delete a table or database.

DROP TABLE table

Deletes the table from the database

CREATE

Used to create a new table or database.

CREATE TABLE table (column1 datatype, column2 datatype)

Creates a new table called table with column1 and column2 of specified datatypes

In SQL, and some other languages such as bash/batch, the * character is a wildcard that will match with anything. For example, SELECT * FROM table will select all columns from the table called table, no matter their names, datatypes, or values. This is very useful when you want to select all data from a table without having to specify each column individually.

That’s a lot of commands! The best part is that those are just the tip of the iceberg. SQL is a very powerful language with many more commands and features. You can find more information about SQL commands in the [SQLite documentation](www.sqlite.org/lang.html).

One thing you may notice when reading through the examples above is that SQL commands are practically plain English. This is one of the reasons why SQL is so popular and widely used, as it is very easy to both understand and write.

The first SQL query we will look at is exploring what tables are available in the Lahman dataset. To do this, we will use the sqlite_master table, which is a special table in SQLite that contains information about all the tables in the database. You can run the following command to see all the tables in the Lahman dataset:

%%sql
SELECT name FROM sqlite_master WHERE type='table';

This will return a list of all the tables in the Lahman dataset. In the example above, you can see we used 3 SQL commands: SELECT, FROM, and WHERE. The SELECT command is used to specify the name column, the FROM command is used to specify the sqlite_master table, and the WHERE command is used to filter the results to only include rows where the type column is equal to 'table'. This means we are only selecting tables, not other database objects like indexes or views.

Typically, SQL databases will have many tables, each containing different types or groups of data. In the Lahman dataset, you will find tables containing information about baseball players, teams, games, and more. As you should be able to tell, tables are typically clearly named to indicate what data they contain. For example, the People table contains information about baseball players, while the Teams table contains information about baseball teams. Go ahead and count how many tables are in the Lahman dataset, remember this count for later.

Let’s take a look at the People table to see what data it contains. Please create a new code cell and create your own SQL query to select all columns from the People table, but limit the results to 5 records. Then, run this cell to see the results.

From this data, you should be able to answer the following questions about our dataset:

  1. How many columns are in the People table?

  2. What is the name of the column that contains the player’s first name?

  3. For a players deathYear, what is put in the column if the player is still alive?

  4. What is Don Aase’s playerID?

Deliverables

1.1. Load the SQL database into a queryable SQLite database.

1.2. Run the SQL query to see all tables in the Lahman dataset.

1.3. Write a SQL query to select all columns from the People table, limiting the results to 5 records.

1.4. Answer the questions (1, 2, 3 and 4 above) about the People table based on the results of your SQL query.

Question 2 (2 points)

Now, let’s try using some slightly more advanced queries. We will continue to use the People table, but we will use the WHERE command to filter the results based on certain conditions.

First, please create a new code cell and write a SQL query to select all columns from the People table where the player’s last name is `Sanders` as follows:

%%sql
SELECT * FROM People WHERE nameLast='Sanders'

You should see that there are 15 players with the last name `Sanders`. Now, let’s see how many players have the last name Sanders and were born before 1900. We will create a new code cell and write a SQL query to select all columns from the People table where the player’s last name is "Sanders" and the birth year is before 1900. Then, run this cell to see the results:

%%sql
SELECT * FROM People WHERE nameLast='Sanders' AND birthYear<1900

You can use the AND operator to combine multiple conditions inside a WHERE clause. Simply put the operator between each condition you want to check. There are other operators you can use as well, such as OR to check if either condition is true, or NOT to negate a condition.

After running this query, you should see that there are 4 players with the last name Sanders who were born before 1900.

Now, write a SQL query to see how many players with the last name Sanders were born before 1900 or after 1975.

Operations like AND, OR, and NOT have an order of precedence, similar to PEMDAS in mathematics. This means that some operations will be evaluated before others. For example, AND has a higher precedence than OR, so if you use both in a query, the AND conditions will be evaluated first.

For example, condition1 AND condition2 OR condition3 will be evaluated as (condition1 AND condition2) OR condition3. If you want to evaluate it as condition1 AND (condition2 OR condition3), you can simply use parentheses to specify the order of evaluation.

After running this, you should see 6 players in our results.

Deliverables

2.1. Write a SQL query to select all columns from the People table where the player’s last name is "Sanders".

2.2. Write a SQL query to select all columns from the People table where the player’s last name is "Sanders" and the birth year is before 1900.

2.3. Write a SQL query to select all columns from the People table where the player’s last name is "Sanders" and the birth year is before 1900 or after 1975.

Question 3 (2 points)

Something that may be useful to us is to sort the results of our queries. For example, we could sort players by their birth year, their weight, or even their last name. To do this, we can use the ORDER BY command in our SQL queries. This command allows us to specify a column we want to sort by using the column name, and also if we want it sorted in ascending or descending order. By default, it will sort in ascending order, but you can specify DESC to sort in descending order.

For example, the following query selects all players who were born before 1900, orders them by last name in descending order, and displays the first 10 results.

%%sql
SELECT * FROM People WHERE birthYear<1900 ORDER BY nameLast DESC LIMIT 10

For a starter, write a query that returns all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order. You can use the People table for this query.

You should typically always use the LIMIT command to limit the number of results returned by your query. This is especially important when working with large datasets, as it can help improve performance and reduce the amount of data you need to process. In this query, you can use LIMIT 10 to limit the results to 10 records.

After you have that query working, copy this query and expand our limit to 20 players. What happens? If your query is correct, you should get an error message that looks like this:

(sqlite3.OperationalError) Could not decode to UTF-8 column 'nameLast' with text 'Zu�iga'

What’s happening here?

The Lahman dataset contains some players with non-ASCII characters in their names, such as "Zuñiga". SQLite and python are having a hard time decoding these characters, which is causing the error. To fix this, we can talk about the LIKE operator in the WHERE clause. This operator allows us to filter results based on a pattern, and it can be used to match non-ASCII characters as well by matching the � character. For example, you can match any Players with a non-ASCII character in their last name by using the following query:

%%sql
SELECT * FROM People WHERE nameLast LIKE '%�%'

Based on this knowledge, can you modify your previous query to exclude players with non-ASCII characters in their last name?

Hint: use the NOT operator before the LIKE operator to exclude results that match the pattern.

Deliverables

3.1. Write a SQL query to select all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order, limiting the results to 10 records.

3.2. Write a SQL query to select all players born after 1970, who are over 70 inches tall, ordered by their last name in descending order, limiting the results to 20 records. (This will throw an error.)

3.3. Fix the error in the previous query by excluding players with non-ASCII characters in their last name.

Question 4 (2 points)

Another useful command in SQL is the GROUP BY command. This command allows us to group rows together if they have the same values in specified columns. This is useful for aggregating data, such as counting the number of records that match a certain condition, or calculating metrics of a column such as the average or sum.

Similar to how the WHERE command comes with operators such as AND and OR, the GROUP BY command also comes with aggregation functions such as COUNT, SUM, AVG, MIN, and MAX. These functions allow us to perform calculations on the grouped data.

For example, suppose we want to know the average height of players with the last name Sanders. We can use the WHERE command to filter the results to only include players with the last name Sanders, the GROUP BY command to group the results by the last name, and the AVG function to calculate the average height. The SQL query would look like this:

%%sql
SELECT *, AVG(height) AS avg_height FROM People WHERE nameLast = 'Sanders' GROUP BY nameLast;

Even though we use the wildcard to select all columns, we still need to specify which columns we want to aggregate. In this case, we are aggregating the height column using the AVG function. Additionally, the AS keyword allows us to put the result of the aggregation into a new column called avg_height. This is useful for readability and understanding the results of our query.

Let’s try this out with a more complex query. Write a SQL query that returns the average weight of players grouped by their birth year, ordered by the average weight in descending order. You can use the People table for this query. Be sure to limit the results to 10 records.

Deliverables

4.1. Write a SQL query to return the average weight of players grouped by their birth year, ordered by the average weight in descending order, limiting the results to 10 records.

Question 5 (2 points)

Another useful operation in SQL is the COUNT function, which allows us to count the number of records that match a certain condition. Remember in Question 1 when you manually counted the number of tables in the Lahman dataset. We can instead use the COUNT function to do this task for us. The COUNT function can be used in conjunction with the SELECT command to count the number of records that match a certain condition. For example, if we want to count the number of players in the People table, we can use the following SQL query:

%%sql
SELECT COUNT(*) FROM People;

This will return a single number, which is the total number of records in the People table. The * wildcard is used to count all records, regardless of their values.

We could also use the WHERE command to count the number of players with a certain condition. For example, if we want to count the number of players with the last name "Sanders", we can use the following SQL query:

%%sql
SELECT COUNT(*) FROM People WHERE nameLast = 'Sanders';

We can also use the COUNT function in conjunction with the GROUP BY command to count the number of records in each group. For example, if we want to count the number of players grouped by their birth year, we can use the following SQL query:

%%sql
SELECT birthYear, COUNT(*) FROM People GROUP BY birthYear LIMIT 10;

Now, time for you to try some yourself. Can you write an SQL query that counts the number of tables in the Lahman dataset? How about the number of players born before 1900 who are over 74 inches tall?

You should get 27 and 767, respectively, for the two queries above.

Deliverables

5.1. Write a SQL query to count the number of tables in the Lahman dataset.
5.2. Write a SQL query to count the number of players born before 1900 who are over 74 inches tall.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project5.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.