TDM 20100: Project 10 — 2023

Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.

Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values!

Scope: SQL, SQL in R

Learning Objectives
  • Demonstrate the ability to interact with popular database management systems within R.

  • Solve data-driven problems using a combination of SQL and R.

  • Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.

  • Showcase the ability to filter, alias, and write subqueries.

  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

For this project, we will be using the lahman sqlite database. This database contains the data in the directory

  • /anvil/projects/tdm/data/lahman

You may get some more lahman database information from this youtube video youtube.com/watch?v=tS_-oTbsDzs

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.

Questions

Question 1 (1 pt)

  1. Let’s say we are interested in the total number of baseball players for each year, from year 2018 to year 2022, respectively. Please write a query to count the total number of players in the appearances table (by year), and display these totals in, descending order, by year.

output

year num_of_players

2022 1495…​

2021 1798…​

2020 1857…​

2019 1870…​

2018 1918…​

Dr Ward just made up some representative numbers here; these are not the exact numbers!

  • In the query, give an alias to the yearID from the appearances table, so that yearID appears listed as year. Similarly, name the counting of distinct players as an alias called num_of_players. The alias is a great way to not only make the headers look good, but aliases can also be used to reduce the text in a query, by giving some intermediate results a shorter name. The following is the basic syntax of column alias. You may get more information from www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm [alias]

SELECT column_name AS alias_name FROM table_name WHERE [condition]

Question 2 (2 pts)

Now, let’s look into the teams table. The attendance column provides the total number of audiences that attended a team’s home games. We may say that a team is more popular if it has more attendance at its home games.

  1. Please find out what is the average attendance number for each team in the teams table, during games from 2022 (only). You should have one average attendance number per team.

  2. Now use a subquery to compute the average attendance across all teams and games. Then modify question 2a, to only include teams whose average attendance for the team is larger than the average across all teams and games. Using an alias, change the attendance column in your query to appear as "average_attendance".

The AVG function will be useful to calculate average attendance
We can achieve this using a subquery. A subquery is a query that is used to embed a query within another query.

Question 3 (1 pt)

If you answered question (2) correctly, you should find that team Los Angeles Dodgers, with team ID 'LAN`, had the highest average attendance. We can consider this team as the most popular team in 2022.

  1. Please calculate the winning percentage for this team in 2022, using the fields 'W' and 'L' from the teams table with the formula:

winning_percentage = W/(W+L)

Use the name winning_per for the resulting column.

Some of you might get a 0 in your output and wonder why the most popular baseball team had a 0 win percentage! What’s happening here? How can you fix this?

Question 4 (2 pts)

You now know 2 different applications of the AS keyword, and you also know how to use a query as a subquery. Great!

In the previous project, we were introduced to aggregate functions. We know we can use the WHERE clause to filter our results, but what if we wanted to filter our results based on an aggregated column?

  1. Update the query from question (3) to print all teams that have winning percentage from year 2012 to 2022 (inclusive) greater than 55%. You should get 3 teams. Display the results, by win percentage, in descending order.

See this article for more information on the HAVING and WHERE clauses.

Question 5 (2 pts)

  1. Now let’s look at allstarfull table. Please list all players who attended 20 or more All Star games. List the players in descending order, by the number of All Star games that they attended.

  2. Please explore the tables in the database and write a query about some information that you are interested in. Please make sure to use aliasing, a subquery, and at least one aggregate function.

    Project 10 Assignment Checklist
    • Jupyter notebook with your code, comments and output for questions 1 to 5

      • firstname-lastname-project10.ipynb

    • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.