3. Aggregation and Grouping
Concepts
| Command | Description | Example | Example Description |
|---|---|---|---|
|
Used to group rows with the same values in specified columns into summary row(s). |
|
Selects the |
GROUP BY
Another useful command in SQL is the GROUP BY. This command allows us to group rows together if they have the same values in one or more specified columns. It is especially helpful when we want to summarize or aggregate data, for instance, counting the number of titles per genre or calculating the average rating for each year.
Similar to how the WHERE command can be combined with logical operators such as AND and OR, the GROUP BY command is often used together with aggregation functions such as COUNT, SUM, AVG, MIN, and MAX. These functions let us perform calculations on each group of data.
HAVING
HAVING is similar to the WHERE clause but instead of being applied to every row before grouping, it is applied on the aggregated data. A single query can have both a WHERE clause and a HAVING clause. The WHERE clause will filter the data before aggregation, and the HAVING clause will filter the data after aggregation.
Demo
For example, we want to find the number of rows for each type in titles table:
SELECT type, COUNT(*) FROM titles GROUP BY type;
Add average run time in minutes for each type:
SELECT type, COUNT(*), AVG(runtime_minutes) FROM titles GROUP BY type;
Or we only care about the movies:
SELECT type, COUNT(*) AS n, AVG(runtime_minutes)
FROM titles WHERE type = 'movie';
In another example, suppose we want to find the average runtime of movies in each genre. We can use the WHERE command to filter the results to only include movies (for instance, type = 'movie'), the GROUP BY command to group the data by genre, and the AVG function to calculate the average runtime for each group. The SQL query would look like this:
SELECT genres, AVG(runtime_minutes) AS avg_runtime
FROM titles
WHERE type = 'movie'
GROUP BY genres;
We can check only for 'Action':
SELECT AVG(runtime_minutes) AS avg_runtime
FROM titles
WHERE type = 'movie'
AND genres LIKE '%Action%';
Even when we use the wildcard (*) to select all columns, we still have to specify which column we want to apply the aggregation to. In this example, we use the AVG function to calculate the average of the runtime_minute column. The AS keyword then lets us assign a clear name, avg_runtime, to the resulting column, which helps make our query output easier to read and interpret.
Another example: find the number of titles for each language:
SELECT language, COUNT(*) as count FROM akas
GROUP BY language
ORDER BY count DESC;
Dr. Ward found all the title_id for Marvel Cinematic Universe (MCU) before, we use them for this example to find how many we have each year:
SELECT premiered AS year, COUNT(*) AS n FROM titles
WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered;
Let’s modify this query to get the percentage of MCU movies released in each year. First find the total:
SELECT COUNT(*) FROM titles
WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286');
When you divide the counts to total, you get zeros:
SELECT premiered AS year, COUNT(*)/29 AS n FROM titles
WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered;
Why?
When we calculate a percentage using integer (numeric data without a decimal place) columns, SQLite performs integer division. That means it divides the numbers as integers and truncates any decimal part, instead of giving a decimal (floating-point) result. For example:
SELECT 5/2;
This returns the integer part (2) in SQLite, not 2.5, because both 5 and 2 are integers. We can fix it converting at least one of them to real number as following by :
SELECT CAST(5 AS REAL)/2 as result;
REAL is equivalent to a float or double, meaning a number that can include decimal values.
Lets fix it in the real data:
SELECT premiered AS year, CAST(COUNT(*) AS real)/29 AS n FROM titles
WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered;
Lets continue to practise more by writing a query that returns the average number of words in the primary_title column, by year:
SELECT premiered, primary_title,
length(primary_title) - length(replace(primary_title, ' ', '')) + 1
FROM titles LIMIT 5;
Then, only for years where the average number of words in the primary_title is less than 3:
SELECT premiered,
AVG(length(primary_title) - length(replace(primary_title, ' ', '')) + 1) AS avg_title_length
FROM titles GROUP BY premiered
HAVING avg_title_length < 3;
Practice on your own
3.1. Write a SQL query to count the number of players born before 1900 who are over 74 inches tall.
3.2. Write a SQL query to return the average weight of players grouped by their birth year, ordered by the average weight in descending order, limiting the results to 10 records.