TDM 20100: Project 9 — 2022

Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, MIN, and MAX.

Scope: SQL, sqlite

Learning Objectives
  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.

  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

  • Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/taxi/taxi_sample.db

Questions

Question 1

In previous projects, we used awk to parse through and summarize data. While awk is extremely convenient and can work well, but SQL is even better.

Write a query that will return the fare_amount, surcharge, tip_amount, and tolls_amount as a percentage of total_amount.

Make sure to limit the output to only 100 rows! Use the LIMIT clause to do this.

Use the sum aggregate function to calculate the totals, and division to figure out the percentages.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Check out the payment_type column. Write a query that counts the number of each type of payment_type. The end result should print something like the following.

Output sample
payment_type, count
CASH, 123

You can use aliasing to control the output header names.

Write a query that sums the total_amount for payment_type of "CASH". What is the total amount of cash payments?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.

Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.

Make sure all queries limit output to only 100 rows.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Write a query that gets the average total_amount for each year in the database. Which year has the largest average total_amount? Use the pickup_datetime column to determine the year.

Read this page and look at the strftime function.

If you want the headers to be more descriptive, you can use aliases.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

What percent of data in our database has information on the location of pickup and dropoff? Examine the data, to see if there is a pattern to the rows with that information and without that information.

There is a distinct pattern. Pay attention to the date and time of the data.

Confirm your hypothesis with the original data set(s) (in /anvil/projects/tdm/data/taxi/yellow/*.csv), using bash. This doesn’t have to be anything more thorough than running a simple head command with a 1-2 sentence explanation.

Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the avg function on a latitude or longitude column, by year to maybe get a pattern.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.