STAT 29000: Project 10 — Fall 2021

Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, MIN, and MAX.

Scope: SQL, sqlite

Learning Objectives
  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.

  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

  • Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/taxi/taxi_sample.db

Questions

Question 1

In project (8), you used bash tools, including awk, to parse through large amounts of yellow taxi data from /depot/datamine/data/taxi/. Of course, calculating things like the mean is not too difficult using awk, and awk is extremely fast and efficient, BUT SQL is better for some of the work we attempted to do in project (8).

Don’t take my word on it! We’ve placed a sample of 5 of the data files for the yellow taxi cab into an SQLite database called taxi_sample.db. This database contains, among other things, the yellow table (for yellow taxi cab data).

Write a query that will return the fare_amount, surcharge, tip_amount, and tolls_amount as a percentage of total_amount.

Now, take into consideration that this query will be evaluating these percentages for 5 of the data files, not just the first file or so. Wow, impressive!

Use the sum aggregate function to calculate the totals, and division to figure out the percentages.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Check out the payment_type column. Write a query that counts the number of each type of payment_type. The end result should print something like the following.

Output sample
payment_type, count
CASH, 123

You can use aliasing to control the output header names.

Write a query that sums the total_amount for payment_type of "CASH". What is the total amount of cash payments?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.

Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Write a query that gets the average total_amount for each year in the database. Which year has the largest average total_amount? Use the pickup_datetime column to determine the year.

Read this page and look at the strftime function.

If you want the headers to be more descriptive, you can use aliases.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

What percent of data in our database has information on the location of pickup and dropoff? Examine the data, to see if there is a pattern to the rows with that information and without that information.

There is a distinct pattern. Pay attention to the date and time of the data.

Confirm your hypothesis with the original data set(s) (in /depot/datamine/data/taxi/yellow/*.csv), using bash. This doesn’t have to be anything more thorough than running a simple head command with a 1-2 sentence explanation.

Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the avg function on a latitude or longitude column, by year to maybe get a pattern.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.