# STAT 29000: Project 10 — Fall 2021

Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`.

Scope: SQL, sqlite

Learning Objectives

• Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

• Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

• Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/taxi/taxi_sample.db`

## Questions

### Question 1

In project (8), you used bash tools, including `awk`, to parse through large amounts of yellow taxi data from `/depot/datamine/data/taxi/`. Of course, calculating things like the mean is not too difficult using `awk`, and `awk` is extremely fast and efficient, BUT SQL is better for some of the work we attempted to do in project (8).

Don’t take my word on it! We’ve placed a sample of 5 of the data files for the yellow taxi cab into an SQLite database called `taxi_sample.db`. This database contains, among other things, the `yellow` table (for yellow taxi cab data).

Write a query that will return the `fare_amount`, `surcharge`, `tip_amount`, and `tolls_amount` as a percentage of `total_amount`.

Now, take into consideration that this query will be evaluating these percentages for 5 of the data files, not just the first file or so. Wow, impressive!

 Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages.
Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 2

Check out the `payment_type` column. Write a query that counts the number of each type of `payment_type`. The end result should print something like the following.

Output sample
```payment_type, count
CASH, 123```
 You can use aliasing to control the output header names.

Write a query that sums the `total_amount` for `payment_type` of "CASH". What is the total amount of cash payments?

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 3

Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.

Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 4

Write a query that gets the average `total_amount` for each year in the database. Which year has the largest average `total_amount`? Use the `pickup_datetime` column to determine the year.

 If you want the headers to be more descriptive, you can use aliases.
Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 5

What percent of data in our database has information on the location of pickup and dropoff? Examine the data, to see if there is a pattern to the rows with that information and without that information.

 There is a distinct pattern. Pay attention to the date and time of the data.

Confirm your hypothesis with the original data set(s) (in `/depot/datamine/data/taxi/yellow/*.csv`), using bash. This doesn’t have to be anything more thorough than running a simple `head` command with a 1-2 sentence explanation.

 Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the `avg` function on a latitude or longitude column, by year to maybe get a pattern.
Items to submit
• Code used to solve this problem.

• Output from running the code.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.