STAT 19000: Project 13 — Spring 2022

Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.

Context: We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We may even ask a question that asks you to use a tool you haven’t used before — but don’t worry, if we do, we will provide you with extra guidance.

Scope: Python, R, bash, unix, computers

Learning Objectives
  • Use the cumulative knowledge you’ve built this semester to answer a variety of data-driven questions.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/flights/subset/2008.parquet

  • /depot/datamine/data/coco/unlabeled2017/000000000008.jpg

  • /depot/datamine/data/movies_and_tv/imdb.db

Questions

Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don’t feel limited by one language, you can use different languages to answer different questions (maybe one language would be easier to do something in). If you are feeling bold, you can also try answering the questions using all languages!

Question 1

Read in the file 2008.parquet into a pandas dataframe and convert the column DepTime to a datetime. Print the first 50 converted values. They should match the following.

results
0     13:43:00
1     11:25:00
2     20:09:00
3     09:03:00
4     14:23:00
5     20:24:00
6     17:53:00
7     06:22:00
8     19:44:00
9     14:53:00
10    20:30:00
11    07:08:00
12    17:49:00
13    12:17:00
14    09:54:00
15    17:58:00
16    22:10:00
17    07:40:00
18    10:11:00
19    16:12:00
20    11:24:00
21    08:24:00
22    21:12:00
23    06:41:00
24    17:13:00
25    14:14:00
26    19:15:00
27    09:29:00
28    12:14:00
29    13:18:00
30    17:35:00
31    09:04:00
32    15:55:00
33    08:07:00
34    09:26:00
35    15:44:00
36    19:31:00
37    22:06:00
38    06:58:00
39    12:54:00
40    13:43:00
41    00:13:00
42         NaT
43    08:40:00
44    20:58:00
45    10:35:00
46    15:29:00
47    17:13:00
48    06:29:00
49    13:15:00
Name: DepTime, dtype: object

The apply method from the pandas library can be useful here. You can also use the string method zfill to zero-pad a string.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Use the PIL package to loop through the pixels in this image, /depot/datamine/data/coco/unlabeled2017/000000000008.jpg, and use f-strings to report the percentage of pixels that are "mostly green", "mostly blue", and "mostly red". The output should look like the following.

results
red: 2.66%
green: 9.88%
blue: 87.46%

To view the image:

from IPython.display import Image
Image("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg")
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

List the number of titles by year premiered in the imdb.db database. Don’t know SQL? That is 100% fine! Read the documentation here, and work with pandas dataframes.

Can’t figure out how to go through all of the data without having the kernel crash? That’s okay! If you don’t want to do this right now, it is okay to simply give the results for the first 10k movies:

sample of expected results for first 10k
 	type
premiered
1892 	3
1893 	1
1894 	6
1895 	19
1896 	104
1897 	37
1898 	45
1899 	47
1900 	82
1901 	35
1902 	36
1903 	57
1904 	21
1905 	32
1906 	41
1907 	49
1908 	157
1909 	306
1910 	362
1911 	508
1912 	600
1913 	978
1914 	1225
1915 	1465
1916 	1235
1917 	1200
1918 	1015
1919 	307
1920 	15
1921 	5
1922 	2
1925 	4
1936 	1

If you want to process the entire table of the database, great! The key is to use the chunksize argument. This returns an iterator — something you can loop over. If you set chunksize=10000, in each iteration of your loop, the value you are using in your loop will be equal to a dataframe with 10000 rows! Simply group by premiered, and count the values. Use pd.concat, and sum! The results (a sample, at least):

sample of results
premiered
1874.0     1.0
1877.0     1.0
1878.0     2.0
1881.0     1.0
1883.0     1.0
          ...
2024.0    66.0
2025.0    14.0
2026.0     9.0
2027.0     6.0
2028.0     3.0

Want to use SQL? Okay! You can run sql queries on this database from within a Jupyter Notebook cell. For example:

%load_ext sql
%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db
%%sql

SELECT * FROM titles LIMIT 5;

This section will be helpful!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Check out the following two datasets:

  • /depot/datamine/data/okcupid/filtered/users.csv

  • /depot/datamine/data/okcupid/filtered/questions.csv

How many men (as defined by the gender2 column) believe and don’t believe in ghosts? How about women (as defined by the gender2 column)?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Get the total dollar amount of liquor sold in the /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt dataset.

This dataset is about 3.5 GB in size — this is more than you will be able to load in our Jupyter Notebooks in a pandas data frame. You’ll have to explore a different strategy to solve this!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.