STAT 19000: Project 14 — Spring 2022

Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.

Context: We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We may even ask a question that asks you to use a tool you haven’t used before — but don’t worry, if we do, we will provide you with extra guidance.

Scope: Python, R, bash, unix, computers

Learning Objectives
  • Use the cumulative knowledge you’ve built this semester to answer a variety of data-driven questions.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the following dataset(s):

  • /depot/datamine/data/airbnb/**/reviews.csv.gz

  • /depot/datamine/data/election/itcont2022.txt

  • /depot/datamine/data/death_records/DeathRecords.csv


Question 1

Scan through the reviews.csv.gz files in /depot/datamine/data/airbnb/* and find the 10 most common reviewer_name values.

The pathlib library will be particularly useful here.

In particular, check out the example(s) in the basic use section. The part of /*.py means roughly "every directory and subdirectory, recursively".

You can read csv files directly from .gz files using the argument compression="gzip".

The following is an example of one way you could sum the values of a dictionary.

d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'c': 4, 'd': 5, 'a': 6}

result = {key: d1.get(key, 0) + d2.get(key, 0) for key in set(d1) | set(d2)}

Test your code on a few of the reviews.csv.gz files before running it for all files. Running it for all files will take around 3 or so minutes.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

After completing question 1, it is likely you have a solid understanding on how the data is organized. Add some logic to your code from question 1 to instead print the 5 most common names for each country.

If your $HOME country (haha) is in the list — do the names sound about right? What kind of bias does this data likely show?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Checkout the newest set of election data /depot/datamine/data/election/itcont2022.txt. Let’s say we are interested in all entries (rows) that have the word "purdue" in it (of course, this may include entries that don’t relate to Purdue University, but we are okay with that error).

This is around 5 GB of data, and only a small fraction of that has relevant information. In pandas, there is not an ergonomic way to check if a row of data has a string in it. This is where knowing how to use multiple tools will come in handy!

There is a tool called grep that can very quickly search large text files for certain text. We will learn more about grep (and other useful command line utilities) in STAT 29000. With that being said, why not figure out how to use grep to create a subset of data to read into pandas that is already filtered — it isn’t too bad!

Use grep to create a subset of data called my_election_data.txt. my_election_data.txt should contain only the rows that have the word "purdue" in it. my_election_data.txt should live in your $HOME directory: /home/purduealias/my_election_data.txt.

  1. Use grep to find only rows with the word "purdue" in them (case insensitive). Use redirection to save the output to $HOME/my_election_data.txt.

    You can use the -i flag to make your grep search case insensitive — this means that rows with "Purdue" or "purdue" or "PuRdUe" would be found.

    You can run grep from within Jupyter Notebooks using the %%bash magic. For example, the following would find the word "apple" in a dataset and create a new file called "my_new_file.csv" in my $HOME directory.

    grep "apple" /depot/datamine/data/yelp/data/json/yelp_academic_dataset_review.json > $HOME/my_new_file.csv

    In order to insert the header line into your newly created file, you can run the following sed command directly after your grep command.

  2. Use pandas to read in your newly created, much smaller dataset, $HOME/my_election_data.txt.

Finally, print the EMPLOYER, NAME, OCCUPATION, and TRANSACTION_AMT, for the top 10 donations (by size).

You may notice that each row represents a single donation. Group the data by the NAME column to get the total amount of donation per individual. What is the NAME of the top donor?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

What is the average age of death for individuals who were married, single, divorced, widowed, or unknown?

Further split the data by Sex — do the same patterns hold? Dig in a bit and notice that how we look at the data can make a very big difference!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

It has been a fun year. We hope that you learned something new!

  • Write down 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 19000).

  • Write down 3 (or more) of your favorite projects and/or topics you wish you were able to learn more about.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.