STAT 19000: Project 4 — Spring 2022

Motivation: Up until this point we’ve utilized bits and pieces of the pandas library to perform various tasks. In this project we will formally introduce pandas and numpy, and utilize their capabilities to solve data-driven problems.

Context: By now you’ll have had some limited exposure to pandas. This is the first in a three project series that covers some of the main components of both the numpy and pandas libraries. We will take a two project intermission to learn about functions, and then continue.

Scope: python, pandas

Learning Objectives
  • Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

  • Use numpy, scipy, and pandas to solve a variety of data-driven problems.

  • Demonstrate the ability to read and write data of various formats using various packages.

  • View and access data inside DataFrames, Series, and ndarrays.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/stackoverflow/unprocessed/2021.csv

Questions

Question 1

The following is an example showing how to time creating a dictionary in two different ways.

from block_timer.timer import Timer

with Timer(title="Using dict to declare a dict") as t1:
    my_dict = dict()

with Timer(title="Using {} to declare a dict") as t2:
    my_dict = {}

# or if you need more fine-tuned values
print(t1.elapsed)
print(t2.elapsed)

There are a variety of ways to store, read, and write data. The most common is probably still csv data. csv data is simple, and easy to understand, however, it is a horrible format to read, write, and store. It is slow to read. It is slow to write. It takes up a lot of space.

Luckily, there are some other great options!

Check out the pandas documentation showing the various methods used to read and write data: pandas.pydata.org/docs/reference/io.html

Read in the 2021.csv file into a pandas DataFrame called my_df. Use the Timer to time writing my_df out to /scratch/brown/ALIAS/2021.csv, /scratch/brown/ALIAS/2021.parquet, and /scratch/brown/ALIAS/2021.feather.

Make sure to replace "ALIAS" with your purdue alias.

Use f-strings to print how much faster writing the parquet format was than the csv format, as a percentage.

Use f-strings to print how much faster writing the feather format was than the csv format, as a percentage.

You should now have 3 files in your $SCRATCH directory: 2021.csv, 2021.parquet, and 2021.feather.

Use the Timer to time reading in the 2021.csv, 2021.feather, and 2021.parquet files into pandas DataFrames called my_df.

Use f-strings to print how much faster reading the parquet format was than the csv format, as a percentage.

Use f-strings to print how much faster reading the feather format was than the csv format, as a percentage.

Round percentages to 1 decimal place. See here for examples on how to do this using f-strings.

Finally, how much space does each file take up? Use f-strings to print the size in MB.

There are a couple of options on how to get file size, here.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

If you haven’t already, please check out and walk through the 10 minute intro to pandas. It is a really great way to get started using pandas.

Also, check out this and this.

A method is a function that is associated with a particular class. For example, mean is a method of the pandas DataFrame object.

# myDF is an object of class DataFrame
# mean is a method of the DataFrame class
myDF.mean()

Typically, when using pandas, you will be working with either a DataFrame or a Series. The DataFrame class is what you would normally think of when you think about a data frame. A Series is essentially 1 column or row of data. In pandas, both Series and DataFrames have methods that perform various operations.

Use indexing and the value_counts method to get and print the count of Gender for survey respondents from Indiana.

Next, use the plot method to generate a plot. Use the rot option of the plot method to rotate the x-labels so they are displayed vertically.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Let’s figure out whether or not YearsCode is associated with ConvertedCompYearly. Get an array of unique values for the YearsCode column. As you will notice, there are some options that are not numeric values! In fact, when we read in the data, because of these values ("Less than 1 year", "More than 50 years", etc.), pandas was unable to choose an appropriate data type for that column of data, and set it to "Object". Use the following code to convert the column to a string.

my_df['YearsCode'] = my_df['YearsCode'].astype("str")

Great! Now that column contains strings. Use the replace method with regex=True to replace all non numeric values with nothing!

my_df["YearsCode"] = my_df['YearsCode'].replace("[^0-9]", "", regex=True)

Next, use the astype method to convert the column to "int64".

Finally, use the plot method to plot the YearsCode on the x-axis and ConvertedCompYearly on the y-axis. Use the kind argument to make it a "scatter" plot and set the logy=True, so large salaries don’t ruin our plot.

Write 1-2 sentences with any observations you may have.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Check out the LanguageHaveWorkedWith column. It contains a semi-colon separated list of languages that the respondent has worked with. Pretty cool.

How many times is each language listed? If you get stuck, refer to the hints below. What languages have you worked with from this list?

You can start by converting the column to strings.

my_df['LanguageHaveWorkedWith'] = my_df['LanguageHaveWorkedWith'].astype(str)

This function can be used to "flatten" a list of lists.

def flatten(t):
    return [item for sublist in t for item in sublist]

flatten([[1,2,3],[4,5,6]])
Output
[1, 2, 3, 4, 5, 6]

You can apply any of the Python string methods to an entire column of strings in pandas. For example, I could replace every instance of "hello" with nothing as follows.

myDF['some_column_of_strings'].str.replace("hello", "")

Check out the split string method.

You could use a dict to count each of the languages, or, since this is a pandas project, you could convert the list to a pandas Series and use the value_counts method!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

pandas really helps out when it comes to working with data in Python. This is a really cool dataset, use your newfound skills to do a mini-analysis. Your mini-analysis should include 1 or more graphics, along with some interesting observation you made while exploring the data.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.