STAT 19000: Project 8 — Spring 2022

Motivation: Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable.

Context: At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using pandas by using pandas to solve data-driven problems.

Scope: Python, pandas

Learning Objectives
  • Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

  • Demonstrate the ability to use pandas and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/stackoverflow/unprocessed/2011.csv

Questions

Question 1

Take a look at the datasets in /depot/datamine/data/stackoverflow/unprocessed/. There are a variety of ways this dataset could be cleaned up. In this project, we will figure out how to clean these datasets up a bit, using pandas.

Read in 2011.csv. This is a comma-separated dataset. If you want a given csv file to be easy to parse through using a variety of tools, you should first make sure that the delimiter is a comma, and that there is exactly 1 less comma in each row than there is columns.

Print the columns of the dataset that have commas in their content. Which columns have commas in their content, and how many commas are in each column, total?

Results should look like:

Output
Which best describes the size of your company?: 821.0
Which of the following best describes your occupation?: 210.0
Unnamed: 18: 1940.0
...

Remember, you can use string methods on a column with string data using the .str attribute in pandas. See here for more information.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

It looks like there are a lot of commas in a lot of columns in our dataset. This could make it more difficult to parse this dataset than necessary.

For example, skip to question 3 and read the first paragraph. Not all analysis is done in R or Python!

Perform the same operations in question (1), but instead of looking for commas in the content of each column, look for semi-colons.

Given the fact that we want our dataset easy to parse, and given what we know about the usage of commas and semi-colons, what would you suggest we do to clean up this dataset, and why?

Hopefully, your answer was to convert instances of "internal" semi-colons to commas, so there are no remaining "internal" semi-colons, and only commas. This way, you can export the entire dataset to a dsv (delimiter-separated value) with semi-colons as the delimiter instead of a comma.

Then, convert all instances of semi-colons to commas.

Double check by re-running your code that checks for semi-colons, to make sure they no longer exist.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

You may have noticed some other low-hanging fruit that could be cleaned up. You will notice that some columns have "Unnamed: X" (where X is a number) and essentially either contain some value, or are empty. These columns represent a potential answer to the most previous column not named "Unnamed: X" (again, where X is a number).

Instead of having a separate column for each potential answer (as it currently is), instead, it would be much better to have a single column where each row could contain one of the categorical values that was originally shown in each of the "Unnamed: X" columns. Let’s do the following for each set of "Unnamed: X" columns.

  1. For each column, if the value is not pd.NA, then append it to a comma-separated list of values in the original question column.

    So, for example, given the following example, we would want the following result.

    Original question?; Unnamed: 1; Unnamed: 2; Unnamed: 3
    answerA;answerB;answerC;answerD
    answerA; NA; NA; answerD
    Original question?
    answerA,answerB,answerC,answerD
    answerA,answerD

    However, we would expect all such columns to be combined, for each set of columns where the potential answer is broken into multiple columns.

    Remove commas in the given columns prior to pasting them together with commas.

  2. After, and only after the columns have been combined, remove the "Unnamed: X" columns. That data is now redundant.

The original question column will be where the rest of the columns data is stored. Don’t forget to include the value in the original question column in the final list of answers.

You could use whether or not "unname" is in the column name to find and combine data as described above.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4 (optional, 0 pts)

In the previous questions, you were able to greatly simplify the dataset. This is great, however, let’s try and automate this process in case we were to ever receive a dataset like this, but with different column names and values. Assume things would be in the same format, so a question with multiple choice answers will have columns called "Unnamed: X", immediately following the column with the actual question.

Write a function called fix_columns that accepts a pandas DataFrame as an argument, changes all instances of semi colons to a comma within the "Unnamed: X" columns, and changes the column names as described above (including the eventual removal of the "Unnamed: X" columns).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Calculate a breakdown of the column "Which languages are you proficient in?". Create a graphic using the plotting package of your choice, showing the number of people who are proficient in the top 10 named languages (in order of most to least). Create this graphic using the cleaned up 2011 data.

Remember, if you are using plotly, be sure to set renderer="jpg" so that your image appears in the notebook in Gradescope. If you notebook does not appear in Gradescope, you will not receive full credit.

  1. You can now use string methods on that column to get the languages.

  2. There is a special Counter dict that could be useful.

from collections import Counter

my_counter = Counter(['first', 'second', 'third', 'third', 'third'])
my_counter.update(['first', 'first', 'second'])
my_counter
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.