STAT 19000: Project 7 — Spring 2021

Motivation: There is one pretty major topic that we have yet to explore in Python — functions! A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code.

Context: We are taking a small hiatus from our pandas and numpy focused series to learn about and write our own functions in Python!

Scope: python, functions, pandas

Learning objectives
  • Comprehend what a function is, and the components of a function in Python.

  • Differentiate between positional and keyword arguments.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/yelp/data/parquet

Questions

Question 1

You’ve been given a path to a folder for a dataset. Explore the files. Give a brief description of the files and what each file contains.

Take a look at the size of each of the files. If you are interested in experimenting, try using pandas read_json function to read the yelp_academic_dataset_user.json file in the json folder /class/datamine/data/yelp/data/json/yelp_academic_dataset_user.json. Even with the large amount of memory available to you, this should fail. In order to make it work you would need to use the chunksize option to read the data in bit by bit. Now consider that the reviews.parquet file is .3gb larger than the yelp_academic_dataset_user.json file, but can be read in with no problem. That is seriously impressive!

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

  • The name of each dataset and a brief summary of each dataset. No more than 1-2 sentences about each dataset.

Question 2

Read the businesses.parquet file into a pandas DataFrame called businesses. Take a look to the hours and attributes columns. If you look closely, you’ll observe that both columns contain a lot more than a single feature. In fact, the attributes column contains 39 features and the hours column contains 7!

len(businesses.loc[:, "attributes"].iloc[0].keys()) # 39
len(businesses.loc[:, "hours"].iloc[0].keys()) # 7

Let’s start by writing a simple function. Create a function called has_attributes that takes a business_id as an argument, and returns True if the business has any attributes and False otherwise. Test it with the following code:

print(has_attributes('f9NumwFMBDn751xgFiRbNA')) # True
print(has_attributes('XNoUzKckATkOD1hP6vghZg')) # False
print(has_attributes('Yzvjg0SayhoZgCljUJRF9Q')) # True
print(has_attributes('7uYJJpwORUbCirC1mz8n9Q')) # False

While this is useful to get whether or not a single business has any attributes, if you wanted to apply this function to the entire attributes column/Series, you would just use the notna method:

businesses.loc[:, "attributes"].notna()

Make sure your return value is of type bool. To check this:

type(True) # bool
type("True") # str
Items to submit
  • Python code used to solve the problem.

  • Output from running the provided "test" code.

Question 3

Take a look at the attributes of the first business:

businesses.loc[:, "attributes"].iloc[0]

What is the type of the value? Let’s assume the company you work for gets data formatted like businesses each week, but your boss wants the 39 features in attributes and the 7 features in hours to become their own columns. Write a function called fix_businesses_data that accepts an argument called data_path (of type str) that is a full path to a parquet file that is in the exact same format as businesses.parquet. In addition to the data_path argument, fix_businesses_data should accept another argument called output_dir (of type str). output_dir should contain the path where you want your "fixed" parquet file to output. fix_businesses_data should return None.

The result of your function, should be a new file called new_businesses.parquet saved in the output_dir, the data in this file should no longer contain either the attributes or hours columns. Instead, each row should contain 39+7 new columns. Test your function out:

from pathlib import Path
my_username = "kamstut" # replace "kamstut" with YOUR username
fix_businesses_data(data_path="/class/datamine/data/yelp/data/parquet/businesses.parquet", output_dir=f"/scratch/scholar/{my_username}")
# see if output exists
p = Path(f"/scratch/scholar/{my_username}").glob('**/*')
files = [x for x in p if x.is_file()]
print(files)

Make sure that either /scratch/scholar/{my_username} or /scratch/scholar/{my_username}/ will work as arguments to output_dir. If you use the pathlib library, as shown in the provided function "skeleton" below, both will work automatically!

from pathlib import Path
def fix_businesses_data(data_path: str, output_dir: str) -> None:
    """
    fix_data accepts a parquet file that contains data in a specific format.
    fix_data "explodes" the attributes and hours columns into 39+7=46 new
    columns.
    Args:
        data_path (str): Full path to a file in the same format as businesses.parquet.
        output_dir (str): Path to a directory where new_businesses.parquet should be output.
    """
    # read in original parquet file
    businesses = pd.read_parquet(data_path)

    # unnest the attributes column

    # unnest the hours column

    # output new file
    businesses.to_parquet(str(Path(f"{output_dir}").joinpath("new_businesses.parquet")))

    return None

Check out the code below, notice how using pathlib handles whether or not we have the trailing /.

from pathlib import Path
print(Path("/class/datamine/data/").joinpath("my_file.txt"))
print(Path("/class/datamine/data").joinpath("my_file.txt"))

You can test out your function on /class/datamine/data/yelp/data/parquet/businesses_sample.parquet to not waste as much time.

If we were using R and the tidyverse package, this sort of behavior is called "unnesting". You can read more about it here.

This stackoverflow post should be very useful! Specifically, run this code and take a look at the output:

businesses
businesses.loc[0:4, "attributes"].apply(pd.Series)

Notice that some rows have json, and others have None:

businesses.loc[0, "attributes"] # has json
businesses.loc[2, "attributes"] # has None

This method allows us to handle both cases. If the row has json it converts the values, if it has None it just puts each column with a value of None.

Here is an example that shows you how to concatenate (combine) dataframes.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

That’s a pretty powerful function, and could definitely be useful. What if, instead of working on just our specifically formatted parquet file, we wrote a function that worked for any pandas DataFrame? Write a function called unnest that accepts a pandas DataFrame as an argument (let’s call this argument myDF), and a list of columns (let’s call this argument columns), and returns a DataFrame where the provided columns are unnested.

You may write unnest so that the resulting dataframe contains the original dataframe and the unnested columns, or you may return just the unnested columns — both will be accepted solutions.

The following should work:

businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
new_businesses_df = unnest(businesses, ["attributes", ])
new_businesses_df.shape # (209393, 39)
new_businesses_df.head()
new_businesses_df = unnest(businesses, ["attributes", "hours"])
new_businesses_df.shape # (209393, 46)
new_businesses_df.head()
Items to submit
  • Python code used to solve the problem.

  • Output from running the provided code.

Question 5

Try out the code below. If a provided column isn’t already nested, the column name is ruined and the data is changed. If the column doesn’t already exist, a KeyError is thrown. Modify our function from question (4) to skip unnesting if the column doesn’t exist. In addition, modify the function from question (4) to skip the column if the column isn’t nested. Let’s consider a column nested if the value of the column is a dict, and not nested otherwise.

businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
new_businesses_df = unnest(businesses, ["doesntexist",]) # KeyError
new_businesses_df = unnest(businesses, ["postal_code",]) # not nested

To test your code, run the following. The result should be a DataFrame where attributes has been unnested, and that is it.

businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
results = unnest(businesses, ["doesntexist", "postal_code", "attributes"])
results.shape # (209393, 39)
results.head()

To see if a variable is a dict you could use type:

my_variable = {'key': 'value'}
type(my_variable)
Items to submit
  • Python code used to solve the problem.

  • Output from running the provided code.