STAT 19000: Project 8 — Spring 2021

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code. There are some pretty powerful features of functions that we have yet to explore that aren’t necessarily present in R. In this project we will continue to learn about and harness the power of functions to solve data-driven problems.

Context: We are taking a small hiatus from our pandas and numpy focused series to learn about and write our own functions in Python!

Scope: python, functions, pandas

Learning objectives
  • Comprehend what a function is, and the components of a function in Python.

  • Differentiate between positional and keyword arguments.

  • Learn about packing and unpacking variables and arguments.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/yelp/data/parquet

Questions

Question 1

The company you work for is assigning you to the task of building out some functions for the new API they’ve built. Please load these two pandas DataFrames:

users = pd.read_parquet("/class/datamine/data/yelp/data/parquet/users.parquet")
reviews = pd.read_parquet("/class/datamine/data/yelp/data/parquet/reviews.parquet")

You do not need these four DataFrames in this project.

photos = pd.read_parquet("/class/datamine/data/yelp/data/parquet/photos.parquet")
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
checkins = pd.read_parquet("/class/datamine/data/yelp/data/parquet/checkins.parquet")
tips = pd.read_parquet("/class/datamine/data/yelp/data/parquet/tips.parquet")

You would expect that friends may have a similar taste in restaurants or businesses. Write a function called get_friends_data that accepts a user_id as an argument, and returns a pandas DataFrame with the information in the users DataFrame for each friend of user_id. Look at the solutions from the previous project, as well as this page. Add type hints for your function. You should have a type hint for our argument, user_id, as well as a type hint for the returned data. In addition to type hints, make sure to document your function with a docstring.

Every function in the solutions for last week’s projects has a docstring. You can use this as a reference.

You should get the same number of friends for the following code:

print(get_friends_data("ntlvfPzc8eglqvk92iDIAw").shape) # (13,22)
print(get_friends_data("AY-laIws3S7YXNl_f_D6rQ").shape) # (1, 22)
print(get_friends_data("xvu8G900tezTzbbfqmTKvA").shape) # (193,22)

It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting).

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 2

Write a function called calculate_avg_business_stars that accepts a business_id and returns the average number of stars that business received in reviews. Like in question (1) make sure to add type hints and docstrings. In addition, add comments when and if they are necessary.

There is a really cool method that gives us the same "powers" that tapply gives us in R. Use the groupby method from pandas to calculate the average stars for all businesses. Index the result to confirm that your calculate_avg_business_stars function worked properly.

You should get the same average number of start value for the following code:

print(calculate_avg_business_stars("f9NumwFMBDn751xgFiRbNA")) # 3.1025641025641026
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Write a function called visualize_stars_over_time that accepts a business_id and returns a line plot that shows the average number of stars for each year the business has review data. Like in previous questions, make sure to add type hints and docstrings. In addition, add comments when (and if) necessary. You can test your function with some of these:

visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g')
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

Modify question (3), and add an argument called granularity that dictates whether the plot will show the average rating over years, or months. granularity should accept one of two strings: "years", or "months". By default, if granularity isn’t specified, it should be "years".

visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g', "months")
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

Modify question (4) to accept multiple business_id’s, and create a line for each id. Each of the following should work:

visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "months")
visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", "months")
visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", granularity="years")

Use plt.show to decide when to show your complete plot and start anew.

It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting).

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 6

After some thought, your boss decided that using the function from question (5) would get pretty tedious when there are a lot of businesses to include in the plot. You disagree. You think there is a way to pass a list of `business_id`s without modifying your function, but rather how you pass the arguments to the function. Demonstrate how to do this with the list provided:

our_businesses = ["RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg"]
# modify something below to make this work:
visualize_stars_over_time(our_businesses, granularity="years")

Google "python packing unpacking arguments".

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.