TDM 10200: Project 8 — 2023

Motivation: We are going to take a step back and work towards building a beer recommendation system. As you know already, a key part of analysis is being able to write functions.

Context: We will continue to introduce functions and practice doing so!

Scope: python, functions, pandas

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

-/anvil/projects/tdm/data/yelp/data/parquet/

Questions

Let’s first list all of the files in this folder .Helpful Hint

Details
ls /anvil/projects/tdm/data/yelp/data/parquet/

We want to load only two pandas data frames to use in this project. We will not be working with the other data frames. Don’t forget to import pandas!

users = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/users.parquet")
reviews = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/reviews.parquet")

You will likely need to use 10 cores for this project because the files for this week are very large.

ONE

Typically we can assume that we surround ourselves with people with similar interests and we would have aligning tastes in restaurants and businesses. Let’s check that by writing a function called get_friends_data with the user_id as an argument. This new function should return a pandas DataFrame with the information in the users DataFrame for each friend of user_id.

In this function we want to add type hints. We have added type hints in most of our functions so far this spring, but now we want to pay attention to them. These type hints provide a formal way to indicate the type of arguments passed to a function in python. You can learn more about python type hints here or this is another good site for information. Go ahead and use type hints into your function. Use one for the user_id and one for the returned data.

These are the three examples that I tested in the videos:

get_friends_data("ntlvfPzc8eglqvk92iDIAw")
get_friends_data("AY-laIws3S7YXNl_f_D6rQ")
get_friends_data("xvu8G900tezTzbbfqmTKvA")
Helpful Hint

a type hint for a string appears as str in our function

get_friends_data(myuserid: str)
Items to submit
  • Code used to answer the question.

  • Result of code.

TWO

Next, we need to write a function called calculate_avg_business_stars that accepts the business_id and returns the average number of stars that the business has received.

Doing this we can use the groupby() function. This is one of the most powerful functions for (more easily) working with data frames in python.

These are the three examples that I tested in the videos:

calculate_avg_business_stars('-MhfebM0QIsKt87iDN-FNw')
calculate_avg_business_stars('5JxlZaqCnk1MnbgRirs40Q')
calculate_avg_business_stars('faPVqws-x-5k2CQKDNtHxw')
Insider Information
  • groupby()- allows us group data according to categories and also can help us compile and summarize data easily.

Items to submit
  • Code used to answer the question.

  • Result of code.

THREE

Next up, we need to write a function called visualize_stars_over_time. It needs to accept a business_id as an argument. It should make a line plot that shows the average number of stars for each year that the business has reviews.

These are the three examples that I tested in the videos:

visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw")
Helpful Hint

You will need to import matplotlib.pyplot as plt

FOUR

We are now going to add an argument to our function visualize_stars_over_time that we wrote in question 3. This argument is called granularity. Granularity should accept one of two strings, either "years" or "months" (don’t forget the double quotes!) that will help show the average rating over time.

These are the nine examples that I tested in the videos:

visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw")
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "years")
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "months")

visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "years")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "months")

visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "years")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "months")
Insider Information

Granularity indicates how much data can be shown on a chart. It can expressed in units of time, it can be - "minute" - "hour" - "day" - "week" - "month" - "year".

Items to submit
  • Code used to answer the question

  • Result of the code

FIVE

Now we continue to modify the function visualize_stars_over_time that we were working on, from questions 3 and 4. We want to add the ability to accept multiple business_ids, and create a line for each id. (It is OK to remove the functionality about "granularity" from Project 4; just make the plots with the yearly summaries, and do not worry about the monthly granularity. You can just remove anything about the granularity.)

This is the example that I tested in the videos:

visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "5JxlZaqCnk1MnbgRirs40Q", "faPVqws-x-5k2CQKDNtHxw")
Items to submit
  • Code used to answer the question

  • Result of the code

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.