STAT 19000: Project 10 — Spring 2021

Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!

Context: This is the third project in a series of projects designed to learn about the pandas and numpy packages. In this project we build on to our previous project to finalize our beer recommendation system.

Scope: python, numpy, pandas

Learning objectives
  • Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

  • Use numpy, scipy, and pandas to solve a variety of data-driven problems.

  • Demonstrate the ability to read and write data of various formats using various packages.

  • View and access data inside DataFrames, Series, and ndarrays.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/beer

Load the following datasets up and assume they are always available:

beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")
breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet")
reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet")

Project 09 Solution

Below is the solution for the previous projects, as we’ll be using its methods and don’t want to leave anybody behind:

def prepare_data(myDF, min_num_reviews):

    # remove rows where score is na
    myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
    # get a list of usernames that have at least min_num_reviews
    usernames = myDF.loc[:, "username"].value_counts() >= min_num_reviews
    usernames = usernames.loc[usernames].index.values.tolist()
    # get a list of beer_ids that have at least min_num_reviews
    beerids = myDF.loc[:, "beer_id"].value_counts() >= min_num_reviews
    beerids = beerids.loc[beerids].index.values.tolist()
    # first remove all rows where the username has less than min_num_reviews
    myDF = myDF.loc[myDF.loc[:, "username"].isin(usernames), :]

    # remove rows where the beer_id has less than min_num_reviews
    myDF = myDF.loc[myDF.loc[:, "beer_id"].isin(beerids), :]

    return myDF
train = prepare_data(reviews, 1000)
def mutate_std_score(data: pd.DataFrame) -> pd.DataFrame:
    """
    mutate_std_score is a function to use in conjunction with
    pd.apply and pd.groupby to create a new column that is
    the standardized score.
    Args:
        data (pd.DataFrame): A pandas DataFrame.
    Returns:
        pd.DataFrame: A modified pandas DataFrame.
    """
    data['standardized_score'] = (data['score'] - data['score'].mean())/data['score'].std()
    return data
train = train.groupby(["username"]).apply(mutate_std_score)
score_matrix = pd.pivot_table(train, values='standardized_score', index='username', columns='beer_id')
print(score_matrix.shape)
score_matrix.head()
score_matrix = score_matrix.fillna(score_matrix.mean(axis=0))
score_matrix.head()

Questions

Question 1

If you struggled or did not do the previous project, or would like to start fresh, please see the solutions to the previous project (will be posted Saturday morning) and feel free to use them as your own. Cosine similarity is a measure of similarity between two non-zero vectors. It is used in a variety of ways in data science. Here is a pretty good article that tries to give some intuition into it. sklearn provides us with a function that calculates cosine similarity:

from sklearn.metrics.pairwise import cosine_similarity

Use the cosine_similarity function on our score_matrix. The result will be a numpy array. Use the fill_diagonal method from numpy to fill the diagonals with 0. Convert the array back to a pandas DataFrame. Make sure to manually assign the indexes of the new DataFrame to be equal to score_matrix.index. Lastly, manually assign the columns to be score_matrix.index as well. The end result should be a matrix with usernames on both the x and y axes. Each value in the cell represents how "close" one user is to another. Normally the values in the diagonals would be 1 because the same user is 100% similar. To prevent this we forced the diagonals to be 0. Name the final result cosine_similarity_matrix.

Items to submit
  • Python code used to solve the problem.

  • head of cosine_similarity_matrix.

Question 2

Write a function called get_knn that accepts the cosine_similarity_matrix, a username, and a value, k. The function get_knn should return a pandas Series or list containing the usernames of the k most similar users to the input username.

This may sound difficult, but it is not. It really only involves sorting some values and grabbing the first k.

Test it on the following; we demonstrate the output if you return a list:

k_similar=get_knn(cosine_similarity_matrix,"2GOOFY",4)
print(k_similar) # ['Phil-Fresh', 'mishi_d', 'SlightlyGrey', 'MI_beerdrinker']
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Let’s test get_knn to see if the results make sense. Pick out a user, and the most similar other user. First, get a DataFrame (let’s call it aux) containing just their reviews. The result should be a DataFrame that looks just like the reviews DataFrame, but just contains your users' reviews.

Next, look at aux. Wouldn’t it be nice to get a DataFrame where the beer_id is the row index, the first column contains the scores for the first user, and the second column contains the scores for the second user? Use the pivot_table method to accomplish this, and save the result as aux.

Lastly, use the dropna method to remove all rows where at least one of the users has an NA value. Sort the values in aux using the sort_values method. Take a look at the result and write 1-2 sentences explaining whether or not you think the users rated the beers similarly.

You could also create a scatter plot using the resulting DataFrame. If it is a good match the plot should look like a positive sloping line.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

  • 1-2 sentences explaining whether or not you think the users rated the beers similarly.

Question 4

We are so close, and things are looking good! The next step for our system, is to write a function that finds recommendations for a given user. Write a function called recommend_beers, that accepts three arguments: the train DataFrame, a username, a cosine_similarity_matrix, and k (how many neighbors to use). The function recommend_beers should return the top 5 recommendations.

Calculate the recommendations by:

  1. Finding the k nearest neighbors of the input username.

  2. Get a DataFrame with all of the reviews from train for every neighbor. Let’s call this aux.

  3. Get a list of all beer_id that the user with username has reviewed.

  4. Remove all beers from aux that have already been reviewed by the user with username.

  5. Group by beer_id and calculate the mean standardized_score.

  6. Sort the results in descending order, and return the top 5 `beer_id`s.

Test it on the following:

recommend_beers(train, "22Blue", cosine_similarity_matrix, 30) # [40057, 69522, 22172, 59672, 86487]
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

(optional, 0 pts) Improve our recommendation system! Below are some suggestions, don’t feel limited by them:

  • Instead of returning a list of beer_id, return the beer info from the beers dataset.

  • Remove all retired beers.

  • Somehow add a cool plot.

  • Etc.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.