# STAT 19000: Project 10 — Spring 2021

Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!

Context: This is the third project in a series of projects designed to learn about the `pandas` and `numpy` packages. In this project we build on to our previous project to finalize our beer recommendation system.

Scope: python, numpy, pandas

Learning objectives
• Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

• Use numpy, scipy, and pandas to solve a variety of data-driven problems.

• Demonstrate the ability to read and write data of various formats using various packages.

• View and access data inside DataFrames, Series, and ndarrays.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/beer`

Load the following datasets up and assume they are always available:

``````beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")

### Project 09 Solution

Below is the solution for the previous projects, as we’ll be using its methods and don’t want to leave anybody behind:

``````def prepare_data(myDF, min_num_reviews):

# remove rows where score is na
myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
# get a list of usernames that have at least min_num_reviews
# get a list of beer_ids that have at least min_num_reviews
beerids = myDF.loc[:, "beer_id"].value_counts() >= min_num_reviews
beerids = beerids.loc[beerids].index.values.tolist()
# first remove all rows where the username has less than min_num_reviews

# remove rows where the beer_id has less than min_num_reviews
myDF = myDF.loc[myDF.loc[:, "beer_id"].isin(beerids), :]

return myDF
train = prepare_data(reviews, 1000)``````
``````def mutate_std_score(data: pd.DataFrame) -> pd.DataFrame:
"""
mutate_std_score is a function to use in conjunction with
pd.apply and pd.groupby to create a new column that is
the standardized score.
Args:
data (pd.DataFrame): A pandas DataFrame.
Returns:
pd.DataFrame: A modified pandas DataFrame.
"""
data['standardized_score'] = (data['score'] - data['score'].mean())/data['score'].std()
return data
``````score_matrix = pd.pivot_table(train, values='standardized_score', index='username', columns='beer_id')
print(score_matrix.shape)
``````score_matrix = score_matrix.fillna(score_matrix.mean(axis=0))

## Questions

### Question 1

If you struggled or did not do the previous project, or would like to start fresh, please see the solutions to the previous project (will be posted Saturday morning) and feel free to use them as your own. Cosine similarity is a measure of similarity between two non-zero vectors. It is used in a variety of ways in data science. Here is a pretty good article that tries to give some intuition into it. `sklearn` provides us with a function that calculates cosine similarity:

``from sklearn.metrics.pairwise import cosine_similarity``

Use the `cosine_similarity` function on our `score_matrix`. The result will be a `numpy` array. Use the `fill_diagonal` method from `numpy` to fill the diagonals with 0. Convert the array back to a `pandas` DataFrame. Make sure to manually assign the indexes of the new DataFrame to be equal to `score_matrix.index`. Lastly, manually assign the columns to be `score_matrix.index` as well. The end result should be a matrix with usernames on both the x and y axes. Each value in the cell represents how "close" one user is to another. Normally the values in the diagonals would be 1 because the same user is 100% similar. To prevent this we forced the diagonals to be 0. Name the final result `cosine_similarity_matrix`.

Items to submit
• Python code used to solve the problem.

• `head` of `cosine_similarity_matrix`.

### Question 2

Write a function called `get_knn` that accepts the `cosine_similarity_matrix`, a `username`, and a value, `k`. The function `get_knn` should return a `pandas` Series or list containing the usernames of the `k` most similar users to the input `username`.

 This may sound difficult, but it is not. It really only involves sorting some values and grabbing the first `k`.

Test it on the following; we demonstrate the output if you return a list:

``````k_similar=get_knn(cosine_similarity_matrix,"2GOOFY",4)
print(k_similar) # ['Phil-Fresh', 'mishi_d', 'SlightlyGrey', 'MI_beerdrinker']``````
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 3

Let’s test `get_knn` to see if the results make sense. Pick out a user, and the most similar other user. First, get a DataFrame (let’s call it `aux`) containing just their reviews. The result should be a DataFrame that looks just like the `reviews` DataFrame, but just contains your users' reviews.

Next, look at `aux`. Wouldn’t it be nice to get a DataFrame where the `beer_id` is the row index, the first column contains the scores for the first user, and the second column contains the scores for the second user? Use the `pivot_table` method to accomplish this, and save the result as `aux`.

Lastly, use the `dropna` method to remove all rows where at least one of the users has an `NA` value. Sort the values in `aux` using the `sort_values` method. Take a look at the result and write 1-2 sentences explaining whether or not you think the users rated the beers similarly.

 You could also create a scatter plot using the resulting DataFrame. If it is a good match the plot should look like a positive sloping line.
Items to submit
• Python code used to solve the problem.

• Output from running your code.

• 1-2 sentences explaining whether or not you think the users rated the beers similarly.

### Question 4

We are so close, and things are looking good! The next step for our system, is to write a function that finds recommendations for a given user. Write a function called `recommend_beers`, that accepts three arguments: the `train` DataFrame, a `username`, a `cosine_similarity_matrix`, and `k` (how many neighbors to use). The function `recommend_beers` should return the top 5 recommendations.

Calculate the recommendations by:

1. Finding the `k` nearest neighbors of the input `username`.

2. Get a DataFrame with all of the reviews from `train` for every neighbor. Let’s call this `aux`.

3. Get a list of all `beer_id` that the user with `username` has reviewed.

4. Remove all beers from `aux` that have already been reviewed by the user with `username`.

5. Group by `beer_id` and calculate the mean `standardized_score`.

6. Sort the results in descending order, and return the top 5 `beer_id`s.

Test it on the following:

``recommend_beers(train, "22Blue", cosine_similarity_matrix, 30) # [40057, 69522, 22172, 59672, 86487]``
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 5

(optional, 0 pts) Improve our recommendation system! Below are some suggestions, don’t feel limited by them:

• Instead of returning a list of `beer_id`, return the beer info from the `beers` dataset.

• Remove all retired beers.

• Somehow add a cool plot.

• Etc.

Items to submit
• Python code used to solve the problem.

• Output from running your code.