# STAT 19000: Project 9 — Spring 2021

Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!

Context: At this point in the semester we have a solid grasp on Python basics, and are looking to build our skills using the `pandas` and `numpy` packages to build a data-driven recommendation system for beers.

Scope: python, pandas, numpy

Learning objectives
• Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

• Use numpy, scipy, and pandas to solve a variety of data-driven problems.

• Demonstrate the ability to read and write data of various formats using various packages.

• View and access data inside DataFrames, Series, and ndarrays.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/beer`

Load the following datasets up and assume they are always available:

``````beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")

## Questions

### Question 1

Write a function called `prepare_data` that accepts an argument called `myDF` that is a `pandas` DataFrame. In addition, `prepare_data` should accept an argument called `min_num_reviews` that is an integer representing the minimum amount of reviews that the user and the beer must have, to be included in the data. The function `prepare_data` should return a `pandas` DataFrame with the following properties:

First remove all rows where `score` or `username` or `beer_id` is missing, like this:

``````    myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
myDF = myDF.loc[myDF.loc[:, "beer_id"].notna(), :]
myDF.reset_index(drop=True)``````

Among the remaining rows, choose the rows of `myDF` that have a user (`username`) and a `beer_id` that each occur at least `min_num_reviews` times in `myDF`.

``````train = prepare_data(reviews, 1000)
print(train.shape) # (952105, 10)``````
 We added two examples of how to do this with the election data (instead of the beer review data) in the book: cleaning and filtering data
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 2

Run the function in question (1). Use `train=prepare_data(reviews, 1000)`. The basis of our recommendation system will be to "match" a user to another user will similar taste in beer. Different users will have different means and variances in their scores. If we are going to compare users' scores, we should standardize users' scores. Update the `train` DataFrame with 1 additional column: `standardized_score`. To calculate the `standardized_score`, take each individual score, and subtract off the user’s average score and divide that result by the user’s score’s standard deviation.

In R, we have the following code:

``````myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myMean = tapply(myDF\$b + myDF\$c, myDF\$a, mean)
myMeanDF = data.frame(a=as.numeric(names(myMean)), mean=myMean)
myDF = merge(myDF, myMeanDF, by='a')``````

Or you could also use a very handy package called tidyverse in R to do the same thing:

``````library(tidyverse)
myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myDF %>%
group_by(a) %>%
mutate(d=mean(b+c))``````

Unfortunately, there isn’t a great way to do this in Python:

``````def summer(data):
data['d'] = (data['b']+data['c']).mean()
return data
myDF = myDF.groupby(["a"]).apply(summer)``````

Create a new column `standardized_score`. Calculate the `standardized_score` by taking the score and subtracting the average score, then divide by the standard deviation. As it may take a minute or two to create this new column, feel free to test it on a small sample of the reviews DataFrame:

``````import pandas as pd
 Don’t forget about the `pandas` DataFrame `std` and `mean` methods.
 If you are worried about getting `NA`s, do not worry. The only way we would get `NA`s would be if there is only a single review for the user (which we took care of by limiting to users with at least 1000 reviews), or if there is no variance in a user’s scores (which doesn’t happen).
 We added an example about how to do this with the election data in the book: standardizing data example
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 3

Use the `pivot_table` method from `pandas` to put your `train` data into "wide" format. What this means is that each row in the new DataFrame will be a `username`, and each column will be a `beer_id`. Each cell will contain the `standardized_score` for the given `username` and `beer` combination. Call the resulting DataFrame `score_matrix`.

Items to submit
• Python code used to solve the problem.

• Output the `head` and `shape` of `score_matrix`.

### Question 4

The result from question (3) should be a sparse matrix (lots of missing data!). Let’s fill in the missing data. For now, let’s fill in a beer_id’s missing data by filling in every missing value with the average score for the beer.

 The `fillna` method in `pandas` will be very helpful!
Items to submit
• Python code used to solve the problem.

• Output the `head` of `score_matrix`.

Congratulations! Next week, we will complete our recommendation system!