STAT 19000: Project 9 — Spring 2021

Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!

Context: At this point in the semester we have a solid grasp on Python basics, and are looking to build our skills using the pandas and numpy packages to build a data-driven recommendation system for beers.

Scope: python, pandas, numpy

Learning objectives
  • Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

  • Use numpy, scipy, and pandas to solve a variety of data-driven problems.

  • Demonstrate the ability to read and write data of various formats using various packages.

  • View and access data inside DataFrames, Series, and ndarrays.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/beer

Load the following datasets up and assume they are always available:

beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")
breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet")
reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet")

Questions

Question 1

Write a function called prepare_data that accepts an argument called myDF that is a pandas DataFrame. In addition, prepare_data should accept an argument called min_num_reviews that is an integer representing the minimum amount of reviews that the user and the beer must have, to be included in the data. The function prepare_data should return a pandas DataFrame with the following properties:

First remove all rows where score or username or beer_id is missing, like this:

    myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
    myDF = myDF.loc[myDF.loc[:, "username"].notna(), :]
    myDF = myDF.loc[myDF.loc[:, "beer_id"].notna(), :]
    myDF.reset_index(drop=True)

Among the remaining rows, choose the rows of myDF that have a user (username) and a beer_id that each occur at least min_num_reviews times in myDF.

train = prepare_data(reviews, 1000)
print(train.shape) # (952105, 10)

We added two examples of how to do this with the election data (instead of the beer review data) in the book: cleaning and filtering data

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 2

Run the function in question (1). Use train=prepare_data(reviews, 1000). The basis of our recommendation system will be to "match" a user to another user will similar taste in beer. Different users will have different means and variances in their scores. If we are going to compare users' scores, we should standardize users' scores. Update the train DataFrame with 1 additional column: standardized_score. To calculate the standardized_score, take each individual score, and subtract off the user’s average score and divide that result by the user’s score’s standard deviation.

In R, we have the following code:

myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myMean = tapply(myDF$b + myDF$c, myDF$a, mean)
myMeanDF = data.frame(a=as.numeric(names(myMean)), mean=myMean)
myDF = merge(myDF, myMeanDF, by='a')

Or you could also use a very handy package called tidyverse in R to do the same thing:

library(tidyverse)
myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myDF %>%
    group_by(a) %>%
    mutate(d=mean(b+c))

Unfortunately, there isn’t a great way to do this in Python:

def summer(data):
    data['d'] = (data['b']+data['c']).mean()
    return data
myDF = myDF.groupby(["a"]).apply(summer)

Create a new column standardized_score. Calculate the standardized_score by taking the score and subtracting the average score, then divide by the standard deviation. As it may take a minute or two to create this new column, feel free to test it on a small sample of the reviews DataFrame:

import pandas as pd
testDF = pd.read_parquet('/class/datamine/data/beer/reviews_sample.parquet')

Don’t forget about the pandas DataFrame std and mean methods.

If you are worried about getting `NA`s, do not worry. The only way we would get `NA`s would be if there is only a single review for the user (which we took care of by limiting to users with at least 1000 reviews), or if there is no variance in a user’s scores (which doesn’t happen).

We added an example about how to do this with the election data in the book: standardizing data example

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Use the pivot_table method from pandas to put your train data into "wide" format. What this means is that each row in the new DataFrame will be a username, and each column will be a beer_id. Each cell will contain the standardized_score for the given username and beer combination. Call the resulting DataFrame score_matrix.

Items to submit
  • Python code used to solve the problem.

  • Output the head and shape of score_matrix.

Question 4

The result from question (3) should be a sparse matrix (lots of missing data!). Let’s fill in the missing data. For now, let’s fill in a beer_id’s missing data by filling in every missing value with the average score for the beer.

The fillna method in pandas will be very helpful!

Items to submit
  • Python code used to solve the problem.

  • Output the head of score_matrix.

Congratulations! Next week, we will complete our recommendation system!