TDM 10200: Project 13 — Spring 2023

Motivation: Removing 'stopwords' and creating 'WordClouds' to help us get a better feel for a large amount of text. These are first steps in enabling us to understand the sentiment of some text. Such techniques can be helpful when looking at text, for instance, at reviews.

Scope: python, nltk, wordcloud

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

When launching Juypter Notebook on Anvil you will need to use 4 cores.

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv

Helpful Hints

import pandas as pd
finefood = pd.read_csv("/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv")

ONE

Once you have read in the dataset, take a look at the .head() of the data.

Immediately we see two columns that might be interesting: HelpfulnessNumerator and HelpfulnessDenominator. What do you think those mean, and what would they (potentially) be used for?
What is the user id of review number 23789?
How many duplicate ProfileName values are there? (I am not asking for which values are duplicated but just the total number of duplicated ProfileName values; it is helpful to explain your answer for this one.)

Helpful Hint

df.columnname.duplicated().sum()

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to questions a,b,c

TWO

Now we are going to focus on three more columns:

Score : customer’s product rating
Text : the full review written by the customer

We can see that the rating system is a numerical value in the range 0-5. A rating of 0 is the worst rating available and 5 is the best rating available. We want to start by getting a feel for the ratings, e.g., do we have more negative than positive reviews? The easiest way to see this is to plot the data.

What type of visualization did you choose to represent the score data?
Why did you choose it?
What do you notice about the results?

Insider Information

Common reasons we would want to use data visualizations is to (this is not an exhaustive list)

show change over time (bar charts, line charts, box plots)
compare a part to the whole (pie chart, stacked bar chart, stacked area charts)
we want to see how the data is distributed (bar chart, histogram, box plot, etc., you have freedom to choose and explore)
when we want to compare values amongst different groups (bar chart, dotplot, line chart, grouped bar chart)
when we are observing relationships variables (scatter plot, bubble chart, heatmaps)

Helpful Hint

#for a histogram
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

fig = px.histogram(df, x="columnname")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Whateveryouwanttonameit')
fig.show()

#for a piechart
import matplotlib.pyplot as plt
rating_counts = df["columnname"].value_counts()
plt.pie(rating_counts, labels=rating_counts.index)
plt.title("whateveryouwantotnameit")
plt.show()

Items to submit

Code used to solve this problem.
Output from running the code.

THREE

In the Text column, you will see that there are a lot of commonly used words. Specifically, in English, we are talking about "the", "is", "and" etc. What we want to do is eliminate the unimportant words, so that we can focus on the words that will tell us more information.
An example: "There is a dog over by the creek, laying down by the tree" When we remove the stop words we will get:

There
dog
over
creek
laying
down
tree

This allows us to focus on information that can be used for classification or clustering. There are several Natural Language Processing libraries that you can use to remove stop words.

Stop words with NLTK
Stop words with Gensim
Stop words with SpaCy

Go ahead and remove the stop words from the column "Text" in our dataframe .

Helpful Hint

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

Insider Knowledge

A few resources to read up on about stop words:

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to the question

FOUR

A Word Cloud is a way that we can represent text. The size of each word indicates its frequency/importance.

Now that we have removed the stop words, we can focus on finding the significant words, to see if they are more positive or negative in sentiment.

Create a wordcloud from the column "Text" that should have all the stop words taken out of it.
Are there any additional "stop words" or words that are unimportant to your analysis that you could take out (an example could be cant, gp, br, hef, etc)?
Take out those additional stop words and then create a new wordcloud. What do you notice?

Helpful Hint

from wordcloud import WordCloud

#ways to add new stop words to the list
.append()
.extend(newlist)

Items to submit

Code used to solve this problem.
Output from running the code.
Answer to the questions

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.