TDM 30100: Project 4 - Exploratory Data Analysis and Visualization

Project Objectives

You will work through a dataset containing music related features based on audio in Spotify in this project. Music, audio, or streaming services data can be represented in multiple ways outside of sound or audio formats, reflecting generalizable data and statistics on both users and available songs or services, and characteristics described by numerical values. Throughout the project, you will analyze these data by applying concepts from exploratory data analysis (EDA), visualization and distributions, and see how we can provide further insights by computing relationships between variables.

Learning Objectives

Understand the purpose and steps of EDA and apply appropriate methods,
Visualize data and analyze distribution using seaborn and scikit in Python,
Understand cosine similarity and apply calculation.

Dataset

/anvil/projects/x-cis220051/data/spotify/spotify_dataset.csv

Questions

Question 1 Explore and Understand the Dataset (2 points)

This dataset gives us data of over 100,000 tracks across numerous genres and differentiating attributes, providing us with descriptions of music that we can not as accurately or meticulously obtain through hearing sound alone. We can generally refer to these numerical characteristics that describe and analyze signals as audio features. They are measurable traits that can be obtained in different ways. Extracting low level features directly utilize raw audio signals and transformation methods which gives us more applicable spectral or statistical features. Deriving mid or higher level audio features result in more perceptual, abstract information such as pitch, loudness, tempo, instrumentals, acousticness, and so on.

Having these data is essential in signal processing or machine learning tasks, including systems for classification, detection, and/or generation. We will be exploring cosine similarity later in this project, which is one of fundamental concepts in starting machine learning. Without our dataset’s numerical features, we wouldn’t be able to obtain the results that come from applying these mathematical concepts. And of course, a big part of data analysis is to make sure we have the exact information and format in the first place, otherwise no matter how much computation we want to perform, our results might not be as reliable or accurate. This is why we have EDA and data cleaning methods. We will also explore this relevant to our dataset.

Music inherently has properties that can be described by math, but not always perfectly, and having data like the ones we have in this dataset allows us to find ways to quantify artistic, expressive qualities; having access to, or creating analyzable data, and knowing analysis tools let us gain better understanding in any subject, and even allow some seemingly opposing concepts to coexist in products and systems.

When reading in this dataset, you need to make sure you drop the extra unnamed index column - this will ensure we can properly find any duplicate entries. There are multiple ways to do this, one of which is:

df = pd.read_csv("/anvil/projects/x-cis220051/data/spotify/spotify_dataset.csv").iloc[:,1:]

Deliverables

1a. Load the csv into a pandas data frame and print the head of observe the first few rows. Show the output of the dimensions and the data types of columns. Write a few sentences on your observation and initial thoughts about the dataset.
1b. Find the number of missing values and where we have them.
1c. nunique() in pandas returns the number of unique items in the specified location. unique() will return the unique items. Output how many unique genres are available in the dataframe. Also print what they are.

Question 2 Checking Duplicates (2 points)

The track_id column shows the Spotify ID for a specific songs. We can check if there are overlaps in our dataset, which would suggest same songs appearing multiple times. However, we should also account for the occasions where singers or producers create different versions of a song. So, we can double check for exact duplicates, where separate row entries have matching column values.

We can check duplicate rows in a dataframe by using duplicated(). The default is duplicated(), or duplicated(keep = "first"), where all identical rows are marked as True, except for the first row that is the comparison point.

exact_duplicates = df.duplicated()
print("Exact duplicates:", exact_duplicates.sum())

The code above will output "Exact duplicates: 450".

exact_duplicates = df.duplicated(keep=False)
print("Exact Duplicates:", exact_duplicates.sum())

In this case, the code will output "Exact Duplicates: 894".

Why we might want to differentiate between these options:

In cases where we only care about the remaining one copy of a song, and the number of appearances doesn’t matter, we usually use keep='first', keep='last', or just the default function.
However sometimes, we might want to check for percentages or exact numbers to analyze or report on our dataset. Then setting keep=False will allow us to have an overall view and ratio of dataset content. Similarly, we can get the exact number of times a song is shared or produced.

Deliverables

2a. Check for duplicates. Print the output for the number of duplicates checked by track_id only.
2b. Print the output for the number of exact duplicates (where all column entries are equivalent). Try both duplicated() and duplicated(keep=False). What do the difference in number potential suggest?
2c. As mentioned, we can have different versions in music under the same title, singer, etc. We will check this here. Use keep=False and groupby to find duplicates based on track_id. Then find what they differ by.

Question 3 Cleaning Data (2 points)

Once we have understood the dataframe we are working with, we can move onto cleaning and preparing our data. This step is a big part of Exploratory Data Analysis (EDA). EDA allows us to see features and contents of the data we are working with, find patterns and relationships, find outliers or parts we want to fix beforehand, and visualize correlations where necessary.

The first question where we were simply observing and printing values is also a part of the beginning phase of EDA, and it is essential in being able to move forward.

Some fundamental steps we can take include:

Understand the Problem and the Data
Import and Inspect the Data
Handle Missing Values
Explore Data Characteristics
Perform Data Transformation
Visualize Data Relationships
Handle Outliers
Communicate Findings and Insights

bullet points from: www.geeksforgeeks.org/data-analysis/steps-for-mastering-exploratory-data-analysis-eda-steps/

For our dataset, we will be handling duplicate data, missing values, and check data types.

Deliverables

3a. Use drop_duplicates(keep='first') to remove duplicate removes from the data set. Output the new dimensions.
3b. Output missing values for each columns. Which columns have missing values and how many?
3c. After step B, you should see that the columns with missing values only have one missing each - we can drop those values. Drop the rows with the missing values and output the new shape.

Dropping Rows: It is common to drop rows with missing values when cleaning data; missing data can present issues, such as bias, lack of representativeness, and negatively affecting modelling. In our case, we were able to drop them since it was a very small portion of our data and most likely would not introduce bias or change future analysis. However, in general we need to be careful about when we can drop such rows, and when we don’t have cases like this there are other methods to deal with missing data. Some methods include substituting in mean values or potential values derived from regressions and filling in the space with constants such as 0, or using last or next observed values depending on how the values are laid out.

Question 4 Visualize and Understand the Distribution of our Data (2 points)

It is important to know how our data is distributed, while also checking for any outliers. One way to achieve this in pandas is by using the describe(). This function returns the descriptive statistics relevant to the dataset, such as mean, median, standard deviation, and more. Implementing this for our data can be done as below:

stat = new_df[new_df.select_dtypes(include=np.number).columns].describe()
print(stat)

select_dtypes() has parameter include and exclude, allowing us to pick which data types we want to work with. In our case, we only select numerical values. describe() will provide the statistical summary for those columns.

Once you get the output, you will notice that features such as danceability, energy, and liveliness are distributed within 0 and 1 by the way they are defined.

Now let’s take a look at duration. It is on a much larger scale than other variables and by the numerical values only it seems like we have extreme outliers. For example, the max value is 5.237295e+06, which converts to 87 minutes. Usually, we would remove such extreme outliers; however, let’s first confirm what our data is that corresponds to these values.

new_df.loc[new_df['duration_ms'].idxmax()]

Using loc allows us to obtain the entire row by the index label, and idxmax() returns the index that corresponds to the maximum value (in this case amongst duration_ms).

The output should look like:

track_id 3Cnz3Bu9Wcw8p3kiBTXTxp

track_name Unity (Voyage Mix) Pt. 1

artists Tale Of Us

duration_ms 5237295

Name: 73617, dtype: object

The effect and by how much this has on modelling or calculations we want to perform varies by case. In the next question, we will use the cosine similarity method to find similar songs. Since our goal is to use all numeric data that shows the characteristics of all existing types of music, and the method uses angles between vectors for computation while we also have a scaling method before using cosine_similarity(), we will keep our duration values. We will explain this further in the next part.

Additionally, visualization can also provide insight into not only the distribution, but also make it easier for us to identify relationships or behavior that is harder to do with seeing numeric only.

We can try it out using seaborn, which is a visualization library in python. To plot histograms and kde plots of variables, we can follow steps as such:

numeric_col = new_df.select_dtypes(include=np.number).columns
plt.figure(figsize=(20,15))
for i, col in enumerate(numeric_col[:16], 1):
  plt.subplot(4,4,i)
  sns.histplot(data = new_df, x = col, kde = True)
  plt.title(col)
plt.tight_layout()
plt.show()

Setting kde=True creates the kde plot over our histogram showing smoothed distribution.

Deliverables

4a. Use describe() to print descriptive statistics for the numerical columns only in our dataset. Explain what insights we can gain from this, and your observation in a few sentences.
4b. Find the row with the maximum duration_ms value and output these columns: track_id, track_name, artists, duration_ms
4c. Try out plotting distributions of each numeric variables. Write 1-2 sentences to explain what it is showing us and any observations you have.

KDE plot is one of the ways to visualize data distribution and it shows us the probability density function of variables. It is closely related to histograms. KDE is given defined by:

$\frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right)$

Where $K$ is the kernel function. There are multiple types that can be used, such as uniform, normal, parabolic, triangular, biweight, etc., based on the distance $x-x_{i}$ to compute the probability density. $h$ is the bandwidth. $h$ helps with smoothing. We need to always make sure that smoothing is neither over or underdone, since it can lead to loss of important data.

Question 5 Find Similar Songs(2 points)

Cosine similarity is a common method that measure the similarity between vectors. It is defined by:

$cos(\theta) = \frac{A \cdot B}{||A||||B||}$

It utilizes the angle between the vectors, and does not consider magnitudes. This way, we focus on the direction of the vectors and how similar they are. The calculation produces a value between -1 and 1, where 0 represents orthogonality or no correlation, -1 represents opposite vectors, and 1 represents identical vectors.

In this question, we will see a short example of this method by finding similar music in the dataset given a song. Scikit-learn provides an easy way to implement this using cosine_similarity.

We select the features we will be including to compute cosine similarity; we want to use values that reflect characteristics of songs, and here we will pick numeric values.

print(new_df.select_dtypes(include=np.number).columns)
characteristics = ['popularity','duration_ms', 'danceability',
        'energy', 'key', 'loudness','speechiness', 'acousticness',
        'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']

Scikit-Learn provides a way to perform computation with cosine_similarity() and other scaling methods. StandardScaler() performs z-score normalization which will help us deal with the varying scale the values live in for different columns and get better cosine similarity values.

data = new_df[characteristics].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

We can also implement cosine similarity in numpy, simply by following the definition.

dot_product = np.dot(a,b)
a_mag = np.linalg.norm(a)
b_mag = np.linalg.norm(b)
cosine_similarity = dot_product / (a_mag * b_mag)

Deliverables

5a. Write a 2-3 sentences to explain cosine similarity in your own words.
5b. Write a function that computes cosine similarity to find similar songs using cosine_similarity(). Output top 10 and top 15. The output should include song title, artist name, track id, and similarity score.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project4.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.