TDM 40100: Project 04 - Classifiers - Basics of Classification
Project Objectives
In this project, we will learn about the basics of classification. We will explore some of the most common classifiers, and their strengths and weaknesses.
Questions
Question 1 (2 points)
A classifier, as you may remember from Project 2, is a machine learning model that uses input features to classify the data. Classifiers can be used to determine if email is spam or not, or determine what kind of flower a plant is. We can split classifiers into 2 major categories: binary classification and multi-class classification.
Binary classifiers are used when we want to classify binary outcomes, such as testing if a patient is sick or not. Multi-class classifiers are used when we want to classify more than 2 outcomes, such as a color, or a type of flower.
Multi-label classifiers are a special case of multi-class classifiers, where multiple classes can be assigned to a single instance. For example, an image containing both a cat and dog would be classified as both a cat and a dog. These are commonly found in image recognition problems. |
Pennsylvania State University has a great lesson on examples of classification problems. You can read about them in section 1.5 here. Please read through some of these examples, and then come up with your own real world examples of binary and multi-class classification problems.
-
What is a real world example of binary classification?
-
What is a real world example of multi-class classification?
-
Is email spam classification (spam or not spam) a binary or multi-class classification problem?
-
Is digit recognition (determining numerical digits that are handwritten) a binary or multi-class classification problem?
Question 2 (2 points)
There are many different classification models. In this course, we will go more in depth into the K-Nearest Neighbors (KNN) model, the Decision Tree model, and the Random Forest model. Each of these models has its own strengths and weaknesses, and is better suited for different types of data. There are many other classification models, and more methods are being developed all the time. We won’t go into detail about these models in this project, but it is important to know that they exist and behave differently.
GeeksforGeeks has a great article on different classification models and their strengths and weaknesses. You can read about them here. Please read through this article and then answer the following questions.
-
Can you name 3 other models that could be used for classification?
-
Why is it important to understand the strengths and weaknesses of different classification models?
Question 3 (2 points)
There are many metrics that can be used to evaluate the performance of a classifier. Some of the most common metrics are accuracy, precision, recall, and F1 score.
In binary classification, there are 4 possible results from a classifier:
True Positive: The classifier predicts the presence of a class, when the class is actually present True Negative: The classifier correctly predicts the absence of a class, when the class is actually absent False Positive: The classifier predicts the presence of a class, when the class is actually absent False Negative: The classifier predicts the absence of a class, when the class is actually present
Accuracy is simply the percentage of correct predictions made by the model. As we learned in Project 3, our data is split into training and testing sets. We can calculate the accuracy of our model by comparing the predicted values on the testing set to the actual values in the testing set.
$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
Precision is a metric that tells us how many of the predictions of a certain class were actually correct. It is calculated by dividing the number of true positives by the number of true positives plus the number of false positives.
$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
Recall is a metric that tells us how many of the actual instances of a certain class were predicted correctly. It is calculated by dividing the number of true positives by the number of true positives plus the number of false negatives.
$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
Finally, the F1 score is the harmonic mean of precision and recall. It is calculated as 2 times the product of precision and recall divided by the sum of precision and recall. This metric is useful when we want to balance precision and recall.
$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
Let’s try an example. Given the following table of predictions and actual values, calculate the accuracy, precision, recall, and F1 score.
Actual Value | Predicted Value |
---|---|
Positive |
Positive |
Positive |
Positive |
Negative |
Positive |
Positive |
Negative |
Positive |
Positive |
Negative |
Negative |
Negative |
Positive |
Positive |
Negative |
Positive |
Positive |
Positive |
Positive |
-
Why is accuracy not always the best metric to evaluate the performance of a classifier?
-
In your own words, what is the difference between precision and recall?
-
Calculate the accuracy, precision, recall, and F1 score for the example above.
Question 4 (2 points)
There are many applications of classification in the real world. One common application is in the medical field, where classifiers can be used to predict whether a patient has a certain disease based on their symptoms. Another application is in the financial industry, where classifiers can be used to predict whether a transaction is fraudulent or not.
In more recent years, classifiers have been used in the field of image recognition. For example, classifiers can be used to determine whether an image contains a cat or a dog. More advanced classifiers, such as Haar cascades, can be used to detect faces in images by looking for patterns of light and dark pixels.
In these uses, there often are privacy concerns associated with the data that is being used. If a company wants to develop a classifier to predict whether a transaction is fraudulent, they may need access to sensitive financial data of normal customers. In more recent times, generative image AIs have concerns about what images they were trained on, and if these artists should have their work used to train these models.
Another issue to consider is bias within these datasets. If a model is trained on data biased towards a certain group, it may make incorrect predictions or reinforce existing biases. If a dataset contains a thousand images of cats and only 5 images of a frog, the classifier may be unable to accurately predict whether an image contains a frog, and may often times incorrectly classify images as cats. Another way bias can be found is in the training itself. A model may wind up relying on a single feature to make predictions, often times creating bias towards that feature (think race, age, income, nationality, etc).
There are many ways to address bias in classifiers. Typically, the best way to start is to ensure that the training data is very diverse and representative of the real world. Collecting a large amount of data from a variety of sources helps to ensure that the data is not intrinsically biased. Regularization methods can be used to prevent the model from heavily relying on a single or a small number of features. Finally, fairness metrics and bias detection tools such as Google’s "What-If" tool or IBM’s "AI Fairness 360 (AIF360)" can be used post training to detect and mitigate biases in the model.
-
Can you think of any areas where there may be ethical concerns with using classifiers?
-
Are there any image recognition applications that you interact with, on a daily basis?
Question 5 (2 points)
Although classifiers are powerful tools, they are not without their limitations. One significant limitation is that classifiers rely heavily on the data they are trained with. If the training data is biased, incomplete, or not representative of the real world, the classifier may make incorrect predictions.
Class imbalance is a common problem in classification, where one class has significantly more instances than another. This can lead to classifiers that are biased towards the majority class and perform poorly on the minority class. For example, if my dataset contains 99% cats and 1% dogs, a classifier may simply not have enough data to learn how to classify dogs correctly, and may often times incorrectly classify images as cats.
An easy way to check our class balance is by creating a chart to visualize the distribution of classes in the dataset. To practice, please load the Iris dataset into a dataframe called iris_df
. Then, run the below code to generate a pie chart displaying the class distribution.
import matplotlib.pyplot as plt
# get the counts of the species column
column_counts = iris_df['Species'].value_counts()
# graph the pie chart
column_counts.plot.pie(autopct='%1.1f%%')
Are the classes in the Iris dataset balanced?
Feature engineering is another important aspect of machine learning. Feature engineering is the process of manually selecting or transforming input features in the dataset that are most relevant to the problem at hand. The more irrelevant features a classifier has to work with, the more likely it is to make incorrect predictions.
A notable idea is the Pareto Principle (aka the 80/20 rule) is the idea that 80% of the effects can be attributed to 20% of the causes. This idea can be observed in a myriad of different situations and fields. In the context of our classification models, this theory says that 20% of our features are responsible for 80% of the predictive power of our model. By identifying what features are important, we can reduce our datasets dimensionality and make our models significantly more efficient and interpretable.
One example of where features can be removed is in the case of multicollinearity. This is when a set of features are highly correlated with each other (i.e., the data for them is redundant). This can lead to overfitting, as the model cannot truly distinguish between the features. In this case, we can remove all but one of these correlated features to reduce our dataset’s dimensionality while avoiding the problems of multicollinearity.
We previously looked at encoding categorical variables in Project 3. There are many different ways to encode categorical variables, and the best method depends on the type of data and the model being used. This is an example of feature engineering, as we are transforming the data to a more suitable form for the model.
-
Are the classes in the Iris dataset balanced?
-
What are some ways to address class imbalance in a dataset?
-
Why is feature engineering important in classification?
Question 6 (2 points)
An important aspect of classification (and machine learning) is the concept of generalization. Generalization is the model’s ability to make accurate predictions on new data. This is very important for deploying a model into the real world, as the model will certainly encounter new data that could be wildly different from the training data.
Within generalization, there are 2 common problems that can be found: Underfitting and Overfitting.
Underfitting is when the model is too simple or not flexible enough to capture the underlying patterns or relationships in the dataset. Imagine finding a linear line of best fit for a graph that clearly shows some parabolic relationship. The model is underfitting the data and cannot make an accurate prediction.
Overfitting is when the model trained too heavily on the dataset, and is unable to understand or generalize new data. This can often happen when the model has too many features, when the model is too complex, or if there is significant noise in the data.
There are many ways to help ensure a model generalizes well. One common method is L1 and L2 Regularization. Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. L1 regularization adds the absolute value of the coefficients to the loss function, while L2 regularization adds the square of the coefficients to the loss function. This encourages the model to keep the coefficients small, prevent the model from essentially performing its own feature selection. Another technique called dropout can be used in neural networks. This method works by randomly selecting neurons to be "dropped out" from the network during training, encouraging the network to have a more robust understanding of the relationships of features to better generalize to new data.
-
In your own words, what is underfitting and overfitting?
-
What is regularization and how does it help prevent overfitting?
-
Can you think of any other ways to help ensure a model generalizes well?
Submitting your Work
-
firstname_lastname_project4.ipynb
You must double check your You will not receive full credit if your |