TDM 40100: Project 06 - Classifiers - K-Nearest Neighbors (KNN) II

Project Objectives

In this project, we will learn about more advanced techniques for K-Nearest Neighbors (KNN), and continue building our KNN from scratch.

Learning Objectives
  • Learn about feature engineering

  • Learn about better ways to handle ties in KNN

Dataset

  • /anvil/projects/tdm/data/iris/Iris.csv

Questions

Question 1 (2 points)

In the previous project, we developed a KNN class that is able to classify new data points. If you completed the previous project, you should have a basic understanding of how a KNN works.

In this question, we will briefly recap last project’s code and concepts. Please run the following code to load the Iris dataset, scale the input features, and split the data into training and testing sets.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv')
X = df.drop(['Species','Id'], axis=1)
y = df['Species']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

Then, please run the following code to import the KNN class. If you did the previous project, please use your own KNN class. If you did not complete the previous project, please use the following code to import the KNN class.

class KNN:
    def __init__(self, X_train, y_train):
        self.features = X_train
        self.labels = y_train

    def train(self, X_train, y_train):
        self.features = X_train
        self.labels = y_train

    def euc_dist(self, point1, point2):
        # short 1 line approach
        return sum((point1 - point2) ** 2) ** 0.5

    def classify(self, new_point, k=1):
        # sort the combined list by the distance from the point in the list to the new point
        nearest_labels = [x[1] for x in sorted(zip(self.features, self.labels), key=lambda x: self.euc_dist(x[0], new_point))[:k]]
        return max(set(nearest_labels), key=nearest_labels.count)

    def test(self, X_test, y_test, k=1):
        # short 1 line approach, efficent list comprehension
        return sum([1 for p in zip(X_test, y_test) if self.classify(p[0],k=k)==p[1]])/len(X_test)

To review the concepts of the KNN algorithm, please answer the following questions.

Deliverables
  • What is the purpose of the train function in the KNN class?

  • How does a KNN pick which k neighbors to use when classifying a new point?

  • How does a KNN handle ties when classifying a new point?

Question 2 (2 points)

To review, the KNN works entirely by calculating the Euclidean distance between points in n-dimensional space. This means that scaling our input features is very important, as features with larger scales will have a larger impact on the distance calculation.

However, uniformly scaling our features may not be the best approach. If we wanted to identify the difference between a red apple and a green apple, the most important feature would be the color of the apple. Therefore, we would want to scale the color feature more than the size feature.

This concept of manually assigning the importance of features is an example of feature engineering. We can use existing knowledge (or often times intuition) to determine how important each feature should be in the model. This can greatly improve our model’s performance if done right, and can also often lead to a more interpretable model.

Let’s create a new function inside the KNN class that will calculate the euclidean distance between two points, but will take a list of weights to determine how important each feature is. This will allow us to scale the distance between two points based on the importance of each feature.

def scaled_distance(self, point1, point2, weights):
    # firstly, scale weights so they sum to 1 (each weight should be a fraction of the total 1)
    ''' YOUR CODE HERE '''

    # then, scale the 2 points by the weights (multiply each feature in the point by the corresponding weight)
    ''' YOUR CODE HERE '''

    # finally, calculate and return the euclidean distance between the 2 points (use the existing euc_dist function)
    ''' YOUR CODE HERE '''
    pass

To ensure your implementation is correct, run the below code to calculate the scaled distance between the first row of the training set and the second row of the testing set. The printed values should be 1.4718716551800963 for euc_dist and 0.15147470763642634 for scaled_distance.

knn = KNN(X_train, y_train)
print(knn.euc_dist(X_train[0], X_test[1]))
print(knn.scaled_distance(X_train[0], X_test[1], [1,1,1,10]))
Deliverables
  • Output of running the sample code to confirm correct implementation of the scaled_distance function

  • What does the distance decreasing when we raised the weight of the last feature mean?

Question 3 (2 points)

Now that we have code to scale the distance between two points based on the importance of each feature, let’s write two functions inside the KNN class to classify a point using weights, and to test the model using weights.

These functions will be extremely similar to the existing classify and test functions, but use the scaled_distance function instead of the euc_dist function.

def classify_weighted(self, new_point, k=1, weights=None):
    ''' If weights == None, run the existing classify function '''

    # now, write the classify function using the scaled_distance function
    ''' YOUR CODE HERE '''

def test_weighted(self, X_test, y_test, k=1, weights=None):
    ''' YOUR CODE TO TEST THE MODEL '''
    pass

To test that your functions work, please run the below code to calculate the accuracy of the model with different weights. Your accuracies should be 0.9666666666666667, 0.9666666666666667, and 0.8333333333333334 respectively.

knn = KNN(X_train, y_train)
print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,1]))
print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,10]))
print(knn.test_weighted(X_test, y_test, k=1, weights=[10,1,1,1]))
Deliverables
  • Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,1]

  • Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,10]

  • Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[10,1,1,1]

  • Does the accuracy of the model change when we change the weights? Why or why not?

Question 4 (2 points)

One potential limitation of the KNN is that we are simply selecting the class based on the majority of the k nearest neighbors. Suppose we attempt to classify some point with k=3. Suppose this results in finding 2 neighbors of class A and 1 neighbor of class B. In this case, the KNN would classify the point as class A. However, what if the 2 neighbors of class A are very far away from our new point, while the class B neighbor is extremely close? It would probably make more sense to classify the point as class B.

Additionally, suppose our dataset is unbalanced. We may have hundreds of examples of class A in our dataset, but only a few examples of class B. In this case, it is very likely that the KNN will classify points as class A, even if they are closer to class B neighbors.

To address this limitation, a common modification to the KNN is to weight the k-nearest neighbors based on their distance to the new point. This means that closer neighbors will have a larger impact on the classification than farther neighbors. Although this is more computationally expensive, it creates a much more robust model.

Implement a new function inside the KNN class that classifies a new point using weighted neighbors. This function should work similarly to the classify function, but should return the class based on the average distance of each class, as opposed to a simple majority vote.

def classify_distance(self, new_point, k=1, weights=None):
    # follow the same approach as the classify function. however, for each nearest neighbor, we need to save both the label and the distance
    # nearest_labels = [(label, distance), ... k times]
    ''' YOUR CODE HERE '''

    # now, we need to select the class based on each distance, not just the label
    # we can find the average distance of each class and select the class with the smallest average distance
    ''' YOUR CODE HERE '''

It is recommended to use defaultdict from the collections module to initialize a dictionary with a default value of a list. This will allow you to append to the list without checking if the key exists.

To test that your function works properly, we will classify the a test point at different k values. Run the below code to ensure that your function works properly. The output should be 'Iris-versicolor', 'Iris-versicolor', and 'Iris-virginica' respectively.

knn = KNN(X_train, y_train)
print(knn.classify_distance(X_test[8], k=5, weights=None))
print(knn.classify_distance(X_test[8], k=7, weights=None))
print(knn.classify_distance(X_test[8], k=9, weights=None))

If you print some debugging information inside the function, you should see that even though at k=9 there are more 'Iris-versicolor' neighbors, the average distance of the 'Iris-virginica' neighbors is smaller and therefore is selected.

Deliverables
  • Classification test at k=5, 7, and 9.

  • Explanation of why the classification changes when we change the k value

  • What do you think happens if we set k to the number of training points?

Question 5 (2 points)

In this project you have learned about feature engineering, feature importance scaling, and different ways to handle ties in KNN.

Based on what you have learned about KNNs, please answer the following questions.

Deliverables
  • What is the purpose of feature engineering in machine learning?

  • Why is it important to scale input features in KNN?

  • What are the advantages and disadvantages of the two approaches to handling ties in KNN?

  • What are limitations of the KNN algorithm?

Question 6 (2 points)

A change that may be beneficial is to only use the distance based weighting when there is a tie in classification. This would allow the model to be more accurate when there is a tie, but not change the classification when there is not a tie. Please make a new classify function called classify_weighted_ties that will use the distance based weighting only when there is a tie in classification.

def classify_weighted_ties(self, new_point, k=1, weights=None):
    # try to classify using the normal method
    ''' YOUR CODE HERE '''

    # if there is a tie in labels, classify using the weighted method
    ''' YOUR CODE HERE '''

To check that your function works properly, run the below code as a test case.

knn = KNN(X_train, y_train)
print(knn.classify_weighted_ties(X_test[8], k=1, weights=None))
print(knn.classify_weighted_ties(X_test[8], k=2, weights=None))
print(knn.classify_weighted_ties(X_test[8], k=3, weights=None))
print(knn.classify_weighted_ties(X_test[8], k=4, weights=None))
Deliverables
  • Output of classifying X_test[8] at k=1,2,3,4

Question 7 (2 points)

Another modification that may be beneficial is checking if there is a class imbalance in our dataset. If ther is, we should use the distance based weighting in order to select the class. This is because in the event of class imbalance in our dataset, it is significantly more likely that our point will be classified as the majority class.

For this, we will make 2 new functions: get_class_distribution and classify_weighted_imbalance. The get_class_distribution function will return a dictionary of every class and its proportion in the training data. (e.g. If 50% of the labels are Iris-virginica, the dictionary should contain 'Iris-virginica': 0.5). The classify_weighted_imbalance function will get the distribution of classes and if it detects a class imbalance, it will use the distance based weighting to classify the point. We can detect a class imbalance if the proportion of any class is more than some threshold times the proportion of any other class.

def get_class_distribution(self):
    # get each unique value from self.labels
    # count the number of times each unique value appears
    # return a dictionary of the class and its proportion in the training data
    ''' YOUR CODE HERE '''

def classify_weighted_imbalance(self, new_point, k=1, weights=None, threshold=2):
    # get the class distribution
    ''' YOUR CODE HERE '''

    # check if there is a class imbalance
    ''' YOUR CODE HERE '''

To test this function, we will need to make a new dataset with a class imbalance. Run the below code to create a modified Iris dataframe that has a class imbalance (We will remove 25 examples of the 'Iris-versicolor' class).

iris_imbalanced = df.copy()

# remove the first 30 rows of the Iris-setosa class
iris_imbalanced = iris_imbalanced.drop(iris_imbalanced[iris_imbalanced['Species'] == 'Iris-versicolor'].head(25).index)

X_imbalanced = iris_imbalanced.drop(['Species','Id'], axis=1)
y_imbalanced = iris_imbalanced['Species']

scaler = StandardScaler()
X_imbalanced_scaled = scaler.fit_transform(X_imbalanced)

X_train, X_test, y_train, y_test = train_test_split(X_imbalanced_scaled, y_imbalanced, test_size=0.2, random_state=15)

y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

Then, we can test the functions by running the below code.

knn = KNN(X_train, y_train)
# check the class distribution
print(knn.get_class_distribution())

# classify the 9th row of the testing input features using the original classify function
print(knn.classify(X_test[8], k=5))

# classify the 9th row of the testing input features using the imbalance detection classify function
print(knn.classify_weighted_imbalance(X_test[8], k=5, weights=None, threshold=1.5))
Deliverables
  • Output of the class distribution

  • Output of classifying X_test[8] using the original classify function with k=5

  • Output of classifying X_test[8] using the KNN algorithm with k=5 and threshold=1.5

Submitting your Work

Items to submit
  • firstname_lastname_project6.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.