TDM 30100: Project 06 - Classifiers - K-Nearest Neighbors (KNN) II
Project Objectives
In this project, we will learn about more advanced techniques for K-Nearest Neighbors (KNN), and continue building our KNN from scratch.
Questions
Question 1 (2 points)
In the previous project, we developed a KNN class that is able to classify new data points. If you completed the previous project, you should have a basic understanding of how a KNN works.
In this question, we will briefly recap last project’s code and concepts. Please run the following code to load the Iris dataset, scale the input features, and split the data into training and testing sets.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv')
X = df.drop(['Species','Id'], axis=1)
y = df['Species']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
Then, please run the following code to import the KNN class. If you did the previous project, please use your own KNN class. If you did not complete the previous project, please use the following code to import the KNN class.
class KNN:
def __init__(self, X_train, y_train):
self.features = X_train
self.labels = y_train
def train(self, X_train, y_train):
self.features = X_train
self.labels = y_train
def euc_dist(self, point1, point2):
# short 1 line approach
return sum((point1 - point2) ** 2) ** 0.5
def classify(self, new_point, k=1):
# sort the combined list by the distance from the point in the list to the new point
nearest_labels = [x[1] for x in sorted(zip(self.features, self.labels), key=lambda x: self.euc_dist(x[0], new_point))[:k]]
return max(set(nearest_labels), key=nearest_labels.count)
def test(self, X_test, y_test, k=1):
# short 1 line approach, efficent list comprehension
return sum([1 for p in zip(X_test, y_test) if self.classify(p[0],k=k)==p[1]])/len(X_test)
To review the concepts of the KNN algorithm, please answer the following questions.
-
What is the purpose of the
train
function in the KNN class? -
How does a KNN pick which k neighbors to use when classifying a new point?
-
How does a KNN handle ties when classifying a new point?
Question 2 (2 points)
To review, the KNN works entirely by calculating the Euclidean distance between points in n-dimensional space. This means that scaling our input features is very important, as features with larger scales will have a larger impact on the distance calculation.
However, uniformly scaling our features may not be the best approach. If we wanted to identify the difference between a red apple and a green apple, the most important feature would be the color of the apple. Therefore, we would want to scale the color feature more than the size feature.
This concept of manually assigning the importance of features is an example of feature engineering. We can use existing knowledge (or often times intuition) to determine how important each feature should be in the model. This can greatly improve our model’s performance if done right, and can also often lead to a more interpretable model. |
Let’s create a new function inside the KNN class that will calculate the euclidean distance between two points, but will take a list of weights to determine how important each feature is. This will allow us to scale the distance between two points based on the importance of each feature.
def scaled_distance(self, point1, point2, weights):
# firstly, scale weights so they sum to 1 (each weight should be a fraction of the total 1)
''' YOUR CODE HERE '''
# then, scale the 2 points by the weights (multiply each feature in the point by the corresponding weight)
''' YOUR CODE HERE '''
# finally, calculate and return the euclidean distance between the 2 points (use the existing euc_dist function)
''' YOUR CODE HERE '''
pass
To ensure your implementation is correct, run the below code to calculate the scaled distance between the first row of the training set and the second row of the testing set. The printed values should be 1.4718716551800963 for euc_dist and 0.15147470763642634 for scaled_distance.
knn = KNN(X_train, y_train)
print(knn.euc_dist(X_train[0], X_test[1]))
print(knn.scaled_distance(X_train[0], X_test[1], [1,1,1,10]))
-
Output of running the sample code to confirm correct implementation of the scaled_distance function
-
What does the distance decreasing when we raised the weight of the last feature mean?
Question 3 (2 points)
Now that we have code to scale the distance between two points based on the importance of each feature, let’s write two functions inside the KNN class to classify a point using weights, and to test the model using weights.
These functions will be extremely similar to the existing classify and test functions, but use the scaled_distance function instead of the euc_dist function. |
def classify_weighted(self, new_point, k=1, weights=None):
''' If weights == None, run the existing classify function '''
# now, write the classify function using the scaled_distance function
''' YOUR CODE HERE '''
def test_weighted(self, X_test, y_test, k=1, weights=None):
''' YOUR CODE TO TEST THE MODEL '''
pass
To test that your functions work, please run the below code to calculate the accuracy of the model with different weights. Your accuracies should be 0.9666666666666667, 0.9666666666666667, and 0.8333333333333334 respectively.
knn = KNN(X_train, y_train)
print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,1]))
print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,10]))
print(knn.test_weighted(X_test, y_test, k=1, weights=[10,1,1,1]))
-
Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,1]
-
Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,10]
-
Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[10,1,1,10]
-
Does the accuracy of the model change when we change the weights? Why or why not?
Question 4 (2 points)
One potential limitation of the KNN is that we are simply selecting the class based on the majority of the k nearest neighbors. Suppose we attempt to classify some point with k=3. Suppose this results in finding 2 neighbors of class A and 1 neighbor of class B. In this case, the KNN would classify the point as class A. However, what if the 2 neighbors of class A are very far away from our new point, while the class B neighbor is extremely close? It would probably make more sense to classify the point as class B.
Additionally, suppose our dataset is unbalanced. We may have hundreds of examples of class A in our dataset, but only a few examples of class B. In this case, it is very likely that the KNN will classify points as class A, even if they are closer to class B neighbors.
To address this limitation, a common modification to the KNN is to weight the k-nearest neighbors based on their distance to the new point. This means that closer neighbors will have a larger impact on the classification than farther neighbors. Although this is more computationally expensive, it creates a much more robust model.
Implement a new function inside the KNN class that classifies a new point using weighted neighbors. This function should work similarly to the classify function, but should return the class based on the average distance of each class, as opposed to a simple majority vote.
def classify_distance(self, new_point, k=1, weights=None):
# follow the same approach as the classify function. however, for each nearest neighbor, we need to save both the label and the distance
# nearest_labels = [(label, distance), ... k times]
''' YOUR CODE HERE '''
# now, we need to select the class based on each distance, not just the label
# we can find the average distance of each class and select the class with the smallest average distance
''' YOUR CODE HERE '''
It is recommended to use |
To test that your function works properly, we will classify the a test point at different k values. Run the below code to ensure that your function works properly. The output should be 'Iris-versicolor', 'Iris-versicolor', and 'Iris-virginica' respectively.
knn = KNN(X_train, y_train)
print(knn.classify_distance(X_test[8], k=5, weights=None))
print(knn.classify_distance(X_test[8], k=7, weights=None))
print(knn.classify_distance(X_test[8], k=9, weights=None))
If you print some debugging information inside the function, you should see that even though at k=9 there are more 'Iris-versicolor' neighbors, the average distance of the 'Iris-virginica' neighbors is smaller and therefore is selected. |
-
Classification test at k=5, 7, and 9.
-
Explanation of why the classification changes when we change the k value
-
What do you think happens if we set k to the number of training points?
Question 5 (2 points)
In this project you have learned about feature engineering, feature importance scaling, and different ways to handle ties in KNN.
Based on what you have learned about KNNs, please answer the following questions.
-
What is the purpose of feature engineering in machine learning?
-
Why is it important to scale input features in KNN?
-
What are the advantages and disadvantages of the two approaches to handling ties in KNN?
-
What are limitations of the KNN algorithm?
Submitting your Work
-
firstname_lastname_project6.ipynb
You must double check your You will not receive full credit if your |