K-Nearest Neighbour (KNN) from scratch
Writing KNN functions from scratch to classify abalone into different classes based on their age.
TL;DR
Writing KNN algorithms to classify abalone based on their age with an accuracy of 27.614% (n=20). The low accuracy may be due to imbalanced data set as 17 of 29 classes have less than a hundred data points resulting in insufficient data to train the model accurately. Data originates from UCI Machine Learning Repository.
Refer to k-nearest-neigbour github repo for more details.
Introduction
Classification problem refers to grouping similar data or classifying data by specific categories using a predetermined set of features. The output of a classification problem has discrete value represented by an integer.
In this project, we’ll attempt to classify the age of abalone into 15 classes (i.e., rings) using the following features.
Features | Data Type | Measures | Description |
---|---|---|---|
Sex | Nominal | M, F, I (Infant) | Nil |
Length | Continuous | mm | Longest shell measurement |
Diameter | Continuous | mm | Perpendicular to length |
Height | Continuous | mm | With meat in shell |
Whole weight | Continuous | grams | Whole abalone |
Shucked weight | Continuous | grams | Weight of meat |
Viscera weight | Continuous | grams | Gut weight (after bleeding) |
Shell weight | Continuous | grams | After being dried |
K-Nearest Neighbours (KNN)
The KNN algorithm assumes that similar data points exist in close proximity. The neighbouring data points are used to predict the outcome of the the new data point by choosing the most common categories for classification problems or mean of the neighbours for regression problems. K
represents the number of neighbours used to predict the outcome.
Normalisation of data points when using KNN for prediction is important to ensure that the distance is not being influenced by features with higher scale of measurements that could result in misclassifications. KNN is distance dependent, where more weight is given to higher scaled feature.
Python code for KNN algorithm, train-test-split and K-fold:
import numpy as np
def loadData(filename):
# Load data from file into X
X = []
count = 0
text_file = open(filename, "r")
lines = text_file.readlines()
for line in lines:
X.append([])
words = line.split(",")
# Convert values of the first attribute into float
for word in words:
if (word=='M'):
word = 0.333
if (word=='F'):
word = 0.666
if (word=='I'):
word = 1
X[count].append(float(word))
count += 1
return np.asarray(X)
def testNorm(X_norm):
xMerged = np.copy(X_norm[0])
# Merge datasets
for i in range(len(X_norm)-1):
xMerged = np.concatenate((xMerged,X_norm[i+1]))
print(np.mean(xMerged,axis=0))
print(np.sum(xMerged,axis=0))
def dataNorm(X):
# Normalise data points for first 8 columns
for i in range(0, 8):
col = X[:,i]
max_val = col.max()
min_val = col.min()
X[:,i] = (col-min_val)/(max_val-min_val)
return X
def splitTT(X_norm, percentTrain):
# Split dataset into train and test set based on the percentTrain specified
# Random shuffling of data before splitting
np.random.shuffle(X_norm)
# Get index to split the data and slice the dataset based on the index
index = round(len(X_norm)*percentTrain)
X_train, X_test = X_norm[:index,:], X_norm[index:,:]
X_split = [X_train, X_test]
return X_split
def splitCV(X_norm, k):
# Partition dataset into k-folds based on k (i.e., number of partitions) specified
# Random shuffling of data before partitioning
np.random.shuffle(X_norm)
X_split = []
# Get the approx. size for each fold
size = round(len(X_norm)/k)
# Partition the data based on the size. The remaining data points will be added to the arrays
# and attempt to make the size of each array equal. Remaining data points would be left unused.
for i in range(k):
X_split.append(X_norm[:size,:])
X_norm = X_norm[size:,:]
# If there are limited data points, the following code could be use to add the remaining data points
# to the partitions: (Extra)
# if len(X_norm) != 0:
# i = 0
# for elem in X_norm:
# X_split[i] = list(X_split[i])
# X_split[i].append(elem)
# X_split[i] = np.array(X_split[i])
# i+=1
return X_split
def knn(X_train, X_test, K):
predictions = []
# Generate prediction for X_test
for row in X_test:
# Calculate the distance of the data points in X_train for each data point in X_test
distances = []
for row_train in X_train:
distance = 0
# Calculated distance based on Euclidean distance
for i in range(len(row)-1):
distance += (row[i] - row_train[i])**2
distance = np.sqrt(distance)
distances.append((row_train, distance))
# Sort the distances for the data point of X_test
distances.sort(key=lambda elem: elem[1])
# Get the "K" (specified number of neighbours) and predict the class for that data point
neighbours = []
for i in range(K):
neighbours.append(distances[i][0])
output_vals = [n[-1] for n in neighbours]
prediction = max(set(output_vals), key=output_vals.count)
predictions.append(prediction)
# Calculate the accuracy of the prediction (no. accurate predictions / total predictions)
accurate_pred = 0
pred_count = len(predictions)
actual = X_test[:,-1]
for i in range(pred_count):
if predictions[i] == actual[i]:
accurate_pred += 1
accuracy = (accurate_pred/pred_count) * 100.00
return accuracy
# This is an example the main of KNN with train-and-test + Euclidean
def knnMain(filename,percentTrain,k):
# Data load
X = loadData(filename)
# Normalization
X_norm = dataNorm(X)
# Data split: train-and-test
X_split = splitTT(X_norm,percentTrain)
# KNN: Euclidean
accuracy = knn(X_split[0],X_split[1],k)
return accuracy
Accuracy
The following accuracy is calculated using the percentage of accurate predictions. Values are rounded up to 3 decimal places.
- The highest and lowest accuracy for train-and-test split are 27.614% for 70:30 split where k-neighbour=20 and 18.861% for 50:50 split where k-neighbour=1.
- The highest and lowest accuracy for k-fold cross validation split are 25.659% for k-fold = 15 where k-neighbour=20 and 18.861% for k-fold=1 where k-neighbour=1.
Accuracy | Train-and-Test | Cross-Validation | ||||
---|---|---|---|---|---|---|
0.7-0.3 | 0.6-0.4 | 0.5-0.5 | 5-fold | 10-fold | 15-fold | |
K=1 | 19.393% | 19.270% | 18.861% | 19.641% | 19.895% | 19.976% |
K=5 | 25.858% | 24.536% | 22.882% | 23.401% | 22.410% | 23.285% |
K=10 | 24.741% | 25.613% | 25.227% | 24.048% | 23.916% | 24.580% |
K=15 | 26.257% | 26.272% | 26.663% | 25.102% | 24.731% | 24.724% |
K=20 | 27.614% | 27.169% | 26.472% | 25.269% | 25.162% | 25.659% |
Run Time
The computational time is tabulated using the timeit
library. Values are rounded up to 3 decimal places.
Computational Time | Train-and-Test | Cross-Validation | ||||
---|---|---|---|---|---|---|
0.7-0.3 | 0.6-0.4 | 0.5-0.5 | 5-fold | 10-fold | 15-fold | |
K=1 | 11.969 seconds | 13.483 seconds | 14.147 seconds | 45.429 seconds | 50.753 seconds | 52.492 seconds |
K=5 | 12.136 seconds | 13.474 seconds | 14.070 seconds | 45.299 seconds | 50.434 seconds | 52.282 seconds |
K=10 | 11.931 seconds | 13.479 seconds | 13.972 seconds | 45.307 seconds | 50.532 seconds | 52.689 seconds |
K=15 | 11.906 seconds | 13.605 seconds | 13.977 seconds | 45.840 seconds | 50.737 seconds | 52.855 seconds |
K=20 | 11.936 seconds | 13.662 seconds | 13.992 seconds | 45.423 seconds | 50.601 seconds | 52.021 seconds |