Step-by-Step Guide To Implement Machine Learning I - KNN

Ryukkkk

2.25/5 (5 votes)

May 4, 2019

CPOL

2 min read

7595

Easy to implement machine learning

Introduction

K-nearest neighbors(KNN) is a simple machine learning algorithm, the principle of which is to calculate the distance between the test object and the training set. Then, the object in training set is selected by the distances to add to a K-NN set until the K-NN set includes a predefined number of nearest neighbors, which can be expressed as:

$y = argmax\sum_{x_{j}\in N_{k}}^{k}{I(y_{i}=c_{i})}$

where $N_{k}$ is the KNN set, $I$ is given by:

$I(x)= \begin{cases} 0,& \text{if x is true}\\ 1,& \text{else} \end{cases}$

KNN Model

KNN model consists of distance calculation, select K and classify.

Distance Calculation

Distance is used to measure the similarity between two objects in the feature space. There are many methods to calculate distance between two objects, such as Euclidean distance, Cosine distance, Edit distance, Manhattan distance, etc. The simplest method is Euclidean distance, which is calculated as:

$L_{2}(x_{i},x_{j})=\left( \sum_{l=1}^{n}{|x_{i}^{(l)}-x_{j}^{(l)}|^{2}}\right)^\frac{1}{2}$

where n is the feature dimension.

Because the different scales of feature value, the bigger value has more effect on the distance. Thus, all the features need to be normalized. Specifically, there are two normalization methods. One is min-max normalization, which is given by:

$x'=\frac{x-min(x)}{max(x)-min(x)}$

    def Normalization(self, data):
        # get the max and min value of each column
        minValue = data.min(axis=0)
        maxValue = data.max(axis=0)
        diff = maxValue - minValue
        # normalization
        mindata = np.tile(minValue, (data.shape[0], 1))
        normdata = (data - mindata)/np.tile(diff, (data.shape[0], 1))
        return normdata

The other is z-score normalization, which is given by:

$x'=\frac{x-\mu}{\sigma}$

where $\mu$ is the mean of $x$ and $\sigma$ is the standard deviation of $x$ .

    def Standardization(self, data):
        # get the mean and the variance of each column
        meanValue = data.mean(axis=0)
        varValue = data.std(axis=0)
        standarddata = (data - np.tile(meanValue, 
                       (data.shape[0], 1)))/np.tile(varValue, (data.shape[0], 1))
        return standarddata

After normalization, the code of Euclidean distance is shown below:

train_num = train_data.shape[0]
# calculate the distances
distances = np.tile(input, (train_num, 1)) - train_data
distances = distances**2
distances = distances.sum(axis=1)
distances = distances**0.5

Select K

If we select a small K, the model is learned with a small neighbor which will lead to small "approximate error" while large "estimate error". In a word, small K will make the model complex and tend to overfit. On the contrary, if we select a large K, the model is learned with a large neighbor which will lead to large "approximate error" while small "estimate error". In a word, large K will make the model simple and tend to large computation.

Classifiy

After determine K, we can utilize vote for classification, which means the minority is subordinate to the majority, whose code is shown below:

disIndex = distances.argsort()
labelCount = {}
for i in range(k):
    label = train_label[disIndex[i]]
    labelCount[label] = labelCount.get(label, 0) + 1
prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
label = prediction[0][0]

In the above code, we first use argsorts() to get the ordered index, then count each kind of label in the first K samples, finally labelCount is sorted to get the label with the most votes, which is the prediction for the test object. The whole prediction function is shown below:

def calcuateDistance(self, input_sample, train_data, train_label, k):
        train_num = train_data.shape[0]
        # calculate the distances
        distances = np.tile(input, (train_num, 1)) - train_data
        distances = distances**2
        distances = distances.sum(axis=1)
        distances = distances**0.5

        # get the labels of the first k distances
        disIndex = distances.argsort()
        labelCount = {}
        for i in range(k):
            label = train_label[disIndex[i]]
            labelCount[label] = labelCount.get(label, 0) + 1

        prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
        label = prediction[0][0]
        return label

Conclusion and Analysis

KNN is implemented by linear traverse in this articles. However, there exists more effective method for KNN, like kd tree. Moreover, it is valid to apply cross-validation to get a more suitable K. Finally, let's compare our KNN with the KNN in Sklearn and the detection performance is displayed below.

Form the figure, we can learn that the KNN in this article is better than sklearn knn in terms of accuracy. The runtime is nearly the same.

The related code and dataset in this article can be found in MachineLearning.