K-Nearest neighbor Algorithm

If you are here, I’m gonna assume you have already heard a lot about machine learning. We will start of our journey by starting with really simple and intuitive algorithm. It always begins with classification problems. Apples or Oranges? What is that digit? Is it a one or perhaps seven? Hence we shall start with classification algorithms.

The first algorithm in classification problems is k-nearest-neighbor algorithm. Imagine you are an alien and you land in silicon vally. You see faces from all races. You notice that some people are labeled as Chinese, some are Indians and some are Americans. A Korean guy walks up to you to say hi. How do you label him? Your best guess would be Chinese, right? Facial feature wise Chinese people are nearest neighbors to the Korean guy. k-nearest neighbor (kNN) does something similar. In following pen. I’ve used kNN to classify body type into thin, fat and healthy based on height and weight of the person.

classify a person into fat ,healthy and thin based on height and weight

Same algorithm can be used to classify many things. You only have to change training data. You could use this to classify movies into genres. Your parameters could be number of kisses, number of violent scenes, number of comic scenes,on scale of 1 to 10 how hard was it to predict the next scene? Based on these you could classify movies into romance, horror, comedy and suspense.

Consider a two sets of dots in the space, red dots and blue dots. If I were to give you a random point and ask you color of the point what would you do? You would say red if the point was closer to red cluster and blue if the point was near the blue cluster. That’s what we are going to do.

  • Measure the distance between unknown point and all the points
  • Sort the Nodes from minimum distance to maximum distance
  • Take k number of Nodes. If majority of them say blue, it’s blue

Let’s define a point in space

var Node = function(label,x,y){
this.label = label;
this.x = x;
this.y = y;
}

Create dummy data set. Let’s classify movies. Our training data will look something like this

// [genere, kisses,fights]var trainData = [
['Action',2,8],
['Romance',16,2]
['Action',1,18]
['Action',2,8]
['Romance',14,3]
['Romance',7,0]
['Romance',12,1]
['Action',2,7]
];

‘genere’ will be a string. Rest of the parameters would be numbers. Lets create nodes and store them in an array.

var Nodes = [];
for(var i=0;i<trainData.length;i++){
var l = trainData[i][0],
x = trainData[i][1],
y = trainData[i][2];
Nodes.push(new Node(l,x,y));
}

Now we will write a function to calculate distance between two points.

Node.prototype.distanceFrom = function(node){
var sqrDist = Math.pow(this.x - node.x,2)
+ Math.pow(this.y-node.y,2);
return Math.sqrt(sqrDist);
}

We just used the geometry formula

Now we are going to write function that actually is our machine learning algorithm. Remember I said, we take voting from k closest members. That is the gist of the algorithm. k nearest neighbors.

var k = 3;
function whichGenere(node){
var dists = [];
for(var i=0;i<Nodes.length;i++){
var d = node.distanceFrom(Nodes[i]);
dists.push({label:Nodes[i].label,d:d});
}
//sort the array
var sortedArr = dists.sort(function(a,b){
return a.d-b.d;
});

var votes = {'Action':0,'Romance':0};

for(var i=0;i<k;i++){
if(sortedArr[i].label=='Action')votes.Action++
if(sortedArr[i].label=='Romance')votes.Romance++
}

if(votes.Romance>votes.Action) return 'Romance';
else return 'Action';

}

In first for loop we are calculating distance of our unknown node to every other node. Then we are storing that distance in an array called dists. Then we just sort the array from min distance to max distance. In next for loop we iterate over first k elements, because they represent closest nodes. And then it’s just basic vote counting. I generally write my programs on jsbin and then fool around a bit. Now time to test our work. Try this in console.

// 5 kisses and 0 fights should output romance.
console.log(whichGenere(new Node('',5,0)));

Now try some higher values of k. The results become inaccurate. It’s like deciding who gets chemistry Nobel prize based on votes of layman. That should be decided only by excellent chemistry scientists. But you can not ask just one scientist because he can be wrong. kNN applies same principals. Guessing the right value of k is tricky business. But yes there are ways and guidelines that can help you guess better. This is pretty powerful stuff and the algorithm itself has been around for a long time. There are many cool things you can do with this. In this example we used only two parameters to simplify the situation. But the real power lies in classifications based on multiple parameters. In this example itself you can add parameters like

  • number of comic scenes
  • number of horror scenes
  • number of emotional scenes

and then you can add labels like horror, family drama,comedy.

Our example here is two dimensional. So you can always plot it on canvas or using d3.js whatever pleases you. It’s good to get a visualization of the algorithm. In my pen, I used d3.js

When you have a classification problem kNN is the first thing you can try. It’s simple yet powerful. In next post we will talk about limitations of kNN and hence the second algorithm. I would love to see some apps on codepen or may be jsBin, if you prefer that. Don’t forget to share them with me. Until next time, practice practice practice.

just a bug hunter and problem addict