Tuesday, January 3, 2012

Clustering - Silhouette, K-means & K-medoids

One of my friend sent out an email asking for suggestion on an algorithm to cluster data given a similarity matrix. And that led to a discussion and another friend forwarded a link to k-medoids algorithm and suggested using it instead of k-means because its more robust to noise etc. But I was still wondering how to find the right k and even if I find the right k is k-medoids better than k-means. I was going to leave it to my friend to figure it out but was too impatient to wait that long.

So, I decided to investigate more and kept reading the wikipedia pages and research papers pertitnent to the topic. Instead of just reading this time I wanted to challenge myself. So, I kept a But, instead of just reading it I wanted to challenge myself. So, I set a 5 hour deadline, to understand, formalize, implement, test, create a video, upload and finally write this blog post on it. I started a timer and boy I was in for a surprise... It took me 6 hours to get everything done :(

There are three advantages I found in doing this.
1) I learnt about the two methods and related work in a quick time, good enough to implement them and know their limitations.
2) A timer countdown helps you focus and ignore everything else instead of just saying I will do this today!
3) Most important, I wanted to see how fast I could transition from basic knowledge to the point of having a running code and beautiful outputs with a good report and finally a blog entry that will stay forever(or till blogspot's existence)

I started on a whiteboard with todo list(image below) and time estimates and within 4 hours got the code working after which I had to spend time on making the video, writing the report, uploading, creating the blog entry etc..

The youtube video of the final output

Report: pdf
Matlab code: code

1 comment:

doctorscholls00 said...

This was fun, try this in matlab:

kl=@(p,q) p*log(p/q)+(1-p)*log((1-p)/(1-q))

It'll plot the KL divergence against a fair coin for a range of p-values. Don't ask me why i did this, because i have no idea