# An intro to spectral clustering – complete with illustrative MATLAB toys

In this set of notes we’re going to learn about a clustering approach that throws out the K-means assumption that clusters fall in convex globular clusters and does something different: spectral clustering (we’ll cover just the vanilla version, improved and tweaked versions based on this can be found in standard papers on the subject). Spectral clustering clusters data points by treating them not as distinct convex globular clusters as in K-means but as connected subgraphs. This change in perspective allows separation of non-convex globular clusters – something K-means just can’t do. More importantly we’ll see that spectral clustering can ignore sparse interconnections between arbitrarily shaped clusters of data – a characteristic that makes it an especially useful graph-based clustering technique. You can find the notes below

And you can find a ZIP file with the MATLAB code used to generate figures for these notes here. The code consists of two toys that allow you to place an arbitrary number of points anywhere you want on a blank canvas and then compare various clustering algorithms – they’re hella fun to play with.

hello!

thank you, its “very easy” to understand tutorial. and also very good one.

can you please add the matlab function “datasample”. it seems to be missing.

thanks

G

G – datasample is a built in matlab function. See below!

http://www.mathworks.com/help/stats/datasample.html

Hey Jeremy, good read, easy to understand but when I tried to test some exaples get the following error:

“>> spectral_toy(2,1,1.5,3)

??? Undefined function or method ‘datasample’ for input arguments of type ‘double’.

Error in ==> spectral_toy>kmeans at 168

c = datasample(cand,K,’Replace’,false); % pick K random input pts as initial

centroids

Error in ==> spectral_toy at 117

[D,x] = kmeans(Y,K);”

MATLAB2009a Any idea on this? Thanks

Petar –

Glad to hear the writeup this was helpful!

The error you’re receiving is likely because the function “datasample” requires the Statistics toolbox in MATLAB which maybe you don’t have. However, that line just picks random points from those you’ve inputed (which are used to initialize the centroids in the k-means clustering algorithm), and a suitable replacement can be written that doesn’t require the datasample function.

For example, you can replace this line ( line 168) with

c = Y(randperm(size(Y,1),K),:);

to get the same result.

Does that help?

– Jeremy

Jeremy thanks for a really quick reply.

Statistics toolbox is installed and tried this modified line but then get:

Error using ==> randperm

Too many input arguments

Same thing on a PC with Matlab 2010

Weird – can you try running this updated version of spectral_toy (that uses randperm). Download at

https://drive.google.com/file/d/0B9LZEwqBZcp4UXlWVUNIdmJaTVE/edit?usp=sharing

Unfortunatelly same thing

Thanks again for trying, but I don’t won’t to bother you with this anymore. Will try to resolve it, myself

Regards

Petar

It seems that randperm until Matlab 2012 accepted only 1 parameter hence the error.

Ah – ok!

You can do the same thing using only one parameter with randperm by replacing the previous suggestion

c = Y(randperm(size(Y,1),K),:);

with these 3 lines to produce the same effect

samples = randperm(size(Y,1));

samples = samples(1:K);

c = Y(samples,:);

If you like, here is an updated version of the file with these 3 lines replacing the old suggestion, along with a test dataset

https://drive.google.com/file/d/0B9LZEwqBZcp4NmVNMTg4VXFka2c/edit?usp=sharing

You can uncomment the UI to place your own points, and comment out the data load in. With the test dataset, the file should produce an image like this one

https://drive.google.com/file/d/0B9LZEwqBZcp4U0lMSUhDX1A3YUE/edit?usp=sharing

Managed to get it working last night but mostly not getting expected results

[IMG]http://i.imgur.com/zl9zKFZ.png[/IMG]

And same with this last modified code you uploaded, results differ from yours

[IMG]http://i.imgur.com/ZC9IQnH.png[/IMG]

Progress!

Can you try running the last version with the fixed data set several times? Results of spectral clustering on the same data set can (and usually do) vary from try to try based on different initializations of k-means (and here it has a random initialization).

PS – another factor contributing to mis-clustering with this method, besides of course choosing the right number of clusters (which thankfully with this toy we can ignore) is choosing the right parameters for the graph (e.g., if choosing k-nearest neighbors what should k be?). Its not obvious to me beforehand how to make this choice, and in practice to my knowledge this must be tuned.

So the toy really alleviates only one of the two serious problems with spectral clustering (choosing the right number of clusters), but tuning the graph parameters correctly is still a serious issue.

This issue discussed further (for a few types of common graphs) in section 8 of this popular tutorial

http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Luxburg07_tutorial_4488%5b0%5d.pdf

Yes, progress :)

And you are correct, tried multiple runs with varying results. Thanks for the link to the Luxburg tutorial, will give it a read after my hayfever subsides a bit, can’t function at the moment.