An intro to spectral clustering – complete with illustrative MATLAB toys

In this set of notes we’re going to learn about a clustering approach that throws out the K-means assumption that clusters fall in convex globular clusters and does something different: spectral clustering (we’ll cover just the vanilla version, improved and tweaked versions based on this can be found in standard papers on the subject). Spectral clustering clusters data points by treating them not as distinct convex globular clusters as in K-means but as connected subgraphs. This change in perspective allows separation of non-convex globular clusters – something K-means just can’t do. More importantly we’ll see that spectral clustering can ignore sparse interconnections between arbitrarily shaped clusters of data – a characteristic that makes it an especially useful graph-based clustering technique.  You can find the notes below

Intro to spectral clustering

And you can find a ZIP file with the MATLAB code used to generate figures for these notes here.  The code consists of two toys that allow you to place an arbitrary number of points anywhere you want on a blank canvas and then compare various clustering algorithms – they’re hella fun to play with.

About these ads

13 comments

  1. gil

    hello!

    thank you, its “very easy” to understand tutorial. and also very good one.
    can you please add the matlab function “datasample”. it seems to be missing.

    thanks

    G

  2. Petar Mrvic

    Hey Jeremy, good read, easy to understand but when I tried to test some exaples get the following error:
    “>> spectral_toy(2,1,1.5,3)
    ??? Undefined function or method ‘datasample’ for input arguments of type ‘double’.
    Error in ==> spectral_toy>kmeans at 168
    c = datasample(cand,K,’Replace’,false); % pick K random input pts as initial
    centroids

    Error in ==> spectral_toy at 117
    [D,x] = kmeans(Y,K);”

    MATLAB2009a Any idea on this? Thanks

    • neonwatty

      Petar –

      Glad to hear the writeup this was helpful!

      The error you’re receiving is likely because the function “datasample” requires the Statistics toolbox in MATLAB which maybe you don’t have. However, that line just picks random points from those you’ve inputed (which are used to initialize the centroids in the k-means clustering algorithm), and a suitable replacement can be written that doesn’t require the datasample function.

      For example, you can replace this line ( line 168) with

      c = Y(randperm(size(Y,1),K),:);

      to get the same result.

      Does that help?

      – Jeremy

  3. Petar Mrvic

    Jeremy thanks for a really quick reply.
    Statistics toolbox is installed and tried this modified line but then get:
    Error using ==> randperm
    Too many input arguments
    Same thing on a PC with Matlab 2010

  4. neonwatty

    Ah – ok!

    You can do the same thing using only one parameter with randperm by replacing the previous suggestion

    c = Y(randperm(size(Y,1),K),:);

    with these 3 lines to produce the same effect

    samples = randperm(size(Y,1));
    samples = samples(1:K);
    c = Y(samples,:);

    If you like, here is an updated version of the file with these 3 lines replacing the old suggestion, along with a test dataset

    https://drive.google.com/file/d/0B9LZEwqBZcp4NmVNMTg4VXFka2c/edit?usp=sharing

    You can uncomment the UI to place your own points, and comment out the data load in. With the test dataset, the file should produce an image like this one

    https://drive.google.com/file/d/0B9LZEwqBZcp4U0lMSUhDX1A3YUE/edit?usp=sharing

  5. Petar Mrvic

    Managed to get it working last night but mostly not getting expected results
    [IMG]http://i.imgur.com/zl9zKFZ.png[/IMG]
    And same with this last modified code you uploaded, results differ from yours
    [IMG]http://i.imgur.com/ZC9IQnH.png[/IMG]

    • neonwatty

      Progress!

      Can you try running the last version with the fixed data set several times? Results of spectral clustering on the same data set can (and usually do) vary from try to try based on different initializations of k-means (and here it has a random initialization).

      • neonwatty

        PS – another factor contributing to mis-clustering with this method, besides of course choosing the right number of clusters (which thankfully with this toy we can ignore) is choosing the right parameters for the graph (e.g., if choosing k-nearest neighbors what should k be?). Its not obvious to me beforehand how to make this choice, and in practice to my knowledge this must be tuned.

        So the toy really alleviates only one of the two serious problems with spectral clustering (choosing the right number of clusters), but tuning the graph parameters correctly is still a serious issue.

        This issue discussed further (for a few types of common graphs) in section 8 of this popular tutorial

        http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Luxburg07_tutorial_4488%5b0%5d.pdf

  6. Petar Mrvic

    Yes, progress :)
    And you are correct, tried multiple runs with varying results. Thanks for the link to the Luxburg tutorial, will give it a read after my hayfever subsides a bit, can’t function at the moment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s