Due to their wide applicability, sparse and low-rank modelling tools have quickly become some of the most important techniques for today’s researchers in signal, image, and video processing as well as machine learning. Application areas in which sparse and low-rank modelling tools have been applied span a wide range of topics in these fields including: compressive sensing, image inpainting, super-resolution, denoising, foreground/background separation, image and video compression, object recognition and classification, l-1 regularized regression, least absolute deviations, Robust Principle Component Analysis, as well as Collaborative Filtering.
In this tutorial we reviewed some of the most useful optimization tools for tacking sparse and low-rank problems that we are aware of. This includes a discussion of greedy methods for sparse recovery, smooth reformulation tricks, Accelerated Proximal Gradient Methods, and the Alternating Direction Method of Multipliers approach.
Part 1 – Motivating Problems and Applications
Part 2 – Quick and painless optimization approaches
Part 3: Accelerated Proximal Gradient Methods
Part 4: Alternating Direction Method of Multipliers
The typical way we measure ‘mastery’ of a technical term or concept in K-12 and college mathematical science courses is one dimensional: can you regurgitate an equation related to this term or concept, and maybe a reason out a small logical statement relating to this equation? While this is definitely a part of understanding a mathematical concept, in my personal experience this doesn’t correlate with real mastery of a subject, and some times it doesn’t really enable you to truly understand what you’re talking the thing you’re studying.
The simple model below – reproduced by Reza Borhani from the paper “Improving Student Comprehension by Thinking About a Topic in Multiple Ways” by Mark Vondracek’s (original paper can be found here) – is one of the best and most economical descriptions I’ve ever seen about how to teach and truly master a mathematical idea. Its an amazing piece of pedagogical technology, complete and very portable (I keep a small business-card sized version in my wallet). Yes, you need to know the equation related to the term or concept you’re studying/teaching. But you also need to invest time in understanding the idea from other perspectives, all of which I think you’ll agree are valuable components to a full understanding of a technical term or concept.
So when you learn a new idea in the mathematical sciences, don’t be content with just memorizing an equation (even though that might be all that’s required of you in class). Be disciplined with yourself. Take a few minutes and make sure you’ve digested the idea from all six of these perspectives. You’ll not only be able to claim true mastery of the idea in the present, but you’ll retain the concept a lot longer and be better equipped to recognize situations where the idea can be deployed in your future work.
For more amazing ideas on how to teach technical subjects – especially physics – see Mark’s class blog located here.
Standard machine learning tasks can be quite costly to run on large and high dimensional datasets. Its therefore hugely beneficial to first reduce the dimension of the dataset prior to analysis, since computationally the same analysis can be done at significantly reduced costs on lower dimensional data. Principal Component Analysis (PCA) is a popular approach for reducing the dimension of a dataset that has been applied in a tremendous number of fields from image and signal processing, to machine learning and statistics, to sociology and economics. PCA works by first finding a basis of a particular low dimensional subspace, and then by projecting the data onto the span of those basis vectors. The specific subspace sought in PCA is one which preserve important elements of the original structure of the data post projection.
You can find my notes below
Notes on PCA
You can find a ZIP file with the MATLAB code used to generate figures for these notes here. The code is a little toy that allows you to place an arbitrary number of points anywhere you want on a blank canvas and then run PCA on those points. A plot of your (centered) points with it’s principal components are returned. A second plot with your data projected onto the first principal component is also returned so you can compare how well the data structure is preserved post projection.
The Singular Value Decomposition (SVD) is an incredibly useful factorization of rectangular matrices which is entirely analogous to the factorization of square symmetric matrices provided by the Spectral Theorem of Symmetric Matrices (see my notes on eigenvalues/vectors for more details). The motivation for the SVD factorization is also completely analogous to the motivation for eigenvectors and eigenvalues we explored in those notes – I recommend looking back at it prior to reading the motivating problem for SVD. However unlike the eigen-factorization of a square symmetric matrix the SVD can be applied to any rectangular valued array, hence it’s utility in applications is significantly broader. We’ll look at just one common example of SVD applied to linear Least Squares regression, but describe in detail an enormous machine learning application of SVD called Principal Component Analysis in the next set of notes.
You can find the SVD notes below
Notes on SVD
The concept of eigenvectors and eigenvalues is motivated from a natural desire to understand how a given square matrix A
acts on vectors x (by the multiplication Ax) in the simplest possible terms. This naturally leads to the pursuit of a particular basis (of the input/output space of A) over which the multiplication is decomposed as parsimoniously as possible. We’l'l see in a later set of notes that the motivation for the Singular Value Decomposition (SVD) is essentially the same – only it deals with rectangular matrices. Because SVD serves as the foundation of Principal Component Analysis – an incredibly useful machine learning tool we’ll learn about in yet a further set of notes – eigenvalues, eigenvectors, and the Spectral Theorem of Symmetric Matrices have far reaching and really practical implications in machine learning.
My notes on these topics can be found via the link below
Notes on eigenvalues, eigenvectors, and the Spectral Theorem of Symmetric Matrices
In this set of notes we’re going to learn about a clustering approach that throws out the K-means assumption that clusters fall in convex globular clusters and does something different: spectral clustering (we’ll cover just the vanilla version, improved and tweaked versions based on this can be found in standard papers on the subject). Spectral clustering clusters data points by treating them not as distinct convex globular clusters as in K-means but as connected subgraphs. This change in perspective allows separation of non-convex globular clusters – something K-means just can’t do. More importantly we’ll see that spectral clustering can ignore sparse interconnections between arbitrarily shaped clusters of data – a characteristic that makes it an especially useful graph-based clustering technique. You can find the notes below
Intro to spectral clustering
And you can find a ZIP file with the MATLAB code used to generate figures for these notes here. The code consists of two toys that allow you to place an arbitrary number of points anywhere you want on a blank canvas and then compare various clustering algorithms – they’re hella fun to play with.
This is a short introduction to Least Squares (LS) and Least Absolute Deviations (LAD) approaches to linear regression. We’ll see how to setup optimization models for both problems and use a simple simulated dataset to compare the methods. The simple comparison illustrates how LAD is robust to outliers (since it invokes an L1 norm), whereas LS is not. A pdf of the notes may be downloaded via the link below
Application of L-1 norm modeling in machine learning – LS and LAD
A zip file containing annotated MATLAB code that generates the figures in this set of notes can be found here.
This is a short introduction to sparse modeling from a linear algebra perspective. We’ll see how to setup the basic optimization model for recovering the sparsest solution to an undetermined system, along with a simulated example recovery comparing the sparsest solution to a given system and its smallest solution.
A pdf of the notes may be found below.
You can find annotated MATLAB code that produces the experiments and plots illustrated in the notes in a zip file located here.
You can find some slides I put together for a review of the article ”Enhancing Sparsity by Reweighed L-1 Minimization” by Candès, Wakin, and Boyd. A very simple and effective technique for improving standard basis pursuit problems.
Rewighted L1 lab talk
You can find the original article here - its definitely worth reading.
NU ML is very excited to have NU EECS Prof. Doug Downey as our May speaker! A talk abstract may be found below. You can find out more about Prof. Downey’s work at his website: http://www.cs.northwestern.edu/~ddowney/
We will again meet in the Tech Building room L324 (EECS conference room) on Northwestern’s Evanston campus at 2145 Sheridan Rd. You can learn more about NU Machine Learing meetup by visiting our website here.
Talk title: Scaling Semi-supervised Learning to the Web
Abstract: Semi-supervised machine learning (SSL) techniques attempt to learn classifiers from large amounts of unlabeled data, in addition to a smaller amount of hand-labeled data. SSL has been shown to improve accuracy when compared to standard supervised learners, which do not leverage unlabeled data. However, most existing SSL techniques have scalability limitations – they require multiple passes over the entirety of the unlabeled data for each new concept to be learned. This makes SSL difficult to apply to massive data sets like the Web, where new concepts of interest arise frequently. Requiring even a single pass over the Web for each new concept is intractable.
In this talk, I will describe new SSL techniques that scale to large data sets like the Web. I start by describing new methods for learning word representations from large text corpora using unsupervised Hidden Markov Models, and how the resulting representations can be utilized for Web Information Extraction. Then, I will describe new methods for scaling SSL Naïve Bayes text classifiers to massive text corpora using lexical statistics. Lastly, I will demonstrate two new prototype tools for Web search: the Atlasify exploratory search system, which automatically generates visualizations to help Web users investigate complex concepts; and Wikitables, a search engine for Wikipedia tables.