Slides from my review of article “Enhancing Sparsity by Reweighed L-1 Minimization” given to IVPL lab on 5/9/13

You can find some slides I put together for a review of the article ”Enhancing Sparsity by Reweighed L-1 Minimization” by Candès, Wakin, and Boyd.  A very simple and effective technique for improving standard lasso problems.

Rewighted L1 lab talk

You can find the original article here - its definitely worth reading.

NU Machine learning meetup May talk: EECS Prof. Doug Downey “Scaling Semi-supervised Learning to the Web”

NU ML is very excited to have NU EECS Prof. Doug Downey as our May speaker!  A talk abstract may be found below.  You can find out more about Prof. Downey’s work at his website: http://www.cs.northwestern.edu/~ddowney/

We will again meet in the Tech Building room L324 (EECS conference room) on Northwestern’s Evanston campus at 2145 Sheridan Rd.  You can learn more about NU Machine Learing meetup by visiting our website here.

Talk title: Scaling Semi-supervised Learning to the Web

Abstract: Semi-supervised machine learning (SSL) techniques attempt to learn classifiers from large amounts of unlabeled data, in addition to a smaller amount of hand-labeled data.  SSL has been shown to improve accuracy when compared to standard supervised learners, which do not leverage unlabeled data.  However, most existing SSL techniques have scalability limitations – they require multiple passes over the entirety of the unlabeled data for each new concept to be learned.  This makes SSL difficult to apply to massive data sets like the Web, where new concepts of interest arise frequently.  Requiring even a single pass over the Web for each new concept is intractable.

In this talk, I will describe new SSL techniques that scale to large data sets like the Web.  I start by describing new methods for learning word representations from large text corpora using unsupervised Hidden Markov Models, and how the resulting representations can be utilized for Web Information Extraction.  Then, I will describe new methods for scaling SSL Naïve Bayes text classifiers to massive text corpora using lexical statistics.  Lastly, I will demonstrate two new prototype tools for Web search: the Atlasify exploratory search system, which automatically generates visualizations to help Web users investigate complex concepts; and Wikitables, a search engine for Wikipedia tables.

I’m co-teaching a new course next winter: “Sparse and low-rank recovery problems in signal processing and machine learning”

I’m incredibly stoked to announce that in the winter quarter of the 2013-2014 academic year I will be co-teaching a brand new graduate course at NU titled

“Sparse and low rank recovery problems in signal processing and machine learning”

along with Aggelos Katsaggelos.  A tentative syllabus for the course may be found here.

The course is intended for researchers of all academic backgrounds interested in the field and will have a practical focus on applications in signal processing and machine learning as well as optimization methods for solving sparse and low rank problems.  The course is based on a manuscript I am currently writing with the help of Aggelos and Reza Borhani, a draft of which will be provided as notes to those enrolled.

NU Machine Learning Meetup’s Feb seminar “How to Compete on Kaggle and Win” next Tuesday, February 19 @ 5:30pm

How to Compete on Kaggle
and Win !
Halla Yang, Ph.D.
5:30 p.m., Tuesday, February 19th 2013
Graduate Student Commons (Room 250)
2122 Sheridan Road, Evanston.

Abstract: Kaggle is an innovative solution for statistical/analytics outsourcing. Companies, governments and researchers present datasets and problems – the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

I will provide an overview of Kaggle and talk about some of the tools and techniques I have used to win my first Kaggle competition. See http://www.kaggle.com/users/36141/halla#profile-results for a complete set of my competition results.

About the Speaker I hold a B.A. in physics from Harvard, as well as a Ph.D. in Business Economics from Harvard. After finishing my doctorate, I developed trading strategies at Goldman Sachs Asset Management in New York and managed portfolios at Arrowstreet Capital in Boston. When my wife accepted a tenure-track position at Northwestern, I resigned from my job to move to Evanston. My former employer required me to sit out for one year, and I decided to spend that year exploring a topic I knew little about but was keenly interested in – machine learning – via competing on Kaggle. Eight competitions later, I am currently ranked in the top 50 out of over 76,000 registered Kaggle users, and recently won my first competition, an invitation-only event sponsored by Pfizer. I have a Kaggle hoodie to boot. This March, when my non-compete is over, I’ll be joining a proprietary trading firm in Chicago.

Find out more about NU ML Meetup and RSVP at:
http://www.meetup.com/NU-Machine-Learning-Meetup

Brian Lange of Datascope Analytics to give Jan talk at NU Machine Learning Meetup!

On Tuesday Jan 22nd from 5:30-7pm in the Grad Student Commons Brian will discuss his recent work with Datascope Analytics, a great Chicago analytics startup founded by two Northwestern graduates!  You can find out more about Datascope Analytics at their site here.  And you can find the NU Machine Learning Meetup here.

Talk title and abstract:


Think Like a Designer How Smarter Process Leads to Smarter Analysis

I will talk about why data-first problem solving sometimes misses the mark, how machine learning and the design process go hand in hand, and what that means for changing the ways machine learning is applied. My presentation will also feature some demonstrations of Datascope Analytics projects where the harmonious pair of data and design resulted in some pretty cool tools- one for tracking food trucks, and the other for helping lawyers find digital evidence.

Brian’s Bio:

Brian Lange is a student in the Computer Engineering program at Northwestern University. He also spends co-op quarters working as a Data Scientist at Datascope Analytics, where he leads design process exercises and designs analyses, web interfaces and visualizations. He has contributed to projects for P&G, Thomson Reuters, and other well known companies and his work has been featured on FlowingData. While he’s not nerding out about typography and information density, he enjoys science and comedy podcasts, making beer, and listening to weird music.

Abstract of Nov 5 NU Machine Learning talk by Prof. Brenna Argall!!!

Abstract:

Demonstration-based learning is a powerful and practical technique to develop robot motion control behaviors, which can be further assisted by continuing to learn from experience after demonstration. The first part of this talk will provide a crash course in the area of Robot Learning from Demonstration (LfD). The second part will overview an approach developed in my research, that acquires a motion control policy from teacher demonstration and then adapts the learned policy with corrective feedback. Validation of this approach has been carried out on in two very different robot domains: mobility control for a wheeled robot, and manipulation control for a high degree-of-freedom humanoid. Corrective feedback has been found to improve the performance of a demonstrated behavior, as well as to enable its adaptation different motion control tasks.

Bio:

Brenna Argall is the June and Donald Brewer Junior Professor of Electrical Engineering and Computer Science at Northwestern University (NU). She also holds a Research Scientist position within the Sensory Motor Performance Program at the Rehabilitation Institute of Chicago (RIC), and is an assistant in the Department of Physical Medicine and Rehabilitation (PMR) at NU. Prior to joining Northwestern and RIC, she was a postdoctoral fellow (2009-2011) in the Learning Algorithms and Systems Laboratory at the École Polytechnique Fédérale de Lausanne (EPFL). Her Ph.D. in Robotics (2009) was received from the Robotics Institute at Carnegie Mellon University, as well as her M.S. in Robotics (2006) and B.S. in Mathematics (2002). Prior to graduate school, she held a Computational Biology position in the Laboratory of Brain and Cognition, at the National Institutes of Health (NIH). Her research interests lay at the intersection of robotics, machine learning and rehabilitation; in particular, on learning robot motion control from demonstration and then enriching behavior development with human feedback, with a focus on robotic devices that provides physical assistance.

 

Find the NU Machine Learning meetup online here: http://www.meetup.com/NU-Machine-Learning-Meetup/

 

My talk summary for Nov 13th Chicago ML Meetup!

Find the Chicago ML Meetup here!

Title: An introduction to Generalized K-means with image processing applications

Summary: In this era of digital Big Data, large scale collections of digital images are proliferating all over the place on and offline. For example

  • users on Facebook upload more than 200 million photos every day
  • in the medical imaging domain, over 68 million CT scans were performed in the US last year
  • IT giants are building enormous visual maps of the world from massive collections of street view images

This explosive growth in image data poses serious challenges in terms of both storage – that is, how do we more efficiently compress rapidly growing collections of images? – and search – that is, how can we more effectively sort through image databases? – and in each case the best solutions developed so far rely heavily on machine learning techniques. Generalized K-means (G-K-means), more commonly called Dictionary Learning, is one of the key machine learning tools researchers such as my self are using to attack the storage problem. While at a high level this technique is really just exactly what you’d expect it to be – a generalized version of the standard K-means where you assign a data point to multiple clusters instead of just one – algorithm-wise it falls into the bucket of modern sparse statistical methods (e.g. compressive sensing, the lasso) which have been mentioned at some previous meetups.

This talk will be a user-friendly introduction to G-K-means with a practical algorithmic and application focus. I’ll first review the standard K-means algorithm and its popular application to single image compression. I’ll then show how, viewing K-means as a sparse statistical method, you can easily derive the analogous G-K-means model along with a natural greedy algorithm for solving it. Finally I’ll show some cool applications of G-K-means to the processing of large databases of images, and discuss its application to the storage problem – that is, to large scale image collection compression.

Decision Theory and Chicken Bones

Fortune telling, its still kind of a thing people are still into these days. Shamans, psychics, priests, palm readers, magic 8 balls; in almost every patch of the globe you’ll find someone working the trade. Each type of diviner has his own distinct set of data he employs to predict a poor schmo’s future: taro cards, the orientation of the stars, hell some even use chicken bones. Typically what the guy with the chicken bones will do (according to my expert knowledge acquired through various sci-fi TV shows) is take a handful of ivories and toss them on the ground. Then combining some understanding of the chicken bone constellation he makes a prediction about how lucky the person will be, say in the next week.

I’m not a believer in this sort of thing, but lets suppose for the sake of this post that chicken-bone guy’s schtic really does work. That is just using the constellation of the bones for a particular customer he divines their future luck with some reasonable accuracy. If it was seriously legit, then I’d definitely want in on a piece of the action, I mean it seems like a pretty sweet gig. You get respect from your community, you have a reasonable cash flow situation going on, and the hours seem descent. But I have feeling that chicken-bone dudes don’t just go around telling everyone how their mojo works. So if I want to learn the tricks of the trade myself I’ll have to do it covertly.

But what exactly do I need extract from one of these guys? Well, first off I’d need data: in particular for a bunch of a shaman’s future customers I need to collect input data, i.e. something about the bone constellations, as well as and outupt data, which is some measure of how lucky each person was over the week following their divination. After finding my mark shaman, here’s how I’d collect that stuff

  1. Install a hidden camera in chicken-bone guy’s office, then take snapshots of the bones rolled for each future customer.
  2.  Install hidden cameras/wire tap chicken-bone guy’s future customers, then I can make a good guess about how lucky each person was afterword by studying the tapes.

Now lets say I’ve done all that, its cool I’ve seen plenty of Bond/Bourne movies, and I have all the info on a bunch of one shaman’s customers. Just having the data won’t allow me to emulate this dude, I still need to figure out how the he uses the placement of the bones to predict the luck of each schmo. More precisely, I need to know shaman’s divination function. How to go about doing that? Well luckily several tools from Statistics and Machine Learning were designed precisely to answer this sort of question.

Say the shaman uses p distinct bones for the divination process for all of his clients, and that I have observed (through my covert ops) N  schmos’ divinations. To make the manipulations ahead a bit easier to digest lets assume from here on in that the bones are 2-D, like in this pic where p=3

What information do you think the shaman is using from this image of the tossed bones? I think it makes sense that he wouldn’t care where the collection of bones as a whole landed (i.e. on what part of the table/floor he tosses them on), but only the orientations of each bone along with the relative distances between all the bones. That means that in the analysis of the spy pics I took for each client what needs to be recorded every bone is

  1. its angle of rotation (with respect to an arbitrary fixed position for that bone) and
  2. its distances to the other bones.

The measurements we would take for the first bone, for example, are illustrated below

Note then that all together the number of features extracted from
each customer’s bone image is d=p+\frac{p\left(p-1\right)}{2}:
since we take p angles (one per bone), along with \underset{i=1}{\overset{p-1}{\sum}}i=\frac{p\left(p-1\right)}{2}
distances all together between the bones (making sure we don’t double
count e.g. d_{12}=d_{21}).

For i=1...N lets denote by \mathbf{x}_{i}\in\mathbb{R}^{d} the vector of bone data and y_{i}\in\mathbb{R} the corresponding luck value for the i^{th} customer (say the more positive the value the luckier the person, the more negative the value the more unlucky). Further, although we can reasonably assume our sample input/output are instances of a random vector \mathbf{x} (for sure the bones data is random for each client) and random variable y (possible futures for the schmos range from the good to the not so good), we don’t know their joint distribution. Remember that I did get all the data via hidden cameras and wire taps, I just observed what happened for each person. So we have no other recourse than to assume that p\left(g\left(\mathbf{x}_{i}\right),y_{i}\right)=\frac{1}{N}, i.e. that every pair \left(g\left(\mathbf{x}_{i}\right),y_{i}\right) is equally likely to occur. Now, its natural to try to identify the shaman’s divination function g, which will give the best predicted output y for a given input data, by solving for the function which minimizes the Mean Squared Error (MSE) between g\left(\mathbf{x}\right) and y ala

\underset{g}{argmin}\,\mathbb{E}_{\left(g\left(\mathbf{x}\right),y\right)}\left[\left(y-g\left(\mathbf{x}\right)\right)^{2}\right]=\underset{g}{argmin}\frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-g\left(\mathbf{x}_{i}\right)\right)^{2}

Now using iterated expectation and bit of rearranging you can show that the function minimizing this critera is the conditonal expectation

g\left(\mathbf{x}\right)=\mathbb{E}\left[y\,\vert\,\mathbf{x}\right]

(see previous post on expectations if one of those terms is unclear). Intuitively this is also a pretty good choice for a predictor function. So armed with the conditional expecation I can best copycat the shaman’s predictions for his previous clients. Furthermore, this will let me make good predictions for future customers of my own. I’ll just roll the bones (oh yeah, I also cloned his set of chicken bones), and plug that data in into the conditional expectation to divine a future luck.

But I’m not in fat city yet, because practically speaking the best predictor \mathbb{E}\left[y\,\vert\,\mathbf{x}\right] kind of sucks. It is the best function for the task for sure, but its not a nice and neat formula I can just plug and play with. Its messy, its an integral involving probabilities that I don’t have access too. So I have no choice but to refine the goal: because what I really need is a formula, not just a function, that best mimics the shaman’s mojo. And the simplest formula involving the data is, of course, a linear one.

So making another necessary simplfiying assumption, lets suppose (and this is the standard way of going about things) that the conditional expectation takes a linear form i.e. that

\mathbb{E}\left[y\,\vert\,\mathbf{x}\right]\approx\mathbf{\beta}^{T}\mathbf{x}+\beta_{0}

To make the notation simpler lets absorb the constant into \mathbf{x}, e.g. set \mathbf{x}=\left[1\,\vert\,\mathbf{x}\right] and rewrite this as \mathbb{E}\left[y\,\vert\,\mathbf{x}\right]\approx\mathbf{\beta}^{T}\mathbf{x} (now \mathbf{x} is a d+1 dimensional vector, instead of ddimensions like before).

The orignal problem to recover the divination function now reduces to solving for the unkown coefficients via

\underset{\beta}{argmin}\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-\beta^{T}\mathbf{x}_{i}\right)^{2}

This problem, of minimizing MSE, is referred to as regression. Note that I’ve also removed that \frac{1}{N}  from the original form since it doesn’t change the solution at all.

Minimizing MSE, Calculus free

Instead of jumping right in with Calculus, which I invite you to do, I want to just use some geometric intuition to see if we can’t figure out what the best \beta should be.

Suppose for the moment that the input data is just one dimensional, then we can write down the MSE above as

\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-\beta x_{i}\right)^{2}=\left(\mathbf{y}-\beta\mathbf{x}\right)^{T}\left(\mathbf{y}-\beta\mathbf{x}\right)=\Vert\mathbf{y}-\beta\mathbf{x}\Vert^{2}

where individual inputs/outputs have been stacked into vectors \mathbf{x}=\left[x_{1}...x_{N}\right] and \mathbf{y}=\left[y_{1}...y_{N}\right]. Now minimizing the MSE means solving

\underset{\beta}{argmin}\Vert\mathbf{y}-\beta\mathbf{x}\Vert^{2}

which is precisely the definition of the projection of the vector \mathbf{y} onto \mathbf{x} (denoted proj_{\mathbf{y}}\mathbf{x} in the pic below)! So the optimal coeffecint is given by \beta=\frac{\mathbf{x}^{T}\mathbf{y}}{\mathbf{x^{T}}\mathbf{x}}.

Notice here that, like the theme of the previous post about correaltion and inner products, by stacking the data into vectors and re-analyzing the scene we have quickly deduced a clear and simple geometric solution to our original problem.

Ok, back to our main scenario where we had d+1 dimensional imputs \mathbf{x}_{1}...\mathbf{x}_{N} – we can think by analogy to what we’ve just seen. Denoting by X the N\times\left(d+1\right) matrix whose i^{th} row is \mathbf{x}_{i}, we should expect that by minimizing the original MSE we’ll uncover the projection of \mathbf{y} onto the column space of X (which we’ll assume has full rank). And this is precisely what we get (by calculus too). Rewriting the original MSE we can see that we are again solving for a projection

\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-\beta^{T}\mathbf{x}_{i}\right)^{2}=\left(\mathbf{y}-\beta^{T}X\right)^{T}\left(\mathbf{y}-\beta^{T}X\right)=\Vert\mathbf{y}-\beta^{T}X\Vert^{2}

which has the coefficient vector  \beta=\left(X^{T}X\right)^{-1}X^{T}\mathbf{y}

Grifting the shaman

At last, a formula that emulates the shaman’s prediction function – which is a linear approximation to the conditional expecation. Along with the deduced coefficients to this linear model, based on the data pilfered from the shaman’s previous N  cusotmers, I could safely open a copycat buisness. When a customer comes calling I’d roll the bones, take a snapshot of the results and have the constellation data automatically extracted from the image and fed into the formula. And then, voila, out pops the schmo’s destiny for the next week. Now that would be easy money.