Fortune telling, its still kind of a thing people are still into these days. Shamans, psychics, priests, palm readers, magic 8 balls; in almost every patch of the globe you’ll find someone working the trade. Each type of diviner has his own distinct set of data he employs to predict a poor schmo’s future: taro cards, the orientation of the stars, hell some even use chicken bones. Typically what the guy with the chicken bones will do (according to my expert knowledge acquired through various sci-fi TV shows) is take a handful of ivories and toss them on the ground. Then combining some understanding of the chicken bone constellation he makes a prediction about how lucky the person will be, say in the next week.
I’m not a believer in this sort of thing, but lets suppose for the sake of this post that chicken-bone guy’s schtic really does work. That is just using the constellation of the bones for a particular customer he divines their future luck with some reasonable accuracy. If it was seriously legit, then I’d definitely want in on a piece of the action, I mean it seems like a pretty sweet gig. You get respect from your community, you have a reasonable cash flow situation going on, and the hours seem descent. But I have feeling that chicken-bone dudes don’t just go around telling everyone how their mojo works. So if I want to learn the tricks of the trade myself I’ll have to do it covertly.
But what exactly do I need extract from one of these guys? Well, first off I’d need data: in particular for a bunch of a shaman’s future customers I need to collect input data, i.e. something about the bone constellations, as well as and outupt data, which is some measure of how lucky each person was over the week following their divination. After finding my mark shaman, here’s how I’d collect that stuff
- Install a hidden camera in chicken-bone guy’s office, then take snapshots of the bones rolled for each future customer.
- Install hidden cameras/wire tap chicken-bone guy’s future customers, then I can make a good guess about how lucky each person was afterword by studying the tapes.
Now lets say I’ve done all that, its cool I’ve seen plenty of Bond/Bourne movies, and I have all the info on a bunch of one shaman’s customers. Just having the data won’t allow me to emulate this dude, I still need to figure out how the he uses the placement of the bones to predict the luck of each schmo. More precisely, I need to know shaman’s divination function. How to go about doing that? Well luckily several tools from Statistics and Machine Learning were designed precisely to answer this sort of question.
Say the shaman uses
distinct bones for the divination process for all of his clients, and that I have observed (through my covert ops)
schmos’ divinations. To make the manipulations ahead a bit easier to digest lets assume from here on in that the bones are 2-D, like in this pic where 

What information do you think the shaman is using from this image of the tossed bones? I think it makes sense that he wouldn’t care where the collection of bones as a whole landed (i.e. on what part of the table/floor he tosses them on), but only the orientations of each bone along with the relative distances between all the bones. That means that in the analysis of the spy pics I took for each client what needs to be recorded every bone is
- its angle of rotation (with respect to an arbitrary fixed position for that bone) and
- its distances to the other bones.
The measurements we would take for the first bone, for example, are illustrated below

Note then that all together the number of features extracted from
each customer’s bone image is
:
since we take
angles (one per bone), along with 
distances all together between the bones (making sure we don’t double
count e.g.
).
For
lets denote by
the vector of bone data and
the corresponding luck value for the
customer (say the more positive the value the luckier the person, the more negative the value the more unlucky). Further, although we can reasonably assume our sample input/output are instances of a random vector
(for sure the bones data is random for each client) and random variable
(possible futures for the schmos range from the good to the not so good), we don’t know their joint distribution. Remember that I did get all the data via hidden cameras and wire taps, I just observed what happened for each person. So we have no other recourse than to assume that
, i.e. that every pair
is equally likely to occur. Now, its natural to try to identify the shaman’s divination function
, which will give the best predicted output
for a given input data, by solving for the function which minimizes the Mean Squared Error (MSE) between
and
ala
![\underset{g}{argmin}\,\mathbb{E}_{\left(g\left(\mathbf{x}\right),y\right)}\left[\left(y-g\left(\mathbf{x}\right)\right)^{2}\right]=\underset{g}{argmin}\frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-g\left(\mathbf{x}_{i}\right)\right)^{2} \underset{g}{argmin}\,\mathbb{E}_{\left(g\left(\mathbf{x}\right),y\right)}\left[\left(y-g\left(\mathbf{x}\right)\right)^{2}\right]=\underset{g}{argmin}\frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\left(y_{i}-g\left(\mathbf{x}_{i}\right)\right)^{2}](http://s0.wp.com/latex.php?latex=%5Cunderset%7Bg%7D%7Bargmin%7D%5C%2C%5Cmathbb%7BE%7D_%7B%5Cleft%28g%5Cleft%28%5Cmathbf%7Bx%7D%5Cright%29%2Cy%5Cright%29%7D%5Cleft%5B%5Cleft%28y-g%5Cleft%28%5Cmathbf%7Bx%7D%5Cright%29%5Cright%29%5E%7B2%7D%5Cright%5D%3D%5Cunderset%7Bg%7D%7Bargmin%7D%5Cfrac%7B1%7D%7BN%7D%5Cunderset%7Bi%3D1%7D%7B%5Coverset%7BN%7D%7B%5Csum%7D%7D%5Cleft%28y_%7Bi%7D-g%5Cleft%28%5Cmathbf%7Bx%7D_%7Bi%7D%5Cright%29%5Cright%29%5E%7B2%7D&bg=ffffff&fg=000000&s=0)
Now using iterated expectation and bit of rearranging you can show that the function minimizing this critera is the conditonal expectation
![g\left(\mathbf{x}\right)=\mathbb{E}\left[y\,\vert\,\mathbf{x}\right] g\left(\mathbf{x}\right)=\mathbb{E}\left[y\,\vert\,\mathbf{x}\right]](http://s0.wp.com/latex.php?latex=g%5Cleft%28%5Cmathbf%7Bx%7D%5Cright%29%3D%5Cmathbb%7BE%7D%5Cleft%5By%5C%2C%5Cvert%5C%2C%5Cmathbf%7Bx%7D%5Cright%5D&bg=ffffff&fg=000000&s=0)
(see previous post on expectations if one of those terms is unclear). Intuitively this is also a pretty good choice for a predictor function. So armed with the conditional expecation I can best copycat the shaman’s predictions for his previous clients. Furthermore, this will let me make good predictions for future customers of my own. I’ll just roll the bones (oh yeah, I also cloned his set of chicken bones), and plug that data in into the conditional expectation to divine a future luck.
But I’m not in fat city yet, because practically speaking the best predictor
kind of sucks. It is the best function for the task for sure, but its not a nice and neat formula I can just plug and play with. Its messy, its an integral involving probabilities that I don’t have access too. So I have no choice but to refine the goal: because what I really need is a formula, not just a function, that best mimics the shaman’s mojo. And the simplest formula involving the data is, of course, a linear one.
So making another necessary simplfiying assumption, lets suppose (and this is the standard way of going about things) that the conditional expectation takes a linear form i.e. that
![\mathbb{E}\left[y\,\vert\,\mathbf{x}\right]\approx\mathbf{\beta}^{T}\mathbf{x}+\beta_{0} \mathbb{E}\left[y\,\vert\,\mathbf{x}\right]\approx\mathbf{\beta}^{T}\mathbf{x}+\beta_{0}](http://s0.wp.com/latex.php?latex=%5Cmathbb%7BE%7D%5Cleft%5By%5C%2C%5Cvert%5C%2C%5Cmathbf%7Bx%7D%5Cright%5D%5Capprox%5Cmathbf%7B%5Cbeta%7D%5E%7BT%7D%5Cmathbf%7Bx%7D%2B%5Cbeta_%7B0%7D&bg=ffffff&fg=000000&s=0)
To make the notation simpler lets absorb the constant into
, e.g. set
and rewrite this as
(now
is a
dimensional vector, instead of
dimensions like before).
The orignal problem to recover the divination function now reduces to solving for the unkown coefficients via

This problem, of minimizing MSE, is referred to as regression. Note that I’ve also removed that
from the original form since it doesn’t change the solution at all.
Minimizing MSE, Calculus free
Instead of jumping right in with Calculus, which I invite you to do, I want to just use some geometric intuition to see if we can’t figure out what the best
should be.
Suppose for the moment that the input data is just one dimensional, then we can write down the MSE above as

where individual inputs/outputs have been stacked into vectors
and
. Now minimizing the MSE means solving

which is precisely the definition of the projection of the vector
onto
(denoted
in the pic below)! So the optimal coeffecint is given by
.

Notice here that, like the theme of the previous post about correaltion and inner products, by stacking the data into vectors and re-analyzing the scene we have quickly deduced a clear and simple geometric solution to our original problem.
Ok, back to our main scenario where we had
dimensional imputs
– we can think by analogy to what we’ve just seen. Denoting by
the
matrix whose
row is
, we should expect that by minimizing the original MSE we’ll uncover the projection of
onto the column space of
(which we’ll assume has full rank). And this is precisely what we get (by calculus too). Rewriting the original MSE we can see that we are again solving for a projection

which has the coefficient vector 
Grifting the shaman
At last, a formula that emulates the shaman’s prediction function – which is a linear approximation to the conditional expecation. Along with the deduced coefficients to this linear model, based on the data pilfered from the shaman’s previous N cusotmers, I could safely open a copycat buisness. When a customer comes calling I’d roll the bones, take a snapshot of the results and have the constellation data automatically extracted from the image and fed into the formula. And then, voila, out pops the schmo’s destiny for the next week. Now that would be easy money.