I've had a couple of emails recently about the excellent Stanford Machine Learning and AI online classes, so I thought I'd put up the odd post or two on some of the techniques they cover, and what they might look like in PHP.
The second lecture of the ML class jumps into a simple, but often powerful, technique: linear regression - or fitting a line to data (don't worry if you haven't watched it, this post hopefully makes sense on it's own). There are a lot of problems that fall under predicting these types of continuous values based on limited inputs - for example: given the air pressure, how much rain will there be, given the qualifying times, how quick will the fastest lap be in the race. By taking a bunch of existing data and fitting a line, we will be able to make a prediction easily - and often reasonably correctly.
We can define a line with 2 variables, or parameters: the intercept (where it crosses the axis) and the gradient (how much is moves in one dimension for a move of one in the other dimension). Because we're going to want to predict variables with that line, we'll write a function that defines it, which, in keeping with tradition, we'll refer to as our hypothesis function:
But which line is the best fit? Intuitively, it is the one closest to the data points we have - which we can measure by taking the square of the different between each predicted value and actual value - we square because then we get positive numbers no matter which way the difference goes:
So, our job is to find the line which minimises the squared error - this type of function is referred to the cost function. Lets take our functions above and take a first stab at fitting our line. We'll write a very simple algorithm which just tests moving each parameter up and down a little bit at a time, and sees if the score gets lower or higher. If it gets lower, we'll take those parameters and restart the process, until we don't see any improvement:
So for this data, we get the line y = 1 + 4x. If we plot our data and our line on a graph, we can see it's a pretty good fit:
This algorithm, while straightforward, has several problems. We're moving in steps of 0.25, so the best answer could be not in this range, and we don't cache any of our checks, so we'll always end up evaluating our previous position as a possible move, which is a waste of time. It's also kind of slow - surely we can do something a bit better?
Of course, we can, and what we can do is use the gradient descent algorithm. This relies on the fact that we are effectively describing a curve with our cost function (or more accurately a curved surface) and we can see how sloped the surface is at any point by taking the derivative, and use the direction of the slope to guide our improvement. In order to use that, we need a bit of calculus to take the partial derivatives for the two parameters, which makes our score function look like this:
With this new function, we're going to get a positive or
Truncated by Planet PHP, read more at the original (another 1573 bytes)