If you’ve never heard of Coursera, they partner with universities to offer free online courses. The Machine Learning course is taught by Professor Andrew Ng from Stanford. It’s a well known course, at least in online/programming circles. It runs for ten weeks, and we are entering the second week of lectures and assignments.
Intro to Machine Learning
The first thing covered in week one’s lectures is what exactly is machine learning. Ng acknowledges that there are many different definitions, but the first one he offers is this: Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.
The first week”s lectures include example applications of machine learning. One involves a program which takes as input two microphone feeds (seen here at 0:38) . Two people are in the room. One is counting aloud to ten in English, the other in Spanish. The machine learning algorithm is able to separate the overlapping English and Spanish into two separate audio tracks. It analyzes the sound data and looks for patterns in the data. It is able to recognize that one voice is distinct from the other, and separate the sounds accordingly.
After the introductory videos comes the main topic for the week, Single Variable Linear Regression. This is the same linear regression used in introductory Statistics courses. The idea is to take a set of data consisting of tuples, (x, y) and find a “best fit” line that passes as closely as possible through all of the data points. Wikipedia has a good description of simple linear regression.
Single variable linear regression is easily solved in a closed-formula form. But Ng takes the opportunity to introduce gradient descent, which is an optimization algorithm used widely in machine learning and other fields.
Gradient descent takes a function (the cost function for linear regression, in this case) and starts at a random point (the proposed values for the slope and intercept). At every iteration, it uses the gradient of the function to adjust its current position in the direction of greatest decrease. Because the cost function for linear regression is convex, as the algorithm steps in the downward direction it will eventually reach the global minimum. This point corresponds to the slope and intercept of the best fit line through the data.
The online course includes homework assignments. These will be programming projects using Octave or Matlab. To get a little more experience with Matlab, I wrote a simple script to produce a vector x (the input values) of random data points and a vector y, where each element of y = 2 * x – 4. Then I used the Matlab function awgn() to add noise to the data. This gave me a set of 50 data points:
The script then runs gradient descent to perform linear regression. After 1500 iterations, it produces the best fit line with a slope of 2.054 and intercept of -3.648. For comparison, Matlab’s polyfit() function returns the regression line having a slope of 2.055 and an intercept of -3.653. The line produced by gradient descent is plotted against the data here:
Real World Uses (Sort of)
What I’ve learned so far has come in handy already. I was able to give a decent answer to someone’s question on Stack Overflow: What is the difference between linear regression and logistic regression?
The Stack Overflow question led me to finally investigate why we use the word regression to describe finding the best-fit line. I found this post that describes it pretty well (and it mixes in some baseball, too).