Home » Machine Learning » Linear Regression with Multiple Variables

# Linear Regression with Multiple Variables

Week two of Coursera’s Machine Learning class covers linear regression with multiple variables. With single variable regression, we can predict house prices based on an individual feature (or attribute), such as number of bathrooms. Multiple variable regression lets us predict house prices based on a combination of many features, such as floor area, number of bedrooms, number of bathrooms, size of the lot, and more. Implementing multiple variable linear regression in Matlab is this week’s homework assignment.

In this example, each row of $X$ is one training example. The first column of 1s is added to the data to allow us to construct a more flexible model. The second column lists the number of bedrooms. The third column gives the number of bathrooms.

$y$ contains the training data’s outputs. In this example, these are house prices in $1000s. We can interpret $X$ and $y$ as saying that our first house has 3 bedrooms and 2 bathrooms and sold for$140,000. Our second house has 2 bedrooms and 1 bathroom and sold for $100,000. Our third house has 1 bedrooms and 1 bathroom and sold for$75,000.

$\Theta$ gives the parameters for our linear model. Each row can be viewed as a weight attached to a particular feature category. The second row, with its value of 20, can be thought of as the weight we give to a house’s number of bedrooms. 10, in the last row, can be viewed as the weight for the number of bathrooms in a house. The first row, 10, is always multiplied by the constant 1, so it gives us a starting point for our predictions: in our model, even a very poor house with 0 bedrooms and 0 bathrooms would have a price greater than 0.

To predict a house’s price, we multiply each number in $\Theta$ by the corresponding feature of our house. If our hypothetical house has 3 bedrooms and 3 bathrooms, we would calculate $10 \times 1 + 20 \times 3 + 10 \times 3 = 100$. Our predicted price of the house is \$100,000.

Another way to represent the same calculation above, using matrix multiplication, is as follows:

The current values of $\Theta$ don’t give us a very accurate model to predict house prices. But it gives us a starting point to improve upon.

Gradient descent is an algorithm to iteratively improve our parameters $\Theta$. At each iteration we perform the following:

Update all $\Theta_j$ simultaneously:

$\Theta_j := \Theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\Theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

Where $h_\Theta(x) = \Theta^T x$, $m$ is the number of training examples, and $\alpha$ is the learning rate, a small positive number.

($\Theta_j$ means the jth component of $\Theta$. In our example, $\Theta_1 = 20$. $x^{(i)}$ means the ith training example. So $x^{(1)}$ is our house with 3 bedrooms and 2 bathrooms. Somewhat confusingly, in the conventions of the course indices for components start at 0 and indices for training examples start at 1)

This rule is fairly simple to implement using a pair of nested loops. Andrew Ng, in the lectures for week two, emphasizes that Matlab (and Octave, an open source alternative) is highly optimized to perform matrix and vector calculations. For that reason, he recommends vectorizing the above updates; that is, instead of using loops, represent the above update rule as a single expression by using matrix operations.

The homework assignment includes a code framework that imports the training data. It follows the convention of representing the training data as shown in the example above: each row vector is a training example, and each column corresponds to a different feature. However, in the lectures, the updates for gradient descent are shown as operations where the training examples are column vectors. It is easy to reconcile the difference for a non-vectorized update, but it involved some trial and error to condense the entire operation into a single line of code. Stepping through a numeric example helps make clear what the matrix multiplications and transpositions are accomplishing.

Let’s examine a single iteration of the gradient descent algorithm and see how the updates are calculated for each component of $\Theta$. We will use a learning rate of 0.2.

First, we need to find out how much our current prediction differs from the training output. This is the $h_\Theta(x^{(i)}) - y^{(i)}$ term in the update rule. (Note, the training examples below are the transpose of the rows of our matrix $X$.)

Training Example 1

Training Example 2

Training Example 3

We plug these values into the update equation:

After one iteration, we have updated $\Theta$ to:

These parameters are not perfect, but if you calculate new predictions for the training examples, you will find the predicted house prices have moved closer to their target outputs.

The above sequences of updates can be represented by the following equation:

$\Theta := \Theta - \alpha \frac{1}{m} \times ((\Theta^TX^T - y^T) \times X)^T$

Plugging the same numbers as before into this equation:

We get exactly the same result, with one equation that is a single line of code in Matlab. This is enough to perform gradient descent on data sets with one variable or ten thousand variables.