Deriving the Normal Equation
Consider a linear model
where
is a matrix of real numbers with as the number of samples (or rows), and is the number of features (or columns),
is a matrix (also called a vector) of coefficients , and
is a matrix of target variables per ith sample.
The normal equation is a matrix equation that allows you to solve above. By doing so, you can construct a linear regression model that allows you to predict the value of the target variable for any new samples. It is given by
The goal of the normal equation is to provide you with so that the cost function
is minimized.
You can find derivations of the normal equation on the internet, but most of them include a lot of handwaving and skip seemingly trivial, but actually tedious and crucial steps. I will derive it in this article.
Minimizing Using Vector Calculus
Recall back in calculus that to minimize a function, you simply take its derivative and set it to zero. We will do the same for the cost function. But first, we need to borrow some concepts from vector calculus.
The derivative of a vector is called a gradient, and is denoted by the symbol . We subscript the symbol with some variable to denote that we’re taking a vector’s derivative with respect to that variable. Thus, the derivative with respect to is simply .
Minimizing the cost function means taking its derivative with respect to and setting the result to zero:
We will use this equation to derive the normal equation.
Some Useful Theorems
In order to proceed with the derivation, we’re gonna need the help of some theorems. You can find these as well in any proper vector calculus textbook. Let , , and be matrices. The two theorems we will use are the following:
where the trace of a matrix is just the sum of its diagonal elements.
If we set the matrix as the identity matrix in the second theorem above, we simply get
By setting , we get
due to the fact that the trace of a matrix is a function that maps from to . Taking the transpose of the last equation and noting that , we get
But the left-hand side of the equation is just the first theorem. Thus, we have
We can finally proceed with our derivation.
All Together Now
Let’s take the gradient of with respect to
The term on the right side is a real number. This indicates that the whole term in the parentheses is some function that maps to a real number. If we use the fact that the trace of a real number is just itself, and that , we get
The term has vanished because it’s treated as a constant when we’re taking the gradient with respect to .
We are now left with
If we use our derived theorem above () and substitute as follows,
we should get
Since , we finally have
which is the normal equation. How cool is that! :)