7.4. Least Squares Solutions#
7.4.1. Introduction#
In Chapter 2, especially Section 2.1, we studied linear systems. One way to write them down was as a matrix-vector equation \(A\vect{x} = \vect{b}\). We saw that a linear system could be either consistent or inconsistent. And if a system was inconsistent, that would then be the end of the story.
In this section we will reconsider the inconsistent situation and ask ourselves the question whether there is a vector \(\vect{x}\) that is in a sense the ‘best possible’ alternative to a solution.
One common situation where an inconsistent linear system arises quite naturally is fitting a line through a set of points. Suppose \(n \geq 3\) points \((x_1,y_1), \ldots, (x_n,y_n)\) in the plane are given. Which line \(\ell: y = ax + b\) best fits this set of points?
There are different ways to define what is the best line. For instance, we may mean the line that minimizes the sum of the (perpendicular) distances of the points to the line. From a purely geometric point of view that seems the most natural way. Or, we can take the line for which the sum of vertical distances from the points to the line, i.e.,
is minimal. This approach makes sense in a physicial context where typically the \(x\)-variable may be an input variable over which a researcher has control, and \(y\) is some output variable which may be liable to fluctuations and/or uncertainties. See Figure 7.4.1 for both interpretations of ‘best’ line.
Both are sensible ideas. However, to turn any of these two ideas into an algorithm to find the best line is not as straightforward as the computations that come up if we put the question into the realm of linear algebra. And there it will turn out to be the problem of an inconsistent linear system.
Ideally there are parameters \(a\) and \(b\) such that the line with the equation
contains all points \((x_i,y_i)\).
That means that all equations
will be satisfied simultaneously. This only happens if the matrix-vector equation
is consistent. Which generally is not the case.
We will come back to this question in Subsection 7.4.5.
7.4.2. Least Squares Solutions#
Let \(A\) be an \(m \times n\) matrix with columns \(\vect{a}_1, \ldots, \vect{a}_n\).
We have seen (Section 2.4, Remark 2.4.1) that the linear system
is equivalent to the vector equation
What we can do if the linear system is inconsistent, thus if
is try to find the best approximation of the vector \(\vect{b}\) with a vector in \(\Span{\vect{a}_1, ... , \vect{a}_n}\). This is the idea behind the following definition.
Let \(A\) be an \(m\times n\) matrix and \(\vect{b}\) a vector in \(\R^{m}\). A vector \(\hat{\vect{x}}\) is called a least squares solution of the linear system \(A\vect{x} = \vect{b}\) if for every \(\vect{x}\) in \(\R^n\) the inequality
holds.
The vector \( A\hat{\vect{x}} - \vect{b} \) is called the vector of the errors, and the norm of this vector, i.e.,
is called the least squares error.
Consider the linear system \(A\vect{x} = \vect{b}\), with
For the trial solution \(x_1 = 5, x_2 = 4, x_3 = 5\), the error vector and its norm are computed as
and
By definition \(A\hat{\vect{x}} = \hat{x}_1\vect{a}_1 + \ldots + \hat{x}_n\vect{a}_n\) is the best approximation of \(\vect{b}\) with vectors in \(\Span{\vect{a}_1, ... , \vect{a}_n}\).
By minimizing \(\norm{A\vect{x} - \vect{b}}\) we are in fact minimizing the sum of the squares of the errors. This explains the name least squares error.
From
we read off that a least squares solution yields a linear combination of the columns of \(A\) that has a minimal distance to \(\vect{b}\).
For a consistent linear system a least squares solution will be an actual solution, i.e. \(A\hat{\vect{x}} - \vect{b} = \vect{0}\). In this case the least squares error
will be zero.
In the situation where we want to fit a line \(y = ax + b\), we can take as ‘best’ parameters the least squares solution of the linear system as in Equation (7.4.1),
The interpretation of the ‘error vector’ \(A\vect{x} - \vect{b}\) then becomes
and the error \(\norm{A\vect{x} - \vect{b}}\) comes down to
The least squares solution \((\hat{a},\hat{b})\) minimizes this error, so in fact it minimizes the sum of the squares
The questions we will address are
-
Does a least squares solution always exist?
-
How to compute the least squares solution(s)?
-
If a least squares solution exists, is it unique?
The next proposition provides the answers to question i. and question iii.
For each linear system \(A\vect{x} = \vect{b}\), where \(A\) is an \(m \times n\) matrix and \(\vect{b}\) a vector in \(\R^m\), a least squares solution always exists. Moreover the least squares solution is unique if and only if the columns of \(A\) are linearly independent.
Proof of Proposition 7.4.1
In Remark 7.4.1 it was noted that a least squares solution corresponds to the vector in Col \(A\) that is closest to \(\vect{b}\).
The vector in Col \(A\) that is closest to \(\vect{b}\) is precisely the orthogonal projection of \(\vect{x}\) onto Col \(A\). (See Proposition 7.2.2.)
This projection, a linear combination of the colums of \(A\), always exists.
The coefficients of this linear combination then give a least squares solution.
Lastly, these coefficients are unique if and only if the columns of \(A\) are linearly independent.
Find the least squares solution of the linear system
According to Proposition 7.4.1 the least squares solution consists of the coefficients of the orthogonal projection of the vector \( \vect{b} = \left[ \begin{array}{c} 9 \\ 7 \\ 11 \end{array} \right] \) onto \(\Span{\vect{a}_1, \vect{a}_2} = \Span{\left[ \begin{array}{c} 1 \\ 2 \\ 3 \end{array} \right], \left[ \begin{array}{c} 1 \\ -2 \\ 1 \end{array} \right]}\).
In this first example we have chosen \(\vect{a}_1\) and \(\vect{a}_2\) that are orthogonal.
So by the projection formula for an orthogonal basis (see Theorem 7.2.2, see side note), the projection is given by
And then the least squares solution is found to be \(\hat{\vect{x}} = \left[ \begin{array}{c} 4 \\ 1 \end{array} \right]\).
For this vector we find
Proposition 7.4.1 tells us this is the closest we can get.
In Example 7.4.2 the coefficients of the orthogonal projection were quickly found due to the fact that the vectors \(\vect{a}_1\) and \(\vect{a}_2\) were orthogonal. In Section 7.3 we saw how we can construct an orthogonal basis from an arbitrary basis. And then we can use the projection formula (7.2.1) to find the orthogonal projection. However, we will see that this is an unnecessary detour.
7.4.3. Normal Equations#
There is a direct way to find the coefficients of the orthogonal projection onto Col\( A\) if the columns are not orthogonal.
(Normal Equations)
Suppose \(A\) is an \(m \times n\) matrix and \(\vect{b}\) is a vector in \(\R^m\).
Then the system of linear equations
is always consistent. The equations in this system are called the normal equations.
Any solution \(\hat{\vect{x}}\) of the normal equations is a least squares solution.
Before having a look at the proof consider the following example.
We compute the least squares solution of the linear system
The normal equations lead to the augmented matrix
This can be row reduced to the echelon form
The least squares solution can be read off from the last column in this matrix.
The error vector and the least squares error are found to be
This is slightly better than the ‘trial solution’ in Example 7.4.1, where the norm of the error vector was found to be \(\sqrt{15}\).
In the proof properties of the orthogonal projection are combined in a clever way. As usual, feel free to skip it.
Proof of Theorem 7.4.1
As usual we denote the columns of the \(m \times n\) matrix \(A\) by \(\vect{a}_1, \ldots, \vect{a}_n\).
From the section about orthogonal projections, we know that the orthogonal projection of \(\vect{b}\) onto the column space of \(A\) exists and is unique. (cf. Theorem 7.2.2.) This projection will be a vector of the form
for certain constants \(c_1, \ldots c_n\).
By the definition of the orthogonal projection we have that \((\vect{b} - (c_1\vect{a}_1 + \ldots + c_n\vect{a}_n))\) lies in the orthogonal complement of Col\( A\), i.e.,
This leads to \(n\) linear equations
for the unknowns \(c_1, \ldots, c_n\).
These equations can be rewritten as
and further to
In matrix-vector form this becomes
Which leads to the following very concise form.
So, to find the least squares solution(s) of the linear system \(A\vect{x} = \vect{b}\), we have to solve the normal equations
If \(\vect{c} = \left[ \begin{array}{c} c_1 \\ c_2 \\ \vdots \\ c_n \end{array} \right]\) is the least squares solution of the linear system \(A\vect{c} = \vect{b}\), then the orthogonal projection of \(\vect{b}\) of Col\( A\) is given by
If the columns \(\vect{a}_1, \ldots, \vect{a}_n\) of \(A\) are linearly independent, the coefficients \(c_i\) are the coordinates with respect to the basis \(\{\vect{a}_1, \ldots, \vect{a}_n\}\), hence they are unique. Thus in that case the normal equations
must have a unique solution.
There is another way to see this, which follows from the next proposition.
Suppose \(A\) is an \(m \times n\) matrix. If the columns of \(A\) are linearly independent then the matrix \(A^TA\) is invertible.
Proof of Proposition 7.4.2
In fact, something stronger holds:
First if \(A\vect{x}= \vect{0}\), then clearly \(A^TA\vect{x} = A^T\vect{0} = \vect{0}\).
To prove the converse, suppose \(A^TA\vect{x} = \vect{0}\).
Then \(\vect{x}^TA^TA\vect{x} = \vect{x}^T\vect{0} = \vect{0}\) too.
Now realize that \(\vect{x}^TA^TA\vect{x} = (A\vect{x})^TA\vect{x} = \norm{A\vect{x}}^2\).
So \(A^TA\vect{x} = \vect{0}\) implies \(\norm{A\vect{x}}^2 = 0\), and that means that \(A\vect{x}\) must be the zero vector.
The equivalence (7.4.4) implies: if \(A\) has linearly independent columns, then \(A\vect{x} = \vect{0}\) has \(\vect{x}= \vect{0}\) as only solution, so \(A^TA\vect{x} = \vect{0}\) has \(\vect{x}= \vect{0}\) as only solution. This means that \(A^TA\) is invertible.
Prove the converse of Proposition 7.4.2.
For any \(m \times n\) matrix \(A\), if \(A^TA\) is invertible, then the columns of \(A\) must be linearly independent. (Note that the matrix \(A\) is not supposed to be a square matrix.)
Solution to Exercise 7.4.1
Suppose that \(A\) is an \(m \times n\) matrix \(A\) for which \(A^TA\) is invertible. To prove that \(A\) has linearly independent columns we have to show that the equation
has only the trivial solution \(\vect{x} = \vect{0}\).
So suppose that \( A\vect{c} = \vect{0}\) for some vector \(\vect{c} \in \R^{n}\). Then a fortiori
The assumption that \(A^TA\) is invertible implies that indeed \(\vect{c} = \vect{0}\).
As stated, the least squares solution of a system \(A\vect{x} = \vect{b}\) consists of the coefficients \(c_i\) of the orthogonal projection
of \(\vect{b}\) onto the column space of \(A\).
For a matrix \(A\) with linearly independent columns, \(\vect{c}\) is unique, and given by
So for a matrix \(A\) with linearly independent columns the projection of a vector \(\vect{b}\) onto Col \(A\) is given by
We already knew how to find this projection if the columns are orthogonal. Using the normal equations we don’t need orthogonality.
Applying Theorem 7.4.1 to the earlier example (Example 7.4.2), shows that in the case of orthogonal vectors there is actually nothing new. This is illustrated in the next example.
Let us again find, but now using Theorem 7.4.1, the least squares solution of the linear system
The normal equations give the augmented matrix
Note that the orthogonality of the columns leads to a coefficient matrix \(A^TA\) that is a diagonal matrix. The normal equations can be solved in one stroke.
This (again) yields the least squares solution \(\hat{\vect{x}} =\left[\begin{array}{c} 4 \\ 1 \end{array} \right]\).
The previous example can be generalized as follows.
If the columns \(\{\vect{a}_1, \ldots, \vect{a}_n\}\) of an \(m \times n\) matrix \(A\) form a set of non-zero, orthogonal vectors in \(\R^m\), then the orthogonal projection
of a vector \(\vect{b}\) in \(\R^m\) onto Col \(A\) is found by solving the normal equations
Since the columns are orthogonal, all products \(\vect{a}_i^{T}\vect{a}_j = \vect{a}_i\ip\vect{a}_j\) with \(i \neq j\) are zero.
Expressing the equation using inner products we find
Which leads to the good old expressions \(c_i = \dfrac{\vect{a}_i\ip\vect{b}}{\vect{a}_i\ip\vect{a}_i} = \dfrac{\vect{b}\ip\vect{a}_i}{\vect{a}_i\ip\vect{a}_i}\).
As before (Theorem 7.2.2) the orthogonal projection becomes
Show that Formula (7.4.5) for a matrix \(A\) with linearly independent columns and QR decomposition \(A = QR\) (see Theorem 7.3.2) simplifies to
Also explain this simpler formula by interpreting the \(QR\) decomposition in a suitable way.
Solution to Exercise 7.4.2
This involves some elementary matrix operations.
Suppose \(A = QR\), where \(Q^TQ = I\), and \(R\) is an upper triangular matrix with a positive diagonal. So \(R\) is invertible.
Substitution of \(A=QR\) into (7.4.5)
gives
Using \(Q^TQ = I\) and \(\left[R^TR\right]^{-1} = R^{-1}(R^T)^{-1}\) this can be simplified further to
The interpretation is as follows. The columns \(\vect{q}_i\) of \(Q\) form an orthonormal basis for the column space of \(A\). So the orthogonal projection onto Col\((A\)) is the same as the orthogonal projection onto Col\((Q)\). For a matrix with orthonormal columns the projection formula
can be written in a very concise way as
To conclude this section we will consider the situation where the matrix \(A\) has linearly dependent columns. Then the least squares solution is not unique.
Let us look at an example first.
We find the least squares solutions of the linear system \(A\vect{x} = \vect{b}\), where
Note that the columns of \(A\) are indeed linearly dependent.
For the least squares solution we have to solve the system with the augmented matrix
The augmented matrix can be row reduced to
From this we can read off all least squares solutions. They are given by
For this low-dimensional problem we can draw a picture.
The least squares solutions are depicted as the line \(\ell: \vect{x} = \hat{\vect{x}}_0 + c\left[\begin{array}{c} 2 \\ 1 \end{array}\right]\).
We can analyse Example 7.4.5 from a higher perspective. In the general solution of the normal equations
the ‘homogeneous’ part \(\vect{x}_H = c\left[\begin{array}{c} 2 \\ 1 \end{array}\right]\) is the nulspace of \(A^TA\). Because of the equivalence (7.4.4) this is equal to the nulspace of \(A\). Now from Section 7.1, Proposition 7.1.3, we know that
which is visualized in Figure 7.4.2.
Let us give one more example to illustrate matters.
We find the least squares solutions of the linear system \(A\vect{x} = \vect{b}\), where
The first column of \(A\) is the sum of the other two columns, so the columns of \(A\) are linearly dependent.
For the least squares solution we have to solve the system with augmented matrix
The augmented matrix can be row reduced to
From this we can read off all least squares solutions. They are given by
As in Example 7.4.5 the ‘homogeneous’ part \(\vect{x}_H = c\left[\begin{array}{c} 1 \\ -1 \\ -1 \end{array}\right]\) is the nulspace of \(A^TA\), which is equal to the nulspace of \(A\).
For instance, by taking \(c=0\) and \(c = -1\) we find the two least squares solutions
For both we find the least squares error
7.4.4. Grasple Exercises#
About the interpretation of a least squares solution.
Show/Hide Content
About the uniqueness of a least squares solution.
Show/Hide Content
About least squares solutions and normal equations.
Show/Hide Content
Finding the LS solution for 3x2 systems in three steps.
Show/Hide Content
LS solution + LS error for a 3x2 system.
Show/Hide Content
LS solution + LS error for a 4x4 system.
Show/Hide Content
LS solution + LS error for a 4x3 system.
Show/Hide Content
On the connection between orthogonal projections and least squares problems.
Show/Hide Content
LS solution + LS error for a 3x2 system.
Show/Hide Content
LS solutions + LS error for a 4x3 system.
Show/Hide Content
Finding the LS solution for a 4x3 system (involving quite some reduction work)
Show/Hide Content
Finding the LS solution for a 4x3 system (with some tricky reduction work).
Show/Hide Content
What is the quickest way to find the least squares solution?
Show/Hide Content
7.4.5. Linear Models#
In Section 7.4.1 we looked at ways to fit a line \(y = a + bx\) to a set of points \((x_i, y_i), i = 1, \ldots, n\) in the plane. In statistics this plays an important role in so-called regression models.
One way to define the best fitting line \(y = \hat{a}+\hat{b}x\) is to let \((\hat{a},\hat{b})\) be the least squares solution to the set of \(n\) linear equations
Note that in these equations the parameters \(a\) and \(b\) are the unknowns.
This line is sometimes refered to as the least squares line.
Suppose we want to find the least squares line for the set of four points
At first sight the line through the first and the last point, i.e.,
seems a good candidate. The points and the ‘first guess’ are depicted in Figure 7.4.3
For this line the sum of the squares of the errors
which in this context are called residues, becomes
To find the least squares line we consider the four equations in the form
Since the matrix has linearly independent columns the normal equations, with augmented matrix
give a unique least squares solution, and it is \(\hat{a} = 1.6\), \(\hat{b} = 0.3\).
Figure 7.4.4 shows both lines.
For the line \(y = \hat{a} + \hat{b}x\) the sum of the squares of the residues becomes
This is indeed slightly better than with the line found ‘at first sight’, where the sum was equal to \(0.125\).
We can even find a ready-made formula for the least squares line through \(n\) given points \((x_1,y_1), (x_2, y_2), \ldots, (x_n, y_n)\).
The coefficients of the least squares line \(y = \hat{a} + \hat{b}x\) for the set of points \((x_1,y_1), (x_2, y_2), \ldots, (x_n, y_n)\) are given by
Introducing the notation
the coefficients get the following more concise form,
We will derive the formula by diligently working through the normal equations.
In matrix-vector form the linear system we want to solve reads
Noting that
and
This leads to the normal equations
If we multiply both sides of this equation by
we see the expressions (7.4.6) readily appearing.
Fitting a line to a set of points, may be looked upon as fitting a linear combination of the functions \(f_0(x) = 1\) and \(f_1(x) = x\).
Depending on the context we may also consider fitting a
linear combination of a larger set of ‘basic’ functions.
For instance, when we use \(f_0(x) = 1\), \(f_1(x) = x\) and \(f_2(x) = x^2\), we are in fact looking for a quadratic function \(y = a_0 + a_1x + a_2x^2\) that best fits the set of points.
And we may even go beyond that. Then we get a more general so-called linear model.
(Linear Model)
Suppose \(k\) functions \(f_1(x), \ldots, f_k(x)\) and \(n\) points \((x_1,y_1), \ldots, (x_n,y_n)\) in the plane are given.
The linear model refers to the ‘best’ linear combination, in the least squares sense, of the form
That is, the linear combination that minimizes
the sum of the squares of the errors/residues
The epithet linear refers to the fact that the parameters \(c_1, \ldots, c_k\) appear in the model in a linear way. The functions \(f_i\) that are used in the model definitely don’t have to be linear.
The parameters \(c_1,c_2,\ldots,c_k\) that minimize the sum in (7.4.7) coincide with the least squares solution of the linear system
In practice the points \((x_1,y_2), \ldots, (x_n,y_n)\) are often called the data, sometimes considered to be a sample from some space. The variable \(x\) is then considered as the input variable and the \(y\) as the output variable. Lastly the matrix in the expression on the left of Equation (7.4.8) is sometimes refered to as the design matrix.
Several generalizations are possible. We mention two.
-
The input may be multivariate. The set of data points then has the form
(7.4.9)#\[\begin{split} \begin{array}{c} (x_{11},x_{12},\ldots,x_{1k},y_1) \\ (x_{21},x_{22},\ldots,x_{2k},y_2)\\ \vdots \quad\quad \vdots \quad\quad \vdots\\ (x_{n1},x_{n2},\ldots,x_{nk},y_n), \end{array}\end{split}\]
and we want to find the linear combination
\[ \beta_1f_1(x_1, \ldots, x_{k}) + \ldots + \beta_{\ell}f_{\ell}(x_{1},\ldots, x_{k}) \]that best fits these data.
For instance, to find the linear function
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k \]that best fits the data in (7.4.9), we can take \(f_0(x_1,\ldots,x_k) = 1\) and \(f_i(x_1,\ldots,x_k) = x_i\), for \(i = 1, \ldots, k\).
In a least squares model we then look for the parameters \(\beta_1, \ldots, \beta_{\ell}\) that minimize
(7.4.10)#\[ \sum_{i=1}^{n} \big(y_i - \beta_1f_1(x_{i1},\ldots, x_{ik}) \,-\, \ldots \,-\, \beta_{\ell}f_{\ell}(x_{i1},\ldots, x_{ik})\big)^2.\] -
In a weighted least squares model the terms in the sum (7.4.10) get different weights \(w_i\). When building a statistical model this may be desirable when some data give more ‘information’.
Then the expression we want to minimize is given by
\[ \sum_{i=1}^{n} \class{blue}{w_i} \big(y_i - \beta_1f_1(x_{i1},\ldots, x_{ik}) \,-\, \ldots \,-\, \beta_{\ell}f_{\ell}(x_{i1},\ldots, x_{ik})\big)^2. \]This approach may be relevant if the variance of some of the observations that led to them seems smaller than the variance of other observations.
To illustrate matters we present two examples.
(Fitting a plane)
Suppose \(n\) points \((x_i,y_i,z_i)\), \(i = 1, \ldots , n\) are given and we want to fit a plane through these. In other words, we want to find a linear combination
of the functions \(f_0(x,y) = 1\), \(f_1(x,y) = x\) and \(f_2(x,y) = y\) that fits these points best.
In the least square sense we would then have to solve the normal equations for the linear system
Suppose we have \(n\) data points \((x_i,y_i)\), \(i = 1,\ldots,n\), and taking physical conditions into account, we expect a relation of the form \( y = ax^r\) between them.
One way to go about to find suitable parameters \(a\) and \(r\) is to transform both \(x\) and \(y\) to log-scale by introducing the new variable
The relation between \(x\) and \(y\) then transforms to
We can find the least squares linear fit to the points \((\ln(x_i), z_i)\).
for the points
If we have found the least squares parameters \(\hat{\alpha}\) and \(\hat{\beta}\) for the transformed problem, at the end we have to ‘transform back’ to find the ‘best power fit’.
Figure 7.4.5 shows what’s going on. The first plot shows the points \((x_i,y_i)\), the second plot shows the points \((\ln(x_i), \ln(y_i))\), and the last two plots give the points with the two fits.
7.4.6. Grasple Exercises (for Linear Models)#
On the precise definition of the least squares line.
Show/Hide Content
Design matrix and input vector for the regression line through a set of points.
Show/Hide Content
Fitting a line through set of points and compute the residual vector.
Show/Hide Content
Fitting a parabola to a set of points.
Show/Hide Content
Fitting a quadratic polynomial to a set of points.
Show/Hide Content
Design matrix to fit \(y = ax + bx^3\) to a set of points.
Show/Hide Content
Design matrix to fit \(y = c_1 e^t + c_2 \cos(x) + c_3 \sin(x)\) to a set of points.