7.4. Least Squares Solutions#

7.4.1. Introduction#

In Chapter 2, especially Section 2.1, we studied linear systems. One way to write them down was as a matrix-vector equation Ax=b. We saw that a linear system could be either consistent or inconsistent. And if a system was inconsistent, that would then be the end of the story.

In this section we will reconsider the inconsistent situation and ask ourselves the question whether there is a vector x that is in a sense the ‘best possible’ alternative to a solution.

One common situation where an inconsistent linear system arises quite naturally is fitting a line through a set of points. Suppose n3 points (x1,y1),,(xn,yn) in the plane are given. Which line :y=ax+b best fits this set of points?

There are different ways to define what is the best line. For instance, we may mean the line that minimizes the sum of the (perpendicular) distances of the points to the line. From a purely geometric point of view that seems the most natural way. Or, we can take the line for which the sum of vertical distances from the points to the line, i.e.,

S=i=1n|yi(a+bxi)|

is minimal. This approach makes sense in a physicial context where typically the x-variable may be an input variable over which a researcher has control, and y is some output variable which may be liable to fluctuations and/or uncertainties. See Figure 7.4.1 for both interpretations of ‘best’ line.

../_images/Fig-LeastSquares-BestLines.svg

Fig. 7.4.1 What is the best best line? Can you get the total distance in the left picture below 6.5? Can you get the total distance in the right picture below 9.0 ?#

Both are sensible ideas. However, to turn any of these two ideas into an algorithm to find the best line is not as straightforward as the computations that come up if we put the question into the realm of linear algebra. And there it will turn out to be the problem of an inconsistent linear system.

Ideally there are parameters a and b such that the line with the equation

y=ax+b

contains all points (xi,yi).

That means that all equations

{y1=ax1+by2=ax2+byn=axn+b

will be satisfied simultaneously. This only happens if the matrix-vector equation

(7.4.1)#[1x11x21xn][ab]=[y1y2yn]

is consistent. Which generally is not the case.

We will come back to this question in Subsection 7.4.5.

7.4.2. Least Squares Solutions#

Let A be an m×n matrix with columns a1,,an.
We have seen (Section 2.4, Remark 2.4.1) that the linear system

Ax=b

is equivalent to the vector equation

x1a1++x1an=b.

What we can do if the linear system is inconsistent, thus if

bSpan{a1,...,an},

is try to find the best approximation of the vector b with a vector in Span{a1,...,an}. This is the idea behind the following definition.

Definition 7.4.1

Let A be an m×n matrix and b a vector in Rm. A vector x^ is called a least squares solution of the linear system Ax=b if for every x in Rn the inequality

Ax^bAxb

holds.

The vector Ax^b is called the vector of the errors, and the norm of this vector, i.e.,

Ax^b

is called the least squares error.

Example 7.4.1

Consider the linear system Ax=b, with

A=[121211321213],b=[20204030].

For the trial solution x1=5,x2=4,x3=5, the error vector and its norm are computed as

v=[121211321213][545][20204030]=[18194329][20204030]=[2131]

and

v=15.

Remark 7.4.1

By definition Ax^=x^1a1++x^nan is the best approximation of b with vectors in Span{a1,...,an}.

By minimizing Axb we are in fact minimizing the sum of the squares of the errors. This explains the name least squares error.

From

Ax^b=x^1a1++x^nanb

we read off that a least squares solution yields a linear combination of the columns of A that has a minimal distance to b.

For a consistent linear system a least squares solution will be an actual solution, i.e. Ax^b=0. In this case the least squares error

Ax^b=bb=0

will be zero.

Remark 7.4.2

In the situation where we want to fit a line y=ax+b, we can take as ‘best’ parameters the least squares solution of the linear system as in Equation (7.4.1),

[1x11x21xn][ab]=[y1y2yn].

The interpretation of the ‘error vector’ Axb then becomes

[a+bx1y1a+bx2y2a+bxnyn]

and the error Axb comes down to

(y1(ax1+b))2++(yn(axn+b))2.

The least squares solution (a^,b^) minimizes this error, so in fact it minimizes the sum of the squares

i=1n(yi(axi+b))2.

The questions we will address are

  1. Does a least squares solution always exist?

  2. How to compute the least squares solution(s)?

  3. If a least squares solution exists, is it unique?

The next proposition provides the answers to question i. and question iii.

Proposition 7.4.1

For each linear system Ax=b, where A is an m×n matrix and b a vector in Rm, a least squares solution always exists. Moreover the least squares solution is unique if and only if the columns of A are linearly independent.

Proof of Proposition 7.4.1

In Remark 7.4.1 it was noted that a least squares solution corresponds to the vector in Col A that is closest to b.

The vector in Col A that is closest to b is precisely the orthogonal projection of x onto Col A. (See Proposition 7.2.2.)

This projection, a linear combination of the colums of A, always exists.

The coefficients of this linear combination then give a least squares solution.

Lastly, these coefficients are unique if and only if the columns of A are linearly independent.

Example 7.4.2

Find the least squares solution of the linear system

{x1+x2=92x12x2=73x1+x2=11

According to Proposition 7.4.1 the least squares solution consists of the coefficients of the orthogonal projection of the vector b=[9711] onto Span{a1,a2}=Span{[123],[121]}.

In this first example we have chosen a1 and a2 that are orthogonal.

So by the projection formula for an orthogonal basis (see Theorem 7.2.2, see side note), the projection is given by

ba1a1a1a1+ba2a2a2a2=5614a1+66a2=4a1+1a2.

And then the least squares solution is found to be x^=[41].

For this vector we find

Ax^=[5613],withAx^b=(4)2+(1)2+22=21.

Proposition 7.4.1 tells us this is the closest we can get.

In Example 7.4.2 the coefficients of the orthogonal projection were quickly found due to the fact that the vectors a1 and a2 were orthogonal. In Section 7.3 we saw how we can construct an orthogonal basis from an arbitrary basis. And then we can use the projection formula (7.2.1) to find the orthogonal projection. However, we will see that this is an unnecessary detour.

7.4.3. Normal Equations#

There is a direct way to find the coefficients of the orthogonal projection onto ColA if the columns are not orthogonal.

Theorem 7.4.1 (Normal Equations)

Suppose A is an m×n matrix and b is a vector in Rm.

Then the system of linear equations

(7.4.2)#ATAx=ATb

is always consistent. The equations in this system are called the normal equations.

Any solution x^ of the normal equations is a least squares solution.

Before having a look at the proof consider the following example.

Example 7.4.3

We compute the least squares solution of the linear system

{x1+2x2+x3=202x1+x2+x3=203x1+2x2+4x3=402x1+x2+3x3=30.

The normal equations lead to the augmented matrix

[ATA|ATb]=[181221240121014170211427290].

This can be row reduced to the echelon form

[10016/301050014].

The least squares solution can be read off from the last column in this matrix.

x^=[16/354].

The error vector and the least squares error are found to be

v=Ax^b=[2/31/327/3],v=10.

This is slightly better than the ‘trial solution’ in Example 7.4.1, where the norm of the error vector was found to be 15.

In the proof properties of the orthogonal projection are combined in a clever way. As usual, feel free to skip it.

If c=[c1c2cn] is the least squares solution of the linear system Ac=b, then the orthogonal projection of b of ColA is given by

projColA(b)=c1a1++cnan=Ac.

If the columns a1,,an of A are linearly independent, the coefficients ci are the coordinates with respect to the basis {a1,,an}, hence they are unique. Thus in that case the normal equations

ATAx=ATb

must have a unique solution.

There is another way to see this, which follows from the next proposition.

Proposition 7.4.2

Suppose A is an m×n matrix. If the columns of A are linearly independent then the matrix ATA is invertible.

Proof of Proposition 7.4.2

In fact, something stronger holds:

(7.4.4)#Ax=0ATAx=0.

First if Ax=0, then clearly ATAx=AT0=0.

To prove the converse, suppose ATAx=0.

Then   xTATAx=xT0=0   too.

Now realize that xTATAx=(Ax)TAx=Ax2.

So ATAx=0 implies Ax2=0, and that means that Ax must be the zero vector.

The equivalence (7.4.4) implies: if A has linearly independent columns, then Ax=0 has x=0 as only solution, so ATAx=0 has x=0 as only solution. This means that ATA is invertible.

Exercise 7.4.1

Prove the converse of Proposition 7.4.2.

For any m×n matrix A, if ATA is invertible, then the columns of A must be linearly independent. (Note that the matrix A is not supposed to be a square matrix.)

Remark 7.4.3

As stated, the least squares solution of a system Ax=b consists of the coefficients ci of the orthogonal projection

projColA(b)=c1a1+c2a2++cnan=Ac,c=[c1cn],

of b onto the column space of A.

For a matrix A with linearly independent columns, c is unique, and given by

c=(ATA)1ATb.

So for a matrix A with linearly independent columns the projection of a vector b onto Col A is given by

(7.4.5)#b^=projCol A(b)=A(ATA)1ATb.

We already knew how to find this projection if the columns are orthogonal. Using the normal equations we don’t need orthogonality.

Applying Theorem 7.4.1 to the earlier example (Example 7.4.2), shows that in the case of orthogonal vectors there is actually nothing new. This is illustrated in the next example.

Example 7.4.4

Let us again find, but now using Theorem 7.4.1, the least squares solution of the linear system

{x1+x2=92x12x2=73x1+x2=11.

The normal equations give the augmented matrix

[ATA|ATb]=[a1Ta1a1Ta2a1Tba2Ta1a2Ta2a2Tb]=[14056066].

Note that the orthogonality of the columns leads to a coefficient matrix ATA that is a diagonal matrix. The normal equations can be solved in one stroke.

This (again) yields the least squares solution x^=[41].

The previous example can be generalized as follows.

If the columns {a1,,an} of an m×n matrix A form a set of non-zero, orthogonal vectors in Rm, then the orthogonal projection

c1a1+c2a2++cnan

of a vector b in Rm onto Col A is found by solving the normal equations

[a1Ta1a1Ta2a1Tana2Ta1a2Ta2a2TananTa1anTa2anTan][x1x2xn]=[a1Tba2TbanTb].

Since the columns are orthogonal, all products aiTaj=aiaj with ij are zero.

Expressing the equation using inner products we find

[a1a1000a2a2000anan][x1x2xn]=[a1ba2banb].

Which leads to the good old expressions   ci=aibaiai=baiaiai.

As before (Theorem 7.2.2) the orthogonal projection becomes

proj(b)=ba1a1a1a1+ba2a2a2a2++banananan.

Exercise 7.4.2

Show that Formula (7.4.5) for a matrix A with linearly independent columns and QR decomposition A=QR (see Theorem 7.3.2) simplifies to

b^=projColA(b)=QQTb.

Also explain this simpler formula by interpreting the QR decomposition in a suitable way.

To conclude this section we will consider the situation where the matrix A has linearly dependent columns. Then the least squares solution is not unique.

Let us look at an example first.

Example 7.4.5

We find the least squares solutions of the linear system Ax=b, where

A=[1224],b=[24]

Note that the columns of A are indeed linearly dependent.

For the least squares solution we have to solve the system with the augmented matrix

[ATA|ATb]=[51010102020].

The augmented matrix can be row reduced to

[122000].

From this we can read off all least squares solutions. They are given by

x^=x^0+xH=[20]+c[21],cR.

For this low-dimensional problem we can draw a picture.

../_images/Fig-LeastSquares-SmallestLS.svg

Fig. 7.4.2 Multiple least squares solutions.#

The least squares solutions are depicted as the line :x=x^0+c[21].

We can analyse Example 7.4.5 from a higher perspective. In the general solution of the normal equations

x=x^0+c[21],

the ‘homogeneous’ part xH=c[21] is the nulspace of ATA. Because of the equivalence (7.4.4) this is equal to the nulspace of A. Now from Section 7.1, Proposition 7.1.3, we know that

(NulA)=RowA=Span{[12]}

which is visualized in Figure 7.4.2.

Let us give one more example to illustrate matters.

Example 7.4.6

We find the least squares solutions of the linear system Ax=b, where

A=[110110101101],b=[1324]

The first column of A is the sum of the other two columns, so the columns of A are linearly dependent.

For the least squares solution we have to solve the system with augmented matrix

[ATA|ATb]=[4221022042026].

The augmented matrix can be row reduced to

[101301110000].

From this we can read off all least squares solutions. They are given by

x^=x^0+xH=[310]+c[111],cR.

As in Example 7.4.5 the ‘homogeneous’ part xH=c[111] is the nulspace of ATA, which is equal to the nulspace of A.

For instance, by taking c=0 and c=1 we find the two least squares solutions

x^1=[310],x^2=[201].

For both we find the least squares error

Ax^ib=[2233][1324]=4=2.

7.4.4. Grasple Exercises#

Grasple Exercise 7.4.1

https://embed.grasple.com/exercises/9666528f-1e81-4273-b571-8dae64a7972c?id=91139

About the interpretation of a least squares solution.

Grasple Exercise 7.4.2

https://embed.grasple.com/exercises/c96c0359-9599-4fd4-b5b2-bd7f9d2da463?id=91141

About the uniqueness of a least squares solution.

Grasple Exercise 7.4.3

https://embed.grasple.com/exercises/6068f5ac-3eb3-40a1-8686-ddbf05f172b2?id=91165

About least squares solutions and normal equations.

Grasple Exercise 7.4.4

https://embed.grasple.com/exercises/a0878702-dcc0-4216-b2d6-0b1d7d1b046e?id=91142

Finding the LS solution for 3x2 systems in three steps.

Grasple Exercise 7.4.5

https://embed.grasple.com/exercises/6a0628e8-065d-4390-b0c4-8ff131761de4?id=91161

LS solution + LS error for a 3x2 system.

Grasple Exercise 7.4.6

https://embed.grasple.com/exercises/1d9a943a-b51f-48c9-99a9-691b80b8df60?id=91159

LS solution + LS error for a 4x4 system.

Grasple Exercise 7.4.7

https://embed.grasple.com/exercises/d7480f19-afdb-474d-8542-299fc21a1952?id=91908

LS solution + LS error for a 4x3 system.

Grasple Exercise 7.4.8

https://embed.grasple.com/exercises/c6a52270-1c43-46a2-9dda-f1ab8b366066?id=91146

On the connection between orthogonal projections and least squares problems.

Grasple Exercise 7.4.9

https://embed.grasple.com/exercises/743d744c-1bcb-460a-973e-3e693e86e20d?id=91157

LS solution + LS error for a 3x2 system.

Grasple Exercise 7.4.10

https://embed.grasple.com/exercises/b081f76a-0e03-48cc-b27b-afab51ac2c91?id=91155

LS solutions + LS error for a 4x3 system.

Grasple Exercise 7.4.11

https://embed.grasple.com/exercises/679d1581-08bc-416b-89bf-766faad9f118?id=91394

Finding the LS solution for a 4x3 system (involving quite some reduction work)

Grasple Exercise 7.4.12

https://embed.grasple.com/exercises/3d0a7884-09ee-4f89-a2e5-1c1476d7e2a3?id=91448

Finding the LS solution for a 4x3 system (with some tricky reduction work).

Grasple Exercise 7.4.13

https://embed.grasple.com/exercises/396fd7a0-10cf-4b49-bb9e-fbc4acc2a06a?id=91148

What is the quickest way to find the least squares solution?

7.4.5. Linear Models#

In Section 7.4.1 we looked at ways to fit a line y=a+bx to a set of points (xi,yi),i=1,,n in the plane. In statistics this plays an important role in so-called regression models.

One way to define the best fitting line y=a^+b^x is to let (a^,b^) be the least squares solution to the set of n linear equations

yi=a+bxi,i=1,,n.

Note that in these equations the parameters a and b are the unknowns.

This line is sometimes refered to as the least squares line.

Example 7.4.7

Suppose we want to find the least squares line for the set of four points

(1,2),(2,2),(4,3),(5,3).

At first sight the line through the first and the last point, i.e.,

y=14(x1)+2=0.25x+1.75

seems a good candidate. The points and the ‘first guess’ are depicted in Figure 7.4.3

../_images/Fig-LeastSquares-FirstGuess.svg

Fig. 7.4.3 First guess for best line.#

For this line the sum of the squares of the errors

yi(0.25x+1.75),

which in this context are called residues, becomes

i=14(yi(0.25x+1.75))2=02+0.252+0.252+02=0.125.

To find the least squares line we consider the four equations in the form

[1x11x21x31x4][ab]=[y1y2y3y4],i.e.,[11121415][ab]=[2233].

Since the matrix has linearly independent columns the normal equations, with augmented matrix

[ATA|ATb]=[41210124633],

give a unique least squares solution, and it is a^=1.6, b^=0.3.

Figure 7.4.4 shows both lines.

../_images/Fig-LeastSquares-LSLine.svg

Fig. 7.4.4 Least squares line.#

For the line y=a^+b^x the sum of the squares of the residues becomes

(2(1.6+0.31))2+(2(1.6+0.32))2+(3(1.6+0.34))2+(3(1.6+0.35))2=0.12+0.22+0.22+0.12=0.1.

This is indeed slightly better than with the line found ‘at first sight’, where the sum was equal to 0.125.

We can even find a ready-made formula for the least squares line through n given points (x1,y1),(x2,y2),,(xn,yn).

Example 7.4.8

The coefficients of the least squares line y=a^+b^x for the set of points (x1,y1),(x2,y2),,(xn,yn) are given by

a^=(xi2)(yi)(xi)(xiyi)nxi2(xi)2,b^=n(xiyi)(xi)(yi)nxi2(xi)2.

Introducing the notation

Σx=xi,Σxx=xi2,Σy=yi,Σxy=xiyi,Σyy=yi2,

the coefficients get the following more concise form,

(7.4.6)#a^=ΣxxΣyΣxΣxynΣxxΣx2,b^=nΣxyΣxΣynΣxxΣx2.

We will derive the formula by diligently working through the normal equations.

In matrix-vector form the linear system we want to solve reads

[1x11x21xn][ab]=[y1y2yn].

Noting that

[1x11x21xn]T[1x11x21xn]=[nxixixi2]=[nΣxΣxΣxx]

and

[1x11x21xn]T[y1y2yn]=[yixiyi]=[ΣyΣxy].

This leads to the normal equations

[nΣxΣxΣxx][ab]=[ΣyΣxy].

If we multiply both sides of this equation by

[nΣxΣxΣxx]1=1nΣxxΣx2[ΣxxΣxΣxn],

we see the expressions (7.4.6) readily appearing.

Fitting a line to a set of points, may be looked upon as fitting a linear combination of the functions f0(x)=1 and f1(x)=x. Depending on the context we may also consider fitting a linear combination of a larger set of ‘basic’ functions.
For instance, when we use f0(x)=1, f1(x)=x and f2(x)=x2, we are in fact looking for a quadratic function y=a0+a1x+a2x2 that best fits the set of points.
And we may even go beyond that. Then we get a more general so-called linear model.

Definition 7.4.2 (Linear Model)

Suppose k functions f1(x),,fk(x) and n points (x1,y1),,(xn,yn) in the plane are given.

The linear model refers to the ‘best’ linear combination, in the least squares sense, of the form

c1f1(x)+c2f2(x)++ckfk(x).

That is, the linear combination that minimizes

(7.4.7)#i=1n(yi(c1f1(xi)+c2f2(xi)++ckfk(xi)))2,

the sum of the squares of the errors/residues

yi(c1f1(xi)+c2f2(xi)++ckfk(xi)).

The epithet linear refers to the fact that the parameters c1,,ck appear in the model in a linear way. The functions fi that are used in the model definitely don’t have to be linear.

Remark 7.4.4

The parameters c1,c2,,ck that minimize the sum in (7.4.7) coincide with the least squares solution of the linear system

(7.4.8)#[f1(x1)f2(x1)fk(x1)f1(x2)f2(x2)fk(x2)f1(xn)f2(xn)fk(xn)][c1c2ck]=[y1y2yn].

Remark 7.4.5

In practice the points (x1,y2),,(xn,yn) are often called the data, sometimes considered to be a sample from some space. The variable x is then considered as the input variable and the y as the output variable. Lastly the matrix in the expression on the left of Equation (7.4.8) is sometimes refered to as the design matrix.

Several generalizations are possible. We mention two.

  1. The input may be multivariate. The set of data points then has the form
    (7.4.9)#(x11,x12,,x1k,y1)(x21,x22,,x2k,y2)(xn1,xn2,,xnk,yn),

    and we want to find the linear combination


    β1f1(x1,,xk)++βf(x1,,xk)

    that best fits these data.

    For instance, to find the linear function

    y=β0+β1x1+β2x2++βkxk

    that best fits the data in (7.4.9), we can take f0(x1,,xk)=1 and fi(x1,,xk)=xi, for i=1,,k.

    In a least squares model we then look for the parameters β1,,β that minimize

    (7.4.10)#i=1n(yiβ1f1(xi1,,xik)βf(xi1,,xik))2.
  2. In a weighted least squares model the terms in the sum (7.4.10) get different weights wi. When building a statistical model this may be desirable when some data give more ‘information’.

    Then the expression we want to minimize is given by

    i=1nwi(yiβ1f1(xi1,,xik)βf(xi1,,xik))2.

    This approach may be relevant if the variance of some of the observations that led to them seems smaller than the variance of other observations.

To illustrate matters we present two examples.

Example 7.4.9 (Fitting a plane)

Suppose n points (xi,yi,zi), i=1,,n are given and we want to fit a plane through these. In other words, we want to find a linear combination

L(x,y)=a+bx+cy

of the functions f0(x,y)=1, f1(x,y)=x and f2(x,y)=y that fits these points best.

In the least square sense we would then have to solve the normal equations for the linear system

{a+bx1+cy1=z1a+bx2+cy2=z2=a+bxn+cyn=zni.e.,[1x1y11x2y21xnyn][abc]=[z1z2zn].

Example 7.4.10

Suppose we have n data points (xi,yi), i=1,,n, and taking physical conditions into account, we expect a relation of the form y=axr between them.

One way to go about to find suitable parameters a and r is to transform both x and y to log-scale by introducing the new variable

z=log(y)=ln(y).

The relation between x and y then transforms to

z=ln(y)=ln(a)+rln(x).

We can find the least squares linear fit to the points (ln(xi),zi).

z=α+βx

for the points

(ln(x1),ln(y1)),,(ln(xn),ln(yn)).

If we have found the least squares parameters α^ and β^ for the transformed problem, at the end we have to ‘transform back’ to find the ‘best power fit’.

ln(y)=α^+β^ln(x)y=axb,a=eα^,r=β^.

Figure 7.4.5 shows what’s going on. The first plot shows the points (xi,yi), the second plot shows the points (ln(xi),ln(yi)), and the last two plots give the points with the two fits.

../_images/Fig-LeastSquares-PowerFit.svg

Fig. 7.4.5 Least squares fitting via logarithmic scale.#

7.4.6. Grasple Exercises (for Linear Models)#

Grasple Exercise 7.4.14

https://embed.grasple.com/exercises/6a2b9aeb-3c59-4b8f-8d7c-e51f651998fd?id=91883

On the precise definition of the least squares line.

Grasple Exercise 7.4.15

https://embed.grasple.com/exercises/254298e0-f091-4a9d-8e8a-31cc3cf11f16?id=91884

Design matrix and input vector for the regression line through a set of points.

Grasple Exercise 7.4.16

https://embed.grasple.com/exercises/dad8ca0a-ef17-4757-803c-26b8ae9804de?id=91886

Fitting a line through set of points and compute the residual vector.

Grasple Exercise 7.4.17

https://embed.grasple.com/exercises/9ad7148d-ef39-46ef-8f14-20e97511655d?id=91889

Fitting a parabola to a set of points.

Grasple Exercise 7.4.18

https://embed.grasple.com/exercises/222ee704-85f5-470a-b48f-f88d9900a8d1?id=91890

Fitting a quadratic polynomial to a set of points.

Grasple Exercise 7.4.19

https://embed.grasple.com/exercises/ff6329bd-f5b3-41ce-828f-2086cf651181?id=91898

Design matrix to fit y=ax+bx3 to a set of points.

Grasple Exercise 7.4.20

https://embed.grasple.com/exercises/f75014bf-0e90-43e7-acf4-216cb38ffd11?id=91903

Design matrix to fit y=c1et+c2cos(x)+c3sin(x) to a set of points.