The Least Squares Problem
The Least Squares Problem is one of the most important problem in numerical approximation. The main idea of the least squares problem is to solve the unknown parameters , so that the sum of the squares of the difference between the predicted value and the observed value (i.e. the error, or residual) is minimized. In short, it can be interpreted as solving the equation of \(Ax=b\). In linear algebra, there are three common ways to solve the least squares problem, and their introduction and comparison are shown as below.
The normal equations
For an overdetermined systems (\(m*n\)): \(Ax=b\) such as: \[ \left(\begin{array}{ll} 1 & 4 \\ 1 & 2 \\ 1 & 3 \end{array}\right)\left(\begin{array}{l} x_{1} \\ x_{2} \end{array}\right)=\left(\begin{array}{l} 3 \\ 3 \\ 2 \end{array}\right) \Leftrightarrow\left(\begin{array}{l} 1 \\ 1 \\ 1 \end{array}\right) x_{1}+\left(\begin{array}{l} 4 \\ 2 \\ 3 \end{array}\right) x_{2}=\left(\begin{array}{l} 3 \\ 3 \\ 2 \end{array}\right) \] We can't solve the system by \(a_{1}x_{1}+a_{2}x_{2}=b\) because we can't reach \(b\) through a linear combination of \(a_1\) and \(a_2\). This method only works when \(b \in C(A)\) (It's very unlikely when \(m>n\)).
Therefore, we find the vector \(p\) closest to \(b\) in the same plane of \(a_1\) and \(a_2\), and let \(e=b-p\) to be as small as possible and orthogonal to \(a_1\) and \(a_2\) (\(C(A)\)).
Then find a solution \(\hat{x}\) such that \(p=a_{1}\hat{x}_{1}+a_{2}\hat{x}_{2}\), and \(p\) is the projection of \(b\) onto \(C(A)\). \[ \begin{aligned} &p=a_{1} \hat{x}_{1}+a_{2} \hat{x}_{2}=A \hat{x} \\ &e=b-p=b-A \hat{x} \\ \end{aligned} \] Orthogonality yield \[ \begin{aligned} &\left\{\begin{array}{l} a_{1} \perp e \Leftrightarrow a_{1}^{T} e=0 \\ a_{2} \perp e \Leftrightarrow a_{2}^{T} e=0 \end{array} \Leftrightarrow A^{T} e=0 \Leftrightarrow A^{T}(b-A \hat{x})=0\right. \end{aligned} \]
The normal equations is \(A^{T}A\hat{x}=A^{T}b\). It's a least squares question and normally requires \(A^{T}A\) to be non-singular. It can prove that \(rank(A)=rank(A^{T}A)\), so \(A\) must have independent columns. The normal equations can be solved with Cholesky decomposition (\(A^{T}A\) symm. pos. def.) but \(A^{T}A\) is often ill-conditioned.
The normal equations requires less operation than other methods, but it won't work when \(A\) is singular it also has a high condition number.
QR decomposition
QR decomposition \(A=QR\) is one way to solve least squares problem. The solution is shown as below. \[ \begin{gathered} A^{T} A x=A^{T} b \Rightarrow(Q R)^{T} Q R x=(Q R)^{T} b \Rightarrow \\ R^{T} Q^{T} Q R x=R^{T} Q^{T} b \Rightarrow R^{T} R x=R^{T} Q^{T} b \Rightarrow \\ R x=Q^{T} b \end{gathered} \] Solving \(R x=Q^{T} b\) (with backward substitution) gives the least squares solution.
The merit of QR decomposition is that it has nothing to do with condition number. But the operation of QR decomposition is expensive and matrix \(A\) has to be singular otherwise there will be no solution for it.
Persudo-inverse
When \(A\) is a full-rank square matrix, there is a solution for \(Ax=b\) that \(x=A^{-1}b\). But when \(A\) isn't a full-rank square matrix, there is no solution for this equation. So we need to find the approximate solution \(x^{\prime}=\arg \min \|A x-b\|=A^{+} b\), and \(A^{+}\) is a pseudo-inverse matrix.
Let the SVD of \(A\) be \[ A=U\left(\begin{array}{ll} S & 0 \\ 0 & 0 \end{array}\right) V^{T}, \] where \(U, V\) are both orthogonal matrices, and \(S\) is a diagonal matrix containing the (positive) singular values of \(A\) on its diagonal. Then the pseudo-inverse of \(A\) is the \(n \times m\) matrix defined as \[ A^{+}=V\left(\begin{array}{cc} S^{-1} & 0 \\ 0 & 0 \end{array}\right) U^{T} . \] Note that \(A^{+}\) has the same dimension as the transpose of \(A\).
If \(A\) is square, invertible, then its inverse is \(A^{+}=A^{-1}\).
If \(A\) is full column rank, meaning \(\operatorname{rank}(A)=n \leq m\), that is, \(A^{\top} A\) is not singular, then \(A^{+}\) is a left inverse of \(A\), in the sense that \(A^{+} A=I_{n}\). We have the closed-form expression \(A^{+}=\left(A^{\top} A\right)^{-1} A^{\top}\)
If \(A\) is full row rank, meaning \(\operatorname{rank}(A)=m \leq n\), that is, \(A A^{\top}\) is not singular, then \(A^{+}\) is a right inverse of \(A\), in the sense that \(A A^{+}=I_{m}\). We have the closed-form expression \(A^{+}=A^{\top}\left(A A^{\top}\right)^{-1}\)
The solution to the least-squares problem \(\min _{x}\|A x-b\|_{2}\) with minimum norm is \(x^{*}=A^{+} b\).
Persudo-inverse will work all the time but its operation is also quite expensive due to the singular value decomposition part.
All articles in this blog adopt the CC BY-SA 4.0 agreement except for special statements. Please indicate the source for reprinting!