Working with a basis that is orthogonal but not orthonormal (i.e. not normalizing vectors to length 1) is often convenient, since some other normalization might be convenient in some applications (e.g. in hand calculations square roots are annoying). In this problem, you will see what happens if you don't normalize an orthogonal basis to unit length.
Suppose that we have a basis $v_1, v_2, \ldots, v_n \in \mathbb{R}^m$ for some subspace $S \subseteq \mathbb{R}^m$, which are orthogonal but not normalized to unit lengths. That is: $$ v_i^T v_j = \begin{cases} 0 & \mbox{if } i \ne j \\ s_i & \mbox{otherwise} \end{cases} $$ where $v_i^T v_i = \Vert v_i \Vert^2 = s_i > 0$ are the lengths² of the vectors.
If $V = \begin{pmatrix} v_1 & v_2 & \cdots & v_n \end{pmatrix}$ is the $m\times n$ matrix whose columsn are the basis vectors $v_i$, what is $V^T V$?
In terms of $V$ and/or the vectors $v_i$, write out the projection matrix $P$ onto $S$ (a) as product of matrices and (b) as a sum of rank-1 matrices.
Why is computing the projection $Pb$ of a vector $b$ much cheaper than computing $A(A^T A)^{-1}A^T b$ for an arbitray $m \times n$ full-column-rank matrix $A$? (Be quantitative. The latter requires $\sim mn + n^3$ operations; what does $Pb$ require using your expressions from the previous part?)
By definition of matrix multiplication, $V^TV$ is a $n \times n$ matrix whose row-$i$ column-$j$ entry is $v_i^Tv_j$. This is 0 if $i \neq j$ and $s_i$ if $i = j$. So $V^TV$ is the diagonal square $n \times n$ matrix with $i^{th}$ diagonal entry $s_i$: $$\boxed{V^TV = \begin{pmatrix} s_1 & & & \\ & s_2 & & \\ & & \ddots & \\ & & & s_n \end{pmatrix}}$$
(a) The matrix $V$ has columns forming a basis for $C(V) = S$, so our general formula for projections onto columns spaces applies and says that $P = V(V^TV)^{-1}V^T$. Note that the middle factor $(V^TV)^{-1}$ is actually easy to compute: it's just the $n \times n$ diagonal matrix with $i^{th}$ diagonal entry $s_i^{-1}$. (b) Very similar to class, the orthogonality of the basis means that you can just sum the rank-1 projections $v_iv_i^T/v_iv_i^T = s_i^{-1}v_iv_i^T$ (which doesn't work for a non-orthogonal basis!). There are several ways to see this. The simplest is perhaps to see that $Px = V(V^TV)^{-1}V^Tx$ first computes the dot products $v_i^T x$ (in $V^T x$), then divides each term by $s_i$ (from $(V^TV)^{-1}$), then sums the resulting coefficients times $v_i$ (in multiplying $V$), giving $Px = v_1 s_1^{-1}v_1^Tx + v_2 s_2^{-1}v_2^Tx + \cdots$, which is a sum of rank-1 projection matrices times $x$. (You can also derive this from the "columns times rows" perspective on matrix multiplication.) In summary, $$\boxed{P = V(V^TV)^{-1}V^T = V \begin{pmatrix} s_1^{-1} & & & \\ & s_2^{-1} & & \\ & & \ddots & \\ & & & s_n^{-1} \end{pmatrix} V^T} = s_1^{-1}v_1v_1^T + \cdots + s_n^{-1}v_nv_n^T .$$
Using the formula $P = s_1^{-1}v_1v_1^T + \cdots + s_n^{-1}v_nv_n^T$ is definitely faster because we can avoid inverting $A^TA$, which takes $\sim n^3$ operations. Specifically, bracketing right to left rather than right to left, just like on exam 1, we can write $$Pb = s_1^{-1}v_1(v_1^Tb) + \cdots + s_n^{-1}v_n(v_n^Tb).$$ To compute each $s_i^{-1}v_i(v_i^Tb)$ takes only $\sim m$ operations; it just involves a dot products and some scalar multiplications of vectors in $\mathbb{R}^m$. We have to do this $n$ times, so that's $\sim mn$ operations. We have to add the $n$ resulting vectors in $\mathbb{R}^m$, but that takes only $\sim mn$ operations as well. So we can perform the projection in $\boxed{\sim mn}$ operations instead of $\sim mn + n^3$ for a general $A$..
Often, in least-square fitting, one wants to weight each data point inversely with the (given) error estimates (variances) $\sigma_i^2 > 0$. Data points with more error should count for less in the fit.
That is, if we are trying to fit $y = Ax$ to "observations" $b \in \mathbb{R}^m$ with fit parameters $x \in \mathbb{R}^n$, then we try to minimize: $$ E = \sum_{i=1}^m \frac{(y_i - b_i)^2}{\sigma_i^2} $$
It turns out that this process corresponds to using a modified dot product "$\cdot_\sigma$" $$ y \cdot_\sigma z = \sum_{i=1}^m \frac{y_i z_i}{\sigma_i^2} = y^T W z $$ where $W$ is the $m \times m$ diagonal matrix $$ W = \begin{pmatrix} \sigma_1^{-2} & & & & \\ & \sigma_2^{-2} & & & \\ & & \sigma_3^{-2} & & \\ & & & \ddots & \\ & & & & \sigma_m ^{-2} \end{pmatrix} $$ We can also define the modified length $\Vert y \Vert_\sigma = \sqrt{y \cdot_\sigma y} = \sqrt{y^T W y}$.
Write the error $E$ from above as an expression in terms of $Ax$, $b$, and $\cdot_\sigma$ or $\Vert \cdots \Vert_\sigma$.
What is the projection matrix $P_\sigma$ that projects $b$ into a vector $p \in C(A)$ so that the "error" $b - p$ is orthogonal to $C(A)$ under our modified dot product? i.e. what is the new form of orthogonal projection?
Derive an equation, analogous (but not identical!) to the normal equations $A^TA\hat{x}=A^Tb$, for the $\hat{x}$ that minimizes $E$. (See e.g. the derivation of least-squares in class or in the book; the "by algebra" derivation in section 4.3 is particularly short.)
Outline the steps of a Gram-Schmidt process that converts a sequence of vectors $a_1,a_2,\ldots$ into vectors $q_1,q_2,\ldots$ that are orthonormal under our modified dot product. i.e. it converts a matrix $A$ into a matrix $Q$ with $Q^T W Q = I$ and $C(A)=C(Q)$.
Q'*W*Q
is nearly I).# our four vectors a1,a2,a3,a4 and our W matrix
A = [ -1 1 -5 -4
4 1 -6 -2
2 -3 -7 1
0 1 -6 -8
-7 1 0 5
-5 8 -5 -5 ]
a1 = A[:,1]
a2 = A[:,2]
a3 = A[:,3]
a4 = A[:,4]
W = diagm([1,2,3,4,5,6])
6×6 Array{Int64,2}: 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 6
# functions for our modified dot product and norm
mydot(x,y) = x'*W*y
mynorm(x) = sqrt(mydot(x,x))
mynorm (generic function with 1 method)
Straight from the definition of $\cdot_\sigma$ and $E$, we see $$E = \sum_{i=1}^m \frac{(y_i - b_i)^2}{\sigma_i^2} = (y - b)\cdot_\sigma (y - b) = \Vert y - b\Vert^2_\sigma = \boxed{\Vert Ax - b\Vert^2_\sigma}.$$
We approach projection of a vector $b$ exactly as in class, just changing the dot product: the derivation is almost the same! We want a projection $p \in C(A)$, so we must have $p = A\hat{x}$ for some $\hat{x}$. We want the error $e = b - p$ to be orthogonal to $C(A)$ in the new dot product, i.e. we want $A^T W e = 0$: this means that $a^T W e = a \cdot_\sigma e = 0$ for each column $a$ of $A$. Hence $A^T W (b - A\hat{x}) = 0$, giving us "normal equations" $A^T W A \hat{x} = A^T W b$ for $\hat{x} = (A^T W A)^{-1} A^T W b$. So, $p = A\hat{x} = \left[ A (A^T W A)^{-1} A^T W \right] b$, and the expression in square brackets $[\cdots]$ is our projection matrix $$\boxed{ P_\sigma = A (A^T W A)^{-1} A^T W }$$.
Exactly as in class and in the book, the orthogonal projection (in the new dot product) must be the closest point (in the new dot product), and we can just quote our solution from above for $\hat{x}$: $\boxed{A^T W A \hat{x} = A^T W b}$
q_1 = a_1 / \Vert a_1 \Vert_\sigma \ q_2 = \frac{a_2 - q_1 q_1^T W a_2}{\Vert \cdots \Vert_\sigma} \ q_3 = \frac{a_3 - q_1 q_1^T W a_3 - q_2 q_2^T W a_3}{\Vert \cdots \Vert_\sigma} \ q_4 = \frac{a_4 - q_1 q_1^T W a_4 - q_2 q_2^T W a_4 - q_3 q_3^T W a_4}{\Vert \cdots \Vert_\sigma} . $$
q4
.q1 = a1 / mynorm(a1) # here's the first q vector for you
q2 = a2 - q1*mydot(q1,a2)
q2 = q2 / mynorm(q2)
q3 = a3 - q1*mydot(q1,a3) - q2*mydot(q2,a3);
q3 = q3 / mynorm(q3);
q4 = a4 - q1*mydot(q1,a4) - q2*mydot(q2,a4) - q3*mydot(q3,a4);
q4 = q4 / mynorm(q4)
6-element Array{Float64,1}: -0.14749 0.337808 0.204478 -0.381611 0.0722411 0.0516127
Check that $Q^T W Q \approx I$:
Q = [q1 q2 q3 q4] # put your four vectors into a matrix
Q' * W * Q
4×4 Array{Float64,2}: 1.0 5.55112e-17 6.93889e-17 1.94289e-16 5.55112e-17 1.0 0.0 3.46945e-16 9.71445e-17 -1.38778e-17 1.0 2.77556e-17 1.94289e-16 3.60822e-16 2.77556e-17 1.0
Hooray, it worked! (The tiny off-diagonal values $< 10^{-15}$ are just roundoff errors, as we've seen many times in class.)
Suppose that we reverse the order of the columns of $A$, do Gram-Schmidt, and then reverse the order of the resulting basis to get back a matrix $Q$. Then $A = QS$ for some matrix $S$ — what is the expected pattern of nonzero entries in $S$, and why? (i.e. is $S$ upper-triangular or ...?)
The following Julia code tries this process for a random matrix $A$ to get $Q$. Give a formula for $S = \cdots$ and compute it in Julia to check your answer (or to give you a hint about what to look for):
# a randomly chosen 10×4 matrix with small integer entries
A = [ 6 -5 0 6
9 0 -8 6
4 -8 -7 0
-9 6 1 -1
7 3 -3 -4
1 4 4 6
8 4 0 9
7 0 -5 -5
-2 8 -8 -3
-1 7 -9 -1]
10×4 Array{Int64,2}: 6 -5 0 6 9 0 -8 6 4 -8 -7 0 -9 6 1 -1 7 3 -3 -4 1 4 4 6 8 4 0 9 7 0 -5 -5 -2 8 -8 -3 -1 7 -9 -1
Arev = flipdim(A,2) # A with the columns in reverse order
10×4 Array{Int64,2}: 6 0 -5 6 6 -8 0 9 0 -7 -8 4 -1 1 6 -9 -4 -3 3 7 6 4 4 1 9 0 4 8 -5 -5 0 7 -3 -8 8 -2 -1 -9 7 -1
Qrev = qr(Arev)[1] # the Q from QR factorizing Arev (equivalent to Gram–Schmidt)
Q = flipdim(Qrev,2) # reverse the columns to go back in the same order as A
10×4 Array{Float64,2}: 0.0430779 -0.289121 -0.0646181 -0.386494 0.051702 -0.0720731 -0.526039 -0.386494 -0.226465 -0.566517 -0.403743 0.0 -0.393246 0.375077 0.0684473 0.0644157 0.614837 0.139154 -0.129954 0.257663 0.0988272 0.305304 0.166092 -0.386494 0.301757 0.269268 -0.0969271 -0.579741 0.491389 -0.0691215 -0.23454 0.322078 -0.141188 0.392171 -0.429112 0.193247 -0.222028 0.325475 -0.508329 0.0644157
Reversing the order of the columns of a $m \times n$ matrix is achieved by right multiplication by the $n \times n$ matrix $J = [e_n | \cdots | e_1]$, i.e. the identity matrix with its columns in reverse order. So the matrix produced from $A$ by first reversing columns, then doing Gram-Schmidt, and then reversing columns again is given by $AJRJ$, with $R$ invertible and upper-triangular. So $S = JRJ$. For any upper-triangular matrix $U$, the matrix $JUJ$ is lower-triangular, because right-multiplication by $J$ reverses the order of columns and left multiplication reverses the order of rows, and if we start with an upper-triangular matrix $U$, reverse the order of its rows to form $JU$, and then reverse the order of columns to form $JUJ$, the result is manifestly lower-triangular. So $S$ is $\boxed{\text{lower-triangular}}$.
Alternatively, one can see that the whole process corresponds to doing Gram-Schmidt "from the back" of the matrix, i.e. first normalizing the last column, then adding some of the last column to the second to last, normalizing the second to last, ..., and each of these operations is given by right multiplication by a lower-triangular matrix (so the whole procedure, $S$, is as well).
Since the columns of $Q$ are orthonormal, we have $Q^TQ = I$. So $A = QS \implies \boxed{S = Q^T A}$. (This exact expression was derived in class for $A=QR$; nothing has changed.)
The Julia code is run below (notice that the entries above the diagonal are 0, up to rounding error).
S = Q'*A
4×4 Array{Float64,2}: 14.118 1.33227e-15 -4.44089e-16 -1.11022e-15 -6.18529 16.3597 8.88178e-16 -4.44089e-16 -9.14729 -3.14068 17.3377 -1.11022e-16 -7.7943 1.2239 -2.8987 -15.5242
(From Strang, section 4.4, problem 18.)
Find orthogonal vectors $q_1, q_2, q_3$ by Gram-Schmidt from $a_1, a_2, a_3$ given by: $$ a_1 = \begin{pmatrix} 1 \\ -1 \\ 0 \\ 0 \end{pmatrix}, \; a_2 = \begin{pmatrix} 0 \\ 1 \\ -1 \\ 0 \end{pmatrix}, \; a_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \\ -1 \end{pmatrix}, $$ which are a basis for the vectors perpendicular to $d = \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \end{pmatrix}$. If you form the $4\times 3$ matrix $Q = \begin{pmatrix} q_1 & q_2 & q_3 \end{pmatrix}$, then what is $Q^T d$?
We'll just do Gram-Schmidt without the normalizing, since the problem only asked for an orthogonal basis, and not orthonormal. (The notation is a bit misleading here, since usually "q" means an orthonormal basis. But if you found an orthonormal basis, that is fine too, since any orthonormal basis is of course orthogonal!)
We have $q_1 = a_1$. To find $q_2$, we need to subtract the projection of $a_2$ onto $a_1$. So, $$q_2 = a_2 - (q_1^Tq_1)^{-1}q_1q_1^Ta_2 = a_2 + (1/2)a_1 = \begin{pmatrix}1/2 \\ 1/2 \\ -1 \\ 0\end{pmatrix}.$$ We need to do the same for $a_3$, and we compute $$q_3 = a_3 - (q_1^Tq_1)^{-1}q_1q_1^Ta_3 - (q_2^Tq_2)^{-1}q_2q_2^Ta_3 = a_3 + (2/3)q_2 = \begin{pmatrix}1/3 \\ 1/3 \\ 1/3 \\ -1\end{pmatrix}.$$ In summary, $$\boxed{q_1 = \begin{pmatrix}1 \\ -1 \\ 0 \\ 0\end{pmatrix}, q_2 = \begin{pmatrix}1/2 \\ 1/2 \\ -1 \\ 0\end{pmatrix}, q_3 = \begin{pmatrix}1/3 \\ 1/3 \\ 1/3 \\ -1\end{pmatrix}}$$ By inspection, we can see they are each orthogonal to the rest and also to $d$, so we did it right.
Also the matrix $Q^Td$ will be the column vector recording the dot products of $q_1, q_2, q_3$ with $d$, and you were told that $d$ is orthogonal to $a_1,a_2,a_3$ and hence it must be orthogonal to $q_1,q_2,q_3$ as well (since they span the same subspace)! So, without any calculation, you should be able to see that $\boxed{Q^Td = 0}$.
As a check, let's do the calculation in Julia with the qr
function:
A = [ 1 0 0
-1 1 0
0 -1 1
0 0 -1]
Q, R = qr(A)
Q
4×3 Array{Float64,2}: -0.707107 -0.408248 -0.288675 0.707107 -0.408248 -0.288675 0.0 0.816497 -0.288675 0.0 0.0 0.866025
This is the $Q$ matrix with orthonormal columns, what we would have gotten if we had normalized the vectors. Let's rescale them to have the same lengths as above by dividing by the entries that are supposed to be $-1$:
Q = Q ./ [-Q[2,1] -Q[3,2] -Q[4,3]]
4×3 Array{Float64,2}: 1.0 0.5 0.333333 -1.0 0.5 0.333333 -0.0 -1.0 0.333333 -0.0 -0.0 -1.0
Hooray, this matches our answer! And we can also check $Q^T d$:
Q' * [1,1,1,1]
3-element Array{Float64,1}: -3.33067e-16 -1.11022e-16 -2.22045e-16
This is zero, as expected.
(From Strang, section 4.4, problem 37.)
We know that $P = Q^T Q$ is the projection onto $C(Q)$, where $Q$ is an $m \times n$ matrix with orthonormal columns. Now add another column $a$ to produce $A = \begin{pmatrix} Q & a \end{pmatrix}$. Gram-Schmidt on $A$ replaces $a$ by what vector $q$? (Give a formula in terms of $a$ and $Q$ and/or $P$.)
By definition, Gram–Schmidt will replace $a$ by $a - Pa$ and then normalize it: the preceding columns of $A$ are already orthonormal, so Gram–Schmidt will do nothing to them, whereas for the last column it will subtract the projections of the previous columns. As $P = QQ^T$, this could also be written $a - QQ^Ta$. After normalizing, this gives: $$\boxed{\frac{a - QQ^Ta}{\Vert a - QQ^Ta\Vert}}$$