# Necessary run this if you intend to play with and run the code below
import numpy as np
import matplotlib.pyplot as plt
plt.ion()
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
matplotlib.rcParams['figure.dpi'] = 150
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-7-a5b2554c064f> in <module> 1 # Necessary run this if you intend to play with and run the code below ----> 2 import numpy as np 3 import matplotlib.pyplot as plt 4 plt.ion() 5 from mpl_toolkits.mplot3d import Axes3D ModuleNotFoundError: No module named 'numpy'
Author: Brady Anderson
b.anderson@byu.edu
Linear Regression, or the Least Squares problem, manifests itself quite often in engineering work. First, an example:
A physical (real-world) engineering process needs to be modeled. Now, having the physical system, it is possible to apply known inputs to the process "plant," and measure the corresponding outputs. (Assuming some type of output measurements are possible with some sensor). Linear Regression allows plant parameters to be deduced given a collected set of inputs and outputs. These plant parameters become your model, whether for simulation or for an estimator.
The idea is to find linear coefficients for a set of linearly independent vectors such that the combination of these coefficients and vectors (the estimated data) comes as close as possible to a true dataset.
Linear Regression takes advantage of properties of the Projection Theorem and the induced norms on l2 and L2 in order to compute the aforementioned parameters (or whatever the solution may be to your use case). The key fact to note is that the error of the estimated data is orthogonal to the true data.
Give a detailed discussion (i.e., equations galore) of the topic. The emphasis here is clarity for future students learning the topic.
First a review of the Projection Theorem (Thm 2.9 in Moon):
Let S be a Hilbert space and let V be a closed subspace of S. For any vector x∈S, ∃ a unique vector v0∈V closest to x; that is, ‖x−v0‖≤‖x−v‖ ∀ v ∈V. Furthermore, the point v0 is the minimizer of ‖x−v‖ if and only if x−v0∈V⊥.
If you are unfamiliar or uncomfortable with this definition, please see the section on projection operators. Also, section 2.14 of the book covers the theorem well.
Say you have a normed, linear vector space (S,‖⋅‖), and a matrix of linearly independent vectors T=[p1,p2,…,pm] with V=span(T)∈S. In linear regression, we seek the coeffients ci such that
ˆx=Tˆc
according to the constraint that ˆc=argminc‖x−Tc‖.
This will minimize the total error between the true vector x and the estimate ˆx:
‖e‖=‖x−ˆx‖.
It is geometrically observable that the error is minimized when the error is orthogonal to V when the l2 and L2 norms are used.
Now, if x∈V, it is possible to have ‖e‖=0. However, in general, (especially in engineering), this is not the case.
When we use the induced norm for ‖e‖, we can express the minimization in terms of the orthogonality condition, using the Projection Theorem: the minimum-norm error must be orthogonal to each vector pj
⟨x−∑mi=1cipi,pj⟩=0, j=1,2,…,m
We can write these in what are known as the normal equations, with m equations in m unknowns:
[⟨p1,p1⟩⋯⟨pm,p1⟩⋮⋱⋮⟨p1,pm⟩⋯⟨pm,pm⟩][c1⋮cm]=[⟨x,p1⟩⋮⟨x,pm⟩]The left matrix is known as the Grammian of the set T, the set of vectors composing the T matrix. It usually is denoted by the letter R: Rc=px
where px is the cross-correlation vector.
Due to the properties of the inner product, and that Rij=⟨pj,pi⟩, the Grammian matrix is Hermitian symmetric: RH=R. (The Hermitian of a matrix means its conjugate transpose.)
To solve the normal equations, R must be invertible. Recall that positive-definite matrices are always invertible. This leads to Thm3.1:
A Grammian matrix R is always positive-semidefinite (that is, xHRx≥0 ∀ x∈Cm). It is positive-definite if and only if the vectors p1,…,pm are linearly independent.
Proof:
Let y=[y1,…,ym]T be an arbitrary vector. Then
yHRy=m∑i=1m∑j=1¯yiyj⟨pi,pj⟩=⟨m∑j=1yjpj,m∑i=1yipi⟩=‖m∑j=1yjpj‖2≥0Hence R is positive-semidefinite.
If R is not positive-semidefinite, then there is a nonzero vector y such that
yHRy=0
so that
∑mi=1yipi=0;
thus, the pi are linearly independent.
Conversely, if R is positive-definite, then
yHRy>0
for all nonzero y and by the first line equation of this proof,
∑mi=1yipi≠0
This means that the pi are linearly independent. QED
As a result of this theorem, the Grammian is invertible if all the vectors pi are linearly independent.
As an extension, if the set of vectors pi are orthogonal, then the Grammian is diagonal, significantly reducing the amount of computation required to find the coefficients. They are simply obtained by:
cj=⟨x,pj⟩⟨pj,pj⟩
The orthogonality principle for least-squares is now formalized with Thm 3.2:
Let p1,…,pm be data vectors in a vector space S. Let x be any vector in S. In the representation
x=∑mc=1cipi+e=ˆx+e
the induced norm of the error ‖e‖ is minimized when the error e=x−ˆx is orthogonal to each of the data vectors,
⟨x−∑mi=1cipi,pj⟩=0, j=1,2,…,m
In the case that x∈span(p1,…,pm), the error is zero and hence is orthogonal to the data vectors. This case is therefor trivial and is excluded from what follows.
If x∉span(p1,…,pm), let y be a fixed vector that is orthogonal to all of the data vectors,
⟨y,pi⟩=0 i=1,…,m
such that
x=∑mi=1aipi+y
for some set of coefficients a1,…,am. Let e be a vector satisfying
x=∑mi=1cipi+e eq(1)
for some set of coefficients c1,…,cm. Then by the Cauchy-Schwarz inequality,
‖e‖2‖y‖2≥|⟨e,y⟩|2=|⟨x,y⟩−⟨m∑c=1cipi,y⟩|2=|⟨x,y⟩|2The lower bound is independent of the coefficients ci, and hence no set of coefficients can make the bound smaller. By the quality condition for the Cauchy-Schwarz inequality, the lower bound is achieved --implying the minimum ‖e‖ -- when
e=αy
for some scalar α. Since e must satisfy eq(1), it must be the case that:
α=1
ai=ci
e=y
hence the error is orthogonal to the data. QED
It is important to note that because ˆx is a linear combination of the data vectors, it is orthogonal to the error vector:
⟨ˆx,e⟩=0
Or, in other words, the residual vector (errors from each point) is orthogonal to the column space. See demonstration below.
Provide some simple python code and examples that emphasize the basic concepts.
Note Figure 3.8(b), provided in the book.
It may be confusing that the minimized error is the vertical distance betweent the regressed line and the data points. The following example will show how these vertical distances does not contradict the earlier statements of orthogonal error minimization.
Let data point vector y=[1,4,5]T, measured at time t=[0,1,2]T be approximated by a linear regression:
y=a1t+a0
This may be expressed as:
[145]=[011121][a1a0]Here we have p1=t and p2=[1,1,1]T
To compute the Grammian matrix, R, we need the inner product of p1 and p2 to themselves, and to each other.
We also compute the cross-correlation vector px: px=[⟨y,p1⟩,⟨y,p2⟩]T
Finally, we verify the linear independence of p1,p2 (thus guaranteeing inertibility of R) by asserting
⟨ˆp1,ˆp2⟩≠1
where ˆpi is the unit vector in the direction of pi
Our p-vectors are in R3. When we say that our error is minimized by being orthogonal to the data vectors we mean that the residual is orthogonal to the column space of the matrix created by the data. In our case, this is the matrix composed of p1 and p2. When this is drawn on the 2D plot of the data, it looks like vertical distances from the true data to the regressed line. But if we plot the data vectors in 3D, in our case a plane, we can see that the residual error is perpendicular to this plane, shown by
⟨p1,e⟩=0
⟨p2,e⟩=0
y and t, as well as the regressed line, are drawn below.
In the 3d plot, you can see the plane spanned by p1,p2, and that the estimated ˆy lies in this plane. You can also see the error vector is orthogonal to this plane created by the data vectors. Finally, if you add the error vector to the estimated vector, you get back the original, true data
ˆy+e=y
seen by the fact the yellow vector representing the true data is out of the plane.
You can grab and rotate the plot with the mouse, to see these facts.
# create time and measured data
t_vec = np.array([0,1,2])
y_vec = np.array([1,4,5])
# create data vectors
p_1 = t_vec
p_2 = np.array([1,1,1])
# create the 'A' matrix, or data matrix from data vectors
data_mat = np.array([p_1,p_2]).transpose()
# regression with least squares, compute the Grammian matrix, 'R'
reg_ls_R = np.array([[np.inner(p_1,p_1),np.inner(p_2,p_1)],[np.inner(p_1,p_2),np.inner(p_2,p_2)]])
# compute the cross-correlation vector
p_x = np.array([np.inner(y_vec,p_1),np.inner(y_vec,p_2)])
# compute the norms of the data vectors, to assert invertibility of R (Grammian)
p_1_unit = p_1 / np.linalg.norm(p_1)
p_2_unit = p_2 / np.linalg.norm(p_2)
assert(np.inner(p_1_unit,p_2_unit)!=1)
# compute the error-minimizing coefficients
a_vec = np.linalg.inv(reg_ls_R) @ p_x
# compute estimate data
y_hat = data_mat @ a_vec
# compute errors for each point in the data vector
e_vec = y_vec - y_hat
# show data vectors are orthogonal to error
test_1 = np.inner(p_1,e_vec)
test_2 = np.inner(p_2,e_vec)
print(f'inner product of p1 with e_vec = {test_1}')
print(f'inner product of p2 with e_vec = {test_2}')
print('If these equal 0, then error is orthogonal to data vectors.')
# fig1,(ax1,ax2) = plt.subplots(2,1)
fig1, ax1 = plt.subplots(1,1)
# plt.subplot(2,1,1)
plt.plot(t_vec,y_vec, 'o', label='original data' )
plt.plot(t_vec,y_hat, '-', label='regressed data' )
plt.plot([0,0],[y_vec[0],y_hat[0]], label='error[0]')
plt.plot([1,1],[y_vec[1],y_hat[1]], label='error[1]')
plt.plot([2,2],[y_vec[2],y_hat[2]], label='error[2]')
ax1.set_ylim(0,6)
plt.legend()
ax1.xaxis.set_major_locator(plt.MaxNLocator(5))
ax1.yaxis.set_major_locator(plt.MaxNLocator(7))
inner product of p1 with e_vec = -6.217248937900877e-15 inner product of p2 with e_vec = -4.440892098500626e-15 If these equal 0, then error is orthogonal to data vectors.
# create 3D plot to show data vectors and perpendicular error vector
# necessary to turn on interactivity on 3d plot
%matplotlib notebook
# create necessary elements for plotting surface
v1 = p_1_unit
v2 = p_2_unit
# the cross product is a vector normal to the plane
cp = np.cross(v1, v2)
a, b, c = cp
xx = np.linspace(-1, 7, 5, endpoint=True)
yy = np.linspace(-1, 7, 5, endpoint=True)
XX, YY = np.meshgrid(xx, yy)
ZZ = (-a * XX - b * YY) / c
# error plus y_hat vector
e_p_yhat = y_hat + e_vec
# print to screen
fig2 = plt.figure()
ax2 = fig2.add_subplot(111,projection='3d')
ax2.plot_surface(XX,YY,ZZ,alpha=0.2)
# and plot the point
ax2.plot([0,y_hat[0]] , [0,y_hat[1]] , [0,y_hat[2]], color='green', label='y_hat')
ax2.plot([0,p_1[0]] , [0,p_1[1]] , [0,p_1[2]], color='red', label='p_1')
ax2.plot([0,p_2[0]] , [0,p_2[1]] , [0,p_2[2]], color='red', label='p_2')
ax2.plot([y_hat[0],e_p_yhat[0]],[y_hat[1],e_p_yhat[1]],[y_hat[2],e_p_yhat[2]], color='blue', label='e_vec from y_hat')
ax2.plot([0,y_vec[0]],[0,y_vec[1]],[0,y_vec[2]], color='yellow', label='y_vec')
ax2.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
# plt.legend()
# set aspect ratio to square, for proper viewing of orthogonality
ax2.set_aspect('equal')
ax2.set_xlim(-1,7)
ax2.set_ylim(-1,7)
ax2.set_zlim(-1,7)
(-1, 7)
Finally, we make one more observation, specific to the matrix case.
Recall that the Grammian matrix, R, is the left hand matrix in the normal equations:
[⟨p1,p1⟩⋯⟨pm,p1⟩⋮⋱⋮⟨p1,pm⟩⋯⟨pm,pm⟩][c1⋮cm]=[⟨x,p1⟩⋮⟨x,pm⟩]If we denote our data matrix as A, note that R may also be represented as AHA. Additionally, we may present the right hand vector, px, as AHx.
Also recall that if the columns in the data matrix, A, are linearly independent, then R is invertible. Hence,
Rc=px
⇒AHAc=AHx
⇒c=(AHA)−1AHx
Below we show that the Moore-Penrose pseudo-inverse, (AHA)−1AH, is equivalent to the process above when your data can be expressed as a matrix (i.e. not in function space).
a_pseudo = np.linalg.inv(data_mat.transpose()@data_mat)@data_mat.transpose()@y_vec
err_coeffs = a_pseudo - a_vec
print(f'error between Grammian and pseudo-inverse method for coefficients: {err_coeffs}')
error between Grammian and pseudo-inverse method for coefficients: [-4.44089210e-16 -6.66133815e-16]
Provide a more sophisticated example showing one engineering example of the topic, complete with python code.
Given a mass, spring, damper system,
m¨x−b˙x−kx=u
with known inputs and outputs, estimate the system parameters, m,k,b.
Note: Outputs are usually measured with instruments that introduce noise into the measurements. So we call this an estimate because if we had perfect measurements, you could retrieve the exact values. Due to the noise in the measurements, and imperfect actuator, noise will be added into the data you use to retrieve the parameters.
It is easiest to take advantage of the matrix form of the Moore-Penrose Pseudo-Inverse:
Ac=x
c=(AHA)−1AHx
where, for our problem,
A=[¨x−˙x−x]Suggestion: Play with the sigma (standard deviation) multiplier sigmamult for the noise, as well as tfin, to see limits of this method, and how longer data sampling times can help with the estimation.
# PROBLEM SETUP
# create time vector
dt = 0.01
t_init = 0
t_fin = 10
t_lt = np.int32((t_fin - t_init)/dt + 1)
tt = np.linspace(t_init,t_fin,t_lt.astype(int))
# true system params
mass = 1
b_damp = 4
k_sp = 5
m_b_k_tr = np.array([mass,b_damp,k_sp])
# create true data, and true output (to guarantee solvability)
xx = 7 * np.sin(tt**2)
xx_d = 14*tt*np.cos(tt**2)
xx_dd = 14*np.cos(tt**2) - 28*tt**2*np.sin(tt**2)
A_tr_transpose = np.array([xx_dd,-xx_d,-xx])
A_tr = A_tr_transpose.transpose()
# uu = mass*xx_dd - b_damp*xx_d - k_sp*xx
uu = A_tr @ m_b_k_tr
# add noise to true data, to simulate real world measurements
# generate noise
sigma_mult = 0.5
noise_0 = sigma_mult * np.random.randn(t_lt)
noise_1 = sigma_mult * np.random.randn(t_lt)
noise_2 = sigma_mult * np.random.randn(t_lt)
noise_u = sigma_mult * 0.1 * np.random.randn(t_lt)
xx_noise = noise_0 + xx
xx_d_noise = noise_1 + xx_d
xx_dd_noise = noise_2 + xx_dd
uu_noise = noise_u + uu
# =====================================
# PROBLEM COMPUTATION
# assert data matrix vectors are linearly independent
assert(np.inner(xx_dd_noise,xx_d_noise)!=1)
assert(np.inner(xx_dd_noise,xx_noise)!=1)
assert(np.inner(xx_d_noise,xx_noise)!=1)
# Now use noisy measurements, and estimate the system parameters, m, k, b
# create A matrix, or data vectors, as well as its transpose, for use of the Moore-Penrose Pseudo-Inverse
A_mat_transpose = np.array([xx_dd_noise,-xx_d_noise,-xx_noise])
A_mat = A_mat_transpose.transpose()
# make sure your coefficient order matches your A matrix, fix signs of coefficients
m_b_k_hat = np.linalg.inv(A_mat_transpose @ A_mat) @ A_mat_transpose @ uu_noise
# demonstrate that with perfect measurements, perfect results can be obtained
m_b_k_perfect = np.linalg.inv(A_tr_transpose @ A_tr) @ A_tr_transpose @ uu
mh,bh,kh = m_b_k_hat.tolist()
mp,bp,kp = m_b_k_perfect.tolist()
print(f'Estimated parameter reconstruction m, b, k: {mh:.4f}, {bh:.4f}, {kh:.4f}')
print(f'Perfect parameter reconstruction m, b, k: {mp:.4f}, {bp:.4f}, {kp:.4f}')
Estimated parameter reconstruction m, b, k: 1.0006, 3.9996, 4.8886 Perfect parameter reconstruction m, b, k: 1.0000, 4.0000, 5.0000
Author: Jonathan Barrett
jab128@byu.edu
Suppose you have created a model that evaluates the probability that an ADS-B message will be decoded successfully given a density of UAS/km2. You have collected data for UAS densities from nearly 0 UAS/km2 up to 5 UAS/km2. Not only is the data noisy from randomization, but now the simulations are too resource intensive to continue running at higher densities.
You realize that you need to perform a least squares linear regression on the data to fit a line to the data, and to estimate the probability of a successful decode at 10 UAS/km2. Use the following csv file, and the python template below: https://drive.google.com/file/d/1vBTGxYauh1itxIlzHmcZsaQXcVefohVv/view?usp=sharing
If your estimated decode probability seems wrong, then you may want to use a weighted least squares solution that follows the later trends in the data most closely.
import numpy as np
import matplotlib.pyplot as plt
num_samples = 10000
# Download CSV file from the link provided
data = np.genfromtxt("adsb_data.csv", delimiter=",")
x = np.linspace(0, 5, num_samples)
# Plot data
plt.figure(1)
plt.scatter(x, data)
# Insert code here #
plt.show()
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-4-db87281967e5> in <module> ----> 1 import numpy as np 2 import matplotlib.pyplot as plt 3 4 num_samples = 10000 5 data = np.genfromtxt("adsb_data.csv", delimiter=",") ModuleNotFoundError: No module named 'numpy'
# SOLUTION #
import numpy as np
import matplotlib.pyplot as plt
num_samples = 10000
# Download CSV file from the link provided
data = np.genfromtxt("adsb_data.csv", delimiter=",")
x = np.linspace(0, 5, num_samples)
# Plot data
plt.figure(1)
plt.scatter(x, data, s=2)
# Insert code here #
# Create A and y
A = np.zeros((num_samples, 2))
y = np.zeros(num_samples)
# Populate A and y
for i in range(num_samples):
A[i][0] = i / num_samples
A[i][1] = 1.0
y[i] = data[i]
# Solve for coefficients
h = np.linalg.inv(A.transpose() @ A) @ A.transpose() @ y
# Filter data
filtered_data = A @ h
# Plot filtered data, estimate successful decode probability at 10 UAS/km^2
plt.figure(2)
plt.scatter(x, filtered_data, s=2)
print("Estimated successful decode probability at 10 UAS/km^2:", h[0]*(10) + h[1])
# Create and populate weighing matrix
# Weigh the first 2000 samples much less
W = np.zeros((num_samples, num_samples))
for i in range(num_samples):
if i < 2000:
W[i][i] = 1
else:
W[i][i] = 100
# Solve for weighted coefficients
h_weighted = np.linalg.inv(A.transpose() @ W @ A) @ A.transpose() @ W @ y
# Filter using weighted coefficients
weighted_filtered_data = A @ h
# Plot weighted filtered data, estimate successful decode probability at 10 UAS/km^2
plt.figure(3)
plt.scatter(x, weighted_filtered_data, s=2)
print("Estimated successful decode probability at 10 UAS/km^2, weighted:", h_weighted[0]*(10) + h_weighted[1])
plt.show()
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-5-a3b75435c2f9> in <module> 1 # SOLUTION # ----> 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 num_samples = 10000 ModuleNotFoundError: No module named 'numpy'
Congratulations! You found coefficients to form a linear equation that best fits the data, but you can actually fit the data better by finding coefficients for polynomials of different orders. Re-write your program so that it can fit the data to any order polynomial that you choose.
Once you have done so, set the polynomial order to 10 and run the program. Does the curve seem to fit the data better? Why is that?
Check the approximations for the ADS-B successful decode probability at 10 UAS/km2. Do they seem better or worse than the approximations from the linear equation? Why is that?
Now set the polynomial order to 100 and run the program. Does the curve seem to fit the data well?
# SOLUTION #
import numpy as np
import matplotlib.pyplot as plt
num_samples = 10000
order_of_polynomial = 10
# Download CSV file from the link provided
data = np.genfromtxt("adsb_data.csv", delimiter=",")
x = np.linspace(0, 5, num_samples)
# Plot data
plt.figure(1)
plt.scatter(x, data, s=2)
# Insert code here #
# Create A and y
A = np.zeros((num_samples, order_of_polynomial + 1))
y = np.zeros(num_samples)
# Populate A and y
for i in range(num_samples):
for j in range(order_of_polynomial + 1):
A[i][j] = (i / num_samples)**(order_of_polynomial - j)
y[i] = data[i]
# Solve for coefficients
h = np.linalg.inv(A.transpose() @ A) @ A.transpose() @ y
# Filter data
filtered_data = A @ h
# Plot filtered data
plt.figure(2)
plt.scatter(x, filtered_data, s=2)
# Estimate successful decode probability at 10 UAS/km^2
estimated_value = 0
for i in range(order_of_polynomial + 1):
estimated_value += h[i]*10**(order_of_polynomial - i)
print("Estimated successful decode probability at 10 UAS/km^2:", estimated_value)
# Create and populate weighing matrix
# Weigh the first 2000 samples much less
W = np.zeros((num_samples, num_samples))
for i in range(num_samples):
if i < 2000:
W[i][i] = 1
else:
W[i][i] = 100
# Solve for weighted coefficients
h_weighted = np.linalg.inv(A.transpose() @ W @ A) @ A.transpose() @ W @ y
# Filter using weighted coefficients
weighted_filtered_data = A @ h
# Plot weighted filtered data
plt.figure(3)
plt.scatter(x, weighted_filtered_data, s=2)
# Estimate successful decode probability at 10 UAS/km^2
estimated_value_weighted = 0
for i in range(order_of_polynomial + 1):
estimated_value_weighted += h_weighted[i]*10**(order_of_polynomial - i)
print("Estimated successful decode probability at 10 UAS/km^2, weighted:", estimated_value_weighted)
plt.show()
# EXPLANATIONS #
# With the polynomial order set to 10, the curves do seem to fit the data much better,
# especially for UAS densities below 2 UAS/km^2. This is because we are adding
# polynomials with different coefficients until they manage to fit all the curves of
# the data. The estimations for decode probabilities are way off now, and this is because
# the coefficients are only trying to fit the curve to the data that we have instead of
# trying to fit the curve to what the data probably would be at higher values of UAS density.
# With the polynomial order set to 100, the curve doesn't seem to match the data at all anymore.