#!/usr/bin/env python
# coding: utf-8

# You can read an overview of this Numerical Linear Algebra course in [this blog post](http://www.fast.ai/2017/07/17/num-lin-alg/).  The course was originally taught in the [University of San Francisco MS in Analytics](https://www.usfca.edu/arts-sciences/graduate-programs/analytics) graduate program.  Course lecture videos are [available on YouTube](https://www.youtube.com/playlist?list=PLtmWHNX-gukIc92m1K0P6bIOnZb-mg0hY) (note that the notebook numbers and video numbers do not line up, since some notebooks took longer than 1 video to cover).
# 
# You can ask questions about the course on [our fast.ai forums](http://forums.fast.ai/c/lin-alg).

# # 6. How to Implement Linear Regression

# In the previous lesson, we calculated the least squares linear regression for a diabetes dataset, using scikit learn's implementation.  Today, we will look at how we could write our own implementation.

# In[2]:


from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np
from scipy import linalg


# In[3]:


np.set_printoptions(precision=6)


# In[4]:


data = datasets.load_diabetes()


# In[5]:


feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


# In[6]:


trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)


# In[7]:


trn.shape, test.shape


# In[8]:


def regr_metrics(act, pred):
    return (math.sqrt(metrics.mean_squared_error(act, pred)), 
     metrics.mean_absolute_error(act, pred))


# ### How did sklearn do it?

# How is sklearn doing this?  By checking [the source code](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/linear_model/base.py#L417), you can see that in the dense case, it calls [scipy.linalg.lstqr](https://github.com/scipy/scipy/blob/v0.19.0/scipy/linalg/basic.py#L892-L1058), which is calling a LAPACK method:
# 
#         Options are ``'gelsd'``, ``'gelsy'``, ``'gelss'``. Default
#         (``'gelsd'``) is a good choice.  However, ``'gelsy'`` can be slightly
#         faster on many problems.  ``'gelss'`` was used historically.  It is
#         generally slow but uses less memory.
# 
# - [gelsd](https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/_gelsd.htm): uses SVD and a divide-and-conquer method
# - [gelsy](https://software.intel.com/en-us/node/521113): uses QR factorization
# - [gelss](https://software.intel.com/en-us/node/521114): uses SVD

# #### Scipy Sparse Least Squares

# We will not get into too much detail about the sparse version of least squares.  Here is a bit of info if you are interested: 

# [Scipy sparse lsqr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.linalg.lsqr.html#id1) uses an iterative method called [Golub and Kahan bidiagonalization](https://web.stanford.edu/class/cme324/paige-saunders2.pdf).

# from [scipy sparse lsqr source code](https://github.com/scipy/scipy/blob/v0.14.0/scipy/sparse/linalg/isolve/lsqr.py#L96):
#   Preconditioning is another way to reduce the number of iterations. If it is possible to solve a related system ``M*x = b`` efficiently, where M approximates A in some helpful way (e.g. M - A has low rank or its elements are small relative to those of A), LSQR may converge more rapidly on the system ``A*M(inverse)*z = b``, after which x can be recovered by solving M\*x = z.
#   
# If A is symmetric, LSQR should not be used! Alternatives are the symmetric conjugate-gradient method (cg) and/or SYMMLQ.  SYMMLQ is an implementation of symmetric cg that applies to any symmetric A and will converge more rapidly than LSQR.  If A is positive definite, there are other implementations of symmetric cg that require slightly less work per iteration than SYMMLQ (but will take the same number of iterations).

# ### linalg.lstqr

# The sklearn implementation handled adding a constant term (since the y-intercept is presumably not 0 for the line we are learning) for us.  We will need to do that by hand now:

# In[9]:


trn_int = np.c_[trn, np.ones(trn.shape[0])]
test_int = np.c_[test, np.ones(test.shape[0])]


# Since `linalg.lstsq` lets us specify which LAPACK routine we want to use, lets try them all and do some timing comparisons:

# In[10]:


get_ipython().run_line_magic('timeit', 'coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsd")')


# In[11]:


get_ipython().run_line_magic('timeit', 'coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsy")')


# In[12]:


get_ipython().run_line_magic('timeit', 'coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelss")')


# ### Naive Solution

# Recall that we want to find $\hat{x}$ that minimizes: 
# $$ \big\vert\big\vert Ax - b \big\vert\big\vert_2$$
# 
# Another way to think about this is that we are interested in where vector $b$ is closest to the subspace spanned by $A$ (called the *range of* $A$).  This is the projection of $b$ onto $A$.  Since $b - A\hat{x}$ must be perpendicular to the subspace spanned by $A$, we see that
# 
# $$A^T (b - A\hat{x}) = 0 $$
# 
# (we are using $A^T$ because we want to multiply each column of $A$ by $b - A\hat{x}$

# This leads us to the *normal equations*:
# $$ x = (A^TA)^{-1}A^T b $$

# In[13]:


def ls_naive(A, b):
     return np.linalg.inv(A.T @ A) @ A.T @ b


# In[14]:


get_ipython().run_line_magic('timeit', 'coeffs_naive = ls_naive(trn_int, y_trn)')


# In[15]:


coeffs_naive = ls_naive(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_naive)


# ### Normal Equations (Cholesky)

# Normal equations:
# $$ A^TA x = A^T b $$
# 
# If $A$ has full rank, the pseudo-inverse $(A^TA)^{-1}A^T$ is a **square, hermitian positive definite** matrix.  The standard way of solving such a system is *Cholesky Factorization*, which finds upper-triangular R s.t. $A^TA = R^TR$.

# The following steps are based on Algorithm 11.1 from Trefethen:

# In[16]:


A = trn_int


# In[17]:


b = y_trn


# In[18]:


AtA = A.T @ A
Atb = A.T @ b


# **Warning:** Numpy and Scipy default to different upper/lower for Cholesky

# In[19]:


R = scipy.linalg.cholesky(AtA)


# In[196]:


np.set_printoptions(suppress=True, precision=4)
R


# check our factorization:

# In[20]:


np.linalg.norm(AtA - R.T @ R)


# $$ A^T A x = A^T b $$
# $$ R^T R x = A^T b $$
# $$ R^T w = A^T b $$
# $$ R x = w $$

# In[21]:


w = scipy.linalg.solve_triangular(R, Atb, lower=False, trans='T')


# It's always good to check that our result is what we expect it to be: (in case we entered the wrong params, the function didn't return what we thought, or sometimes the docs are even outdated)

# In[22]:


np.linalg.norm(R.T @ w - Atb)


# In[23]:


coeffs_chol = scipy.linalg.solve_triangular(R, w, lower=False)


# In[24]:


np.linalg.norm(R @ coeffs_chol - w)


# In[25]:


def ls_chol(A, b):
    R = scipy.linalg.cholesky(A.T @ A)
    w = scipy.linalg.solve_triangular(R, A.T @ b, trans='T')
    return scipy.linalg.solve_triangular(R, w)


# In[26]:


get_ipython().run_line_magic('timeit', 'coeffs_chol = ls_chol(trn_int, y_trn)')


# In[27]:


coeffs_chol = ls_chol(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_chol)


# ### QR Factorization

# $$ A x = b $$
# $$ A = Q R $$
# $$ Q R x = b $$
# 
# $$ R x = Q^T b $$

# In[28]:


def ls_qr(A,b):
    Q, R = scipy.linalg.qr(A, mode='economic')
    return scipy.linalg.solve_triangular(R, Q.T @ b)


# In[29]:


get_ipython().run_line_magic('timeit', 'coeffs_qr = ls_qr(trn_int, y_trn)')


# In[30]:


coeffs_qr = ls_qr(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_qr)


# ### SVD

# $$ A x = b $$
# 
# $$ A = U \Sigma V $$
# 
# $$ \Sigma V x = U^T b $$
# 
# $$ \Sigma w = U^T b $$
# 
# $$ x = V^T w $$

# SVD gives the pseudo-inverse

# In[253]:


def ls_svd(A,b):
    m, n = A.shape
    U, sigma, Vh = scipy.linalg.svd(A, full_matrices=False, lapack_driver='gesdd')
    w = (U.T @ b)/ sigma
    return Vh.T @ w


# In[32]:


get_ipython().run_line_magic('timeit', 'coeffs_svd = ls_svd(trn_int, y_trn)')


# In[254]:


get_ipython().run_line_magic('timeit', 'coeffs_svd = ls_svd(trn_int, y_trn)')


# In[255]:


coeffs_svd = ls_svd(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_svd)


# ### Random Sketching Technique for Least Squares Regression

# [Linear Sketching](http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/journal.pdf) (Woodruff)
# 
# 1. Sample a r x n random matrix S, r << n
# 2. Compute S A and S b
# 3. Find exact solution x to regression SA x = Sb

# ### Timing Comparison

# In[244]:


import timeit
import pandas as pd


# In[245]:


def scipylstq(A, b):
    return scipy.linalg.lstsq(A,b)[0]


# In[246]:


row_names = ['Normal Eqns- Naive',
             'Normal Eqns- Cholesky', 
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'Normal Eqns- Cholesky': 'ls_chol', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}


# In[247]:


m_array = np.array([100, 1000, 10000])
n_array = np.array([20, 100, 1000])


# In[248]:


index = pd.MultiIndex.from_product([m_array, n_array], names=['# rows', '# cols'])


# In[249]:


pd.options.display.float_format = '{:,.6f}'.format
df = pd.DataFrame(index=row_names, columns=index)
df_error = pd.DataFrame(index=row_names, columns=index)


# In[256]:


# %%prun
for m in m_array:
    for n in n_array:
        if m >= n:        
            x = np.random.uniform(-10,10,n)
            A = np.random.uniform(-40,40,[m,n])   # removed np.asfortranarray
            b = np.matmul(A, x) + np.random.normal(0,2,m)
            for name in row_names:
                fcn = name2func[name]
                t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
                df.set_value(name, (m,n), t)
                coeffs = locals()[fcn](A, b)
                reg_met = regr_metrics(b, A @ coeffs)
                df_error.set_value(name, (m,n), reg_met[0])


# In[257]:


df


# In[252]:


df_error


# In[618]:


store = pd.HDFStore('least_squares_results.h5')


# In[619]:


store['df'] = df


# ## Notes

# I used the magick %prun to profile my code.

# Alternative: least absolute deviation (L1 regression)
# - Less sensitive to outliers than least squares.
# - No closed form solution, but can solve with linear programming

# ## Conditioning & stability

# #### Condition Number

# *Condition number* is a measure of how small changes to the input cause the output to change.
# 
# **Question**: Why do we care about behavior with small changes to the input in numerical linear algebra?

# The *relative condition number* is defined by
# 
# $$ \kappa = \sup_{\delta x} \frac{\|\delta f\|}{\| f(x) \|}\bigg/ \frac{\| \delta x \|}{\| x \|} $$
#   
# where $\delta x$ is infinitesimal
# 
# According to Trefethen (pg. 91), a problem is *well-conditioned* if $\kappa$ is small (e.g. $1$, $10$, $10^2$) and *ill-conditioned* if $\kappa$ is large (e.g. $10^6$, $10^{16}$)

# **Conditioning**: perturbation behavior of a mathematical problem (e.g. least squares)
# 
# **Stability**: perturbation behavior of an algorithm used to solve that problem on a computer (e.g. least squares algorithms, householder, back substitution, gaussian elimination)

# #### Conditioning example

# The problem of computing eigenvalues of a non-symmetric matrix is often ill-conditioned

# In[178]:


A = [[1, 1000], [0, 1]]
B = [[1, 1000], [0.001, 1]]


# In[179]:


wA, vrA = scipy.linalg.eig(A)
wB, vrB = scipy.linalg.eig(B)


# In[180]:


wA, wB


# #### Condition Number of a Matrix

# The product $\| A\| \|A^{-1} \|$ comes up so often it has its own name: the *condition number* of $A$.  Note that normally we talk about the conditioning of problems, not matrices.
# 
# The *condition number* of $A$ relates to:
# - computing $b$ given $A$ and $x$ in $Ax = b$
# - computing $x$ given $A$ and $b$ in $Ax = b$

# ## Loose ends from last time

# ### Full vs Reduced Factorizations

# **SVD**

# Diagrams from Trefethen:
# 
# <img src="images/full_svd.JPG" alt="" style="width: 80%"/>
# 
# <img src="images/reduced_svd.JPG" alt="" style="width: 70%"/>

# **QR Factorization exists for ALL matrices**

# Just like with SVD, there are full and reduced versions of the QR factorization.
# 
# <img src="images/full_qr.JPG" alt="" style="width: 70%"/>
# 
# <img src="images/reduced_qr.JPG" alt="" style="width: 70%"/>

# ### Matrix Inversion is Unstable

# In[197]:


from scipy.linalg import hilbert


# In[229]:


n = 14
A = hilbert(n)
x = np.random.uniform(-10,10,n)
b = A @ x


# In[230]:


A_inv = np.linalg.inv(A)


# In[233]:


np.linalg.norm(np.eye(n) - A @ A_inv)


# In[231]:


np.linalg.cond(A)


# In[232]:


A @ A_inv


# In[237]:


row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}


# In[238]:


pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])


# In[239]:


for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])


# #### SVD is best here!

# DO NOT RERUN

# In[240]:


df


# In[240]:


df


# **Another reason not to take inverse**

# Even if $A$ is incredibly sparse, $A^{-1}$ is generally dense.  For large matrices, $A^{-1}$ could be so dense as to not fit in memory.

# ## Runtime

# Matrix Inversion: $2n^3$
# 
# Matrix Multiplication: $n^3$
# 
# Cholesky: $\frac{1}{3}n^3$
# 
# QR, Gram Schmidt: $2mn^2$, $m\geq n$ (chapter 8 of Trefethen)
# 
# QR, Householder: $2mn^2 - \frac{2}{3}n^3$ (chapter 10 of Trefethen)
# 
# Solving a triangular system: $n^2$

# **Why Cholesky Factorization is Fast:**
# 
# <img src="images/cholesky_factorization_speed.png" alt="" style="width: 100%"/>
# (source: [Stanford Convex Optimization: Numerical Linear Algebra Background Slides](http://stanford.edu/class/ee364a/lectures/num-lin-alg.pdf))

# ### A Case Where QR is the Best

# In[65]:


m=100
n=15
t=np.linspace(0, 1, m)


# In[66]:


# Vandermonde matrix
A=np.stack([t**i for i in range(n)], 1)


# In[67]:


b=np.exp(np.sin(4*t))

# This will turn out to normalize the solution to be 1
b /= 2006.787453080206


# In[68]:


from matplotlib import pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[69]:


plt.plot(t, b)


# Check that we get 1:

# In[58]:


1 - ls_qr(A, b)[14]


# Bad condition number:

# In[60]:


kappa = np.linalg.cond(A); kappa


# In[181]:


row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}


# In[74]:


pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])


# In[75]:


for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', np.abs(1 - coeffs[-1]))


# In[76]:


df


# 
# The solution for least squares via the normal equations is unstable in general, although stable for problems with small condition numbers.

# ### Low-rank

# In[258]:


m = 100
n = 10
x = np.random.uniform(-10,10,n)
A2 = np.random.uniform(-40,40, [m, int(n/2)])   # removed np.asfortranarray
A = np.hstack([A2, A2])


# In[259]:


A.shape, A2.shape


# In[260]:


b = A @ x + np.random.normal(0,1,m)


# In[263]:


row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}


# In[264]:


pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])


# In[265]:


for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])


# In[266]:


df


# ## Comparison

# Our results from above:

# In[257]:


df


# From Trefethen (page 84):
# 
# Normal equations/Cholesky is fastest when it works.  Cholesky can only be used on symmetric, positive definite matrices.  Also, normal equations/Cholesky is unstable for matrices with high condition numbers or with low-rank.
# 
# Linear regression via QR has been recommended by numerical analysts as the standard method for years.  It is natural, elegant, and good for "daily use".