Linear Regression¶

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

X = np.random.rand(100)
y = X + 0.1 * np.random.randn(100)

In [3]:

plt.scatter(X, y);
plt.show()

Following the steps prescribed by Jake Vanderplas in his awesome text Python Data Science Handbook. He has kindly provided all his codes on github as well.

Step 1. Choose a class of model.¶

In this case we are using linear regression

In [4]:

from sklearn.linear_model import LinearRegression

Step 2. Choose model hyperparameters.¶

In [5]:

model = LinearRegression(fit_intercept=True)

Step 3. Arrange data into a features matrix and target vector¶

In [6]:

X = X.reshape(-1, 1)

In [7]:

X.shape

Out[7]:

(100, 1)

Step 4. Fit the model to your data.¶

In [8]:

model.fit(X, y)

Out[8]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:

model.coef_

Out[9]:

array([ 0.97408915])

In [10]:

model.intercept_

Out[10]:

0.022535905418693603

If you are statistically trained, you would normally dig into other information such as normality of the residuals and check for autocorrelation etc. You may also want to evaluation the parameters as well. Those are valid statistical modelling questions.

Machine Learning focus is on prediction. You will not find these information with the scikit-learn package. Do take note of this key difference between statistics and machine learning.

Step 5. Predict labels for unknown data¶

In [11]:

x_test = np.linspace(0, 1)
x_test

Out[11]:

array([ 0.        ,  0.02040816,  0.04081633,  0.06122449,  0.08163265,
        0.10204082,  0.12244898,  0.14285714,  0.16326531,  0.18367347,
        0.20408163,  0.2244898 ,  0.24489796,  0.26530612,  0.28571429,
        0.30612245,  0.32653061,  0.34693878,  0.36734694,  0.3877551 ,
        0.40816327,  0.42857143,  0.44897959,  0.46938776,  0.48979592,
        0.51020408,  0.53061224,  0.55102041,  0.57142857,  0.59183673,
        0.6122449 ,  0.63265306,  0.65306122,  0.67346939,  0.69387755,
        0.71428571,  0.73469388,  0.75510204,  0.7755102 ,  0.79591837,
        0.81632653,  0.83673469,  0.85714286,  0.87755102,  0.89795918,
        0.91836735,  0.93877551,  0.95918367,  0.97959184,  1.        ])

In [12]:

y_pred = model.predict(x_test.reshape(-1,1))

In [13]:

plt.scatter(X, y)
plt.plot(x_test, y_pred);
plt.show()