import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
X = np.random.rand(100)
y = X + 0.1 * np.random.randn(100)
plt.scatter(X, y);
plt.show()
Following the steps prescribed by Jake Vanderplas in his awesome text Python Data Science Handbook. He has kindly provided all his codes on github as well.
In this case we are using linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
X = X.reshape(-1, 1)
X.shape
(100, 1)
model.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
model.coef_
array([ 0.97408915])
model.intercept_
0.022535905418693603
If you are statistically trained, you would normally dig into other information such as normality of the residuals and check for autocorrelation etc. You may also want to evaluation the parameters as well. Those are valid statistical modelling
questions.
Machine Learning focus is on prediction
. You will not find these information with the scikit-learn
package. Do take note of this key difference between statistics and machine learning.
x_test = np.linspace(0, 1)
x_test
array([ 0. , 0.02040816, 0.04081633, 0.06122449, 0.08163265, 0.10204082, 0.12244898, 0.14285714, 0.16326531, 0.18367347, 0.20408163, 0.2244898 , 0.24489796, 0.26530612, 0.28571429, 0.30612245, 0.32653061, 0.34693878, 0.36734694, 0.3877551 , 0.40816327, 0.42857143, 0.44897959, 0.46938776, 0.48979592, 0.51020408, 0.53061224, 0.55102041, 0.57142857, 0.59183673, 0.6122449 , 0.63265306, 0.65306122, 0.67346939, 0.69387755, 0.71428571, 0.73469388, 0.75510204, 0.7755102 , 0.79591837, 0.81632653, 0.83673469, 0.85714286, 0.87755102, 0.89795918, 0.91836735, 0.93877551, 0.95918367, 0.97959184, 1. ])
y_pred = model.predict(x_test.reshape(-1,1))
plt.scatter(X, y)
plt.plot(x_test, y_pred);
plt.show()