We'll use the famous Boston house prices dataset.
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
dfX = df[['RM', 'AGE', 'DIS', 'LSTAT']]
X = dfX.values
y = boston['target']
We could plot the data:
Let's do machine learning!
First, we'll just use one feature:
x = X[:, 0].reshape(-1, 1)
plt.plot(x, y, 'o')
[<matplotlib.lines.Line2D at 0x7f78a52a43c8>]
import numpy as np
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x, y)
y_hat = reg.predict(x)
reg.coef_, reg.intercept_
(array([9.10210898]), -34.67062077643857)
model_x = np.arange(3, 10, 0.1).reshape(-1, 1)
model_y = reg.predict(model_x)
plt.plot(x, y, 'o')
plt.plot(model_x, model_y)
[<matplotlib.lines.Line2D at 0x7f78a505a278>]
from sklearn.metrics import r2_score
r2_score(y, y_hat)
0.48352545599133423