We load the dataset 'diabetes' using the sklearn load function:
from sklearn import datasets
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
The dataset consists of data and targets. Target tells us what is the desired output for specific example from data:
X = diabetes.data
y = diabetes.target
print X.shape
print y.shape
(442, 10) (442,)
We want to split the data into train set and test set. We fit the linear model on the train set, and we show that it performs good on test set.
Before splitting the data, we shuffle (mix) the examples, because for some datasets the examples are ordered.
If we wouldn't shuffle, train set and test set could be totally different, thus linear model fitted on train set wouldn't be valid on test set. Now we shuffle:
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=1)
print X.shape
print y.shape
(442, 10) (442,)
Each example of data has 10 columns in total.
We want to work with 1-dim data because it is simple to visualize. Therefore select only one column, e.g column 2 and fit linear model on it:
# Use only one column from data
print(X.shape)
X = X[:, 2:3]
print(X.shape)
(442, 10) (442, 1)
Split the data into training/testing sets
train_set_size = 250
X_train = X[:train_set_size] # selects first 250 rows (examples) for train set
X_test = X[train_set_size:] # selects from row 250 until the last one for test set
print(X_train.shape)
print(X_test.shape)
(250, 1) (192, 1)
Split the targets into training/testing sets
y_train = y[:train_set_size] # selects first 250 rows (targets) for train set
y_test = y[train_set_size:] # selects from row 250 until the last one for test set
print(y_train.shape)
print(y_test.shape)
(250,) (192,)
Now we can look at our train data. We can see that the examples have linear relation.
Therefore, we can use linear model to make good classification of our examples.
plt.scatter(X_train, y_train)
plt.scatter(X_test, y_test)
plt.xlabel('Data')
plt.ylabel('Target');
Create linear regression object, which we use later to apply linear regression on data
from sklearn import linear_model
regr = linear_model.LinearRegression()
Fit the model using the training set
regr.fit(X_train, y_train);
We found the coefficients and the bias (the intercept)
print(regr.coef_)
print(regr.intercept_)
[ 865.04619508] 151.179169728
Now we calculate the mean square error on the test set
# The mean square error
print("Training error: ", np.mean((regr.predict(X_train) - y_train) ** 2))
print("Test error: ", np.mean((regr.predict(X_test) - y_test) ** 2))
('Training error: ', 3800.1408249628944) ('Test error: ', 4047.2429967010571)
Now we want to plot the train data and teachers (marked as dots).
With line we represents the data and predictions (linear model that we found):
# Visualises dots, where each dot represent a data exaple and corresponding teacher
plt.scatter(X_train, y_train, color='black')
# Plots the linear model
plt.plot(X_train, regr.predict(X_train), color='blue', linewidth=3);
plt.xlabel('Data')
plt.ylabel('Target')
<matplotlib.text.Text at 0xb101b0cc>
We do similar with test data, and show that linear model is valid for a test set:
# Visualises dots, where each dot represent a data exaple and corresponding teacher
plt.scatter(X_test, y_test, color='black')
# Plots the linear model
plt.plot(X_test, regr.predict(X_test), color='blue', linewidth=3);
plt.xlabel('Data')
plt.ylabel('Target');