Analyzing Linear Regression Model in Python¶

Analyze a Simple Linear Model¶

Goal: Obtaining statistical summary about the linear regression line the topsoil lead concentration (lead column, as y-axis) and the topsoil cadmium concentration (cadmium column, as x-axis).

In [1]:

### Previous steps necessary
# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# import dataset
data = pd.read_csv("meuse.csv")
# build the model
regression_model = LinearRegression()
lr = LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)

/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  # Remove the CWD from sys.path while we load stuff.

R^2¶

R-squared measures how close the data are fitted to the regression line.

In [7]:

print(lr.score(data.cadmium.reshape((-1, 1)), data.lead))

0.6383156080918473

/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.

R-squared measures how close the data are fitted to the regression line. In here, we can conclude that about 63% of the variance of the prediction of lead based on cadmium can be explained by the linear model m1.

For other more advanced, please refer to our later posts regarding advanced topics in Linear Regression Modeling.