Goal: Obtaining statistical summary about the linear regression line the topsoil lead concentration (lead
column, as y-axis) and the topsoil cadmium concentration (cadmium
column, as x-axis).
### Previous steps necessary
# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# import dataset
data = pd.read_csv("meuse.csv")
# build the model
regression_model = LinearRegression()
lr = LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)
/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead # Remove the CWD from sys.path while we load stuff.
R-squared measures how close the data are fitted to the regression line.
print(lr.score(data.cadmium.reshape((-1, 1)), data.lead))
0.6383156080918473
/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead """Entry point for launching an IPython kernel.
R-squared measures how close the data are fitted to the regression line. In here, we can conclude that about 63% of the variance of the prediction of lead
based on cadmium
can be explained by the linear model m1
.
For other more advanced, please refer to our later posts regarding advanced topics in Linear Regression Modeling.