In python, a very handy way of building linear regression model is using a very popular machine learning package Scikit Learn
. This package contains many built-in models, from basic regression models in this post to other complex models and methods in later posts. You may want to check the official guide.
# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# import dataset
data = pd.read_csv("meuse.csv")
# View data
data.head()
x | y | cadmium | copper | lead | zinc | elev | dist | om | ffreq | soil | lime | landuse | dist.m | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 181072 | 333611 | 11.7 | 85 | 299 | 1022 | 7.909 | 0.001358 | 13.6 | 1 | 1 | 1 | Ah | 50 |
1 | 181025 | 333558 | 8.6 | 81 | 277 | 1141 | 6.983 | 0.012224 | 14.0 | 1 | 1 | 1 | Ah | 30 |
2 | 181165 | 333537 | 6.5 | 68 | 199 | 640 | 7.800 | 0.103029 | 13.0 | 1 | 1 | 1 | Ah | 150 |
3 | 181298 | 333484 | 2.6 | 81 | 116 | 257 | 7.655 | 0.190094 | 8.0 | 1 | 2 | 0 | Ga | 270 |
4 | 181307 | 333330 | 2.8 | 48 | 117 | 269 | 7.480 | 0.277090 | 8.7 | 1 | 2 | 0 | Ah | 380 |
Linear regression is one of the most traditional way of examining the relationships among predictors and variables. As we discussed in a previous post about the general idea of modeling and machine learning, we may have the purpose of inference the relationships among variables.
Goal: examine the relationship between the topsoil lead concentration (lead
column, as y-axis) and the topsoil cadmium concentration (cadmium
column, as x-axis).
Using the Scikit Learn
package, we have:
regression_model = LinearRegression()
LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)
/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Please note that we have to reshape the cadmium
column to be two-dimensional, i.e. one column and required number of rows. Please refer to our next several notes about how to visualize and analyze the simple linear regression model.