#!/usr/bin/env python # coding: utf-8 # # Building Linear Regression Model in Python # ### Import Packages and Data # In python, a very handy way of building linear regression model is using a very popular machine learning package `Scikit Learn`. This package contains many built-in models, from basic regression models in this post to other complex models and methods in later posts. You may want to check the [official guide](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). # In[1]: # import packages import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression # In[2]: # import dataset data = pd.read_csv("meuse.csv") # In[3]: # View data data.head() # ### Build a Model using Simple Linear Regression # Linear regression is one of the most traditional way of examining the relationships among predictors and variables. As we discussed in [a previous post](https://oscrproject.wixsite.com/website/post/purpose-of-machine-learning-and-modeling-for-digital-humanities-and-social-sciences) about the general idea of modeling and machine learning, we may have the purpose of inference the relationships among variables. # # Goal: examine the relationship between the topsoil lead concentration (`lead` column, as y-axis) and the topsoil cadmium concentration (`cadmium` column, as x-axis). # # Using the `Scikit Learn` package, we have: # In[8]: regression_model = LinearRegression() LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead) # Please note that we have to reshape the `cadmium` column to be two-dimensional, i.e. one column and required number of rows. Please refer to our next several notes about how to visualize and analyze the simple linear regression model.