#!/usr/bin/env python
# coding: utf-8

# # Building Linear Regression Model in Python

# ### Import Packages and Data

# In python, a very handy way of building linear regression model is using a very popular machine learning package `Scikit Learn`. This package contains many built-in models, from basic regression models in this post to other complex models and methods in later posts. You may want to check the [official guide](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

# In[1]:


# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression


# In[2]:


# import dataset
data = pd.read_csv("meuse.csv")


# In[3]:


# View data
data.head()


# ### Build a Model using Simple Linear Regression

# Linear regression is one of the most traditional way of examining the relationships among predictors and variables. As we discussed in [a previous post](https://oscrproject.wixsite.com/website/post/purpose-of-machine-learning-and-modeling-for-digital-humanities-and-social-sciences) about the general idea of modeling and machine learning, we may have the purpose of inference the relationships among variables. 
# 
# Goal: examine the relationship between the topsoil lead concentration (`lead` column, as y-axis) and the topsoil cadmium concentration (`cadmium` column, as x-axis). 
# 
# Using the `Scikit Learn` package, we have:

# In[8]:


regression_model = LinearRegression()
LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)


# Please note that we have to reshape the `cadmium` column to be two-dimensional, i.e. one column and required number of  rows. Please refer to our next several notes about how to visualize and analyze the simple linear regression model.