Today we will learn the regression, which is the simpliest model in machine learning.
In previous lab practice, we learned how to import data from UCI machine learning dataset repository.
winequality-red.csv
from UCI machine learning repo on Kaggle, unzip it and put it in the same directory with this notebook.import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
Check the csv preview on Kaggle, we will use the first three columns: fixed acidity
, volatile acidity
, and citric acid
, we are interested how these every two quantities are related.
wine_data = pd.read_csv('winequality-red.csv')
plt.scatter(wine_data['citric acid'], wine_data['fixed acidity'], alpha=0.2)
# alpha makes the dots a little transparent
plt.show()
We would like to fit a line to this data. i.e., based on this information, what is the most likely linear relationship between the fixed acid and citric acid concentration of wine?
Since we are doing a linear model: if $x$ is the citric acid concentration, then the fixed acidity $y$ should be
$$ y = w x + b.$$So we are looking for $w$ (weight) and $b$ (bias) that will fit the line as well as possible to the data. What does that mean though? It means that we want to minimize the error that our linear model $y = wx + b$ will have on predicting the weight from the height on our existing data. On a fixed acidity-citric acid concentration pair $(x_i, y_i)$ from our data-set, the model is guessing $y = w x_i + b$, and the actual answer is $y_i$. The total squared error (called Loss) is:
$$L(w,b) = \sum_{i=1}^{N} \Big((w x_i + b) - y_i\Big)^2$$where $\{(x_i, y_i)\}$ are our fixed acidity-citric acid concentration pairs.
We want to minimize this squared error function above. We can:
Remark: the loss function is a function in $w$ and $b$, not $x$ and $y$!!!!.
Solving the gradient = 0. You write down the partial derivatives and solve the linear equations for $w$ and $b$. $$\frac{\partial}{\partial b} L(w,b) = 2 \sum\limits_{i=1}^N \big(w x_i + b-y_i\big) = 0$$
$$\frac{\partial}{\partial w} L(w,b) = 2 \sum\limits_{i=1}^N (w x_i + b-y_i) \cdot x_i) = 0$$Simplifying the first equation is straightforward: $$wX + Nb − Y = 0, \quad \text{where} \quad X = \sum\limits_{i=1}^N x_i, \quad Y = \sum\limits_{i=1}^N y_i .$$ Simplifying the second equaton: $$ w\sum\limits_{i=1}^N x_i^2 - b\sum\limits_{i=1}^N x_i - \sum\limits_{i=1}^N x_i y_i = wA + bX - C = 0, \quad \text{where} \quad A = \sum\limits_{i=1}^N x_i^2, \quad C = \sum\limits_{i=1}^N x_i y_i. $$ Solving this linear system yields: $$ w = \frac{XY - nC}{X^2 - NA}, \quad \text{ and } \quad b = \frac{AY - CX}{NA - X^2}. $$ If we digging deeper to simplify: $$ w =\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}} \quad \text{ and } b = \bar{y} - w\bar{x}. $$
This is called a closed-form solution because we are getting straight to the answer. (no gradient descent or approximation)
This is solving the normal equation for the least square problem.
x = wine_data['citric acid']
y = wine_data['fixed acidity']
N = len(x)
X = np.sum(x)
A = np.sum(x * x) # sum of the squares
C = np.sum(x * y) # sum of x_i * y_i
Y = np.sum(y)
w = (X*Y - N*C) / (X**2- N*A)
b = (A*Y - C*X) / (N*A - X**2)
# more statistical representation
x_bar = np.mean(x)
y_bar = np.mean(y)
w = np.sum( (x-x_bar) * (y-y_bar) )/ np.sum( (x-x_bar)**2 )
b = y_bar - w*x_bar
XX = np.linspace(0,1,200)
YY = w * XX + b
plt.scatter(wine_data['citric acid'], wine_data['fixed acidity'], alpha=0.1)
plt.plot(XX,YY,color='red',linewidth = 4, alpha=0.4)
plt.show()
For data analysis, Exploratory Data Analysis (EDA) can be our first step. EDA helps us to:
import seaborn as sns
sns.set()
sns.distplot(wine_data['citric acid'])
plt.figure(figsize=(12,8))
sns.heatmap(wine_data.corr(),annot=False)
No need to be a hero every time. We can use scikit-learn's LinearRegression()
class.
We train the parameters by using the fit
function and we use the function it learns using the predict
function.
from sklearn import linear_model
# model
acid_regression = linear_model.LinearRegression()
# training data
X_train = wine_data['citric acid']
y_train = wine_data['fixed acidity']
# train/fit
acid_regression.fit(X_train, y_train)
# testing/CV
X_test = np.linspace(0,1,200)
y_pred = acid_regression.predict(X_test)
# visualize
plt.scatter(X_train, y_train, alpha=0.1)
plt.plot(X_test,y_pred, color='red',linewidth = 4, alpha=0.4)
plt.show()
volatile acidity
and fixed acidity
using both explicit formula and scikit-learn
's LinearRegression()
class.LinearRegression()
class, and apply scikit-learn's linear regression model on total sulfur dioxide vs pH in winequality-red.csv
. We can first scatter plot to get a visual cue of whether these two quantities are related.