我們認真的來做一下數據分析! 基本上我們從迴歸、機器學習, 到深度學習, 都是要學一個函數。過程從現在到未來都是:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.linspace(0, 5, 50)
y = 1.2*x + 0.8
畫出圖形來。
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x125860630>
大概的想法就是, 我們真實世界的問題, 化成函數, 我們假設背後有個美好的函數。但相信我們很少看到真實世界的資料那麼漂亮。在統計上, 我們就是假設
$$f(x) + \varepsilon(x)$$也就是都有個 noise 項。
y = 1.2*x + 0.8 + 0.5*np.random.randn(50)
plt.scatter(x, y)
plt.plot(x, 1.2*x + 0.8, 'b')
[<matplotlib.lines.Line2D at 0x12590e3c8>]
做線性迴歸有很多套件, 但我們這裡用 sklearn
裡的 LinearRegression
來做, 嗯, 線性迴歸。
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
這裡要注意我們本來的 x 是
$$[x_1, x_2, \ldots, x_{50}]$$但現在要的是
$$[[x_1], [x_2], \ldots, [x_{50}]]$$這樣的。
x
array([0. , 0.10204082, 0.20408163, 0.30612245, 0.40816327, 0.51020408, 0.6122449 , 0.71428571, 0.81632653, 0.91836735, 1.02040816, 1.12244898, 1.2244898 , 1.32653061, 1.42857143, 1.53061224, 1.63265306, 1.73469388, 1.83673469, 1.93877551, 2.04081633, 2.14285714, 2.24489796, 2.34693878, 2.44897959, 2.55102041, 2.65306122, 2.75510204, 2.85714286, 2.95918367, 3.06122449, 3.16326531, 3.26530612, 3.36734694, 3.46938776, 3.57142857, 3.67346939, 3.7755102 , 3.87755102, 3.97959184, 4.08163265, 4.18367347, 4.28571429, 4.3877551 , 4.48979592, 4.59183673, 4.69387755, 4.79591837, 4.89795918, 5. ])
X = x.reshape(len(x), 1)
regr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Y = regr.predict(X)
regr.predict([[1.3]])
array([2.30079505])
plt.scatter(x, y)
plt.plot(x, Y, 'r' )
plt.plot(x, 1.2*x + 0.8, 'b')
[<matplotlib.lines.Line2D at 0x1282ad438>]
x = np.linspace(0, 5, 200)
y = 1.2*x + 0.8 + 0.5*np.random.randn(200)
plt.scatter(x,y)
<matplotlib.collections.PathCollection at 0x1282ca940>
把原來的 x
, y
中的 70% 給 training data, 30% 給 testing data。
from sklearn.model_selection import train_test_split
我們在「訓練」這個函數時只有以下這些資料。
X = x.reshape(len(x), 1)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 9487)
記得現在我們只用 70% 的資料去訓練。
plt.scatter(x_train, y_train)
<matplotlib.collections.PathCollection at 0x12113e748>
plt.scatter(x_test, y_test)
<matplotlib.collections.PathCollection at 0x12124deb8>
regr = LinearRegression()
regr.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Y_pred = regr.predict(x_test)
plt.scatter(x_test, y_test)
plt.plot(x_test, Y_pred, 'r')
plt.plot(x, 1.2*x+0.8, 'b')
[<matplotlib.lines.Line2D at 0x1213c6be0>]
事實上, 非線性的函數也可以用「線性迴歸」。比如說我們的函數原本是:
$$f(x) = b + w_0 x + w_1 x^2$$用腳看就知不是線性的, 但我們也可令 $X = x$, $Y = x^2$, 於是原式變成:
$$f(X, Y) = b + w_0 X + w_1 Y,$$立馬變線性函數!
這裡我們用個非線性的函數來生假數據:
$$f(x) = \sin(3.2x) + 0.8x$$一樣準備加上一些 noise。
x = np.linspace(0, 5, 100)
y = np.sin(3.2*x) + 0.8*x + 0.3*np.random.randn(100)
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x1215394a8>
X = x.reshape(len(x), 1)
regr_lin = LinearRegression()
regr_lin.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Y_lin = regr_lin.predict(X)
plt.scatter(x, y)
plt.plot(x, regr_lin.predict(X), 'r')
[<matplotlib.lines.Line2D at 0x121545358>]
果然超級不準, 該如何是好?
我們來用 6 次多項式學
X_poly = np.array([[k, k**2, k**3, k**4, k**5, k**6] for k in x])
regr_poly = LinearRegression()
regr_poly.fit(X_poly, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Y_poly = regr_poly.predict(X_poly)
plt.scatter(x, y)
plt.plot(x, Y_poly, 'r')
[<matplotlib.lines.Line2D at 0x121725048>]
def RBF(x, center, sigma=0.3):
k = np.exp(-(x - center)**2/(2*sigma**2))
return k
畫出來看看!
t = np.linspace(-5, 5, 200)
plt.plot(t, RBF(t, 2.5))
[<matplotlib.lines.Line2D at 0x121b90898>]
選 5 個不同 center 的 RBF, 也就是設
$$f(x) = w_1 \varphi_1(x) + w_2 \varphi_2(x) + w_3 \varphi_3(x) + w_4 \varphi_4(x) + w_5 \varphi_5(x)$$X_rbf = np.array([[RBF(k, 0.5),
RBF(k, 1.5),
RBF(k, 2.5),
RBF(k, 3.5),
RBF(k, 4.5)] for k in x])
regr_rbf = LinearRegression()
regr_rbf.fit(X_rbf, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Y_rbf = regr_rbf.predict(X_rbf)
plt.scatter(x, y)
plt.plot(x, Y_rbf, 'r')
[<matplotlib.lines.Line2D at 0x121a4bef0>]
plt.scatter(x,y)
plt.plot(x, Y_lin, label='linear')
plt.plot(x, Y_poly, label='polynomial')
plt.plot(x, Y_rbf, label='rbf')
plt.legend()
<matplotlib.legend.Legend at 0x12173d898>