Hypothesis Function 정의
$n$개의 속성($n$개의 예측 변수)을 지닌 훈련 데이터 $n$-벡터 $x^i=\{x_{1}, x_{2}, ..., x_{n}\}$가 총 $m$ (즉, $1 \le i \le m$)개 주어지고,
각 $x^i$ 벡터마다 연관된 실수 값 $y^i$ (결과 변수)이 주어질 때,
임의의 $n$-벡터 $x^i =\{x_1, x_2,...,x_n\}$에 대해 Hypothesis Function $h_{\theta}(x^i)$ 는 다음과 같이 정의된다. $$h_{\theta}(x^i) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$
계수 벡터 $\theta$를 구하는 수학적 모델
from scipy import stats
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy
%matplotlib inline
data = {
'Temperature': [26, 27, 28, 29, 30, 31, 32, 33],
'Number of Sells': [270, 280, 290, 300, 310, 320, 330, 340]
}
df = pd.DataFrame(data)
df
Number of Sells | Temperature | |
---|---|---|
0 | 270 | 26 |
1 | 280 | 27 |
2 | 290 | 28 |
3 | 300 | 29 |
4 | 310 | 30 |
5 | 320 | 31 |
6 | 330 | 32 |
7 | 340 | 33 |
df['Temperature'].values
array([26, 27, 28, 29, 30, 31, 32, 33])
df['Number of Sells'].values
array([270, 280, 290, 300, 310, 320, 330, 340])
df.plot(kind="scatter", x="Temperature", y="Number of Sells")
<matplotlib.axes._subplots.AxesSubplot at 0x11672dd90>
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Temperature'].values, df['Number of Sells'].values)
format = "%40s: %12.10f"
print format % ("slope", slope)
print format % ("intercept", intercept)
print format % ("r_value (Correlation Coefficient)", r_value)
print format % ("r-squared (Coefficient of Determination)", r_value**2)
print format % ("p_value (Hyperthesis Testing)", p_value)
print format % ("std_err (Standard Error)", std_err)
slope: 10.0000000000 intercept: 10.0000000000 r_value (Correlation Coefficient): 1.0000000000 r-squared (Coefficient of Determination): 1.0000000000 p_value (Hyperthesis Testing): 0.0000000000 std_err (Standard Error): 0.0000000000
회귀식: $y = intercept + slope \times x$
질문: 온도가 34일 때 예상 에어콘 판매량은? $10 + 10 \times 34 = 350$
r-value (Pearson correlation coefficient): $$ r-value = \frac{cov(x_i, y_i)}{\sigma_{x_i} \sigma_{y_i}}$$
where
r-value의 의미
def cov(a, b):
if len(a) != len(b):
return
a_mean = np.mean(a)
b_mean = np.mean(b)
sum = 0
for i in range(0, len(a)):
sum += ((a[i] - a_mean) * (b[i] - b_mean))
return sum/(len(a) - 1)
a = np.cov(df['Temperature'].values, df['Number of Sells'].values, ddof = 1)[0][1]
print a
b = np.std(df['Temperature'].values, ddof = 1)
print b
c = np.std(df['Number of Sells'].values, ddof = 1)
print c
print a / (b * c)
60.0 2.44948974278 24.4948974278 1.0
np.corrcoef(df['Temperature'].values, df['Number of Sells'].values, ddof = 1)[0][1]
1.0
URL
D: 사망률
A1 ~ A15: 사망률에 영향이 있을 것 같은 각종 요인 데이터
References
The death rate is to be represented as a function of other variables.
There are 60 rows of data. The data includes:
import urllib2
import json
path = 'https://raw.githubusercontent.com/bluebibi/LINK_ML_BIG_DATA/master/death_rate.csv'
raw_csv = urllib2.urlopen(path)
df = pd.read_csv(raw_csv)
df.head()
I | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 36 | 27 | 71 | 8.1 | 3.34 | 11.4 | 81.5 | 3243 | 8.8 | 42.6 | 11.7 | 21 | 15 | 59 | 59 | 0.921 |
1 | 2 | 35 | 23 | 72 | 11.1 | 3.14 | 11.0 | 78.8 | 4281 | 3.6 | 50.7 | 14.4 | 8 | 10 | 39 | 57 | 0.997 |
2 | 3 | 44 | 29 | 74 | 10.4 | 3.21 | 9.8 | 81.6 | 4260 | 0.8 | 39.4 | 12.4 | 6 | 6 | 33 | 54 | 0.962 |
3 | 4 | 47 | 45 | 79 | 6.5 | 3.41 | 11.1 | 77.5 | 3125 | 27.1 | 50.2 | 20.6 | 18 | 8 | 24 | 56 | 0.982 |
4 | 5 | 43 | 35 | 77 | 7.6 | 3.44 | 9.6 | 84.6 | 6441 | 24.4 | 43.7 | 14.3 | 43 | 38 | 206 | 55 | 1.071 |
print df['A1'].values
print df['D'].values
[36 35 44 47 43 53 43 45 36 36 52 33 40 35 37 35 36 15 31 30 31 31 42 43 46 39 35 43 11 30 50 60 30 25 45 46 54 42 42 36 37 42 41 44 32 34 10 18 13 35 45 38 31 40 41 28 45 45 42 38] [ 0.921 0.997 0.962 0.982 1.071 1.03 0.934 0.899 1.001 0.912 1.017 1.024 0.97 0.985 0.958 0.86 0.936 0.871 0.959 0.941 0.891 0.871 0.971 0.887 0.952 0.968 0.919 0.844 0.861 0.989 1.006 0.861 0.929 0.857 0.961 0.923 1.113 0.994 1.015 0.991 0.893 0.938 0.946 1.025 0.874 0.953 0.839 0.911 0.79 0.899 0.904 0.95 0.972 0.912 0.967 0.823 1.003 0.895 0.911 0.954]
corr_dic = {}
for i in range(1,16):
corr_dic[i] = np.corrcoef(df['A' + str(i)].values, df['D'].values, ddof = 1)[0][1]
print corr_dic
print
sorted_corr_dic = sorted(corr_dic.items(), key=lambda x: x[1], reverse=True)
print sorted_corr_dic
{1: 0.51058981544585491, 2: -0.058248264329082554, 3: 0.279026441842534, 4: -0.17474932763602924, 5: 0.3586407523344754, 6: -0.51252452972931994, 7: -0.42771304952186134, 8: 0.26418567700899515, 9: 0.64413803239793543, 10: -0.28529506633050106, 11: 0.41125099134506982, 12: -0.17826806806438281, 13: -0.081368918173860549, 14: 0.42579785580212309, 15: -0.05410071395571768} [(9, 0.64413803239793543), (1, 0.51058981544585491), (14, 0.42579785580212309), (11, 0.41125099134506982), (5, 0.3586407523344754), (3, 0.279026441842534), (8, 0.26418567700899515), (15, -0.05410071395571768), (2, -0.058248264329082554), (13, -0.081368918173860549), (4, -0.17474932763602924), (12, -0.17826806806438281), (10, -0.28529506633050106), (7, -0.42771304952186134), (6, -0.51252452972931994)]
df_sub = df[['A9','A1','A6', 'D']]
df_sub.head()
A9 | A1 | A6 | D | |
---|---|---|---|---|
0 | 8.8 | 36 | 11.4 | 0.921 |
1 | 3.6 | 35 | 11.0 | 0.997 |
2 | 0.8 | 44 | 9.8 | 0.962 |
3 | 27.1 | 47 | 11.1 | 0.982 |
4 | 24.4 | 43 | 9.6 | 1.071 |
fig = plt.figure(figsize=(17, 6))
ax1 = fig.add_subplot(131)
ax1.scatter(df_sub['A9'], df_sub['D'])
ax1.set_title("Data Rate vs. Size of the nonwhite population")
ax2 = fig.add_subplot(132)
ax2.scatter(df_sub['A1'], df_sub['D'])
ax2.set_title("Data Rate vs. Average annual precipitation")
ax3 = fig.add_subplot(133)
ax3.scatter(df_sub['A6'], df_sub['D'])
ax3.set_title("Data Rate vs. Number of years of schooling for persons over 22")
<matplotlib.text.Text at 0x119f47950>
slope, intercept, r_value, p_value, std_err = stats.linregress(df_sub['A9'].values, df_sub['D'].values)
format = "%40s: %12.10f"
print format % ("slope", slope)
print format % ("intercept", intercept)
print format % ("r_value (Correlation Coefficient)", r_value)
print format % ("r-squared (Coefficient of Determination)", r_value**2)
print format % ("p_value (Hyperthesis Testing)", p_value)
print format % ("std_err (Standard Error)", std_err)
slope: 0.0044945782 intercept: 0.8865010410 r_value (Correlation Coefficient): 0.6441380324 r-squared (Coefficient of Determination): 0.4149138048 p_value (Hyperthesis Testing): 0.0000000281 std_err (Standard Error): 0.0007008191
predicator_analysis = {}
for i in range(1, 16):
predicator_analysis[i] = Series(np.empty(6), index=['slope', 'intercept', 'r_value', 'r_squared', 'p_value', 'std_err'])
predicator_analysis[i][0],\
predicator_analysis[i][1],\
predicator_analysis[i][2],\
predicator_analysis[i][4],\
predicator_analysis[i][5] = stats.linregress(df['A' + str(i)].values, df['D'].values)
predicator_analysis[i][3] = predicator_analysis[i][2] ** 2
format1 = "%3s %15s %15s %15s %15s %15s %15s"
format2 = "%3d %15f %15f %15f %15f %15f %15f"
print format1 % ('No.', 'slope', 'intercept', 'r_value', 'r_squared', 'p_value', 'std_err')
for i in range(1, 16):
lst = [i]
for j in range(6):
lst.append(predicator_analysis[i][j])
print format2 % tuple(lst)
No. slope intercept r_value r_squared p_value std_err 1 0.003183 0.820937 0.510590 0.260702 0.000031 0.000704 2 -0.000303 0.950407 -0.058248 0.003393 0.658436 0.000681 3 0.003644 0.668059 0.279026 0.077856 0.030855 0.001647 4 -0.007426 1.005207 -0.174749 0.030537 0.181736 0.005494 5 0.165038 0.401320 0.358641 0.128623 0.004895 0.056404 6 -0.037738 1.353973 -0.512525 0.262681 0.000028 0.008302 7 -0.005178 1.358817 -0.427713 0.182938 0.000653 0.001437 8 0.000011 0.896037 0.264186 0.069794 0.041378 0.000005 9 0.004495 0.886501 0.644138 0.414914 0.000000 0.000701 10 -0.003838 1.116685 -0.285295 0.081393 0.027137 0.001693 11 0.006153 0.851430 0.411251 0.169127 0.001098 0.001791 12 -0.000121 0.944433 -0.178268 0.031780 0.172960 0.000087 13 -0.000109 0.942326 -0.081369 0.006621 0.536543 0.000176 14 0.000418 0.917388 0.425798 0.181304 0.000694 0.000117 15 -0.000617 0.975347 -0.054101 0.002927 0.681404 0.001495
fig = plt.figure(figsize=(17, 6))
ax1 = fig.add_subplot(131)
ax1.scatter(df_sub['A9'], df_sub['D'])
line_plot_x1 = np.linspace(df_sub['A9'].min(), df_sub['A9'].max(), 10)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_sub['A9'].values, df_sub['D'].values)
ax1.plot(line_plot_x1, intercept + slope * line_plot_x1)
ax2 = fig.add_subplot(132)
ax2.scatter(df_sub['A1'], df_sub['D'])
line_plot_x2 = np.linspace(df_sub['A1'].min(), df_sub['A1'].max(), 10)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_sub['A1'].values, df_sub['D'].values)
ax2.plot(line_plot_x2, intercept + slope * line_plot_x2)
ax3 = fig.add_subplot(133)
ax3.scatter(df_sub['A6'], df_sub['D'])
line_plot_x3 = np.linspace(df_sub['A6'].min(), df_sub['A6'].max(), 10)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_sub['A6'].values, df_sub['D'].values)
ax3.plot(line_plot_x3, intercept + slope * line_plot_x3)
[<matplotlib.lines.Line2D at 0x11b0be310>]
from sklearn import linear_model
regr = linear_model.LinearRegression()
df[['A9', 'A1']].head()
A9 | A1 | |
---|---|---|
0 | 8.8 | 36 |
1 | 3.6 | 35 |
2 | 0.8 | 44 |
3 | 27.1 | 47 |
4 | 24.4 | 43 |
X = zip(df['A9'], df['A1'])
print X
y = df['D'].values
[(8.8000000000000007, 36), (3.6000000000000001, 35), (0.80000000000000004, 44), (27.100000000000001, 47), (24.399999999999999, 43), (38.5, 53), (3.5, 43), (5.2999999999999998, 45), (8.0999999999999996, 36), (6.7000000000000002, 36), (22.199999999999999, 52), (16.300000000000001, 33), (13.0, 40), (14.699999999999999, 35), (13.1, 37), (14.800000000000001, 35), (12.4, 36), (4.7000000000000002, 15), (15.800000000000001, 31), (13.1, 30), (11.5, 31), (5.0999999999999996, 31), (22.699999999999999, 42), (7.2000000000000002, 43), (21.0, 46), (15.6, 39), (12.6, 35), (2.8999999999999999, 43), (7.7999999999999998, 11), (13.1, 30), (36.700000000000003, 50), (13.6, 60), (5.7999999999999998, 30), (2.0, 25), (21.0, 45), (8.8000000000000007, 46), (31.399999999999999, 54), (11.300000000000001, 42), (17.5, 42), (8.0999999999999996, 36), (3.6000000000000001, 37), (2.2000000000000002, 42), (2.7000000000000002, 41), (28.600000000000001, 44), (5.0, 32), (17.199999999999999, 34), (5.9000000000000004, 10), (13.699999999999999, 18), (3.0, 13), (5.7000000000000002, 35), (3.3999999999999999, 45), (3.7999999999999998, 38), (9.5, 31), (2.5, 40), (25.899999999999999, 41), (7.5, 28), (12.1, 45), (1.0, 45), (4.7999999999999998, 42), (11.699999999999999, 38)]
regr = regr.fit(X, y)
print 'Coefficients:', regr.coef_
print 'Intercept:', regr.intercept_
Coefficients: [ 0.00364445 0.00183603] Intercept: 0.827988572515
test_x = [36, 12]
print regr.predict(test_x)
print 0.8280 + 0.0036 * test_x[0] + 0.0018 * test_x[1]
[ 0.98122101] 0.9792
/Users/yhhan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
# Plot outputs
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(17, 7))
ax1 = fig.add_subplot(121, projection='3d')
ax1.scatter(df['A9'], df['A1'], df_sub['D'])
ax2 = fig.add_subplot(122, projection='3d')
ax2.scatter(df['A9'], df['A1'], df_sub['D'])
# create x,y
xx, yy = np.meshgrid(range(int(df['A9'].min()), int(df['A9'].max())), range(int(df['A1'].min()), int(df['A1'].max())))
# calculate corresponding z
z = 0.8280 + 0.0036 * xx + 0.0018 * yy
ax2.plot_surface(xx, yy, z, rstride=1, cstride=1, linewidth=0, color="yellow", shade=False)
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x124c9e510>
import tensorflow as tf
import numpy as np
# Numpy 랜덤으로 100개의 가짜 데이터 채우기.
x_data = np.float32(np.random.rand(2, 100))
# 학습 레이블(목표값)은 아래의 식으로 산출. (W = [0.1, 0.2], b = 0.3)
y_data = np.dot([0.100, 0.200], x_data) + 0.300
print type(x_data), x_data.shape
print type(y_data), y_data.shape
<type 'numpy.ndarray'> (2, 100) <type 'numpy.ndarray'> (100,)
import tensorflow as tf
import numpy as np
x_data = df[['A9', 'A1']].T
y_data = df['D']
x_data = x_data.as_matrix().astype('float32')
y_data = y_data.as_matrix().astype('float32')
print type(x_data), x_data.shape
print type(y_data), y_data.shape
<type 'numpy.ndarray'> (2, 60) <type 'numpy.ndarray'> (60,)
# b는 0,
b = tf.Variable(tf.zeros([1]))
# W는 1x2 형태의 웨이트 변수
W = tf.Variable(tf.zeros([1, 2]))
y = tf.matmul(W, x_data) + b
print W.get_shape()
print y.get_shape()
(1, 2) (1, 60)
# 손실 함수 정의
loss = tf.reduce_mean(tf.square(y - y_data))
# 경사하강법으로 손실 함수를 최소화 (0.0005는 학습 비율)
optimizer = tf.train.GradientDescentOptimizer(0.0005)
# 학습 오퍼레이션 정의
train = optimizer.minimize(loss)
Coefficients: [ 0.00364445 0.00183603] Intercept: 0.827988572515
# 모든 변수를 초기화.
init = tf.initialize_all_variables()
# 세션 시작
sess = tf.Session()
sess.run(init)
# 200000번 학습.
for step in xrange(0, 200001):
sess.run(train)
if step % 10000 == 0:
print step, sess.run(W), sess.run(b)
0 [[ 0.011511 0.0354317]] [ 0.00093987] 10000 [[ 0.00280043 0.01287403]] [ 0.39743593] 20000 [[ 0.00320522 0.00758022]] [ 0.60392851] 30000 [[ 0.00341587 0.00482532]] [ 0.7113871] 40000 [[ 0.00352549 0.00339164]] [ 0.76730978] 50000 [[ 0.00358254 0.00264559]] [ 0.79641074] 60000 [[ 0.00361222 0.00225743]] [ 0.81155115] 70000 [[ 0.00362768 0.00205532]] [ 0.819435] 80000 [[ 0.00363569 0.00195047]] [ 0.82352459] 90000 [[ 0.00363992 0.00189511]] [ 0.82568407] 100000 [[ 0.00364203 0.00186773]] [ 0.82675219] 110000 [[ 0.00364319 0.00185245]] [ 0.82734823] 120000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 130000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 140000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 150000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 160000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 170000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 180000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 190000 [[ 0.00364356 0.0018477 ]] [ 0.82753348] 200000 [[ 0.00364356 0.0018477 ]] [ 0.82753348]