Notebook

3 군집화¶

지도학습(supervised learning) 알고리즘

모델을 만들기 위해 입력 데이터와 출력 값 (혹은 레이블(label))을 함께 사용.

자율학습(unsupervised learning)
군집화
K-평균 알고리즘 : 데이터를 다른 묶음과 구분되도록 유사한 것끼리 자동으로 그룹화할 때 사용하는 알고리즘

3.1 기본 자료구조: 텐서¶

텐서: 동적크기를 갖는 다차원 데이터 배열. boolean, 문자열, 숫자 같은 정적 자료형을 가짐.
텐서플로와 파이썬의 자료형

텐서플로 자료형	파이썬 자료형	설명
DT_FLOAT	tf.float32	32비트 실수
DT_INT16	tf.int16	16비트 정수
DT_INT32	tf.int32	32비트 정수
DT_INT64	tf.int64	64비트 실수
DT_STRING	tf.string	문자열
DT_BOOL	tf.bool	boolean

랭크(rank) : 텐서가 가지고 있는 배열의 차원
구조(shape), 랭크(rank), 차원번호(dimension number)

구조	랭크	차원번호
[]	0	0-D
[D0]	1	1-D
...	...	1-D
[D0, D1, ..., Dn-1]	n	n-D

텐서 주요 변환 함수

https://www.tensorflow.org/versions/master/api_docs/python/array_ops.html

In [1]:

import numpy as np
import tensorflow as tf
# points는 2000*2 배열입니다.
points = [[0,0] for x in range(2000)]
vectors = tf.constant(points)
expanded_vectors = tf.expand_dims(vectors, 0)

with tf.Session() as sess:
    print(sess.run(tf.rank(vectors)), vectors.get_shape())
    print(sess.run(tf.rank(expanded_vectors)), expanded_vectors.get_shape())

2 (2000, 2)
3 (1, 2000, 2)

코드설명

expanded_vectors의 새로운 차원 D0에는 크기가 1임을 확인할 수 있습니다.

새로운 차원에 크기(size)를 지정할 수 없습니다.

구조 브로드캐스팅

특정 조건을 충족할 때 구조가 다른 텐서 간 연산을 지원합니다.

3.2 텐서플로의 데이터 저장소¶

3.2.1 데이터 파일¶

https://www.tensorflow.org/versions/master/how_tos/reading_data/index.html

3.2.2 변수와 상수¶

tf.constant()를 사용한 상수 생성

함수	설명
tf.zeros_like	모든 원소를 0으로 초기화한 텐서를 생성합니다.
tf.ones_like	모든 원소를 1로 초기화한 텐서를 생성합니다.
tf.fill	주어진 스칼라 값으로 원소를 초기화한 텐서를 생성합니다.
tf.constant	함수 인수로 지정된 값을 이용하여 상수 텐서를 생성합니다.

tf.variable()를 사용한 변수 생성

함수	설명
tf.random_normal	정규분포를 따르는 난수로 텐서를 생성합니다.
tf.truncated_normal	정규분포를 따르는 난수로 텐서를 생성하되, 크기가 표준편차의 2배수보다 큰 값은 제거합니다.
tf.random_uniform	균등분포를 따르는 난수로 텐서를 생성합니다.
tf.random_shuffle	첫 번째 차원을 기준으로 텐서의 원소를 섞습니다.
tf.set_random_seed	난수 시드(seed)를 설정합니다.

변수를 사용하려면 그래프를 구성한 후 run 함수를 실행하기 전에 다음함수로 반드시 초기화

tf.initiaize_all_variables()

3.2.3 파이썬 코드로 제공¶

placeholder: 프로그램 실행중에 데이터를 변경하는 데 사용하는 'symbolic' 변수

tf.placeholder()

원소의 자료형, 텐서의 구조를 매개변수로 제공 가능.

feed_dict

session.run() 또는 Tensor.eval() 메소드를 호출할 때

매개변수로 placeholder를 지정하여 데이터를 전달 할 수 있음.

In [2]:

import numpy as np
import tensorflow as tf

a = tf.placeholder("float")
b = tf.placeholder("float")

y = tf.mul(a, b)

with tf.Session() as sess:
    print(sess.run(y, feed_dict={a: 3, b: 3}))

9.0

코드 설명

sess.run()을 호출할 때 텐서 a,b 두 개를 feed_dict 매개변수를 통해 전달.

3.3 K-평균 알고리즘¶

K-평균 알고리즘(K-means algorithm): 주어진 데이터를 지정된 군집(cluster) 개수(K)로 그룹화.

중심(centroid): 알고리즘의 결과. K 개의 점. 데이터들은 K개의 군집 중 하나만 속할 수 있음.

한 군집 내의 모든 데이터들은 자기 군집 중심과의 거리가 제일 가까움.

알고리즘 기법

직접 오차함수를 최소화하려면 계산 비용이 많이 필요. (NP-난해(hardnesss) 문제)

반복 개선(iterative refinement) 기법으로 몇 번의 반복만으로 수렴.

단계

초기 단계(0단계): K개 중심의 초기 집합을 결정.

할당 단계(1단계): 각 데이터를 가장 가까운 군집에 할당.

수정 단계(2단계): 각 그룹에 대해 새로운 중심을 계산.

알고리즘이 수렴되었다고 간주될 때까지 루프를 통해 반복. 군집 내 데이터 변화가 없을 때 수렴되었다고 간주.

결과는 초기 중심을 어떻게 정했는지 영향을 받으므로, 초기 중심을 바꿔가면서 여러 번 알고리즘을 수행.

In [3]:

import numpy as np

num_points = 2000
vectors_set = []

for i in range(num_points):
    if np.random.random() > 0.5:
        vectors_set.append([np.random.normal(0.0, 0.9),
            np.random.normal(0.0, 0.9)])
    else:
        vectors_set.append([np.random.normal(3.0, 0.5),
            np.random.normal(1.0, 0.5)])

코드설명

두 개의 정규 분포를 이용하여 좌표계에 2000개의 점을 임의로 생성.

대략 절반은 x가 평균 0, 표준편차가 0.9, y도 평균 0, 표준편차가 0.9의 정규분포를 따르고

나머지 절반은 x 평균 3, 표준편차 0.5, y는 평균 1, 표준편차가 0.5인 정규분포를 따름.

In [4]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.DataFrame({"x": [v[0] for v in vectors_set],
                   "y": [v[1] for v in vectors_set]})
sns.lmplot("x", "y", data=df, fit_reg=False, size=6)
plt.show()

# 랜덤하게 생성한 2000개의 점

In [5]:

import tensorflow as tf

vectors = tf.constant(vectors_set)
k = 4
centroids = tf.Variable(tf.slice(tf.random_shuffle(vectors), [0,0], [k,-1]))

expanded_vectors = tf.expand_dims(vectors, 0)
expanded_centroids = tf.expand_dims(centroids, 1)

assignments = tf.argmin(tf.reduce_sum(tf.square(tf.sub(expanded_vectors, expanded_centroids)), 2), 0)

means = tf.concat(0, [tf.reduce_mean(tf.gather(vectors,
    tf.reshape(tf.where(tf.equal(assignments, c)), [1,-1])), 
    reduction_indices=[1]) for c in range(k)])

update_centroids = tf.assign(centroids, means)

init_op = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init_op)

for step in range(100):
    _, centroid_values, assignment_values = sess.run([update_centroids, centroids, assignments])
    #_ = sess.run(update_centroids)
    #centroids_values = sess.run(centroids)
    #assignment_values = sess.run(assignments)

#print(centroid_values)

In [6]:

data = {"x": [], "y": [], "cluster": []}

for i in range(len(assignment_values)):
    data["x"].append(vectors_set[i][0])
    data["y"].append(vectors_set[i][1])
    data["cluster"].append(assignment_values[i])
    
df = pd.DataFrame(data)
sns.lmplot("x", "y", data=df, fit_reg=False, size=6, hue="cluster", legend=False)
plt.show()

3.4 새로운 그룹¶

In [7]:

# vectors = tf.constant(vectors_set)
print(vectors)
print(sess.run(vectors))
vectors.get_shape()

Tensor("Const_1:0", shape=(2000, 2), dtype=float32)
[[ 3.07339978  0.18406528]
 [ 3.62437248  1.95335352]
 [ 3.0714941   0.40908372]
 ..., 
 [ 3.50883102  1.01763129]
 [ 3.23108721  0.61184025]
 [ 3.22827411  0.72328085]]

Out[7]:

TensorShape([Dimension(2000), Dimension(2)])

코드설명

모든 데이터를 상수로 옮김.

상수 텐서의 내용과 구조를 화면에 출력.

vectors는 2000(D0) X 2(D1) 행렬임을 확인.

In [8]:

print(sess.run(tf.slice(vectors, [0,0], [k,-1])))

[[ 3.07339978  0.18406528]
 [ 3.62437248  1.95335352]
 [ 3.0714941   0.40908372]
 [ 2.05173659  0.95611084]]

코드설명

vectors를 위의 k개(여기서는 4개) 만큼 자릅니다.

tf.slice(input_, begin, size, name=None)

https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#slice

In [9]:

# k = 4
# centroids = tf.Variable(tf.slice(tf.random_shuffle(vectors), [0,0], [k,-1]))
print(centroids)
print(sess.run(centroids))
print(centroids.get_shape())

<tensorflow.python.ops.variables.Variable object at 0x7fa6983f2fd0>
[[ 1.0014025   0.41117752]
 [-0.0312904  -0.88871944]
 [ 3.01476979  0.99944139]
 [-0.7339651   0.49133039]]
(4, 2)

코드설명

입력데이터를 다음 함수를 통해 무작위로 섞어서 위의 k개(여기서는 4개) 만큼 잘라서 중심으로 선택합니다.

centroids 변수 텐서의 내용과 구조를 화면에 출력.

centroids는 4(D0) * 2(D1) 행렬임을 확인.

tf.random_shuffle(value, seed=None, name=None)

첫 번째 차원을 기준으로 텐서를 임의로 섞습니다.

https://www.tensorflow.org/api_docs/python/constant_op/random_tensors#random_shuffle

유클리드 제곱거리: 각 점에 대한 거리를 구해 가장 가까운 중심을 계산.

여러 거리 사이의 대소를 비교할 때에만 사용.

https://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance

$\begin{equation*} \ d^2(vector, centroid) = (vector_x - centroid_x)^2 + (vector_y - centroid_y)^2 \end{equation*}$

tf.sub(vectors, centroids) 가 위의 식을 계산하려는 코드.

D0 차원이 vectors는 2000, centroids는 4이기 때문에

tf.expand_dims 함수를 사용하여 두 텐서의 차원을 추가함. 3차원으로 만들어 뺄셈을 할 수 있도록 하기 위함.

In [10]:

#expanded_vectors = tf.expand_dims(vectors, 0)

print(expanded_vectors)
print(sess.run(expanded_vectors))
print(expanded_vectors.get_shape())

Tensor("ExpandDims_1:0", shape=(1, 2000, 2), dtype=float32)
[[[ 3.07339978  0.18406528]
  [ 3.62437248  1.95335352]
  [ 3.0714941   0.40908372]
  ..., 
  [ 3.50883102  1.01763129]
  [ 3.23108721  0.61184025]
  [ 3.22827411  0.72328085]]]
(1, 2000, 2)

코드 설명

expanded_vectors = tf.expand_dims(vectors, 0)

vectors 텐서에 첫 번째 차원(D0)을 추가하여 expanded_vectors를 생성.

expanded_vectors는 1(D0) X 2000(D1) X 2(D2) 행렬임을 확인.

vectors = [[ 3.07339978 0.18406528], ..., [ 3.22827411 0.72328085]]

expanded_vectors = [[[ 3.07339978 0.18406528], ..., [ 3.22827411 0.72328085]]]

tf.expand_dims(input, axis=None, name=None, dim=None)

https://www.tensorflow.org/api_docs/python/array_ops/shapes_and_shaping#expand_dims

tensor 구조에 1 짜리 차원의 축을 추가합니다.

In [11]:

#expanded_centroids = tf.expand_dims(centroids, 1)
print(expanded_centroids)
print(sess.run(expanded_centroids))
print(expanded_centroids.get_shape())

Tensor("ExpandDims_2:0", shape=(4, 1, 2), dtype=float32)
[[[ 1.0014025   0.41117752]]

 [[-0.0312904  -0.88871944]]

 [[ 3.01476979  0.99944139]]

 [[-0.7339651   0.49133039]]]
(4, 1, 2)

코드 설명

expanded_centroids = tf.expand_dims(centroids, 1)

centroids 텐서에 두 번째 차원(D1)을 추가하여 expanded_centroids를 생성.

expanded_centroids는 4(D0) X 1(D1) X 2(D2) 행렬임을 확인.

centroids = [[ 1.0014025 0.41117752], [-0.0312904 -0.88871944], [ 3.01476979 0.99944139], [-0.7339651 0.49133039]]

expanded_centroids = [[[ 1.0014025 0.41117752]], [[-0.0312904 -0.88871944]], [[ 3.01476979 0.99944139]], [[-0.7339651 0.49133039]]]

In [12]:

#assignments = tf.argmin(tf.reduce_sum(tf.square(tf.sub(expanded_vectors, expanded_centroids)), 2), 0)
diff = tf.sub(expanded_vectors, expanded_centroids)
print(sess.run(diff))
print(diff.get_shape())

[[[  2.07199717e+00  -2.27112234e-01]
  [  2.62297010e+00   1.54217601e+00]
  [  2.07009172e+00  -2.09379196e-03]
  ..., 
  [  2.50742865e+00   6.06453776e-01]
  [  2.22968483e+00   2.00662732e-01]
  [  2.22687149e+00   3.12103331e-01]]

 [[  3.10469007e+00   1.07278466e+00]
  [  3.65566278e+00   2.84207296e+00]
  [  3.10278440e+00   1.29780316e+00]
  ..., 
  [  3.54012132e+00   1.90635073e+00]
  [  3.26237750e+00   1.50055969e+00]
  [  3.25956440e+00   1.61200023e+00]]

 [[  5.86299896e-02  -8.15376103e-01]
  [  6.09602690e-01   9.53912139e-01]
  [  5.67243099e-02  -5.90357661e-01]
  ..., 
  [  4.94061232e-01   1.81899071e-02]
  [  2.16317415e-01  -3.87601137e-01]
  [  2.13504314e-01  -2.76160538e-01]]

 [[  3.80736494e+00  -3.07265103e-01]
  [  4.35833740e+00   1.46202314e+00]
  [  3.80545926e+00  -8.22466612e-02]
  ..., 
  [  4.24279594e+00   5.26300907e-01]
  [  3.96505237e+00   1.20509863e-01]
  [  3.96223927e+00   2.31950462e-01]]]
(4, 2000, 2)

코드 설명

텐서플로의 브로드캐스팅 기능

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

tf.sub 함수는 브로드캐스팅을 통해 두 텐서의 각 원소를 어떻게 빼야 할지 알아낼 수 있음.

D0 차원: expanded_vectors의 D0 크기가 4, expanded_centroids의 D0 크기가 1. D0 차원을 4로 늘려 계산.

D1 차원: expanded_vectors의 D0 크기가 1, expanded_centroids의 D1 크기가 2000. D1 차원을 2000으로 늘려 계산.

In [13]:

sqr = tf.square(diff)
print(sess.run(sqr))
print(sqr.get_shape())

[[[  4.29317236e+00   5.15799671e-02]
  [  6.87997198e+00   2.37830687e+00]
  [  4.28527975e+00   4.38396455e-06]
  ..., 
  [  6.28719854e+00   3.67786169e-01]
  [  4.97149467e+00   4.02655303e-02]
  [  4.95895672e+00   9.74084884e-02]]

 [[  9.63910007e+00   1.15086699e+00]
  [  1.33638706e+01   8.07737827e+00]
  [  9.62727070e+00   1.68429303e+00]
  ..., 
  [  1.25324593e+01   3.63417315e+00]
  [  1.06431074e+01   2.25167942e+00]
  [  1.06247597e+01   2.59854484e+00]]

 [[  3.43747577e-03   6.64838195e-01]
  [  3.71615440e-01   9.09948349e-01]
  [  3.21764732e-03   3.48522156e-01]
  ..., 
  [  2.44096503e-01   3.30872717e-04]
  [  4.67932224e-02   1.50234640e-01]
  [  4.55840938e-02   7.62646422e-02]]

 [[  1.44960279e+01   9.44118425e-02]
  [  1.89951057e+01   2.13751173e+00]
  [  1.44815207e+01   6.76451344e-03]
  ..., 
  [  1.80013180e+01   2.76992649e-01]
  [  1.57216406e+01   1.45226270e-02]
  [  1.56993399e+01   5.38010150e-02]]]
(4, 2000, 2)

코드 설명

sqr 텐서는 diff 텐서의 제곱값.

tf.square(x, name=None)

https://www.tensorflow.org/api_docs/python/math_ops/basic_math_functions#square

x 요소별(element-wise)로 제곱을 계산한다.

In [14]:

distances = tf.reduce_sum(sqr, 2)
print(sess.run(distances))
print(distances.get_shape())

[[  4.34475231   9.25827885   4.28528404 ...,   6.65498447   5.01176023
    5.05636501]
 [ 10.78996658  21.44124985  11.31156349 ...,  16.1666317   12.89478683
   13.22330475]
 [  0.66827565   1.28156376   0.35173979 ...,   0.24442737   0.19702786
    0.12184873]
 [ 14.5904398   21.13261795  14.48828506 ...,  18.27831078  15.73616314
   15.75314045]]
(4, 2000)

코드 설명

매개변수로 지정한 차원(D2)이 줄어든 것을 확인 가능.

tf.reduce_sum(input_tensor, axis=None, keep_dims=False, name=None, reduction_indices=None)

https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_sum

텐서의 차원을 가로질러 요소들의 총합(sum)을 구한다.

텐서플로는 텐서의 차원을 감소시키는 수학 연산을 제공.

https://www.tensorflow.org/api_docs/python/math_ops/reduction

In [15]:

assignments = tf.argmin(distances, 0)
print(sess.run(assignments))
print(assignments.get_shape())

[2 2 2 ..., 2 2 2]
(2000,)

코드 설명

지정한 차원(여기서는 중심 값들이 들어 있는 D0 차원)에서 가장 작은 인덱스를 리턴하는

tf.argmin을 통해 각 데이터의 중심이 assignments에 할당.

assignments는 2000(D0) 임을 확인.

tf.argmin(input, dimension, name=None)

https://www.tensorflow.org/versions/r0.10/api_docs/python/math_ops/sequence_comparison_and_indexing#argmin

텐서의 dimention 차원을 따라 가장 작은 값의 인덱스를 리턴.

3.5 새로운 중심 계산하기¶

매 반복마다 새롭게 그룹화를 하면서 각 그룹에 해당하는 새로운 중심을 다시 계산.

In [16]:

#means = tf.concat(0, [tf.reduce_mean(tf.gather(vectors,
#    tf.reshape(tf.where(tf.equal(assignments, c)), [1,-1])), 
#    reduction_indices=[1]) for c in range(k)])

#for c in range(k):
c = 0
equal = tf.equal(assignments, c)
print(equal)
print(sess.run(equal))
print(equal.get_shape())

Tensor("Equal_4:0", shape=(2000,), dtype=bool)
[False False False ..., False False False]
(2000,)

코드설명

equal 함수를 사용하여 한 군집(c)과 매칭되는 assignments 텐서의 각 원소 위치를 True로 표시하는 boolean 텐서를 만듭니다.

equal은 2000(D0) 임을 확인.

** tf.equal(x, y, name=None) **

https://www.tensorflow.org/api_docs/python/control_flow_ops/comparison_operators#equal

(x == y) 요소별(element-wise)로 참 거짓 값을 리턴. broadcasting을 지원함.

In [17]:

where = tf.where(equal)
print(where)
print(str(sess.run(where)).replace('\n',''))
print(where.get_shape())

Tensor("Where_4:0", shape=(?, 1), dtype=int64)
[[  25] [  33] [  44] [  48] [  49] [  59] [  61] [  68] [  73] [  82] [  93] [ 100] [ 103] [ 104] [ 106] [ 114] [ 117] [ 129] [ 134] [ 136] [ 143] [ 146] [ 169] [ 177] [ 178] [ 191] [ 203] [ 204] [ 205] [ 207] [ 208] [ 210] [ 214] [ 215] [ 218] [ 219] [ 227] [ 242] [ 243] [ 247] [ 254] [ 258] [ 259] [ 264] [ 267] [ 271] [ 281] [ 285] [ 288] [ 295] [ 313] [ 316] [ 334] [ 335] [ 340] [ 341] [ 350] [ 360] [ 388] [ 400] [ 401] [ 410] [ 414] [ 418] [ 432] [ 436] [ 442] [ 446] [ 448] [ 450] [ 452] [ 468] [ 475] [ 478] [ 484] [ 491] [ 505] [ 511] [ 521] [ 523] [ 526] [ 527] [ 531] [ 539] [ 543] [ 545] [ 551] [ 560] [ 561] [ 571] [ 573] [ 576] [ 579] [ 580] [ 604] [ 605] [ 612] [ 615] [ 618] [ 622] [ 645] [ 656] [ 662] [ 664] [ 682] [ 683] [ 688] [ 693] [ 697] [ 701] [ 702] [ 715] [ 720] [ 724] [ 730] [ 742] [ 750] [ 756] [ 759] [ 785] [ 790] [ 802] [ 808] [ 809] [ 827] [ 833] [ 836] [ 840] [ 841] [ 843] [ 846] [ 853] [ 854] [ 864] [ 879] [ 881] [ 883] [ 886] [ 888] [ 900] [ 901] [ 910] [ 913] [ 916] [ 922] [ 928] [ 937] [ 940] [ 944] [ 945] [ 948] [ 949] [ 956] [ 971] [ 972] [ 981] [ 987] [1017] [1028] [1033] [1046] [1048] [1053] [1062] [1066] [1071] [1073] [1076] [1077] [1083] [1084] [1087] [1093] [1094] [1110] [1146] [1147] [1153] [1154] [1171] [1182] [1184] [1191] [1197] [1209] [1210] [1212] [1238] [1251] [1253] [1266] [1267] [1271] [1273] [1280] [1283] [1291] [1294] [1309] [1312] [1326] [1335] [1339] [1344] [1358] [1360] [1363] [1366] [1378] [1383] [1384] [1388] [1395] [1398] [1400] [1420] [1426] [1429] [1431] [1434] [1439] [1443] [1447] [1452] [1455] [1457] [1461] [1464] [1468] [1469] [1471] [1476] [1480] [1482] [1483] [1484] [1494] [1495] [1505] [1524] [1529] [1532] [1534] [1543] [1548] [1550] [1554] [1560] [1570] [1577] [1579] [1581] [1596] [1601] [1602] [1612] [1627] [1630] [1633] [1655] [1675] [1677] [1681] [1684] [1686] [1705] [1710] [1714] [1715] [1720] [1730] [1731] [1732] [1733] [1734] [1744] [1748] [1750] [1761] [1767] [1772] [1776] [1790] [1797] [1802] [1804] [1805] [1809] [1812] [1814] [1817] [1823] [1824] [1830] [1836] [1846] [1854] [1855] [1859] [1882] [1895] [1896] [1902] [1904] [1926] [1930] [1933] [1944] [1952] [1954] [1960] [1962] [1963] [1968] [1985] [1988]]
(?, 1)

코드설명

where 함수를 사용하여 매개변수로 받은 equal 텐서에서 True로 표시된 위치를 값으로 가지는 텐서를 만듭니다.

where는 ?(D0) X 1(D1) 임을 확인.

** tf.where(condition, x=None, y=None, name=None) **

https://www.tensorflow.org/api_docs/python/math_ops/sequence_comparison_and_indexing#where

조건에 따라 x 또는 y의 값(요소)을 리턴.

x와 y 모두 None이면 이 연산은 조건의 true 요소의 위치를 리턴.

In [18]:

reshape = tf.reshape(where, [1,-1])
print(reshape)
print(sess.run(reshape))
print(reshape.get_shape())

Tensor("Reshape_4:0", shape=(1, ?), dtype=int64)
[[  25   33   44   48   49   59   61   68   73   82   93  100  103  104
   106  114  117  129  134  136  143  146  169  177  178  191  203  204
   205  207  208  210  214  215  218  219  227  242  243  247  254  258
   259  264  267  271  281  285  288  295  313  316  334  335  340  341
   350  360  388  400  401  410  414  418  432  436  442  446  448  450
   452  468  475  478  484  491  505  511  521  523  526  527  531  539
   543  545  551  560  561  571  573  576  579  580  604  605  612  615
   618  622  645  656  662  664  682  683  688  693  697  701  702  715
   720  724  730  742  750  756  759  785  790  802  808  809  827  833
   836  840  841  843  846  853  854  864  879  881  883  886  888  900
   901  910  913  916  922  928  937  940  944  945  948  949  956  971
   972  981  987 1017 1028 1033 1046 1048 1053 1062 1066 1071 1073 1076
  1077 1083 1084 1087 1093 1094 1110 1146 1147 1153 1154 1171 1182 1184
  1191 1197 1209 1210 1212 1238 1251 1253 1266 1267 1271 1273 1280 1283
  1291 1294 1309 1312 1326 1335 1339 1344 1358 1360 1363 1366 1378 1383
  1384 1388 1395 1398 1400 1420 1426 1429 1431 1434 1439 1443 1447 1452
  1455 1457 1461 1464 1468 1469 1471 1476 1480 1482 1483 1484 1494 1495
  1505 1524 1529 1532 1534 1543 1548 1550 1554 1560 1570 1577 1579 1581
  1596 1601 1602 1612 1627 1630 1633 1655 1675 1677 1681 1684 1686 1705
  1710 1714 1715 1720 1730 1731 1732 1733 1734 1744 1748 1750 1761 1767
  1772 1776 1790 1797 1802 1804 1805 1809 1812 1814 1817 1823 1824 1830
  1836 1846 1854 1855 1859 1882 1895 1896 1902 1904 1926 1930 1933 1944
  1952 1954 1960 1962 1963 1968 1985 1988]]
(1, ?)

코드설명

reshape 함수를 사용하여 c 군집에 속한 vectors 텐서의 포인트들의 인덱스로 구성된 텐서 1 X ?를 만듭니다.

reshape는 1(D0) X ?(D1) 임을 확인.

** tf.reshape(tensor, shape, name=None) **

https://www.tensorflow.org/api_docs/python/array_ops/shapes_and_shaping#reshape

텐서의 구조(shape)를 바꿉니다.

예시) [1, -1]을 사용하면 1행, 열은 나머지 전체 데이터를 열로 만든다는 뜻입니다.

In [19]:

gather = tf.gather(vectors, reshape)
print(gather)
gather_str = str(sess.run(gather)).replace('\n','')
print(gather_str[:gather_str.find("]")+1] + ", ..., " + gather_str[gather_str.rfind("["):])
print(gather.get_shape())

Tensor("Gather_4:0", shape=(1, ?, 2), dtype=float32)
[[[ 2.244349   -0.62279582], ..., [ 1.7031877  -0.69932729]]]
(1, ?, 2)

코드설명

gather 함수를 사용하여 c 군집을 이루는 vectors 점들의 좌표(reshape에 vectors의 인덱스가 있음)를 모은 텐서를 만듭니다.

gather는 1(D0) X ?(D1) X 2000(D2) 임을 확인.

** tf.gather(params, indices, validate_indices=None, name=None) **

https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#gather

indices(인덱스)에 해당하는 params들의 조각을 모읍니다.

In [20]:

#reduce_mean = tf.reduce_mean(gather, reduction_indices=[1])
reduce_mean = tf.reduce_mean(gather, 1)
print(reduce_mean)
print(sess.run(reduce_mean))
print(reduce_mean.get_shape())

Tensor("Mean_4:0", shape=(1, 2), dtype=float32)
[[ 1.0014025   0.41117752]]
(1, 2)

코드설명

reduce_mean 함수를 사용하여 c 군집에 속한 모든 점의 (D1 차원에 대한) 평균 값을 가진 텐서 1 X 2를 만듭니다.

reduce_mean은 1(D0) X 2(D1) 임을 확인.

** tf.reduce_mean(input_tensor, axis=None, keep_dims=False, name=None, reduction_indices=None) **

https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_mean

텐서의 차원을 따라 요소들의 평균을 계산합니다.

In [21]:

means_list = [tf.reduce_mean(tf.gather(vectors,
    tf.reshape(tf.where(tf.equal(assignments, c)), [1,-1])), 
    reduction_indices=[1]) for c in range(k)]
print(means_list)
print(sess.run(means_list))

[<tf.Tensor 'Mean_5:0' shape=(1, 2) dtype=float32>, <tf.Tensor 'Mean_6:0' shape=(1, 2) dtype=float32>, <tf.Tensor 'Mean_7:0' shape=(1, 2) dtype=float32>, <tf.Tensor 'Mean_8:0' shape=(1, 2) dtype=float32>]
[array([[ 1.0014025 ,  0.41117752]], dtype=float32), array([[-0.0312904 , -0.88871944]], dtype=float32), array([[ 3.01476979,  0.99944139]], dtype=float32), array([[-0.7339651 ,  0.49133039]], dtype=float32)]

코드설명

means_list는 4개의 1(D0) X 2(D1) 구조의 텐서를 가지는 리스트입니다.

In [22]:

means = tf.concat(0, means_list)
print(means)
print(sess.run(means))
print(means.get_shape())

Tensor("concat_1:0", shape=(4, 2), dtype=float32)
[[ 1.0014025   0.41117752]
 [-0.0312904  -0.88871944]
 [ 3.01476979  0.99944139]
 [-0.7339651   0.49133039]]
(4, 2)

코드설명

concat 함수를 사용하여 4개의 군집에 속한 모든 점의 (D1 차원에 대한) 평균 값을 가진 텐서 4 X 2를 만듭니다.

means은 4(D0) X 2(D1) 임을 확인.

** tf.concat(concat_dim, values, name='concat') **

https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#concat

concat_dim의 차원을 따라서 텐서들을 결합합니다.

텐서플로 API 문서

https://www.tensorflow.org/versions/master/api_docs/

3.6 그래프 실행¶

In [23]:

update_centroids = tf.assign(centroids, means)

코드설명

means 텐서 값을 centroids에 할당.

run() 메소드가 실행될 때 업데이트 된 중심 값이 다음번 루프에서 사용되기 때문

In [24]:

init_op = tf.initialize_all_variables()

코드설명

데이터 그래프를 실행하기 전에 모든 변수를 초기화하는 연산도 작성.

**tf.initialize_all_variables(*args, **kwargs)**

https://www.tensorflow.org/api_docs/python/state_ops/exporting_and_importing_meta_graphs#initialize_all_variables

deprecated된 함수임. 2017년 3월 2일 이후에는 없어질 예정. 다음 함수를 사용해야 함.

** tf.global_variables_initializer() **

https://www.tensorflow.org/api_docs/python/state_ops/variable_helper_functions#global_variables_initializer

전역변수를 초기화하는 연산을 리턴.

In [25]:

sess = tf.Session()
sess.run(init_op)

for step in range(100):
    _, centroid_values, assignment_values = sess.run([update_centroids, centroids, assignments])

코드설명

데이터 그래프를 실행. 매 반복마다 중심은 업데이트되고 각 점은 새롭게 군집에 할당됨.

3개의 연산(update_centroids, centroids, assignments)는 run함수를 호출하는 순간 지정한 순서대로 실행.

3개의 연산에 상응하는 텐서 3개를 numpy 배열로 만들어 리턴.

update_centroids 연산은 리턴값이 없으므로 _ 밑줄을 사용해 결과를 버리게 하였음.

(파이썬 사용자들은 결과를 버릴 때 _을 사용하는 것이 관습)

In [26]:

print(centroid_values)

[[ 0.04245283 -0.93935204]
 [ 2.99757552  0.99233991]
 [ 0.7706247   0.45893538]
 [-0.86792088  0.4262071 ]]

예제코드

https://github.com/jorditorresBCN/LibroTensorFlow/blob/master/kmeans.py