13장 – 텐서플로에서 데이터 적재와 전처리하기
이 노트북은 13장에 있는 모든 샘플 코드와 연습문제 해답을 가지고 있습니다.
먼저 몇 개의 모듈을 임포트합니다. 맷플롯립 그래프를 인라인으로 출력하도록 만들고 그림을 저장하는 함수를 준비합니다. 또한 파이썬 버전이 3.5 이상인지 확인합니다(파이썬 2.x에서도 동작하지만 곧 지원이 중단되므로 파이썬 3을 사용하는 것이 좋습니다). 사이킷런 버전이 0.20 이상인지와 텐서플로 버전이 2.0 이상인지 확인합니다.
# 파이썬 ≥3.5 필수
import sys
assert sys.version_info >= (3, 5)
# 사이킷런 ≥0.20 필수
import sklearn
assert sklearn.__version__ >= "0.20"
try:
# %tensorflow_version은 코랩 명령입니다.
%tensorflow_version 2.x
%pip install -q -U tfx
print("패키지 호환 에러는 무시해도 괜찮습니다.")
except Exception:
pass
# 텐서플로 ≥2.0 필수
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"
# 공통 모듈 임포트?
import numpy as np
import os
# 노트북 실행 결과를 동일하게 유지하기 위해
np.random.seed(42)
# 깔끔한 그래프 출력을 위해
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# 그림을 저장할 위치
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "data"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("그림 저장:", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
|████████████████████████████████| 2.4 MB 5.0 MB/s
|████████████████████████████████| 147 kB 48.1 MB/s
|████████████████████████████████| 135 kB 49.4 MB/s
|████████████████████████████████| 1.7 MB 19.3 MB/s
|████████████████████████████████| 40 kB 3.9 MB/s
|████████████████████████████████| 406 kB 38.5 MB/s
|████████████████████████████████| 1.2 MB 26.1 MB/s
|████████████████████████████████| 6.5 MB 45.5 MB/s
|████████████████████████████████| 49 kB 4.3 MB/s
|████████████████████████████████| 1.3 MB 44.0 MB/s
|████████████████████████████████| 9.7 MB 24.5 MB/s
|████████████████████████████████| 54 kB 1.8 MB/s
|████████████████████████████████| 189 kB 51.5 MB/s
|████████████████████████████████| 17.7 MB 126 kB/s
|████████████████████████████████| 1.4 MB 51.0 MB/s
|████████████████████████████████| 1.7 MB 42.2 MB/s
|████████████████████████████████| 454.4 MB 7.3 kB/s
|████████████████████████████████| 19.0 MB 1.1 MB/s
|████████████████████████████████| 2.3 MB 47.4 MB/s
|████████████████████████████████| 829 kB 69.0 MB/s
|████████████████████████████████| 151 kB 53.6 MB/s
|████████████████████████████████| 62 kB 746 kB/s
|████████████████████████████████| 255 kB 69.5 MB/s
|████████████████████████████████| 183 kB 64.4 MB/s
|████████████████████████████████| 173 kB 61.1 MB/s
|████████████████████████████████| 144 kB 61.0 MB/s
|████████████████████████████████| 169 kB 69.0 MB/s
|████████████████████████████████| 267 kB 44.4 MB/s
|████████████████████████████████| 435 kB 67.6 MB/s
|████████████████████████████████| 83 kB 1.8 MB/s
|████████████████████████████████| 52 kB 1.4 MB/s
|████████████████████████████████| 105 kB 71.4 MB/s
|████████████████████████████████| 42 kB 1.4 MB/s
|████████████████████████████████| 75 kB 3.8 MB/s
|████████████████████████████████| 188 kB 58.7 MB/s
|████████████████████████████████| 188 kB 57.5 MB/s
|████████████████████████████████| 105 kB 54.0 MB/s
|████████████████████████████████| 1.2 MB 55.5 MB/s
|████████████████████████████████| 4.0 MB 43.9 MB/s
|████████████████████████████████| 462 kB 53.0 MB/s
|████████████████████████████████| 294 kB 74.6 MB/s
|████████████████████████████████| 786 kB 72.4 MB/s
|████████████████████████████████| 370 kB 41.7 MB/s
Building wheel for avro-python3 (setup.py) ... done
Building wheel for dill (setup.py) ... done
Building wheel for future (setup.py) ... done
Building wheel for google-apitools (setup.py) ... done
Building wheel for google-cloud-profiler (setup.py) ... done
Building wheel for grpc-google-iam-v1 (setup.py) ... done
Building wheel for keras-tuner (setup.py) ... done
Building wheel for terminaltables (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-gbq 0.13.3 requires google-cloud-bigquery[bqstorage,pandas]<2.0.0dev,>=1.11.1, but you have google-cloud-bigquery 2.18.0 which is incompatible.
multiprocess 0.70.12.2 requires dill>=0.3.4, but you have dill 0.3.1.1 which is incompatible.
jupyter-console 5.2.0 requires prompt-toolkit<2.0.0,>=1.0.0, but you have prompt-toolkit 3.0.20 which is incompatible.
google-colab 1.0.0 requires ipython~=5.5.0, but you have ipython 7.26.0 which is incompatible.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
패키지 호환 에러는 무시해도 괜찮습니다.
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset
<TensorSliceDataset shapes: (), types: tf.int32>
다음과 동일합니다:
dataset = tf.data.Dataset.range(10)
for item in dataset:
print(item)
tf.Tensor(0, shape=(), dtype=int64) tf.Tensor(1, shape=(), dtype=int64) tf.Tensor(2, shape=(), dtype=int64) tf.Tensor(3, shape=(), dtype=int64) tf.Tensor(4, shape=(), dtype=int64) tf.Tensor(5, shape=(), dtype=int64) tf.Tensor(6, shape=(), dtype=int64) tf.Tensor(7, shape=(), dtype=int64) tf.Tensor(8, shape=(), dtype=int64) tf.Tensor(9, shape=(), dtype=int64)
dataset = dataset.repeat(3).batch(7)
for item in dataset:
print(item)
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64) tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64) tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64) tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64) tf.Tensor([8 9], shape=(2,), dtype=int64)
dataset = dataset.map(lambda x: x * 2)
for item in dataset:
print(item)
tf.Tensor([ 0 2 4 6 8 10 12], shape=(7,), dtype=int64) tf.Tensor([14 16 18 0 2 4 6], shape=(7,), dtype=int64) tf.Tensor([ 8 10 12 14 16 18 0], shape=(7,), dtype=int64) tf.Tensor([ 2 4 6 8 10 12 14], shape=(7,), dtype=int64) tf.Tensor([16 18], shape=(2,), dtype=int64)
#dataset = dataset.apply(tf.data.experimental.unbatch()) # Now deprecated
dataset = dataset.unbatch()
dataset = dataset.filter(lambda x: x < 10) # keep only items < 10
for item in dataset.take(3):
print(item)
tf.Tensor(0, shape=(), dtype=int64) tf.Tensor(2, shape=(), dtype=int64) tf.Tensor(4, shape=(), dtype=int64)
tf.random.set_seed(42)
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)
for item in dataset:
print(item)
tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64) tf.Tensor([8 7 1 0 3 2 5], shape=(7,), dtype=int64) tf.Tensor([4 6 9 8 9 7 0], shape=(7,), dtype=int64) tf.Tensor([3 1 4 5 2 8 7], shape=(7,), dtype=int64) tf.Tensor([6 9], shape=(2,), dtype=int64)
캘리포니아 주택 데이터셋을 로드하고 준비해 보죠. 먼저 로드한 다음 훈련 세트, 검증 세트, 테스트 세트로 나눕니다. 마지막으로 스케일을 조정합니다:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
X_train_full, y_train_full, random_state=42)
scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_
메모리에 맞지 않는 매우 큰 데이터셋인 경우 일반적으로 먼저 여러 개의 파일로 나누고 텐서플로에서 이 파일들을 병렬로 읽게합니다. 데모를 위해 주택 데이터셋을 20개의 CSV 파일로 나누어 보죠:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
housing_dir = os.path.join("datasets", "housing")
os.makedirs(housing_dir, exist_ok=True)
path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
filepaths = []
m = len(data)
for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
part_csv = path_format.format(name_prefix, file_idx)
filepaths.append(part_csv)
with open(part_csv, "wt", encoding="utf-8") as f:
if header is not None:
f.write(header)
f.write("\n")
for row_idx in row_indices:
f.write(",".join([repr(col) for col in data[row_idx]]))
f.write("\n")
return filepaths
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)
train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)
좋습니다. 이 CSV 파일 중 하나에서 몇 줄을 출력해 보죠:
import pandas as pd
pd.read_csv(train_filepaths[0]).head()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedianHouseValue | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.5214 | 15.0 | 3.049945 | 1.106548 | 1447.0 | 1.605993 | 37.63 | -122.43 | 1.442 |
1 | 5.3275 | 5.0 | 6.490060 | 0.991054 | 3464.0 | 3.443340 | 33.69 | -117.39 | 1.687 |
2 | 3.1000 | 29.0 | 7.542373 | 1.591525 | 1328.0 | 2.250847 | 38.44 | -122.98 | 1.621 |
3 | 7.1736 | 12.0 | 6.289003 | 0.997442 | 1054.0 | 2.695652 | 33.55 | -117.70 | 2.621 |
4 | 2.0549 | 13.0 | 5.312457 | 1.085092 | 3297.0 | 2.244384 | 33.93 | -116.93 | 0.956 |
텍스트 파일로 읽으면 다음과 같습니다:
with open(train_filepaths[0]) as f:
for i in range(5):
print(f.readline(), end="")
MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue 3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442 5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687 3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621 7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621
train_filepaths
['datasets/housing/my_train_00.csv', 'datasets/housing/my_train_01.csv', 'datasets/housing/my_train_02.csv', 'datasets/housing/my_train_03.csv', 'datasets/housing/my_train_04.csv', 'datasets/housing/my_train_05.csv', 'datasets/housing/my_train_06.csv', 'datasets/housing/my_train_07.csv', 'datasets/housing/my_train_08.csv', 'datasets/housing/my_train_09.csv', 'datasets/housing/my_train_10.csv', 'datasets/housing/my_train_11.csv', 'datasets/housing/my_train_12.csv', 'datasets/housing/my_train_13.csv', 'datasets/housing/my_train_14.csv', 'datasets/housing/my_train_15.csv', 'datasets/housing/my_train_16.csv', 'datasets/housing/my_train_17.csv', 'datasets/housing/my_train_18.csv', 'datasets/housing/my_train_19.csv']
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)
for filepath in filepath_dataset:
print(filepath)
tf.Tensor(b'datasets/housing/my_train_15.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_08.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_03.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_01.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_10.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_05.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_19.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_16.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_02.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_09.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_00.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_07.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_12.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_04.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_17.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_11.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_14.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_18.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_06.csv', shape=(), dtype=string) tf.Tensor(b'datasets/housing/my_train_13.csv', shape=(), dtype=string)
n_readers = 5
dataset = filepath_dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length=n_readers)
for line in dataset.take(5):
print(line.numpy())
b'4.6477,38.0,5.03728813559322,0.911864406779661,745.0,2.5254237288135593,32.64,-117.07,1.504' b'8.72,44.0,6.163179916317992,1.0460251046025104,668.0,2.794979079497908,34.2,-118.18,4.159' b'3.8456,35.0,5.461346633416459,0.9576059850374065,1154.0,2.8778054862842892,37.96,-122.05,1.598' b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526' b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
네 번째 필드의 4는 문자열로 해석됩니다.
record_defaults=[0, np.nan, tf.constant(np.nan, dtype=tf.float64), "Hello", tf.constant([])]
parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults)
parsed_fields
[<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>, <tf.Tensor: shape=(), dtype=float64, numpy=3.0>, <tf.Tensor: shape=(), dtype=string, numpy=b'4'>, <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]
누락된 값은 제공된 기본값으로 대체됩니다:
parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields
[<tf.Tensor: shape=(), dtype=int32, numpy=0>, <tf.Tensor: shape=(), dtype=float32, numpy=nan>, <tf.Tensor: shape=(), dtype=float64, numpy=nan>, <tf.Tensor: shape=(), dtype=string, numpy=b'Hello'>, <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]
다섯 번째 필드는 필수입니니다(기본값을 tf.constant([])
로 지정했기 때문에). 따라서 값을 전달하지 않으면 예외가 발생합니다:
try:
parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
Field 4 is required but missing in record 0! [Op:DecodeCSV]
필드 개수는 record_defaults
에 있는 필드 개수와 정확히 맞아야 합니다:
try:
parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]
n_inputs = 8 # X_train.shape[-1]
@tf.function
def preprocess(line):
defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(fields[:-1])
y = tf.stack(fields[-1:])
return (x - X_mean) / X_std, y
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')
(<tf.Tensor: shape=(8,), dtype=float32, numpy= array([ 0.16579157, 1.216324 , -0.05204565, -0.39215982, -0.5277444 , -0.2633488 , 0.8543046 , -1.3072058 ], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length=n_readers, num_parallel_calls=n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
tf.random.set_seed(42)
train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
print("X =", X_batch)
print("y =", y_batch)
print()
X = tf.Tensor( [[ 0.5804519 -0.20762321 0.05616303 -0.15191229 0.01343246 0.00604472 1.2525111 -1.3671792 ] [ 5.818099 1.8491895 1.1784915 0.28173092 -1.2496178 -0.3571987 0.7231292 -1.0023477 ] [-0.9253566 0.5834586 -0.7807257 -0.28213993 -0.36530012 0.27389365 -0.76194876 0.72684526]], shape=(3, 8), dtype=float32) y = tf.Tensor( [[1.752] [1.313] [1.535]], shape=(3, 1), dtype=float32) X = tf.Tensor( [[-0.8324941 0.6625668 -0.20741376 -0.18699841 -0.14536144 0.09635526 0.9807942 -0.67250353] [-0.62183803 0.5834586 -0.19862501 -0.3500319 -1.1437552 -0.3363751 1.107282 -0.8674123 ] [ 0.8683102 0.02970133 0.3427381 -0.29872298 0.7124906 0.28026953 -0.72915536 0.86178064]], shape=(3, 8), dtype=float32) y = tf.Tensor( [[0.919] [1.028] [2.182]], shape=(3, 1), dtype=float32)
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
keras.layers.Dense(1),
])
model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1e-3))
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
batch_size = 32
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size, epochs=10,
validation_data=valid_set)
Epoch 1/10 362/362 [==============================] - 2s 3ms/step - loss: 1.4679 - val_loss: 21.5124 Epoch 2/10 362/362 [==============================] - 1s 3ms/step - loss: 0.8735 - val_loss: 0.6648 Epoch 3/10 362/362 [==============================] - 1s 3ms/step - loss: 0.6317 - val_loss: 0.6196 Epoch 4/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5933 - val_loss: 0.5669 Epoch 5/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5629 - val_loss: 0.5402 Epoch 6/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5693 - val_loss: 0.5209 Epoch 7/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5231 - val_loss: 0.6130 Epoch 8/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5074 - val_loss: 0.4818 Epoch 9/10 362/362 [==============================] - 1s 3ms/step - loss: 0.4963 - val_loss: 0.4904 Epoch 10/10 362/362 [==============================] - 1s 3ms/step - loss: 0.5023 - val_loss: 0.4585
<tensorflow.python.keras.callbacks.History at 0x7fb90e831550>
model.evaluate(test_set, steps=len(X_test) // batch_size)
161/161 [==============================] - 0s 2ms/step - loss: 0.4788
0.4787752032279968
new_set = test_set.map(lambda X, y: X) # we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new) // batch_size)
array([[2.3576405], [2.255291 ], [1.4437605], ..., [0.5654392], [3.9442453], [1.0232248]], dtype=float32)
optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.mean_squared_error
n_epochs = 5
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0
for X_batch, y_batch in train_set.take(total_steps):
global_step += 1
print("\rGlobal step {}/{}".format(global_step, total_steps), end="")
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
Global step 1810/1810
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.mean_squared_error
@tf.function
def train(model, n_epochs, batch_size=32,
n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
n_parse_threads=n_parse_threads, batch_size=batch_size)
for X_batch, y_batch in train_set:
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train(model, 5)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.mean_squared_error
@tf.function
def train(model, n_epochs, batch_size=32,
n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
n_parse_threads=n_parse_threads, batch_size=batch_size)
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0
for X_batch, y_batch in train_set.take(total_steps):
global_step += 1
if tf.equal(global_step % 100, 0):
tf.print("\rGlobal step", global_step, "/", total_steps)
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train(model, 5)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
Global step 100 / 1810 Global step 200 / 1810 Global step 300 / 1810 Global step 400 / 1810 Global step 500 / 1810 Global step 600 / 1810 Global step 700 / 1810 Global step 800 / 1810 Global step 900 / 1810 Global step 1000 / 1810 Global step 1100 / 1810 Global step 1200 / 1810 Global step 1300 / 1810 Global step 1400 / 1810 Global step 1500 / 1810 Global step 1600 / 1810 Global step 1700 / 1810 Global step 1800 / 1810
Dataset
클래스에 있는 메서드의 간략한 설명입니다:
for m in dir(tf.data.Dataset):
if not (m.startswith("_") or m.endswith("_")):
func = getattr(tf.data.Dataset, m)
if hasattr(func, "__doc__"):
print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))
● apply() Applies a transformation function to this dataset. ● as_numpy_iterator() Returns an iterator which converts all elements of the dataset to numpy. ● batch() Combines consecutive elements of this dataset into batches. ● cache() Caches the elements in this dataset. ● cardinality() Returns the cardinality of the dataset, if known. ● concatenate() Creates a `Dataset` by concatenating the given dataset with this dataset. ● element_spec() The type specification of an element of this dataset. ● enumerate() Enumerates the elements of this dataset. ● filter() Filters this dataset according to `predicate`. ● flat_map() Maps `map_func` across this dataset and flattens the result. ● from_generator() Creates a `Dataset` whose elements are generated by `generator`. (deprecated arguments) ● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors. ● from_tensors() Creates a `Dataset` with a single element, comprising the given tensors. ● interleave() Maps `map_func` across this dataset, and interleaves the results. ● list_files() A dataset of all files matching one or more glob patterns. ● map() Maps `map_func` across the elements of this dataset. ● options() Returns the options for this dataset and its inputs. ● padded_batch() Combines consecutive elements of this dataset into padded batches. ● prefetch() Creates a `Dataset` that prefetches elements from this dataset. ● range() Creates a `Dataset` of a step-separated range of values. ● reduce() Reduces the input dataset to a single element. ● repeat() Repeats this dataset so each original value is seen `count` times. ● shard() Creates a `Dataset` that includes only 1/`num_shards` of this dataset. ● shuffle() Randomly shuffles the elements of this dataset. ● skip() Creates a `Dataset` that skips `count` elements from this dataset. ● take() Creates a `Dataset` with at most `count` elements from this dataset. ● unbatch() Splits elements of a dataset into multiple elements. ● window() Combines (nests of) input elements into a dataset of (nests of) windows. ● with_options() Returns a new `tf.data.Dataset` with the given options set. ● zip() Creates a `Dataset` by zipping together the given datasets.
TFRecord
이진 포맷¶TFRecord 파일은 단순히 이진 레코드의 리스트입니다. tf.io.TFRecordWriter
를 사용해 만들 수 있습니다:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
f.write(b"This is the first record")
f.write(b"And this is the second record")
그리고 tf.data.TFRecordDataset
사용해 읽을 수 있습니다.:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
print(item)
tf.Tensor(b'This is the first record', shape=(), dtype=string) tf.Tensor(b'And this is the second record', shape=(), dtype=string)
하나의 TFRecordDataset
로 여러 개의 TFRecord 파일을 읽을 수 있습니다. 기본적으로 한 번에 하나의 파일만 읽지만 num_parallel_reads=3
와 같이 지정하면 동시에 3개를 읽고 레코드를 번갈아 반환합니다:
filepaths = ["my_test_{}.tfrecord".format(i) for i in range(5)]
for i, filepath in enumerate(filepaths):
with tf.io.TFRecordWriter(filepath) as f:
for j in range(3):
f.write("File {} record {}".format(i, j).encode("utf-8"))
dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)
for item in dataset:
print(item)
tf.Tensor(b'File 0 record 0', shape=(), dtype=string) tf.Tensor(b'File 1 record 0', shape=(), dtype=string) tf.Tensor(b'File 2 record 0', shape=(), dtype=string) tf.Tensor(b'File 0 record 1', shape=(), dtype=string) tf.Tensor(b'File 1 record 1', shape=(), dtype=string) tf.Tensor(b'File 2 record 1', shape=(), dtype=string) tf.Tensor(b'File 0 record 2', shape=(), dtype=string) tf.Tensor(b'File 1 record 2', shape=(), dtype=string) tf.Tensor(b'File 2 record 2', shape=(), dtype=string) tf.Tensor(b'File 3 record 0', shape=(), dtype=string) tf.Tensor(b'File 4 record 0', shape=(), dtype=string) tf.Tensor(b'File 3 record 1', shape=(), dtype=string) tf.Tensor(b'File 4 record 1', shape=(), dtype=string) tf.Tensor(b'File 3 record 2', shape=(), dtype=string) tf.Tensor(b'File 4 record 2', shape=(), dtype=string)
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
f.write(b"This is the first record")
f.write(b"And this is the second record")
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],
compression_type="GZIP")
for item in dataset:
print(item)
tf.Tensor(b'This is the first record', shape=(), dtype=string) tf.Tensor(b'And this is the second record', shape=(), dtype=string)
이 절을 위해서는 프로토콜 버퍼를 설치해야 합니다. 일반적으로 텐서플로를 사용할 때 프로토콜 버퍼를 설치할 필요는 없습니다. 텐서플로는 tf.train.Example
타입의 프로토콜 버퍼를 만들고 파싱할 수 있는 함수를 제공하며 보통의 경우 충분합니다. 하지만 이 절에서는 자체적인 프로토콜 버퍼를 간단히 만들어 보겠습니다. 따라서 프로토콜 버퍼 컴파일러(protoc
)가 필요합니다. 이를 사용해 프로토콜 버퍼 정의를 컴파일하여 코드에서 사용할 수 있는 파이썬 모듈을 만들겠습니다.
먼저 간단한 프로토콜 버퍼 정의를 작성해 보죠:
%%writefile person.proto
syntax = "proto3";
message Person {
string name = 1;
int32 id = 2;
repeated string email = 3;
}
Overwriting person.proto
이 정의를 컴파일합니다(--descriptor_set_out
와 --include_imports
옵션은 아래 tf.io.decode_proto()
예제를 위해서 필요합니다):
!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports
!ls person*
person.desc person_pb2.py person.proto
from person_pb2 import Person
person = Person(name="Al", id=123, email=["a@b.com"]) # Person 생성
print(person) # Person 출력
name: "Al" id: 123 email: "a@b.com"
person.name # 필드 읽기
'Al'
person.name = "Alice" # 필드 수정
person.email[0] # 배열처럼 사용할 수 있는 반복 필드
'a@b.com'
person.email.append("c@d.com") # 이메일 추가
s = person.SerializeToString() # 바이트 문자열로 직렬화
s
b'\n\x05Alice\x10{\x1a\x07a@b.com\x1a\x07c@d.com'
person2 = Person() # 새로운 Person 생성
person2.ParseFromString(s) # 바이트 문자열 파싱 (27 바이트)
27
person == person2 # 동일
True
드문 경우에 텐서플로에서 (앞서 우리가 만든 것처럼) 사용자 정의 프로토콜 버퍼를 파싱해야 합니다. 이를 위해 tf.io.decode_proto()
함수를 사용할 수 있습니다:
person_tf = tf.io.decode_proto(
bytes=s,
message_type="Person",
field_names=["name", "id", "email"],
output_types=[tf.string, tf.int32, tf.string],
descriptor_source="person.desc")
person_tf.values
[<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Alice'], dtype=object)>, <tf.Tensor: shape=(1,), dtype=int32, numpy=array([123], dtype=int32)>, <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>]
더 자세한 내용은 tf.io.decode_proto()
문서를 참고하세요.
다음이 tf.train.Example
프로토콜 버퍼의 정의입니다.:
syntax = "proto3";
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
경고: 텐서플로 2.0과 2.1에서 from tensorflow.train import X
와 같이 사용하지 못하는 버그가 있기 때문에 X = tf.train.X
로 씁니다. 자세한 내용은 https://github.com/tensorflow/tensorflow/issues/33289%EC%9D%84 참고하세요.
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example
person_example = Example(
features=Features(
feature={
"name": Feature(bytes_list=BytesList(value=[b"Alice"])),
"id": Feature(int64_list=Int64List(value=[123])),
"emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
}))
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
f.write(person_example.SerializeToString())
feature_description = {
"name": tf.io.FixedLenFeature([], tf.string, default_value=""),
"id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
"emails": tf.io.VarLenFeature(tf.string),
}
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
parsed_example = tf.io.parse_single_example(serialized_example,
feature_description)
parsed_example
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb9109be650>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'Alice'>}
parsed_example
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb9109be650>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'Alice'>}
parsed_example["emails"].values[0]
<tf.Tensor: shape=(), dtype=string, numpy=b'a@b.com'>
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")
<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>
parsed_example["emails"].values
<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>
from sklearn.datasets import load_sample_images
img = load_sample_images()["images"][0]
plt.imshow(img)
plt.axis("off")
plt.title("Original Image")
plt.show()
data = tf.io.encode_jpeg(img)
example_with_image = Example(features=Features(feature={
"image": Feature(bytes_list=BytesList(value=[data.numpy()]))}))
serialized_example = example_with_image.SerializeToString()
# then save to TFRecord
feature_description = { "image": tf.io.VarLenFeature(tf.string) }
example_with_image = tf.io.parse_single_example(serialized_example, feature_description)
decoded_img = tf.io.decode_jpeg(example_with_image["image"].values[0])
또는 decode_image()
를 사용합니다. 이 함수는 BMP, GIF, JPEG, PNG 포맷을 지원합니다:
decoded_img = tf.io.decode_image(example_with_image["image"].values[0])
plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()
tf.io.serialize_tensor()
와 tf.io.parse_tensor()
를 사용해 텐서를 쉽게 직렬화하고 파싱할 수 있습니다:
t = tf.constant([[0., 1.], [2., 3.], [4., 5.]])
s = tf.io.serialize_tensor(t)
s
<tf.Tensor: shape=(), dtype=string, numpy=b'\x08\x01\x12\x08\x12\x02\x08\x03\x12\x02\x08\x02"\x18\x00\x00\x00\x00\x00\x00\x80?\x00\x00\x00@\x00\x00@@\x00\x00\x80@\x00\x00\xa0@'>
tf.io.parse_tensor(s, out_type=tf.float32)
<tf.Tensor: shape=(3, 2), dtype=float32, numpy= array([[0., 1.], [2., 3.], [4., 5.]], dtype=float32)>
serialized_sparse = tf.io.serialize_sparse(parsed_example["emails"])
serialized_sparse
<tf.Tensor: shape=(3,), dtype=string, numpy= array([b'\x08\t\x12\x08\x12\x02\x08\x02\x12\x02\x08\x01"\x10\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00', b'\x08\x07\x12\x04\x12\x02\x08\x02"\x10\x07\x07a@b.comc@d.com', b'\x08\t\x12\x04\x12\x02\x08\x01"\x08\x02\x00\x00\x00\x00\x00\x00\x00'], dtype=object)>
BytesList(value=serialized_sparse.numpy())
value: "\010\t\022\010\022\002\010\002\022\002\010\001\"\020\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000" value: "\010\007\022\004\022\002\010\002\"\020\007\007a@b.comc@d.com" value: "\010\t\022\004\022\002\010\001\"\010\002\000\000\000\000\000\000\000"
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)
for serialized_examples in dataset:
parsed_examples = tf.io.parse_example(serialized_examples,
feature_description)
parsed_examples
{'image': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb90c85b2d0>}
SequenceExample
를 사용해 순차 데이터 다루기¶syntax = "proto3";
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
Features context = 1;
FeatureLists feature_lists = 2;
};
경고: 텐서플로 2.0과 2.1에서 from tensorflow.train import X
와 같이 사용하지 못하는 버그가 있기 때문에 X = tf.train.X
로 씁니다. 자세한 내용은 https://github.com/tensorflow/tensorflow/issues/33289%EC%9D%84 참고하세요.
from tensorflow.train import FeatureList, FeatureLists, SequenceExample
context = Features(feature={
"author_id": Feature(int64_list=Int64List(value=[123])),
"title": Feature(bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."])),
"pub_date": Feature(int64_list=Int64List(value=[1623, 12, 25]))
})
content = [["When", "shall", "we", "three", "meet", "again", "?"],
["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
["When", "the", "battle", "'s", "lost", "and", "won", "."]]
def words_to_feature(words):
return Feature(bytes_list=BytesList(value=[word.encode("utf-8")
for word in words]))
content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]
sequence_example = SequenceExample(
context=context,
feature_lists=FeatureLists(feature_list={
"content": FeatureList(feature=content_features),
"comments": FeatureList(feature=comments_features)
}))
sequence_example
context { feature { key: "author_id" value { int64_list { value: 123 } } } feature { key: "pub_date" value { int64_list { value: 1623 value: 12 value: 25 } } } feature { key: "title" value { bytes_list { value: "A" value: "desert" value: "place" value: "." } } } } feature_lists { feature_list { key: "comments" value { feature { bytes_list { value: "When" value: "the" value: "hurlyburly" value: "\'s" value: "done" value: "." } } feature { bytes_list { value: "When" value: "the" value: "battle" value: "\'s" value: "lost" value: "and" value: "won" value: "." } } } } feature_list { key: "content" value { feature { bytes_list { value: "When" value: "shall" value: "we" value: "three" value: "meet" value: "again" value: "?" } } feature { bytes_list { value: "In" value: "thunder" value: "," value: "lightning" value: "," value: "or" value: "in" value: "rain" value: "?" } } } } }
serialized_sequence_example = sequence_example.SerializeToString()
context_feature_descriptions = {
"author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
"title": tf.io.VarLenFeature(tf.string),
"pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),
}
sequence_feature_descriptions = {
"content": tf.io.VarLenFeature(tf.string),
"comments": tf.io.VarLenFeature(tf.string),
}
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
serialized_sequence_example, context_feature_descriptions,
sequence_feature_descriptions)
parsed_context
{'author_id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'pub_date': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1623, 12, 25])>, 'title': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb910db4b10>}
parsed_context["title"].values
<tf.Tensor: shape=(4,), dtype=string, numpy=array([b'A', b'desert', b'place', b'.'], dtype=object)>
parsed_feature_lists
{'comments': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb910db4b50>, 'content': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fb910db48d0>}
print(tf.RaggedTensor.from_sparse(parsed_feature_lists["content"]))
<tf.RaggedTensor [[b'When', b'shall', b'we', b'three', b'meet', b'again', b'?'], [b'In', b'thunder', b',', b'lightning', b',', b'or', b'in', b'rain', b'?']]>
2장에서 사용했던 캘리포니아 주택 데이터셋에 범주형 특성과 누락된 값이 있으므로 이 데이터를 다시 사용하겠습니다:
import os
import tarfile
import urllib.request
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/rickiepark/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
os.makedirs(housing_path, exist_ok=True)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
fetch_housing_data()
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing_median_age = tf.feature_column.numeric_column("housing_median_age")
age_mean, age_std = X_mean[1], X_std[1] # The median age is column in 1
housing_median_age = tf.feature_column.numeric_column(
"housing_median_age", normalizer_fn=lambda x: (x - age_mean) / age_std)
median_income = tf.feature_column.numeric_column("median_income")
bucketized_income = tf.feature_column.bucketized_column(
median_income, boundaries=[1.5, 3., 4.5, 6.])
bucketized_income
BucketizedColumn(source_column=NumericColumn(key='median_income', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1.5, 3.0, 4.5, 6.0))
ocean_prox_vocab = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
ocean_proximity = tf.feature_column.categorical_column_with_vocabulary_list(
"ocean_proximity", ocean_prox_vocab)
ocean_proximity
VocabularyListCategoricalColumn(key='ocean_proximity', vocabulary_list=('<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
# Just an example, it's not used later on
city_hash = tf.feature_column.categorical_column_with_hash_bucket(
"city", hash_bucket_size=1000)
city_hash
HashedCategoricalColumn(key='city', hash_bucket_size=1000, dtype=tf.string)
bucketized_age = tf.feature_column.bucketized_column(
housing_median_age, boundaries=[-1., -0.5, 0., 0.5, 1.]) # age was scaled
age_and_ocean_proximity = tf.feature_column.crossed_column(
[bucketized_age, ocean_proximity], hash_bucket_size=100)
latitude = tf.feature_column.numeric_column("latitude")
longitude = tf.feature_column.numeric_column("longitude")
bucketized_latitude = tf.feature_column.bucketized_column(
latitude, boundaries=list(np.linspace(32., 42., 20 - 1)))
bucketized_longitude = tf.feature_column.bucketized_column(
longitude, boundaries=list(np.linspace(-125., -114., 20 - 1)))
location = tf.feature_column.crossed_column(
[bucketized_latitude, bucketized_longitude], hash_bucket_size=1000)
ocean_proximity_one_hot = tf.feature_column.indicator_column(ocean_proximity)
ocean_proximity_embed = tf.feature_column.embedding_column(ocean_proximity,
dimension=2)
feature_column
을 사용해 파싱하기¶median_house_value = tf.feature_column.numeric_column("median_house_value")
columns = [housing_median_age, median_house_value]
feature_descriptions = tf.feature_column.make_parse_example_spec(columns)
feature_descriptions
{'housing_median_age': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None), 'median_house_value': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None)}
with tf.io.TFRecordWriter("my_data_with_features.tfrecords") as f:
for x, y in zip(X_train[:, 1:2], y_train):
example = Example(features=Features(feature={
"housing_median_age": Feature(float_list=FloatList(value=[x])),
"median_house_value": Feature(float_list=FloatList(value=[y]))
}))
f.write(example.SerializeToString())
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
def parse_examples(serialized_examples):
examples = tf.io.parse_example(serialized_examples, feature_descriptions)
targets = examples.pop("median_house_value") # separate the targets
return examples, targets
batch_size = 32
dataset = tf.data.TFRecordDataset(["my_data_with_features.tfrecords"])
dataset = dataset.repeat().shuffle(10000).batch(batch_size).map(parse_examples)
Warning: the DenseFeatures
layer currently does not work with the Functional API, see TF issue #27416. Hopefully this will be resolved before the final release of TF 2.0.
columns_without_target = columns[:-1]
model = keras.models.Sequential([
keras.layers.DenseFeatures(feature_columns=columns_without_target),
keras.layers.Dense(1)
])
model.compile(loss="mse",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])
model.fit(dataset, steps_per_epoch=len(X_train) // batch_size, epochs=5)
Epoch 1/5 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'housing_median_age': <tf.Tensor 'IteratorGetNext:0' shape=(None, 1) dtype=float32>} Consider rewriting this model with the Functional API.
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'housing_median_age': <tf.Tensor 'IteratorGetNext:0' shape=(None, 1) dtype=float32>} Consider rewriting this model with the Functional API. 362/362 [==============================] - 1s 2ms/step - loss: 3.7619 - accuracy: 0.0016 Epoch 2/5 362/362 [==============================] - 0s 1ms/step - loss: 1.9311 - accuracy: 0.0027 Epoch 3/5 362/362 [==============================] - 0s 1ms/step - loss: 1.4434 - accuracy: 0.0026 Epoch 4/5 362/362 [==============================] - 0s 1ms/step - loss: 1.3579 - accuracy: 0.0030 Epoch 5/5 362/362 [==============================] - 0s 1ms/step - loss: 1.3473 - accuracy: 0.0038
<tensorflow.python.keras.callbacks.History at 0x7fb90c8900d0>
some_columns = [ocean_proximity_embed, bucketized_income]
dense_features = keras.layers.DenseFeatures(some_columns)
dense_features({
"ocean_proximity": [["NEAR OCEAN"], ["INLAND"], ["INLAND"]],
"median_income": [[3.], [7.2], [1.]]
})
<tf.Tensor: shape=(3, 7), dtype=float32, numpy= array([[ 0. , 0. , 1. , 0. , 0. , -0.14504611, 0.7563394 ], [ 0. , 0. , 0. , 0. , 1. , -1.1119912 , 0.56957847], [ 1. , 0. , 0. , 0. , 0. , -1.1119912 , 0.56957847]], dtype=float32)>
try:
import tensorflow_transform as tft
def preprocess(inputs): # inputs is a batch of input features
median_age = inputs["housing_median_age"]
ocean_proximity = inputs["ocean_proximity"]
standardized_age = tft.scale_to_z_score(median_age - tft.mean(median_age))
ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
return {
"standardized_median_age": standardized_age,
"ocean_proximity_id": ocean_proximity_id
}
except ImportError:
print("TF Transform is not installed. Try running: pip3 install -U tensorflow-transform")
import tensorflow_datasets as tfds
datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
print(tfds.list_builders())
['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset', 'ai2_arc', 'ai2_arc_with_ir', 'amazon_us_reviews', 'anli', 'arc', 'bair_robot_pushing_small', 'bccd', 'beans', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'blimp', 'bool_q', 'c4', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'cfq', 'chexpert', 'cifar10', 'cifar100', 'cifar10_1', 'cifar10_corrupted', 'citrus_leaves', 'cityscapes', 'civil_comments', 'clevr', 'clic', 'clinc_oos', 'cmaterdb', 'cnn_dailymail', 'coco', 'coco_captions', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'common_voice', 'coqa', 'cos_e', 'cosmos_qa', 'covid19sum', 'crema_d', 'curated_breast_imaging_ddsm', 'cycle_gan', 'deep_weeds', 'definite_pronoun_resolution', 'dementiabank', 'diabetic_retinopathy_detection', 'div2k', 'dmlab', 'downsampled_imagenet', 'dsprites', 'dtd', 'duke_ultrasound', 'emnist', 'eraser_multi_rc', 'esnli', 'eurosat', 'fashion_mnist', 'flic', 'flores', 'food101', 'forest_fires', 'fuss', 'gap', 'geirhos_conflict_stimuli', 'genomics_ood', 'german_credit_numeric', 'gigaword', 'glue', 'goemotions', 'gpt3', 'groove', 'gtzan', 'gtzan_music_speech', 'hellaswag', 'higgs', 'horses_or_humans', 'i_naturalist2017', 'imagenet2012', 'imagenet2012_corrupted', 'imagenet2012_real', 'imagenet2012_subset', 'imagenet_a', 'imagenet_r', 'imagenet_resized', 'imagenet_v2', 'imagenette', 'imagewang', 'imdb_reviews', 'irc_disentanglement', 'iris', 'kitti', 'kmnist', 'lfw', 'librispeech', 'librispeech_lm', 'libritts', 'ljspeech', 'lm1b', 'lost_and_found', 'lsun', 'malaria', 'math_dataset', 'mctaco', 'mnist', 'mnist_corrupted', 'movie_lens', 'movie_rationales', 'movielens', 'moving_mnist', 'multi_news', 'multi_nli', 'multi_nli_mismatch', 'natural_questions', 'natural_questions_open', 'newsroom', 'nsynth', 'nyu_depth_v2', 'omniglot', 'open_images_challenge2019_detection', 'open_images_v4', 'openbookqa', 'opinion_abstracts', 'opinosis', 'opus', 'oxford_flowers102', 'oxford_iiit_pet', 'para_crawl', 'patch_camelyon', 'paws_wiki', 'paws_x_wiki', 'pet_finder', 'pg19', 'places365_small', 'plant_leaves', 'plant_village', 'plantae_k', 'qa4mre', 'qasc', 'quickdraw_bitmap', 'radon', 'reddit', 'reddit_disentanglement', 'reddit_tifu', 'resisc45', 'robonet', 'rock_paper_scissors', 'rock_you', 'salient_span_wikipedia', 'samsum', 'savee', 'scan', 'scene_parse150', 'scicite', 'scientific_papers', 'sentiment140', 'shapes3d', 'smallnorb', 'snli', 'so2sat', 'speech_commands', 'spoken_digit', 'squad', 'stanford_dogs', 'stanford_online_products', 'starcraft_video', 'stl10', 'sun397', 'super_glue', 'svhn_cropped', 'ted_hrlr_translate', 'ted_multi_translate', 'tedlium', 'tf_flowers', 'the300w_lp', 'tiny_shakespeare', 'titanic', 'trec', 'trivia_qa', 'tydi_qa', 'uc_merced', 'ucf101', 'vctk', 'vgg_face2', 'visual_domain_decathlon', 'voc', 'voxceleb', 'voxforge', 'waymo_open_dataset', 'web_questions', 'wider_face', 'wiki40b', 'wikihow', 'wikipedia', 'wikipedia_toxicity_subtypes', 'wine_quality', 'winogrande', 'wmt14_translate', 'wmt15_translate', 'wmt16_translate', 'wmt17_translate', 'wmt18_translate', 'wmt19_translate', 'wmt_t2t_translate', 'wmt_translate', 'wordnet', 'xnli', 'xquad', 'xsum', 'yelp_polarity_reviews', 'yes_no']
plt.figure(figsize=(6,3))
mnist_train = mnist_train.repeat(5).batch(32).prefetch(1)
for item in mnist_train:
images = item["image"]
labels = item["label"]
for index in range(5):
plt.subplot(1, 5, index + 1)
image = images[index, ..., 0]
label = labels[index].numpy()
plt.imshow(image, cmap="binary")
plt.title(label)
plt.axis("off")
break # just showing part of the first batch
datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
mnist_train = mnist_train.repeat(5).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)
for images, labels in mnist_train.take(1):
print(images.shape)
print(labels.numpy())
(32, 28, 28, 1) [4 1 0 7 8 1 2 7 1 6 6 4 7 7 3 3 7 9 9 1 0 6 6 9 9 4 8 9 4 7 3 3]
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
datasets = tfds.load(name="mnist", batch_size=32, as_supervised=True)
mnist_train = datasets["train"].repeat().prefetch(1)
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28, 1]),
keras.layers.Lambda(lambda images: tf.cast(images, tf.float32)),
keras.layers.Dense(10, activation="softmax")])
model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])
model.fit(mnist_train, steps_per_epoch=60000 // 32, epochs=5)
Epoch 1/5
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. "The `lr` argument is deprecated, use `learning_rate` instead.")
1875/1875 [==============================] - 8s 4ms/step - loss: 32.3357 - accuracy: 0.8419 Epoch 2/5 1875/1875 [==============================] - 3s 1ms/step - loss: 25.9449 - accuracy: 0.8681 Epoch 3/5 1875/1875 [==============================] - 3s 2ms/step - loss: 24.5985 - accuracy: 0.8736 Epoch 4/5 1875/1875 [==============================] - 3s 1ms/step - loss: 24.5770 - accuracy: 0.8748 Epoch 5/5 1875/1875 [==============================] - 3s 1ms/step - loss: 24.1375 - accuracy: 0.8776
<tensorflow.python.keras.callbacks.History at 0x7fb9112b1f50>
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
import tensorflow_hub as hub
hub_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim50/2",
output_shape=[50], input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= keras_layer (KerasLayer) (None, 50) 48190600 _________________________________________________________________ dense (Dense) (None, 16) 816 _________________________________________________________________ dense_1 (Dense) (None, 1) 17 ================================================================= Total params: 48,191,433 Trainable params: 833 Non-trainable params: 48,190,600 _________________________________________________________________
sentences = tf.constant(["It was a great movie", "The actors were amazing"])
embeddings = hub_layer(sentences)
embeddings
<tf.Tensor: shape=(2, 50), dtype=float32, numpy= array([[ 7.45939985e-02, 2.76720114e-02, 9.38646123e-02, 1.25124469e-01, 5.40293928e-04, -1.09435350e-01, 1.34755149e-01, -9.57818255e-02, -1.85177118e-01, -1.69703495e-02, 1.75612606e-02, -9.06603858e-02, 1.12110220e-01, 1.04646273e-01, 3.87700424e-02, -7.71859884e-02, -3.12189370e-01, 6.99466765e-02, -4.88970093e-02, -2.99049795e-01, 1.31183028e-01, -2.12630898e-01, 6.96169436e-02, 1.63592950e-01, 1.05169769e-02, 7.79720694e-02, -2.55230188e-01, -1.80790052e-01, 2.93739915e-01, 1.62875261e-02, -2.80566931e-01, 1.60284728e-01, 9.87277832e-03, 8.44555616e-04, 8.39456245e-02, 3.24002892e-01, 1.53253034e-01, -3.01048346e-02, 8.94618109e-02, -2.39153411e-02, -1.50188789e-01, -1.81733668e-02, -1.20483577e-01, 1.32937476e-01, -3.35325629e-01, -1.46504581e-01, -1.25251599e-02, -1.64428815e-01, -7.00765476e-02, 3.60923223e-02], [-1.56998575e-01, 4.24599349e-02, -5.57703003e-02, -8.08446854e-03, 1.23733155e-01, 3.89427543e-02, -4.37901802e-02, -1.86987907e-01, -2.29341656e-01, -1.27766818e-01, 3.83025259e-02, -1.07057482e-01, -6.11584112e-02, 2.49654502e-01, -1.39712945e-01, -3.91289443e-02, -1.35873526e-01, -3.58613044e-01, 2.53462754e-02, -1.58370987e-01, -1.38350084e-01, -3.90771806e-01, -6.63642734e-02, -3.24838236e-02, -2.20453963e-02, -1.68282315e-01, -7.40613639e-02, -2.49074101e-02, 2.46460736e-01, 9.87201929e-05, -1.85390845e-01, -4.92824614e-02, 1.09015472e-01, -9.54203904e-02, -1.60352528e-01, -2.59811729e-02, 1.13778859e-01, -2.09578887e-01, 2.18261331e-01, -3.11211571e-02, -6.12562597e-02, -8.66057724e-02, -1.10762455e-01, -5.73977083e-03, -1.08923554e-01, -1.72919363e-01, 1.00515485e-01, -5.64153939e-02, -4.97694984e-02, -1.07776590e-01]], dtype=float32)>
부록 A 참고
문제: (10장에서 소개한) 패션 MNIST 데이터셋을 적재하고 훈련 세트, 검증 세트, 테스트
세트로 나눕니다. 훈련 세트를 섞은 다음 각 데이터셋을 TFRecord 파일로 저장합니
다. 각 레코드는 두 개의 특성을 가진 Example
프로토콜 버퍼, 즉 직렬화된 이미지(tf.io.serialize_tensor()
를 사용해 이미지를 직렬화하세요)와 레이블입니다. 참고: 용량이 큰 이미지일 경우 tf.io.encode_jpeg()
를 사용할 수 있습니다. 많은 공간을 절약할 수 있지만 이미지 품질이 손해를 봅니다.
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))
valid_set = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))
def create_example(image, label):
image_data = tf.io.serialize_tensor(image)
#image_data = tf.io.encode_jpeg(image[..., np.newaxis])
return Example(
features=Features(
feature={
"image": Feature(bytes_list=BytesList(value=[image_data.numpy()])),
"label": Feature(int64_list=Int64List(value=[label])),
}))
for image, label in valid_set.take(1):
print(create_example(image, label))
features { feature { key: "image" value { bytes_list { value: "\010\004\022\010\022\002\010\034\022\002\010\034\"\220\006\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\rI\000\000\001\004\000\000\000\000\001\001\000\000\000\000\000\000\000\000\000\000\000\000\000\003\000$\210\177>6\000\000\000\001\003\004\000\000\003\000\000\000\000\000\000\000\000\000\000\000\000\006\000f\314\260\206\220{\027\000\000\000\000\014\n\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\233\354\317\262k\234\241m@\027M\202H\017\000\000\000\000\000\000\000\000\000\000\000\001\000E\317\337\332\330\330\243\177yz\222\215X\254B\000\000\000\000\000\000\000\000\000\001\001\001\000\310\350\350\351\345\337\337\327\325\244\177{\304\345\000\000\000\000\000\000\000\000\000\000\000\000\000\000\267\341\330\337\344\353\343\340\336\340\335\337\365\255\000\000\000\000\000\000\000\000\000\000\000\000\000\000\301\344\332\325\306\264\324\322\323\325\337\334\363\312\000\000\000\000\000\000\000\000\000\000\001\003\000\014\333\334\324\332\300\251\343\320\332\340\324\342\305\3214\000\000\000\000\000\000\000\000\000\000\006\000c\364\336\334\332\313\306\335\327\325\336\334\365w\2478\000\000\000\000\000\000\000\000\000\004\000\0007\354\344\346\344\360\350\325\332\337\352\331\331\321\\\000\000\000\001\004\006\007\002\000\000\000\000\000\355\342\331\337\336\333\336\335\330\337\345\327\332\377M\000\000\003\000\000\000\000\000\000\000>\221\314\344\317\325\335\332\320\323\332\340\337\333\327\340\364\237\000\000\000\000\000\022,Rk\275\344\334\336\331\342\310\315\323\346\340\352\260\274\372\370\351\356\327\000\0009\273\320\340\335\340\320\314\326\320\321\310\237\365\301\316\337\377\377\335\352\335\323\334\350\366\000\003\312\344\340\335\323\323\326\315\315\315\334\360P\226\377\345\335\274\232\277\322\314\321\336\344\341\000b\351\306\322\336\345\345\352\371\334\302\327\331\361AIju\250\333\335\327\331\337\337\340\345\035K\314\324\314\301\315\323\341\330\271\305\316\306\325\360\303\343\365\357\337\332\324\321\336\334\335\346C0\313\267\302\325\305\271\276\302\300\312\326\333\335\334\354\341\330\307\316\272\265\261\254\265\315\316s\000z\333\301\263\253\267\304\314\322\325\317\323\322\310\304\302\277\303\277\306\300\260\234\247\261\322\\\000\000J\275\324\277\257\254\257\265\271\274\275\274\301\306\314\321\322\322\323\274\274\302\300\330\252\000\002\000\000\000B\310\336\355\357\362\366\363\364\335\334\301\277\263\266\266\265\260\246\250c:\000\000\000\000\000\000\000\000\000(=,H)#\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" } } } feature { key: "label" value { int64_list { value: 9 } } } }
다음 함수는 주어진 데이터셋을 일련의 TFRecord 파일로 저장합니다. 이 예제는 라운드-로빈 방식으로 파일에 저장합니다. 이를 위해 dataset.enumerate()
메서드로 모든 샘플을 순회하고 저장할 파일을 겨정하기 위해 index % n_shards
를 계산합니다. 표준 contextlib.ExitStack
클래스를 사용해 쓰는 동안 I/O 에러의 발생 여부에 상관없이 모든 writer
가 적절히 종료되었는지 확인합니다.
from contextlib import ExitStack
def write_tfrecords(name, dataset, n_shards=10):
paths = ["{}.tfrecord-{:05d}-of-{:05d}".format(name, index, n_shards)
for index in range(n_shards)]
with ExitStack() as stack:
writers = [stack.enter_context(tf.io.TFRecordWriter(path))
for path in paths]
for index, (image, label) in dataset.enumerate():
shard = index % n_shards
example = create_example(image, label)
writers[shard].write(example.SerializeToString())
return paths
train_filepaths = write_tfrecords("my_fashion_mnist.train", train_set)
valid_filepaths = write_tfrecords("my_fashion_mnist.valid", valid_set)
test_filepaths = write_tfrecords("my_fashion_mnist.test", test_set)
문제: tf.data로 각 세트를 위한 효율적인 데이터셋을 만듭니다. 마지막으로 이 데이터셋으로 입력 특성을 표준화하는 전처리 층을 포함한 케라스 모델을 훈련합니다. 텐서보드로 프 로파일 데이터를 시각화하여 가능한 한 입력 파이프라인을 효율적으로 만들어보세요.
def preprocess(tfrecord):
feature_descriptions = {
"image": tf.io.FixedLenFeature([], tf.string, default_value=""),
"label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
}
example = tf.io.parse_single_example(tfrecord, feature_descriptions)
image = tf.io.parse_tensor(example["image"], out_type=tf.uint8)
#image = tf.io.decode_jpeg(example["image"])
image = tf.reshape(image, shape=[28, 28])
return image, example["label"]
def mnist_dataset(filepaths, n_read_threads=5, shuffle_buffer_size=None,
n_parse_threads=5, batch_size=32, cache=True):
dataset = tf.data.TFRecordDataset(filepaths,
num_parallel_reads=n_read_threads)
if cache:
dataset = dataset.cache()
if shuffle_buffer_size:
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)
valid_set = mnist_dataset(valid_filepaths)
test_set = mnist_dataset(test_filepaths)
for X, y in train_set.take(1):
for i in range(5):
plt.subplot(1, 5, i + 1)
plt.imshow(X[i].numpy(), cmap="binary")
plt.axis("off")
plt.title(str(y[i].numpy()))
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
class Standardization(keras.layers.Layer):
def adapt(self, data_sample):
self.means_ = np.mean(data_sample, axis=0, keepdims=True)
self.stds_ = np.std(data_sample, axis=0, keepdims=True)
def call(self, inputs):
return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())
standardization = Standardization(input_shape=[28, 28])
# or perhaps soon:
#standardization = keras.layers.Normalization()
sample_image_batches = train_set.take(100).map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()),
axis=0).astype(np.float32)
standardization.adapt(sample_images)
model = keras.models.Sequential([
standardization,
keras.layers.Flatten(),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam", metrics=["accuracy"])
from datetime import datetime
logs = os.path.join(os.curdir, "my_logs",
"run_" + datetime.now().strftime("%Y%m%d_%H%M%S"))
tensorboard_cb = tf.keras.callbacks.TensorBoard(
log_dir=logs, histogram_freq=1, profile_batch=10)
model.fit(train_set, epochs=5, validation_data=valid_set,
callbacks=[tensorboard_cb])
Epoch 1/5 1719/1719 [==============================] - 10s 6ms/step - loss: 570.1453 - accuracy: 0.8415 - val_loss: 149.1351 - val_accuracy: 0.8670 Epoch 2/5 1719/1719 [==============================] - 9s 5ms/step - loss: 639.2134 - accuracy: 0.8783 - val_loss: 282.9144 - val_accuracy: 0.8746 Epoch 3/5 1719/1719 [==============================] - 9s 5ms/step - loss: 98.1983 - accuracy: 0.8907 - val_loss: 0.3328 - val_accuracy: 0.8808 Epoch 4/5 1719/1719 [==============================] - 8s 5ms/step - loss: 437.1278 - accuracy: 0.9007 - val_loss: 151.8366 - val_accuracy: 0.8798 Epoch 5/5 1719/1719 [==============================] - 9s 5ms/step - loss: 198.3806 - accuracy: 0.9077 - val_loss: 87.7543 - val_accuracy: 0.8816
<tensorflow.python.keras.callbacks.History at 0x7fb9119e7850>
Warning: The profiling tab in TensorBoard works if you use TensorFlow 2.2+. You also need to make sure tensorboard_plugin_profile
is installed (and restart Jupyter if necessary).
%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006
Reusing TensorBoard on port 6006 (pid 702), started 1:15:59 ago. (Use '!kill 702' to kill it.)
문제: 이 연습문제에서 데이터셋을 다운로드 및 분할하고 tf.data.Dataset
객체를 만들어 데이터를 적재하고 효율적으로 전처리하겠습니다. 그다음 Embedding
층을 포함한 이진 분류 모델을 만들고 훈련시킵니다.
문제: 인터넷 영화 데이터베이스의 영화 리뷰 50,000개를 담은 영화 리뷰
데이터셋을 다운로드합니다. 이 데이터는
train
과 test
라는 두 개의 디렉터리로 구성되어 있습니다. 각 디렉터리에는 12,500개의 긍정 리뷰를 담은 pos
서브디렉터리와 12,500개의 부정 리뷰를 담은 neg
서브디렉터리가 있습니다. 리뷰는 각각 별도의 텍스트 파일에 저장되어 있습니다. (전처리된 BOW를 포함해) 다른 파일과 디렉터리가 있지만 이 연습문제에서는 무시합니다.
from pathlib import Path
DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
FILENAME = "aclImdb_v1.tar.gz"
filepath = keras.utils.get_file(FILENAME, DOWNLOAD_ROOT + FILENAME, extract=True)
path = Path(filepath).parent / "aclImdb"
path
PosixPath('/root/.keras/datasets/aclImdb')
for name, subdirs, files in os.walk(path):
indent = len(Path(name).parts) - len(path.parts)
print(" " * indent + Path(name).parts[-1] + os.sep)
for index, filename in enumerate(sorted(files)):
if index == 3:
print(" " * (indent + 1) + "...")
break
print(" " * (indent + 1) + filename)
aclImdb/ README imdb.vocab imdbEr.txt test/ labeledBow.feat urls_neg.txt urls_pos.txt neg/ 0_2.txt 10000_4.txt 10001_1.txt ... pos/ 0_10.txt 10000_7.txt 10001_9.txt ... train/ labeledBow.feat unsupBow.feat urls_neg.txt ... neg/ 0_3.txt 10000_4.txt 10001_4.txt ... pos/ 0_9.txt 10000_8.txt 10001_10.txt ... unsup/ 0_0.txt 10000_0.txt 10001_0.txt ...
def review_paths(dirpath):
return [str(path) for path in dirpath.glob("*.txt")]
train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")
len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)
(12500, 12500, 12500, 12500)
문제: 테스트 세트를 검증 세트(15,000개)와 테스트 세트(10,000개)로 나눕니다.
np.random.shuffle(test_valid_pos)
test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]
문제: tf.data를 사용해 각 세트에 대한 효율적인 데이터셋을 만듭니다.
이 데이터셋을 메모리에 적재할 수 있으므로 파이썬 코드와 tf.data.Dataset.from_tensor_slices()
를 사용해 모든 데이터를 적재합니다:
def imdb_dataset(filepaths_positive, filepaths_negative):
reviews = []
labels = []
for filepaths, label in ((filepaths_negative, 0), (filepaths_positive, 1)):
for filepath in filepaths:
with open(filepath) as review_file:
reviews.append(review_file.read())
labels.append(label)
return tf.data.Dataset.from_tensor_slices(
(tf.constant(reviews), tf.constant(labels)))
for X, y in imdb_dataset(train_pos, train_neg).take(3):
print(X)
print(y)
print()
tf.Tensor(b'Positively awful George Sanders vehicle where he goes from being a thief to police czar.<br /><br />While Sanders was an excellent character actor, he was certainly no leading man and this film proves it.<br /><br />It is absolutely beyond stupidity. Gene Lockhart did provide some comic relief until a moment of anger led him to fire his gun with tragedy resulting.<br /><br />Sadly, George Sanders and co-star Carol Landis committed suicide in real life. After making a film as deplorable as this, it is not shocking.<br /><br />The usual appealing Signe Hasso is really nothing here.', shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor(b"Ooof! This one was a stinker. It does not fall 'somewhere in between Star Wars and Thriller', thats for sure. In all actuality, it falls somewhere between the cracks of a Wham! video and Captain EO, only with not as big of a budget, and a lot more close ups of ugly teenagers crying. Simon Le Bon preens front and center, while the rest of the band gamely tries to hide the fact that they stole their whole career from Roxy Music's last 3 albums. Brief clips from Barbarella add nothing. Avoid at all costs. (However, I liked the part when they played 'Hungry Like The Wolf' but why was there a tiger lurking in the audience changing into a woman painted with tiger stripes? I mean, they aren't singing 'Eye of the Tiger' or 'Hungry like the Tiger' it's a Wolf! Whatever.) A DVD of Duran Duran's '80s videos is probably worth a look for nostalgia's sake", shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int32) tf.Tensor(b"It's rare that I feel a need to write a review on this site, but this film is very deserving because of how poorly it was created, and how bias its product was.<br /><br />I felt a distinct attempt on the part of the film-makers to display the Palestinian family as boorish and untrustworthy. We hear them discuss the sadness that they feel from oppression, yet the film is shot and arranged in a way that we feel the politically oppressed population is the Jewish Israeli population. We see no evidence that parallels the position of the Palestinian teenager. We only hear from other Palestinians in prison. I understand restrictions are in place, but the political nature of the restrictions are designed to prevent peace.<br /><br />I came out of the film feeling that the mother of the victim was selfish in her mourning and completely closed minded due to her side of the fence, so to speak. She continued to be unwilling to see the hurt of the bomber's parents, and her angry and closed-minded words caused the final meeting to spiral out of control. It is more realistic, in my mind, to see the Israeli mindset to be a root of the problem; ignored pleas for understanding and freedom, ignored requests for acknowledgment for the process by which the Jewish population acquired the land.<br /><br />I have given this a two because of these selfish weaknesses of the mother, which normally would be admirable in a documentary, however in the light of the lack of impartiality, it all seems exploitative. Also for the poor edits, lack of background in the actual instance, and finally the lack of proper representation of the Palestinian side. Ultimately, it is a poor documentary and a poor film. I acknowledge this is partially the result of the political situation, but am obliged to note the flaws in direction regardless of the heart-wrenching and sad subject matter.", shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int32)
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass
1 loop, best of 1: 23.8 s per loop
이 데이터셋을 적재하고 10회 반복하는데 약 17초가 걸립니다.
하지만 이 데이터셋이 메모리에 맞지 않는다고 가정하고 좀 더 재미있는 것을 만들어 보죠. 다행히 각 리뷰는 한 줄로 되어 있기 때문에(<br />
로 줄바꿈됩니다) TextLineDataset
를 사용해 리뷰를 읽을 수 있습니다. 그렇지 않으면 입력 파일을 전처리해야 합니다(예를 들어, TFRecord로 바꿉니다). 매우 큰 데이터셋의 경우 아파치 빔(Apache Beam) 같은 도구를 사용하는 것이 합리적입니다.
def imdb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
dataset_neg = tf.data.TextLineDataset(filepaths_negative,
num_parallel_reads=n_read_threads)
dataset_neg = dataset_neg.map(lambda review: (review, 0))
dataset_pos = tf.data.TextLineDataset(filepaths_positive,
num_parallel_reads=n_read_threads)
dataset_pos = dataset_pos.map(lambda review: (review, 1))
return tf.data.Dataset.concatenate(dataset_pos, dataset_neg)
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass
1 loop, best of 1: 2min 22s per loop
이 데이터셋을 10회 반복하는데 33초 걸립니다. 데이터셋이 RAM에 캐싱되지 않고 에포크마다 다시 로드되기 때문에 매우 느립니다. .repeat(10)
전에 .cache()
를 추가하면 이전만큼 빨라지는 것을 확인할 수 있습니다.
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).cache().repeat(10): pass
1 loop, best of 1: 29.2 s per loop
batch_size = 32
train_set = imdb_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)
valid_set = imdb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
test_set = imdb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)
문제: 리뷰를 전처리하기 위해 TextVectorization
층을 사용한 이진 분류 모델을 만드세요.
TextVectorization
층을 아직 사용할 수 없다면 (또는 도전을 좋아한다면) 사용자 전처리 층을 만들어보세요. tf.strings
패키지에 있는 함수를 사용할 수 있습니다. 예를 들어 lower()
로 소문자로 만들거나 regex_replace()
로 구두점을 공백으로 바꾸고 split()
로 공백을 기준으로 단어를 나눌 수 있습니다. 룩업 테이블을 사용해 단어 인덱스를 출력하세요. adapt()
메서드로 미리 층을 적응시켜야 합니다.
먼저 리뷰를 전처리하는 함수를 만듭니다. 이 함수는 리뷰를 300자로 자르고 소문자로 변환합니다. 그다음 <br />
와 글자가 아닌 모든 문자를 공백으로 바꾸고 리뷰를 단어로 분할해 마지막으로 각 리뷰가 n_words
개수의 토큰이 되도록 패딩하거나 잘라냅니다:
def preprocess(X_batch, n_words=50):
shape = tf.shape(X_batch) * tf.constant([1, 0]) + tf.constant([0, n_words])
Z = tf.strings.substr(X_batch, 0, 300)
Z = tf.strings.lower(Z)
Z = tf.strings.regex_replace(Z, b"<br\\s*/?>", b" ")
Z = tf.strings.regex_replace(Z, b"[^a-z]", b" ")
Z = tf.strings.split(Z)
return Z.to_tensor(shape=shape, default_value=b"<pad>")
X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"])
preprocess(X_example)
<tf.Tensor: shape=(2, 50), dtype=string, numpy= array([[b'it', b's', b'a', b'great', b'great', b'movie', b'i', b'loved', b'it', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'it', b'was', b'terrible', b'run', b'away', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>']], dtype=object)>
이제 preprocess()
함수의 출력과 동일한 포맷의 데이터 샘플을 입력받는 두 번째 유틸리티 함수를 만듭니다. 이 함수는 가장 빈번한 max_size
개수의 단어로 된 리스트를 출력합니다. 가장 흔한 단어는 패딩 토큰입니다.
from collections import Counter
def get_vocabulary(data_sample, max_size=1000):
preprocessed_reviews = preprocess(data_sample).numpy()
counter = Counter()
for words in preprocessed_reviews:
for word in words:
if word != b"<pad>":
counter[word] += 1
return [b"<pad>"] + [word for word, count in counter.most_common(max_size)]
get_vocabulary(X_example)
[b'<pad>', b'it', b'great', b's', b'a', b'movie', b'i', b'loved', b'was', b'terrible', b'run', b'away']
이제 TextVectorization
층을 만들 준비가 되었습니다. 이 층의 생성자는 단순하게 하이퍼파라미터(max_vocabulary_size
와 n_oov_buckets
)를 저장하는 역할만 수행합니다. adapt()
메서드는 get_vocabulary()
함수를 사용해 어휘 사전을 계산합니다. 그다음 StaticVocabularyTable
를 만듭니다(16장에서 자세히 설명합니다). call()
메서드는 각 리뷰의 단어 리스트를 패딩합니다. 그다음 StaticVocabularyTable
를 사용해 어휘 사전에 있는 단어의 인덱스를 조회합니다:
class TextVectorization(keras.layers.Layer):
def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs):
super().__init__(dtype=dtype, **kwargs)
self.max_vocabulary_size = max_vocabulary_size
self.n_oov_buckets = n_oov_buckets
def adapt(self, data_sample):
self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)
words = tf.constant(self.vocab)
word_ids = tf.range(len(self.vocab), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets)
def call(self, inputs):
preprocessed_inputs = preprocess(inputs)
return self.table.lookup(preprocessed_inputs)
앞서 정의한 X_example
로 테스트해 보죠:
text_vectorization = TextVectorization()
text_vectorization.adapt(X_example)
text_vectorization(X_example)
<tf.Tensor: shape=(2, 50), dtype=int64, numpy= array([[ 1, 3, 4, 2, 2, 5, 6, 7, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 1, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>
좋습니다! 여기에서 볼 수 있듯이 각 리뷰는 정제되고 토큰화되었습니다. 각 단어는 어휘 사전의 인덱스로 인코딩됩니다(0은 <pad>
토큰입니다).
이제 또 다른 TextVectorization
층을 만들고 전체 IMDB 훈련 세트에 적용해 보겠습니다(훈련 세트가 메모리에 맞지 않으면 train_set.take(500)
처럼 일부 데이터만 사용할 수 있습니다):
max_vocabulary_size = 1000
n_oov_buckets = 100
sample_review_batches = train_set.map(lambda review, label: review)
sample_reviews = np.concatenate(list(sample_review_batches.as_numpy_iterator()),
axis=0)
text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets,
input_shape=[])
text_vectorization.adapt(sample_reviews)
동일하게 X_example
로 실행해 보죠. 어휘 사전이 크기 때문에 단어의 ID가 큽니다:
text_vectorization(X_example)
<tf.Tensor: shape=(2, 50), dtype=int64, numpy= array([[ 9, 14, 2, 64, 64, 12, 5, 256, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 9, 13, 269, 531, 334, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>
좋습니다. 그럼 어휘 사전에서 처음 10개 단어를 확인해 보죠:
text_vectorization.vocab[:10]
[b'<pad>', b'the', b'a', b'of', b'and', b'i', b'to', b'is', b'this', b'it']
이 단어가 리뷰에서 가장 많이 등장하는 단어입니다.
이제 모델을 만들기 위해 모든 단어 ID를 어떤 식으로 인코딩해야 합니다. 한가지 방법은 BoW(bag of words)입니다. 어휘 사전에 있는 각 단어에 대해 리뷰에 단어가 등장하는 횟수를 카운트합니다. 예를 들면 다음과 같습니다:
simple_example = tf.constant([[1, 3, 1, 0, 0], [2, 2, 0, 0, 0]])
tf.reduce_sum(tf.one_hot(simple_example, 4), axis=1)
<tf.Tensor: shape=(2, 4), dtype=float32, numpy= array([[2., 2., 0., 1.], [3., 0., 2., 0.]], dtype=float32)>
첫 번째 리뷰에는 단어 0이 두 번 등장하고, 단어 1도 두 번, 단어 2는 0번, 단어 3은 한 번 등장합니다. 따라서 BoW 표현은 [2, 2, 0, 1]
입니다. 비슷하게 두 번째 리뷰에는 단어 0이 세 번, 단어 1이 0번 등장하는 식입니다. 이 로직을 간단한 사용자 정의 층으로 구현해서 테스트해 보겠습니다. 단어 0은 <pad>
토큰에 해당하므로 카운트하지 않겠습니다.
class BagOfWords(keras.layers.Layer):
def __init__(self, n_tokens, dtype=tf.int32, **kwargs):
super().__init__(dtype=dtype, **kwargs)
self.n_tokens = n_tokens
def call(self, inputs):
one_hot = tf.one_hot(inputs, self.n_tokens)
return tf.reduce_sum(one_hot, axis=1)[:, 1:]
테스트해 보죠:
bag_of_words = BagOfWords(n_tokens=4)
bag_of_words(simple_example)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[2., 0., 1.], [0., 2., 0.]], dtype=float32)>
잘 동작하네요! 이제 훈련 세트의 어휘 사전 크기를 지정한 BagOfWord
객체를 만듭니다:
n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for <pad>
bag_of_words = BagOfWords(n_tokens)
이제 모델을 훈련할 차례입니다!
model = keras.models.Sequential([
text_vectorization,
bag_of_words,
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)
Epoch 1/5 782/782 [==============================] - 31s 21ms/step - loss: 0.5428 - accuracy: 0.7189 - val_loss: 0.5081 - val_accuracy: 0.7399 Epoch 2/5 782/782 [==============================] - 21s 23ms/step - loss: 0.4719 - accuracy: 0.7692 - val_loss: 0.5037 - val_accuracy: 0.7473 Epoch 3/5 782/782 [==============================] - 20s 21ms/step - loss: 0.4205 - accuracy: 0.8057 - val_loss: 0.5102 - val_accuracy: 0.7431 Epoch 4/5 782/782 [==============================] - 20s 21ms/step - loss: 0.3504 - accuracy: 0.8502 - val_loss: 0.5317 - val_accuracy: 0.7397 Epoch 5/5 782/782 [==============================] - 20s 21ms/step - loss: 0.2690 - accuracy: 0.9022 - val_loss: 0.5723 - val_accuracy: 0.7355
<tensorflow.python.keras.callbacks.History at 0x7fb9014bde10>
첫 번째 에포크에서 검증 세트에 대해 73.5% 정확도를 얻었습니다. 하지만 더 진전이 없습니다. 16장에서 이를 더 개선해 보겠습니다. 지금은 tf.data
와 케라스 전처리 층으로 효율적인 전처리를 수행하는 것에만 초점을 맞추었습니다.
문제: Embedding
층을 추가하고 단어 개수의 제곱근을 곱하여 리뷰마다 평균 임베딩을 계산하세요(16장 참조). 이제 스케일이 조정된 이 평균 임베딩을 모델의 다음 부분으로 전달할 수 있습니다.
각 리뷰의 평균 임베딩을 계산하고 리뷰에 있는 단어 개수의 제곱근을 곱하기 위해 간단한 함수를 정의합니다. 각 문장에 대해서 이 함수는 $M \times \sqrt N$을 계산합니다. 여기에서 $M$은 (패딩 토큰을 제외하고) 문장에 있는 모든 단어 임베딩의 평균입니다. $N$은 (패딩 토큰을 제외한) 문장에 있는 단어의 개수입니다. $M$을 $\dfrac{S}{N}$로 다시 쓸 수 있습니다. 여기에서 $S$는 모든 단어 임베딩의 합입니다(패딩 토큰은 0 벡터이므로 합에서는 패딩 토큰을 포함했는지 여부가 문제가 안됩니다). 따라서 이 함수는 $M \times \sqrt N = \dfrac{S}{N} \times \sqrt N = \dfrac{S}{\sqrt N \times \sqrt N} \times \sqrt N= \dfrac{S}{\sqrt N}$를 반환해야 합니다.
각 리뷰의 평균 임베딩을 계산하고 리뷰의 단어 개수의 제곱근을 곱하기 위해 간단한 함수를 정의합니다:
def compute_mean_embedding(inputs):
not_pad = tf.math.count_nonzero(inputs, axis=-1)
n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
return tf.reduce_sum(inputs, axis=1) / sqrt_n_words
another_example = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],
[[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])
compute_mean_embedding(another_example)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.535534 , 4.9497476, 2.1213205], [6. , 0. , 0. ]], dtype=float32)>
결과가 올바른지 확인해 보죠. 첫 번째 리뷰에는 2개의 단어가 있습니다(마지막 토큰은 <pad>
토큰을 나타내는 0벡터입니다).
이 두 단어의 평균 임베딩을 계산하고 그 결과에 2의 제곱근을 곱해 보겠습니다:
tf.reduce_mean(another_example[0:1, :2], axis=1) * tf.sqrt(2.)
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[3.535534 , 4.9497476, 2.1213202]], dtype=float32)>
좋습니다. 두 번째 리뷰를 확인해 보죠. 이 리뷰는 하나의 단어만 가지고 있습니다(두 개의 패딩 토큰은 무시합니다):
tf.reduce_mean(another_example[1:2, :1], axis=1) * tf.sqrt(1.)
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[6., 0., 0.]], dtype=float32)>
완벽하군요. 이제 최종 모델을 훈련할 차례입니다. 이전과 동이하지만 BagOfWords
층을 Embedding
층과 compute_mean_embedding
을 호출하는 Lambda
층으로 바꿉니다:
embedding_size = 20
model = keras.models.Sequential([
text_vectorization,
keras.layers.Embedding(input_dim=n_tokens,
output_dim=embedding_size,
mask_zero=True), # <pad> tokens => zero vectors
keras.layers.Lambda(compute_mean_embedding),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(1, activation="sigmoid"),
])
문제: 모델을 훈련하고 얼마의 정확도가 나오는지 확인해보세요. 가능한 한 훈련 속도를 빠르게 하기 위해 파이프라인을 최적화해보세요.
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)
Epoch 1/5 782/782 [==============================] - 15s 12ms/step - loss: 0.5545 - accuracy: 0.7062 - val_loss: 0.5130 - val_accuracy: 0.7380 Epoch 2/5 782/782 [==============================] - 11s 9ms/step - loss: 0.4953 - accuracy: 0.7540 - val_loss: 0.5057 - val_accuracy: 0.7453 Epoch 3/5 782/782 [==============================] - 25s 15ms/step - loss: 0.4845 - accuracy: 0.7594 - val_loss: 0.5057 - val_accuracy: 0.7412 Epoch 4/5 782/782 [==============================] - 11s 10ms/step - loss: 0.4764 - accuracy: 0.7627 - val_loss: 0.5086 - val_accuracy: 0.7413 Epoch 5/5 782/782 [==============================] - 30s 9ms/step - loss: 0.4698 - accuracy: 0.7638 - val_loss: 0.5074 - val_accuracy: 0.7374
<tensorflow.python.keras.callbacks.History at 0x7fb8fe0cb790>
임베딩을 사용해서 더 나아지지 않았습니다(16장에서 이를 개선해 보겠습니다). 파이프라인은 충분히 빨라 보입니다(앞서 최적화했습니다).
문제: tfds.load("imdb_reviews")
와 같이 TFDS를 사용해 동일한 데이터셋을 간단하게 적재해보세요.
import tensorflow_datasets as tfds
datasets = tfds.load(name="imdb_reviews")
train_set, test_set = datasets["train"], datasets["test"]
for example in train_set.take(1):
print(example["text"])
print(example["label"])
tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int64)