본 커널은 다음 참고자료를 통해 재구성한 자료입니다.
저는 캐글을 시작하는 초보자이며, 초보자에게 더 적합하게 쉬운 튜토리얼을 제작하는 것을 목표로 하고 있습니다.더 자세한 사항은 다음 링크를 참고해주세요.
우선 필요한 라이브러리를 호출합니다. 각 라이브러리는 다음과 같습니다.
# 케라스 모델 형성
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import Conv2D, MaxPooling2D, Flatten
# 훈련 데이터셋에서 검증 데이터셋
from sklearn.model_selection import train_test_split
# 데이터 분석
import pandas as pd
import numpy as np
Using TensorFlow backend.
read_csv를 이용하여 입력받습니다.
train = pd.read_csv('../input/train.csv')
labels = train.iloc[:,0].values.astype('int32')
X_train = (train.iloc[:,1:].values).astype('float32')
X_test = (pd.read_csv('../input/test.csv').values).astype('float32')
각 값의 레이블은 to_categocial을 활용하여 범주화 시킵니다.
y_train = np_utils.to_categorical(labels)
이미지는 28 by 28 pixel이고 흑백 채널 1개이므로 다음과 같이 reshape 해줍니다.
X_train = X_train.reshape((-1,28,28,1))
X_test = X_test.reshape((-1,28,28,1))
이제 훈련 데이터를 이용해 검증 데이터셋을 만들어보겠습니다.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.1)
데이터의 스케일을 조정합니다. np.max 대신 255를 사용해도 됩니다.
scale = np.max(X_train)
X_train /= scale
X_test /= scale
mean = np.std(X_train)
X_train -= mean
X_test -= mean
모델을 쌓아줍니다. layer를 계속 쌓아서 만들 수 있습니다.
모델에서 사용하는 layer의 종류는 총 5개입니다.
활성화 함수는 2 종류로 사용합니다.
model = Sequential()
model.add(Conv2D(32,(3,3), activation='relu', input_shape=(28,28,1)))
model.add(Conv2D(32,(3,3), activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D((2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(64,(3,3), activation='relu'))
model.add(Conv2D(64,(3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(10, activation='softmax'))
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=20, batch_size=64, validation_data=(X_val, y_val))
print("훈련데이터 점수 : ", model.evaluate(X_train, y_train))
print("검증데이터 점수 : ", model.evaluate(X_val, y_val))
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Train on 37800 samples, validate on 4200 samples Epoch 1/20 37800/37800 [==============================] - 5s 138us/step - loss: 0.1875 - acc: 0.9415 - val_loss: 0.4303 - val_acc: 0.9731 Epoch 2/20 37800/37800 [==============================] - 3s 80us/step - loss: 0.0588 - acc: 0.9814 - val_loss: 0.1957 - val_acc: 0.9879 Epoch 3/20 37800/37800 [==============================] - 3s 81us/step - loss: 0.0422 - acc: 0.9866 - val_loss: 0.2300 - val_acc: 0.9850 Epoch 4/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0349 - acc: 0.9894 - val_loss: 0.2859 - val_acc: 0.9821 Epoch 5/20 37800/37800 [==============================] - 3s 74us/step - loss: 0.0302 - acc: 0.9901 - val_loss: 0.3112 - val_acc: 0.9805 Epoch 6/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0268 - acc: 0.9917 - val_loss: 0.2147 - val_acc: 0.9867 Epoch 7/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0244 - acc: 0.9924 - val_loss: 0.1748 - val_acc: 0.9890 Epoch 8/20 37800/37800 [==============================] - 3s 83us/step - loss: 0.0227 - acc: 0.9929 - val_loss: 0.2626 - val_acc: 0.9833 Epoch 9/20 37800/37800 [==============================] - 3s 84us/step - loss: 0.0208 - acc: 0.9936 - val_loss: 0.1935 - val_acc: 0.9879 Epoch 10/20 37800/37800 [==============================] - 3s 84us/step - loss: 0.0181 - acc: 0.9947 - val_loss: 0.2167 - val_acc: 0.9862 Epoch 11/20 37800/37800 [==============================] - 3s 76us/step - loss: 0.0174 - acc: 0.9948 - val_loss: 0.3185 - val_acc: 0.9802 Epoch 12/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0171 - acc: 0.9951 - val_loss: 0.6266 - val_acc: 0.9610 Epoch 13/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0167 - acc: 0.9949 - val_loss: 0.9953 - val_acc: 0.9381 Epoch 14/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0157 - acc: 0.9957 - val_loss: 0.5718 - val_acc: 0.9645 Epoch 15/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0147 - acc: 0.9958 - val_loss: 0.3147 - val_acc: 0.9805 Epoch 16/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0140 - acc: 0.9958 - val_loss: 0.2955 - val_acc: 0.9817 Epoch 17/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0141 - acc: 0.9965 - val_loss: 0.3046 - val_acc: 0.9807 Epoch 18/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0128 - acc: 0.9965 - val_loss: 0.5262 - val_acc: 0.9671 Epoch 19/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0126 - acc: 0.9965 - val_loss: 0.9796 - val_acc: 0.9390 Epoch 20/20 37800/37800 [==============================] - 3s 75us/step - loss: 0.0126 - acc: 0.9967 - val_loss: 1.7778 - val_acc: 0.8895 37800/37800 [==============================] - 2s 44us/step 훈련데이터 점수 : [0.004546692839156409, 0.9988624338624339] 4200/4200 [==============================] - 0s 44us/step 검증데이터 점수 : [1.7777700751168393, 0.8895238095238095]
fit 함수의 리턴 값은 속성 4가지입니다.
matplotlib을 활용하여 그래프를 그립니다.
각각은 loss와 accuracy에 관한 그래프입니다.
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
검증 데이터셋이 작아서 그런지, 아니면 층이 복잡해서 과적합이 비교적 빨리 일어나는 것 같습니다. 무시하고 제출해보겠습니다.
마지막 제출용 csv파일을 만드는 방법은 사람마다 다른 것 같습니다. 프랑소와 숄레님은 함수로 만들어서 다음과 같이 마지막 제출 파일을 만들어서 가져와봤습니다.
preds = model.predict_classes(X_test, verbose=0)
def write_preds(preds, fname):
pd.DataFrame({"ImageId": list(range(1,len(preds)+1)), "Label": preds}).to_csv(fname, index=False, header=True)
write_preds(preds, "keras-mnist.csv")