For a dataset in this simple Keras classification example we will use the Wisconsin breast cancer dataset. We will upload data from the local drive and in particular the file breast-cancer-wisconsin.data
. Other ways to upload files to a Google colab notebook can be found here.
from google.colab import files
uploaded = files.upload()
Saving breast-cancer-wisconsin.data to breast-cancer-wisconsin.data
Now let's read the dataset.
import pandas as pd
import io
data = pd.read_csv(io.BytesIO(uploaded['breast-cancer-wisconsin.data']),
header=None, na_values='?')
# Dataset is now stored in a Pandas Dataframe
Let's check if it was uploaded and parsed correctly
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 |
2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 |
3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 |
4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 |
We will drop the first column which is the id of the sample and will not provide anything to the predictive ability of the model and also drop samples with missing values in order to have a pristine dataset. Then we will print the summary statistics of the dataset.
data = data.drop(0, axis=1).dropna()
data.describe()
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 | 683.000000 |
mean | 4.442167 | 3.150805 | 3.215227 | 2.830161 | 3.234261 | 3.544656 | 3.445095 | 2.869693 | 1.603221 | 2.699854 |
std | 2.820761 | 3.065145 | 2.988581 | 2.864562 | 2.223085 | 3.643857 | 2.449697 | 3.052666 | 1.732674 | 0.954592 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 2.000000 |
50% | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 1.000000 | 2.000000 |
75% | 6.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | 6.000000 | 5.000000 | 4.000000 | 1.000000 | 4.000000 |
max | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 4.000000 |
We will normalize the data between 0 and 1 by dividing the attributes by 10.
data.iloc[:,0:9] = data.iloc[:,0:9].div(10)
data.head()
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.5 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.3 | 0.1 | 0.1 | 2 |
1 | 0.5 | 0.4 | 0.4 | 0.5 | 0.7 | 1.0 | 0.3 | 0.2 | 0.1 | 2 |
2 | 0.3 | 0.1 | 0.1 | 0.1 | 0.2 | 0.2 | 0.3 | 0.1 | 0.1 | 2 |
3 | 0.6 | 0.8 | 0.8 | 0.1 | 0.3 | 0.4 | 0.3 | 0.7 | 0.1 | 2 |
4 | 0.4 | 0.1 | 0.1 | 0.3 | 0.2 | 0.1 | 0.3 | 0.1 | 0.1 | 2 |
Now let's try to predict whether the cancer is benign (2) or malignant (4), based on the 9 available features. First we will do a Train-Test split of 80-20%. We will keep the 20% to test the model in the end. We will also do one-hot-encoding for the class variable.
from sklearn.utils import shuffle
data = shuffle(data)
X = data.iloc[:,0:8]
y = data.iloc[:,9].apply(str)
y[y == '2'] = '0'
y[y == '4'] = '1'
from keras.utils import to_categorical
encoded_y = to_categorical(y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, encoded_y,
test_size=0.2,
random_state=1234)
print(X_train[:5])
print(y_train[:5])
1 2 3 4 5 6 7 8 344 0.7 0.6 0.4 0.8 1.0 1.0 0.9 0.5 22 0.3 0.1 0.1 0.1 0.2 0.1 0.2 0.1 124 0.5 0.4 0.6 0.7 0.9 0.7 0.8 1.0 239 1.0 0.4 0.3 0.2 0.3 1.0 0.5 0.3 552 0.3 0.2 0.2 0.2 0.2 0.1 0.4 0.2 [[0. 1.] [1. 0.] [0. 1.] [0. 1.] [1. 0.]]
import tensorflow as tf
import logging
logger = tf.get_logger()
logger.setLevel(logging.ERROR)
We will build a simple feedforward neural net with one hidden layer with 20 nodes.
l0 = tf.keras.layers.Dense(units=20, input_shape=(8,),
activation='relu')
l1 = tf.keras.layers.Dense(units=2, activation='sigmoid')
model = tf.keras.Sequential([l0, l1])
model.compile(optimizer ='adam', loss='binary_crossentropy',
metrics =['accuracy'])
history = model.fit(X_train, y_train, epochs=100, validation_split=0.1)
print('Finished training')
Train on 491 samples, validate on 55 samples Epoch 1/100 491/491 [==============================] - 0s 419us/sample - loss: 0.6856 - acc: 0.4919 - val_loss: 0.6642 - val_acc: 0.5091 Epoch 2/100 491/491 [==============================] - 0s 55us/sample - loss: 0.6624 - acc: 0.5662 - val_loss: 0.6394 - val_acc: 0.6909 Epoch 3/100 491/491 [==============================] - 0s 55us/sample - loss: 0.6394 - acc: 0.7566 - val_loss: 0.6147 - val_acc: 0.8455 Epoch 4/100 491/491 [==============================] - 0s 53us/sample - loss: 0.6150 - acc: 0.8595 - val_loss: 0.5874 - val_acc: 0.9000 Epoch 5/100 491/491 [==============================] - 0s 56us/sample - loss: 0.5867 - acc: 0.9053 - val_loss: 0.5581 - val_acc: 0.9000 Epoch 6/100 491/491 [==============================] - 0s 56us/sample - loss: 0.5573 - acc: 0.9236 - val_loss: 0.5270 - val_acc: 0.9091 Epoch 7/100 491/491 [==============================] - 0s 55us/sample - loss: 0.5262 - acc: 0.9318 - val_loss: 0.4961 - val_acc: 0.9182 Epoch 8/100 491/491 [==============================] - 0s 57us/sample - loss: 0.4940 - acc: 0.9409 - val_loss: 0.4651 - val_acc: 0.9273 Epoch 9/100 491/491 [==============================] - 0s 51us/sample - loss: 0.4611 - acc: 0.9481 - val_loss: 0.4343 - val_acc: 0.9273 Epoch 10/100 491/491 [==============================] - 0s 59us/sample - loss: 0.4279 - acc: 0.9521 - val_loss: 0.4033 - val_acc: 0.9273 Epoch 11/100 491/491 [==============================] - 0s 55us/sample - loss: 0.3941 - acc: 0.9552 - val_loss: 0.3731 - val_acc: 0.9273 Epoch 12/100 491/491 [==============================] - 0s 58us/sample - loss: 0.3610 - acc: 0.9552 - val_loss: 0.3451 - val_acc: 0.9273 Epoch 13/100 491/491 [==============================] - 0s 58us/sample - loss: 0.3297 - acc: 0.9572 - val_loss: 0.3190 - val_acc: 0.9273 Epoch 14/100 491/491 [==============================] - 0s 54us/sample - loss: 0.3022 - acc: 0.9582 - val_loss: 0.2968 - val_acc: 0.9273 Epoch 15/100 491/491 [==============================] - 0s 59us/sample - loss: 0.2769 - acc: 0.9603 - val_loss: 0.2788 - val_acc: 0.9364 Epoch 16/100 491/491 [==============================] - 0s 57us/sample - loss: 0.2559 - acc: 0.9613 - val_loss: 0.2613 - val_acc: 0.9364 Epoch 17/100 491/491 [==============================] - 0s 56us/sample - loss: 0.2361 - acc: 0.9623 - val_loss: 0.2478 - val_acc: 0.9364 Epoch 18/100 491/491 [==============================] - 0s 64us/sample - loss: 0.2192 - acc: 0.9623 - val_loss: 0.2351 - val_acc: 0.9364 Epoch 19/100 491/491 [==============================] - 0s 61us/sample - loss: 0.2052 - acc: 0.9644 - val_loss: 0.2236 - val_acc: 0.9364 Epoch 20/100 491/491 [==============================] - 0s 56us/sample - loss: 0.1919 - acc: 0.9644 - val_loss: 0.2152 - val_acc: 0.9364 Epoch 21/100 491/491 [==============================] - 0s 56us/sample - loss: 0.1808 - acc: 0.9644 - val_loss: 0.2084 - val_acc: 0.9364 Epoch 22/100 491/491 [==============================] - 0s 60us/sample - loss: 0.1715 - acc: 0.9654 - val_loss: 0.2023 - val_acc: 0.9364 Epoch 23/100 491/491 [==============================] - 0s 66us/sample - loss: 0.1628 - acc: 0.9654 - val_loss: 0.1961 - val_acc: 0.9364 Epoch 24/100 491/491 [==============================] - 0s 66us/sample - loss: 0.1555 - acc: 0.9633 - val_loss: 0.1908 - val_acc: 0.9455 Epoch 25/100 491/491 [==============================] - 0s 71us/sample - loss: 0.1490 - acc: 0.9654 - val_loss: 0.1889 - val_acc: 0.9455 Epoch 26/100 491/491 [==============================] - 0s 55us/sample - loss: 0.1433 - acc: 0.9664 - val_loss: 0.1861 - val_acc: 0.9455 Epoch 27/100 491/491 [==============================] - 0s 56us/sample - loss: 0.1381 - acc: 0.9664 - val_loss: 0.1826 - val_acc: 0.9455 Epoch 28/100 491/491 [==============================] - 0s 58us/sample - loss: 0.1335 - acc: 0.9664 - val_loss: 0.1800 - val_acc: 0.9455 Epoch 29/100 491/491 [==============================] - 0s 61us/sample - loss: 0.1294 - acc: 0.9664 - val_loss: 0.1764 - val_acc: 0.9455 Epoch 30/100 491/491 [==============================] - 0s 52us/sample - loss: 0.1261 - acc: 0.9654 - val_loss: 0.1749 - val_acc: 0.9455 Epoch 31/100 491/491 [==============================] - 0s 73us/sample - loss: 0.1229 - acc: 0.9664 - val_loss: 0.1754 - val_acc: 0.9455 Epoch 32/100 491/491 [==============================] - 0s 63us/sample - loss: 0.1194 - acc: 0.9664 - val_loss: 0.1735 - val_acc: 0.9455 Epoch 33/100 491/491 [==============================] - 0s 61us/sample - loss: 0.1168 - acc: 0.9664 - val_loss: 0.1723 - val_acc: 0.9455 Epoch 34/100 491/491 [==============================] - 0s 55us/sample - loss: 0.1137 - acc: 0.9664 - val_loss: 0.1699 - val_acc: 0.9455 Epoch 35/100 491/491 [==============================] - 0s 54us/sample - loss: 0.1113 - acc: 0.9664 - val_loss: 0.1685 - val_acc: 0.9455 Epoch 36/100 491/491 [==============================] - 0s 57us/sample - loss: 0.1094 - acc: 0.9674 - val_loss: 0.1676 - val_acc: 0.9455 Epoch 37/100 491/491 [==============================] - 0s 57us/sample - loss: 0.1073 - acc: 0.9674 - val_loss: 0.1636 - val_acc: 0.9455 Epoch 38/100 491/491 [==============================] - 0s 54us/sample - loss: 0.1062 - acc: 0.9664 - val_loss: 0.1638 - val_acc: 0.9455 Epoch 39/100 491/491 [==============================] - 0s 67us/sample - loss: 0.1047 - acc: 0.9664 - val_loss: 0.1627 - val_acc: 0.9455 Epoch 40/100 491/491 [==============================] - 0s 57us/sample - loss: 0.1031 - acc: 0.9674 - val_loss: 0.1661 - val_acc: 0.9455 Epoch 41/100 491/491 [==============================] - 0s 54us/sample - loss: 0.1021 - acc: 0.9674 - val_loss: 0.1639 - val_acc: 0.9455 Epoch 42/100 491/491 [==============================] - 0s 57us/sample - loss: 0.1006 - acc: 0.9674 - val_loss: 0.1648 - val_acc: 0.9455 Epoch 43/100 491/491 [==============================] - 0s 53us/sample - loss: 0.0995 - acc: 0.9684 - val_loss: 0.1644 - val_acc: 0.9455 Epoch 44/100 491/491 [==============================] - 0s 55us/sample - loss: 0.0992 - acc: 0.9674 - val_loss: 0.1618 - val_acc: 0.9455 Epoch 45/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0979 - acc: 0.9695 - val_loss: 0.1639 - val_acc: 0.9455 Epoch 46/100 491/491 [==============================] - 0s 66us/sample - loss: 0.0968 - acc: 0.9695 - val_loss: 0.1627 - val_acc: 0.9455 Epoch 47/100 491/491 [==============================] - 0s 82us/sample - loss: 0.0962 - acc: 0.9674 - val_loss: 0.1603 - val_acc: 0.9455 Epoch 48/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0955 - acc: 0.9695 - val_loss: 0.1614 - val_acc: 0.9455 Epoch 49/100 491/491 [==============================] - 0s 59us/sample - loss: 0.0948 - acc: 0.9695 - val_loss: 0.1627 - val_acc: 0.9455 Epoch 50/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0938 - acc: 0.9695 - val_loss: 0.1619 - val_acc: 0.9455 Epoch 51/100 491/491 [==============================] - 0s 64us/sample - loss: 0.0933 - acc: 0.9695 - val_loss: 0.1608 - val_acc: 0.9455 Epoch 52/100 491/491 [==============================] - 0s 61us/sample - loss: 0.0927 - acc: 0.9695 - val_loss: 0.1599 - val_acc: 0.9455 Epoch 53/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0922 - acc: 0.9695 - val_loss: 0.1620 - val_acc: 0.9455 Epoch 54/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0916 - acc: 0.9695 - val_loss: 0.1607 - val_acc: 0.9455 Epoch 55/100 491/491 [==============================] - 0s 64us/sample - loss: 0.0919 - acc: 0.9695 - val_loss: 0.1633 - val_acc: 0.9455 Epoch 56/100 491/491 [==============================] - 0s 65us/sample - loss: 0.0912 - acc: 0.9674 - val_loss: 0.1557 - val_acc: 0.9455 Epoch 57/100 491/491 [==============================] - 0s 54us/sample - loss: 0.0909 - acc: 0.9695 - val_loss: 0.1601 - val_acc: 0.9455 Epoch 58/100 491/491 [==============================] - 0s 52us/sample - loss: 0.0901 - acc: 0.9695 - val_loss: 0.1585 - val_acc: 0.9455 Epoch 59/100 491/491 [==============================] - 0s 64us/sample - loss: 0.0898 - acc: 0.9695 - val_loss: 0.1621 - val_acc: 0.9455 Epoch 60/100 491/491 [==============================] - 0s 55us/sample - loss: 0.0892 - acc: 0.9695 - val_loss: 0.1634 - val_acc: 0.9455 Epoch 61/100 491/491 [==============================] - 0s 52us/sample - loss: 0.0888 - acc: 0.9695 - val_loss: 0.1622 - val_acc: 0.9455 Epoch 62/100 491/491 [==============================] - 0s 58us/sample - loss: 0.0884 - acc: 0.9695 - val_loss: 0.1595 - val_acc: 0.9455 Epoch 63/100 491/491 [==============================] - 0s 59us/sample - loss: 0.0883 - acc: 0.9684 - val_loss: 0.1570 - val_acc: 0.9455 Epoch 64/100 491/491 [==============================] - 0s 55us/sample - loss: 0.0877 - acc: 0.9684 - val_loss: 0.1581 - val_acc: 0.9455 Epoch 65/100 491/491 [==============================] - 0s 75us/sample - loss: 0.0873 - acc: 0.9695 - val_loss: 0.1606 - val_acc: 0.9455 Epoch 66/100 491/491 [==============================] - 0s 60us/sample - loss: 0.0871 - acc: 0.9695 - val_loss: 0.1601 - val_acc: 0.9455 Epoch 67/100 491/491 [==============================] - 0s 59us/sample - loss: 0.0868 - acc: 0.9695 - val_loss: 0.1603 - val_acc: 0.9455 Epoch 68/100 491/491 [==============================] - 0s 60us/sample - loss: 0.0869 - acc: 0.9695 - val_loss: 0.1582 - val_acc: 0.9455 Epoch 69/100 491/491 [==============================] - 0s 58us/sample - loss: 0.0864 - acc: 0.9695 - val_loss: 0.1623 - val_acc: 0.9455 Epoch 70/100 491/491 [==============================] - 0s 61us/sample - loss: 0.0860 - acc: 0.9695 - val_loss: 0.1606 - val_acc: 0.9455 Epoch 71/100 491/491 [==============================] - 0s 54us/sample - loss: 0.0859 - acc: 0.9695 - val_loss: 0.1605 - val_acc: 0.9455 Epoch 72/100 491/491 [==============================] - 0s 75us/sample - loss: 0.0856 - acc: 0.9705 - val_loss: 0.1622 - val_acc: 0.9455 Epoch 73/100 491/491 [==============================] - 0s 53us/sample - loss: 0.0853 - acc: 0.9705 - val_loss: 0.1592 - val_acc: 0.9455 Epoch 74/100 491/491 [==============================] - 0s 51us/sample - loss: 0.0851 - acc: 0.9705 - val_loss: 0.1587 - val_acc: 0.9455 Epoch 75/100 491/491 [==============================] - 0s 55us/sample - loss: 0.0848 - acc: 0.9705 - val_loss: 0.1589 - val_acc: 0.9455 Epoch 76/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0851 - acc: 0.9705 - val_loss: 0.1612 - val_acc: 0.9455 Epoch 77/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0845 - acc: 0.9705 - val_loss: 0.1582 - val_acc: 0.9455 Epoch 78/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0853 - acc: 0.9705 - val_loss: 0.1625 - val_acc: 0.9455 Epoch 79/100 491/491 [==============================] - 0s 59us/sample - loss: 0.0844 - acc: 0.9715 - val_loss: 0.1564 - val_acc: 0.9455 Epoch 80/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0841 - acc: 0.9705 - val_loss: 0.1571 - val_acc: 0.9455 Epoch 81/100 491/491 [==============================] - 0s 67us/sample - loss: 0.0838 - acc: 0.9695 - val_loss: 0.1538 - val_acc: 0.9636 Epoch 82/100 491/491 [==============================] - 0s 59us/sample - loss: 0.0839 - acc: 0.9684 - val_loss: 0.1551 - val_acc: 0.9545 Epoch 83/100 491/491 [==============================] - 0s 63us/sample - loss: 0.0835 - acc: 0.9684 - val_loss: 0.1545 - val_acc: 0.9636 Epoch 84/100 491/491 [==============================] - 0s 62us/sample - loss: 0.0842 - acc: 0.9674 - val_loss: 0.1498 - val_acc: 0.9636 Epoch 85/100 491/491 [==============================] - 0s 65us/sample - loss: 0.0832 - acc: 0.9664 - val_loss: 0.1534 - val_acc: 0.9636 Epoch 86/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0829 - acc: 0.9674 - val_loss: 0.1546 - val_acc: 0.9636 Epoch 87/100 491/491 [==============================] - 0s 54us/sample - loss: 0.0831 - acc: 0.9705 - val_loss: 0.1548 - val_acc: 0.9636 Epoch 88/100 491/491 [==============================] - 0s 66us/sample - loss: 0.0825 - acc: 0.9725 - val_loss: 0.1588 - val_acc: 0.9455 Epoch 89/100 491/491 [==============================] - 0s 57us/sample - loss: 0.0832 - acc: 0.9715 - val_loss: 0.1560 - val_acc: 0.9455 Epoch 90/100 491/491 [==============================] - 0s 64us/sample - loss: 0.0824 - acc: 0.9715 - val_loss: 0.1564 - val_acc: 0.9455 Epoch 91/100 491/491 [==============================] - 0s 63us/sample - loss: 0.0822 - acc: 0.9715 - val_loss: 0.1540 - val_acc: 0.9636 Epoch 92/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0823 - acc: 0.9705 - val_loss: 0.1548 - val_acc: 0.9455 Epoch 93/100 491/491 [==============================] - 0s 60us/sample - loss: 0.0833 - acc: 0.9684 - val_loss: 0.1505 - val_acc: 0.9636 Epoch 94/100 491/491 [==============================] - 0s 62us/sample - loss: 0.0823 - acc: 0.9715 - val_loss: 0.1571 - val_acc: 0.9455 Epoch 95/100 491/491 [==============================] - 0s 65us/sample - loss: 0.0818 - acc: 0.9715 - val_loss: 0.1550 - val_acc: 0.9455 Epoch 96/100 491/491 [==============================] - 0s 71us/sample - loss: 0.0815 - acc: 0.9705 - val_loss: 0.1530 - val_acc: 0.9636 Epoch 97/100 491/491 [==============================] - 0s 60us/sample - loss: 0.0816 - acc: 0.9705 - val_loss: 0.1539 - val_acc: 0.9636 Epoch 98/100 491/491 [==============================] - 0s 55us/sample - loss: 0.0815 - acc: 0.9705 - val_loss: 0.1523 - val_acc: 0.9636 Epoch 99/100 491/491 [==============================] - 0s 65us/sample - loss: 0.0812 - acc: 0.9705 - val_loss: 0.1532 - val_acc: 0.9636 Epoch 100/100 491/491 [==============================] - 0s 56us/sample - loss: 0.0811 - acc: 0.9705 - val_loss: 0.1504 - val_acc: 0.9636 Finished training
Now let's plot the loss based on the training epochs.
import matplotlib.pyplot as plt
plt.xlabel('Epoch Number')
plt.ylabel("Loss Magnitude")
plt.plot(history.history['loss'])
[<matplotlib.lines.Line2D at 0x7f418a032198>]
eval_model=model.evaluate(X_train, y_train)
eval_model
546/546 [==============================] - 0s 43us/sample - loss: 0.0879 - acc: 0.9689
[0.08786988179216455, 0.96886444]
The training accuracy is 96.8%.
from numpy import argmax
y_pred_probs = model.predict(X_test)
#inverted = argmax(y_pred)
y_pred = [argmax(y) for y in y_pred_probs]
from sklearn.metrics import confusion_matrix
inverted = [argmax(y) for y in y_test]
cm = confusion_matrix(inverted, y_pred)
print("Confusion Matrix")
print(cm)
print("Accuracy is: {0:.2f}%".format(100*(cm[0,0] + cm[1,1])/sum(sum(cm))))
Confusion Matrix [[80 1] [ 2 54]] Accuracy is: 97.81%
The accuracy in the unseen test set is 97.81% with only 1 False Positive and 2 False Negatives.