Credits: Forked from deep-learning-keras-tensorflow by Valerio Maggio
%matplotlib inline
import numpy as np
import pandas as pd
import theano
import theano.tensor as T
import matplotlib.pyplot as plt
import keras
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from sklearn.cross_validation import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation
Using Theano backend.
For this section we will use the Kaggle otto challenge. If you want to follow, Get the data from Kaggle: https://www.kaggle.com/c/otto-group-product-classification-challenge/data
The Otto Group is one of the world’s biggest e-commerce companies, A consistent analysis of the performance of products is crucial. However, due to diverse global infrastructure, many identical products get classified differently. For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.
https://www.kaggle.com/c/otto-group-product-classification-challenge/data
def load_data(path, train=True):
"""Load data from a CSV File
Parameters
----------
path: str
The path to the CSV file
train: bool (default True)
Decide whether or not data are *training data*.
If True, some random shuffling is applied.
Return
------
X: numpy.ndarray
The data as a multi dimensional array of floats
ids: numpy.ndarray
A vector of ids for each sample
"""
df = pd.read_csv(path)
X = df.values.copy()
if train:
np.random.shuffle(X) # https://youtu.be/uyUXoap67N8
X, labels = X[:, 1:-1].astype(np.float32), X[:, -1]
return X, labels
else:
X, ids = X[:, 1:].astype(np.float32), X[:, 0].astype(str)
return X, ids
def preprocess_data(X, scaler=None):
"""Preprocess input data by standardise features
by removing the mean and scaling to unit variance"""
if not scaler:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
return X, scaler
def preprocess_labels(labels, encoder=None, categorical=True):
"""Encode labels with values among 0 and `n-classes-1`"""
if not encoder:
encoder = LabelEncoder()
encoder.fit(labels)
y = encoder.transform(labels).astype(np.int32)
if categorical:
y = np_utils.to_categorical(y)
return y, encoder
print("Loading data...")
X, labels = load_data('train.csv', train=True)
X, scaler = preprocess_data(X)
Y, encoder = preprocess_labels(labels)
X_test, ids = load_data('test.csv', train=False)
X_test, ids = X_test[:1000], ids[:1000]
#Plotting the data
print(X_test[:1])
X_test, _ = preprocess_data(X_test, scaler)
nb_classes = Y.shape[1]
print(nb_classes, 'classes')
dims = X.shape[1]
print(dims, 'dims')
Loading data... [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 3. 2. 1. 0. 0. 0. 0. 0. 0. 0. 5. 3. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 11. 1. 20. 0. 0. 0. 0. 0.]] (9L, 'classes') (93L, 'dims')
Now lets create and train a logistic regression model.
Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. ref: https://keras.io/
Keras (κέρας) means horn in Greek. It is a reference to a literary image from ancient Greek and Latin literature, first found in the Odyssey, where dream spirits (Oneiroi, singular Oneiros) are divided between those who deceive men with false visions, who arrive to Earth through a gate of ivory, and those who announce a future that will come to pass, who arrive through a gate of horn. It's a play on the words κέρας (horn) / κραίνω (fulfill), and ἐλέφας (ivory) / ἐλεφαίρομαι (deceive).
Keras was initially developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System).
"Oneiroi are beyond our unravelling --who can be sure what tale they tell? Not all that men look for comes to pass. Two gates there are that give passage to fleeting Oneiroi; one is made of horn, one of ivory. The Oneiroi that pass through sawn ivory are deceitful, bearing a message that will not be fulfilled; those that come out through polished horn have truth behind them, to be accomplished for men who see them." Homer, Odyssey 19. 562 ff (Shewring translation).
dims = X.shape[1]
print(dims, 'dims')
print("Building model...")
nb_classes = Y.shape[1]
print(nb_classes, 'classes')
model = Sequential()
model.add(Dense(nb_classes, input_shape=(dims,)))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X, Y)
(93L, 'dims') Building model... (9L, 'classes') Epoch 1/10 61878/61878 [==============================] - 1s - loss: 1.0574 Epoch 2/10 61878/61878 [==============================] - 1s - loss: 0.7730 Epoch 3/10 61878/61878 [==============================] - 1s - loss: 0.7297 Epoch 4/10 61878/61878 [==============================] - 1s - loss: 0.7080 Epoch 5/10 61878/61878 [==============================] - 1s - loss: 0.6948 Epoch 6/10 61878/61878 [==============================] - 1s - loss: 0.6854 Epoch 7/10 61878/61878 [==============================] - 1s - loss: 0.6787 Epoch 8/10 61878/61878 [==============================] - 1s - loss: 0.6734 Epoch 9/10 61878/61878 [==============================] - 1s - loss: 0.6691 Epoch 10/10 61878/61878 [==============================] - 1s - loss: 0.6657
<keras.callbacks.History at 0x23d330f0>
Simplicity is pretty impressive right? :)
Now lets understand:
The core data structure of Keras is a model, a way to organize layers. The main type of model is the Sequential model, a linear stack of layers.
What we did here is stacking a Fully Connected (Dense) layer of trainable weights from the input to the output and an Activation layer on top of the weights layer.
from keras.layers.core import Dense
Dense(output_dim, init='glorot_uniform', activation='linear',
weights=None, W_regularizer=None, b_regularizer=None,
activity_regularizer=None, W_constraint=None,
b_constraint=None, bias=True, input_dim=None)
from keras.layers.core import Activation
Activation(activation)
If you need to, you can further configure your optimizer. A core principle of Keras is to make things reasonably simple, while allowing the user to be fully in control when they need to (the ultimate control being the easy extensibility of the source code). Here we used SGD (stochastic gradient descent) as an optimization algorithm for our trainable weights.
What we did here is nice, however in the real world it is not useable because of overfitting. Lets try and solve it with cross validation.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
To avoid overfitting, we will first split out data to training set and test set and test out model on the test set. Next: we will use two of keras's callbacks EarlyStopping and ModelCheckpoint
X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.15, random_state=42)
fBestModel = 'best_model.h5'
early_stop = EarlyStopping(monitor='val_loss', patience=4, verbose=1)
best_model = ModelCheckpoint(fBestModel, verbose=0, save_best_only=True)
model.fit(X, Y, validation_data = (X_test, Y_test), nb_epoch=20,
batch_size=128, verbose=True, validation_split=0.15,
callbacks=[best_model, early_stop])
Train on 19835 samples, validate on 3501 samples Epoch 1/20 19835/19835 [==============================] - 0s - loss: 0.6391 - val_loss: 0.6680 Epoch 2/20 19835/19835 [==============================] - 0s - loss: 0.6386 - val_loss: 0.6689 Epoch 3/20 19835/19835 [==============================] - 0s - loss: 0.6384 - val_loss: 0.6695 Epoch 4/20 19835/19835 [==============================] - 0s - loss: 0.6381 - val_loss: 0.6702 Epoch 5/20 19835/19835 [==============================] - 0s - loss: 0.6378 - val_loss: 0.6709 Epoch 6/20 19328/19835 [============================>.] - ETA: 0s - loss: 0.6380Epoch 00005: early stopping 19835/19835 [==============================] - 0s - loss: 0.6375 - val_loss: 0.6716
<keras.callbacks.History at 0x1d7245f8>
So, how hard can it be to build a Multi-Layer percepton with keras? It is baiscly the same, just add more layers!
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X, Y)
Your Turn!
Take couple of minutes and Try and optimize the number of layers and the number of parameters in the layers to get the best results.
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
# ...
# ...
# Play with it! add as much layers as you want! try and get better results.
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X, Y)
Building a question answering system, an image classification model, a Neural Turing Machine, a word2vec embedder or any other model is just as fast. The ideas behind deep learning are simple, so why should their implementation be painful?
Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth!
[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015
One much quoted theorem about neural network states that:
Universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.
[1] - Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik 1991