Why does a model consume more space after training?¶

I have noticed that before training, the model size is small. But after training, its size becomes bigger.

It can be observed from the file size when saving it to disk.

You can look at the question here: https://stackoverflow.com/q/57058178/2593810

In [1]:

import tensorflow as tf
from tensorflow import keras as kr
import numpy as np
import os
tf.__version__

Out[1]:

'2.0.0'

In [2]:

def build_model():
    model = kr.Sequential([
        kr.layers.Dense(1000, 'relu', input_shape=(500,)),
        kr.layers.Dense(1000, 'relu'),
        kr.layers.Dense(1, 'sigmoid')
    ])
    model.compile('adam', 'binary_crossentropy', ['acc'])
    return model

In [3]:

def print_model_size(filename):
    print(f"{filename} size: {os.path.getsize(filename) / 1024 / 1024:.3f} MiB")

In [4]:

model_a = build_model()
model_a.summary()
fn = 'model_a.h5'
model_a.save(fn)
print_model_size(fn)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1000)              501000    
_________________________________________________________________
dense_1 (Dense)              (None, 1000)              1001000   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1001      
=================================================================
Total params: 1,503,001
Trainable params: 1,503,001
Non-trainable params: 0
_________________________________________________________________
model_a.h5 size: 5.748 MiB

In [5]:

def create_y(x):
    return (x[:, [100, 200, 300, 400]].sum(1) > 2).astype('float32')

In [6]:

x_train = np.random.random((10000, 500))
x_test = np.random.random((2000, 500))
y_train = create_y(x_train)
y_test = create_y(x_test)

In [7]:

model_a.fit(x_train, y_train, validation_split=0.1, epochs=100, callbacks=[kr.callbacks.EarlyStopping(patience=5, restore_best_weights=True)])

Train on 9000 samples, validate on 1000 samples
Epoch 1/100
9000/9000 [==============================] - 5s 512us/sample - loss: 0.5887 - acc: 0.6672 - val_loss: 0.2988 - val_acc: 0.8920
Epoch 2/100
9000/9000 [==============================] - 3s 379us/sample - loss: 0.2613 - acc: 0.8850 - val_loss: 0.1835 - val_acc: 0.9270
Epoch 3/100
9000/9000 [==============================] - 3s 379us/sample - loss: 0.1957 - acc: 0.9153 - val_loss: 0.1744 - val_acc: 0.9220
Epoch 4/100
9000/9000 [==============================] - 4s 395us/sample - loss: 0.1417 - acc: 0.9413 - val_loss: 0.1274 - val_acc: 0.9470
Epoch 5/100
9000/9000 [==============================] - 3s 372us/sample - loss: 0.1284 - acc: 0.9464 - val_loss: 0.1563 - val_acc: 0.9230
Epoch 6/100
9000/9000 [==============================] - 3s 364us/sample - loss: 0.1577 - acc: 0.9313 - val_loss: 0.1393 - val_acc: 0.9390
Epoch 7/100
9000/9000 [==============================] - 3s 363us/sample - loss: 0.1401 - acc: 0.9403 - val_loss: 0.1368 - val_acc: 0.9360
Epoch 8/100
9000/9000 [==============================] - 3s 358us/sample - loss: 0.1217 - acc: 0.9476 - val_loss: 0.2715 - val_acc: 0.8800
Epoch 9/100
9000/9000 [==============================] - 3s 360us/sample - loss: 0.1355 - acc: 0.9427 - val_loss: 0.1444 - val_acc: 0.9390

Out[7]:

<tensorflow.python.keras.callbacks.History at 0x29b1e866080>

In [8]:

fn = 'model_a_trained.h5'
model_a.save(fn)
print_model_size(fn)

model_a_trained.h5 size: 17.232 MiB

In [9]:

# copy weights of model A to model B
model_b = build_model()
model_b.set_weights(model_a.get_weights())
fn = 'model_b.h5'
model_b.save(fn)
print_model_size(fn)

model_b.h5 size: 5.748 MiB

Load model and evaluate¶

In [10]:

load_model = kr.models.load_model
model_a = load_model('model_a_trained.h5')
model_b = load_model('model_b.h5')

In [13]:

print(model_a.evaluate(x_train, y_train, verbose=0))
print(model_a.evaluate(x_test, y_test, verbose=0))

[0.0855224913239479, 0.974]
[0.12154238364100456, 0.9475]

In [14]:

print(model_b.evaluate(x_train, y_train, verbose=0))
print(model_b.evaluate(x_test, y_test, verbose=0))

[0.0855224913239479, 0.974]
[0.12154238364100456, 0.9475]

Conclusion¶

You will see that both model_a and model_b give the same accuracy yet their disk space consumption is tremendously different.

It's because the .fit() command stores data of the training process that is not used for prediction.

In this case, the data that is being stored is the previous gradients state of the Adam optimizer. Space consumption varies from optimizer to optimizer.

In the case of SGD, the space consumption would not be big as it does not store gradients data.

So if you don't want to train the model anymore you should save it with include_optimizer=False to reduce disk space consumption.