This guide will look at how we can do half precision training with Apex and Torchbearer.
Note: This example requires use of a GPU, so it's easiest to run as a colab notebook, where you can enable a free GPU with
Runtime → Change runtime type → Hardware Accelerator: GPU
First we install Apex by cloning the repo and pip installing with cuda extensions.
try:
import apex
except Exception:
! git clone https://github.com/NVIDIA/apex.git
% cd apex
!pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
%cd ..
We're going to test on a MNIST model so we now create a small toy model with some batch norm to test, since batch norm sometimes has problems with half precision training. We also create the generators for the MNIST dataset and define the loss .
from torch import nn, optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(784, 100)
self.relu = nn.ReLU()
self.bn = nn.BatchNorm1d(100)
self.net2 = nn.Linear(100, 10)
def forward(self, x):
return self.net2(self.bn(self.relu(self.net1(x))))
trainset = datasets.MNIST('./data/mnist', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
train_loader = DataLoader(trainset, batch_size=128)
test_ds = datasets.MNIST('./data/mnist', train=False,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
test_loader = DataLoader(test_ds, batch_size=128)
loss_fn = nn.CrossEntropyLoss()
We now need to initialise Apex with the automatic mixed precision (amp) module of Apex. Here we've selected the highest opt_level O3, full half precision training. You can find out more on optimisation levels here and other arguments for the initialisation here.
We have two things to note here:
First is that we have to send the model to the GPU here instead of letting Torchbearer do it automatically, this is because Apex expects a GPU model.
Second, if we were running this on a real problem, it is usually a good idea to enable full precision for the batch norm layers (keep_batchnorm_fp32 = True), then known as mixed precision training, since it reduces instability in training.
from apex import amp
model = ToyModel()
optimizer = optim.SGD(model.parameters(), lr=0.001)
model, optimizer = amp.initialize(model.cuda(), optimizer,
opt_level='O3',
keep_batchnorm_fp32=False,
)
Now lets do some training. Compared to a normal trial the only thing we need to change is using the apex closure, which performs the loss scaling from Apex. If we didn't want loss scaling then we wouldn't even need to do this.
try:
import torchbearer
except Exception:
!pip install torchbearer
# Alternatively
# pip install git+https://github.com/pytorchbearer/torchbearer
import torchbearer
@torchbearer.callbacks.on_sample
@torchbearer.callbacks.on_sample_validation
def flatten(state):
state[torchbearer.X] = state[torchbearer.X].view(state[torchbearer.X].shape[0], -1)
trial = torchbearer.Trial(model, optimizer, loss_fn, metrics=['loss', 'acc'], callbacks=[flatten])
trial.with_closure(apex_closure())
trial.with_train_generator(train_loader).with_val_generator(test_loader).cuda()
_ = trial.run(10, verbose=2)
We can quickly run some full precision training by using optimisation level O0. On a problem this simple, with a tiny model and easy dataset, we don't expect much improvement over half precision training (if any), but it's nice to compare anyway.
from apex import amp
model = ToyModel()
optimizer = optim.SGD(model.parameters(), lr=0.001)
model, optimizer = amp.initialize(model.cuda(), optimizer,
opt_level='O0',
)
try:
import torchbearer
except Exception:
!pip install torchbearer
import torchbearer
@torchbearer.callbacks.on_sample
@torchbearer.callbacks.on_sample_validation
def flatten(state):
state[torchbearer.X] = state[torchbearer.X].view(state[torchbearer.X].shape[0], -1)
trial = torchbearer.Trial(model, optimizer, loss_fn, metrics=['loss', 'acc'], callbacks=[flatten])
trial.with_train_generator(train_loader).with_val_generator(test_loader).cuda()
_ = trial.run(10, verbose=2)
We expect that Apex distributed will work as expected but have been unable to test it as of yet. Please let us know if you encounter any issues.