Abstract: This lecture will look at the changes in hardware that
enabled neural networks to be efficient and how neural network models
are deployed on hardware.
iftorch.cuda.is_available():our_custom_net.cuda()# ORdevice=torch.device('cuda:0')our_custom_net.to(device)# Remember to do the same for all inputs to the network
Accessing RAM is 3 to 4 orders of magnitude slower than
executing MAC
Processor comparison based on memory and bandwidth¶
CPU has faster I/O bus than GPU, but it has lower bandwidth than
GPU. CPU can fetch small pieces of data very fast, GPU fetches
them slower but in bulk.
GPU has more lower-level memory than CPU. Even though each
individual thread and thread block have less memory than theCPU threads and cores do, there are just so much more threads in
the GPU thattaken as a whole they have much
more lower-level memory.This is memory inversion.
The case for parallelism - Moore's law is slowing down¶
Moore's law fuelled the prosperity of the past 50 years.
where $\theta^{s}_{l,i}$ is an $i$'s parameter at layer $l$ value at
step $s$ of the training process; $r$ is the learning rate; $B$ is
the batch size; and $g^{s}_{l,i,b}$ is the $s$-th training step
gradient coming from $b$-th training example for parameter update of
$i$-th parameter at layer $l$.
DL parallelism: parallelize backprop through an example¶
The matrix multiplications in the forward and backward passes can be
parallelized:
Fast inference is unthinkable without parallel matrix
multiplication.
Frequent synchronization is needed - at each layer the parallel
threads need to sync up.
Overpowered CPU threads are scrambling to juggle the many nodes /
channels they need to compute.
The CPU is slowed down considerably by the fact that it needs to
access its own L3 cache many more times than the GPU would, due to
its lower memory access bandwidth.
print("CPU training code")print("CPU training of the above-defined model short example of how long it takes")our_custom_net.cpu()start=time()train(lenet,MNIST_trainloader)print(f'CPU took {time()-start:.2f} seconds')
CPU training code
CPU training of the above-defined model short example of how long it takes
Epoch 1, iter 469, loss 1.980: : 469it [00:02, 181.77it/s]
Epoch 2, iter 469, loss 0.932: : 469it [00:02, 182.58it/s]
CPU took 5.22 seconds
The model and the batch size just fit once in the memory of the
GPU we chose.
In the best case scenario run through the training examples in a
batch in parallel
For most GPUs the computation is, however, sequential if their
memory is not big enough to hold the entire batch of training
examples.
Parallelize each layer computation between its cores
Groups of several cores are assigned to separate network layers
/ channels. Cores in the group need not be physically close to
each other.
Parallelize matrix multiplication
The matrix multiplication needed to compute a given node /
channel is split between the threads in the group that was
assigned to it. Each thread computes separate sector of the
input.
GPU cores are engaged at all times as they sequentially push through
the training examples all at the same time.
All threads need to sync-up at the end of each layer computation so
that their outputs can become the inputs to the next layer.
print("GPU training")print("GPU training of the same example as in CPU")lenet.cuda()batch_size=512gpu_trainloader=make_MNIST_loader(batch_size=batch_size)start=time()gpu_train(lenet,gpu_trainloader)print(f'GPU took {time()-start:.2f} seconds')
GPU training
GPU training of the same example as in CPU
Epoch 1, iter 118, iter loss 0.786: : 118it [00:02, 52.62it/s]
Epoch 2, iter 118, iter loss 0.760: : 118it [00:02, 57.48it/s]
GPU took 4.37 seconds
print("multi-GPU training")print("GPU training of the same example as in single GPU but with two GPUs")our_custom_net_dp=lenetour_custom_net_dp.cuda()our_custom_net_dp=nn.DataParallel(our_custom_net_dp,device_ids=[0,1])batch_size=1024multigpu_trainloader=make_MNIST_loader(batch_size=batch_size)start=time()gpu_train(our_custom_net_dp,multigpu_trainloader)print(f'2 GPUs took {time()-start:.2f} seconds')
multi-GPU training
GPU training of the same example as in single GPU but with two GPUs
Epoch 1, iter 59, iter loss 0.745: : 59it [00:02, 21.24it/s]
Epoch 2, iter 59, iter loss 0.736: : 59it [00:01, 31.70it/s]
2 GPUs took 4.72 seconds
DL training and inference do not take place solely on the
accelerator.
The accelerator accelerates the gradient computations and
updates.
The CPU will still need to be loading the data (model, train
set) and saving the model (checkpointing).
The accelerator starves if it waits idly for its inputs due for
example to slow CPU, I/O buses, or storage interface (SATA, SSD,
NVMe).
print("starving GPUs")print("show in-code what starving GPU looks like")# Deliberately slow down data flow into the gpu # Do you have any suggestions how to do this in a more realistic way than just to force waiting?print('Using only 1 worker for the dataloader, the time the GPU takes increases.')lenet.cuda()batch_size=64gpu_trainloader=make_MNIST_loader(batch_size=batch_size,num_workers=1)start=time()gpu_train(lenet,gpu_trainloader)print(f'GPU took {time()-start:.2f} seconds')
starving GPUs
show in-code what starving GPU looks like
Using only 1 worker for the dataloader, the time the GPU takes increases.
Epoch 1, iter 938, iter loss 0.699: : 938it [00:04, 214.02it/s]
Epoch 2, iter 938, iter loss 0.619: : 938it [00:04, 208.96it/s]
GPU took 8.92 seconds
print("profiling demo")print("in-house DL training resource profiling code & output - based on the above model and training loop")#for both of the below produce one figure for inference and one for training#MACs profiling - first slide; show as piechardlenet.cpu()profile_ops(lenet,shape=(1,1,28,28))
profiling demo
in-house DL training resource profiling code & output - based on the above model and training loop
Operation OPS
------------------------------------- -------
LeNet/Conv2d[conv1]/onnx::Conv 89856
LeNet/ReLU[relu1]/onnx::Relu 6912
LeNet/MaxPool2d[pool1]/onnx::MaxPool 2592
LeNet/Conv2d[conv2]/onnx::Conv 154624
LeNet/ReLU[relu2]/onnx::Relu 2048
LeNet/MaxPool2d[pool2]/onnx::MaxPool 768
LeNet/Linear[fc1]/onnx::Gemm 30720
LeNet/ReLU[relu3]/onnx::Relu 240
LeNet/Linear[fc2]/onnx::Gemm 7200
LeNet/ReLU[relu4]/onnx::Relu 120
LeNet/Linear[fc3]/onnx::Gemm 600
LeNet/ReLU[relu5]/onnx::Relu 20
------------------------------------ ------
Input size: (1, 1, 28, 28)
295,700 FLOPs or approx. 0.00 GFLOPs
Working set - a collection of all elements needed for executing a
given DL layer
Input and output activations
Parameters (weights & biases)
print("working set profiling")# compute the per-layer required memory:# memory to load weights, to load inputs, to save oputputs# visualize as a per-layer bar chart, each bar consists of three sections - the inputs, outputs, weightsprofile_layer_mem(lenet)
print("exceeding RAM+Swap demo")print("exceeding working set experiment - see the latency spike over a couple of bytes of working set")# sample* a training speed of a model whose layer working sets just first in the memory# bump up layer dimensions which are far from reaching the RAM limit - see that the effect on latency is limited# bump up the layer(s) that are at the RAM limit - observe the latency spike rapidly# add profiling graphs for each of the cases, print out latency numbers.# *train for an epoch or two, give the latency & give a reasonable estimate of how long would the full training take (assuming X epochs)estimate_training_for(LeNet,1000)
exceeding RAM+Swap demo
exceeding working set experiment - see the latency spike over a couple of bytes of working set
Using 128 hidden nodes took 2.42 seconds, training for 1000 epochs would take ~2423.7449169158936s
Using 256 hidden nodes took 2.31 seconds, training for 1000 epochs would take ~2311.570882797241s
Using 512 hidden nodes took 2.38 seconds, training for 1000 epochs would take ~2383.8846683502197s
Using 1024 hidden nodes took 2.56 seconds, training for 1000 epochs would take ~2559.4213008880615s
Using 2048 hidden nodes took 3.10 seconds, training for 1000 epochs would take ~3098.113536834717s
Using 4096 hidden nodes took 7.20 seconds, training for 1000 epochs would take ~7196.521997451782s
Using 6144 hidden nodes took 13.21 seconds, training for 1000 epochs would take ~13207.558155059814s
print("OOM - massive images")print("show in-code how this can hapen - say massive images; maybe show error message")# How could we do this without affecting the recording process?print('Loading too many images at once causes errors.')lenet.cuda()batch_size=6000gpu_trainloader=make_MNIST_loader(batch_size=batch_size,num_workers=1)start=time()gpu_train(lenet,gpu_trainloader)print(f'GPU took {time()-start:.2f} seconds')
OOM - massive images
show in-code how this can hapen - say massive images; maybe show error message
Loading too many images at once causes errors.
Epoch 1, iter 10, iter loss 0.596: : 10it [00:03, 2.78it/s]
Epoch 2, iter 2, iter loss 0.592: : 2it [00:01, 1.69it/s]
Alex Krizhevsky used two GTX 580 GPUs, each with 3GB of memory.
Theoretical AlexNet (without mid-way split) working set profiling:
print("profile AlexNet layers - show memory requirements")print("per-layer profiling of AlexNet - connects to the preceding slide")fromtorchvision.modelsimportalexnetasnetanet=net()profile_layer_alexnet(anet)
profile AlexNet layers - show memory requirements
per-layer profiling of AlexNet - connects to the preceding slide
The software and hardware lottery describes the success of a software
or a piece of hardware resulting not from its universal superiority,
but, rather, from its fit to the broader hardware and software
ecosystem.