#!/usr/bin/env python
# coding: utf-8

# Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.
# - Author: Sebastian Raschka
# - GitHub Repository: https://github.com/rasbt/deeplearning-models

# In[1]:


get_ipython().run_line_magic('load_ext', 'watermark')
get_ipython().run_line_magic('watermark', "-a 'Sebastian Raschka' -v -p torch")


# # Gradient Checkpointing Demo (Network-in-Network trained on CIFAR-10)

# 
# 
# 
# Why do we care about gradient checkpointing? It can lower the memory requirement of deep neural networks quite substantially, allowing us to work with larger architectures and memory limitations of conventional GPUs. However, there is no free lunch here: as a trade-off for the lower-memory requirements, additional computations are carried out which can prolong the training time. However, when GPU-memory is a limiting factor that we cannot even circumvent by lowering the batch sizes, then gradient checkpointing is a great and easy option for making things work!
# 
# Below is a brief summary of how gradient checkpointing works. For more details, please see the excellent explanations in [1] and [2].
# 
# ## Vanilla Backpropagation
# 
# In vanilla backpropagation (the standard version of backpropagation), the required memory grows linearly with the number of layers *n* in the neural network. This is because all nodes from the forward pass are being kept in memory (until all their dependent child nodes are processed).
# 
# ![](figures/gradient-checkpointing-1.png)
# 
# ## Low-memory Backpropagation
# 
# In the low-memory version of backpropagation, the forward pass is recomputed at each step, making it more memory-efficient than vanilla backpropagation, trading the memory for additional computations. In comparison, vanilla backpropagation processes *n* layers (nodes), the low-memory version processes $n^2$ nodes.
# 
# ![](figures/gradient-checkpointing-2.png)
# 
# ## Gradient Checkpointing
# 
# The gradient checkpointing method is a compromise between vanilla backpropagation and low-memory backpropagation, where nodes are recomputed more often than in vanilla backpropagation but not as often as in the low-memory version. In gradient checkpointing, we designate certain nodes as checkpoints so that they are not recomputed and serve as a basis for recomputing other nodes. The optimal choice is to designate every `\sqrt{n}`-th node as a checkpoint node. Consequently, the memory requirement increases by `\sqrt{n}` compared to the low-memory version of backpropagation.
# 
# As stated in [3], gradient checkpointing, we can implement models that are 4x to 10x larger than architectures that would usually fit into GPU memory.
# 
# ![](figures/gradient-checkpointing-3.png)
# 
# ## Gradient Checkpointing in PyTorch
# 
# PyTorch allows us to use gradient checkpointing very conveniently. In this notebook, we are only using the checkpointing for sequential models. However, it is also possible (and not much more complicated) to use checkpointing for non-sequential models. I recommend checking out the tutorial in [3] for more details.
# 
# A great performance benchmark and write-up is available at [4], showing the difference in memory consumption between a baseline ResNet-18 and one enhanced with gradient checkpointing.
# 
# 
# 
# 
# References
# -----------
# 
# 
# [1] Saving memory using gradient-checkpointing: https://github.com/cybertronai/gradient-checkpointing
# 
# [2] Fitting larger networks into memory: https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9
# 
# [3] Trading compute for memory in PyTorch models using Checkpointing: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb
# 
# [4] Deep Learning Memory Usage and Pytorch Optimization Tricks: https://www.sicara.ai/blog/2019-28-10-deep-learning-memory-usage-and-pytorch-optimization-tricks

# ### Network Architecture

# For this demo, I am using a simple Network-in-Network (NiN) architecture for the purpose of code readability. The gain from gradient checkpointing can be larger the deeper the architecture. 
# 
# 
# The CNN architecture is based on
# - Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
# 

# # Part 1: Setup and Baseline (No Gradient Checkpointing)

# ## Imports

# In[2]:


import os
import time
import random

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Subset

from torchvision import datasets
from torchvision import transforms

import matplotlib.pyplot as plt
from PIL import Image


if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True


# ## Model Settings

# #### Setting a random seed

# I recommend using a function like the following one prior to using dataset loaders and initializing a model if you want to ensure the data is shuffled in the same manner if you rerun this notebook and the model gets the same initial random weights:

# In[3]:


def set_all_seeds(seed):
    os.environ["PL_GLOBAL_SEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)


# #### Setting cuDNN and PyTorch algorithmic behavior to deterministic

# Similar to the `set_all_seeds` function above, I recommend setting the behavior of PyTorch and cuDNN to deterministic (this is particulary relevant when using GPUs). We can also define a function for that:

# In[4]:


def set_deterministic():
    if torch.cuda.is_available():
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True
    torch.set_deterministic(True)


# In[5]:


##########################
### SETTINGS
##########################

# Device
CUDA_DEVICE_NUM = 2 # change as appropriate
DEVICE = torch.device('cuda:%d' % CUDA_DEVICE_NUM if torch.cuda.is_available() else 'cpu')
print('Device:', DEVICE)

# Hyperparameters
RANDOM_SEED = 1
LEARNING_RATE = 0.0001
BATCH_SIZE = 256
NUM_EPOCHS = 40

# Architecture
NUM_CLASSES = 10

set_all_seeds(RANDOM_SEED)

# Deterministic behavior not yet supported by AdaptiveAvgPool2d
#set_deterministic()


# #### Import utility functions

# In[6]:


import sys

sys.path.insert(0, "..") # to include ../helper_evaluate.py etc.

from helper_evaluate import compute_accuracy
from helper_data import get_dataloaders_cifar10
from helper_train import train_classifier_simple_v1


# ## Dataset

# In[7]:


### Set random seed ###
set_all_seeds(RANDOM_SEED)

##########################
### Dataset
##########################

train_loader, valid_loader, test_loader = get_dataloaders_cifar10(
    batch_size=BATCH_SIZE, 
    num_workers=2, 
    validation_fraction=0.1)


# In[8]:


# Checking the dataset
print('Training Set:\n')
for images, labels in train_loader:  
    print('Image batch dimensions:', images.size())
    print('Image label dimensions:', labels.size())
    print(labels[:10])
    break
    
# Checking the dataset
print('\nValidation Set:')
for images, labels in valid_loader:  
    print('Image batch dimensions:', images.size())
    print('Image label dimensions:', labels.size())
    print(labels[:10])
    break

# Checking the dataset
print('\nTesting Set:')
for images, labels in train_loader:  
    print('Image batch dimensions:', images.size())
    print('Image label dimensions:', labels.size())
    print(labels[:10])
    break


# ## Model

# This is the basic NiN model without gradient checkpointing for reference.

# In[9]:


##########################
### MODEL
##########################

class NiN(nn.Module):
    def __init__(self, num_classes):
        super(NiN, self).__init__()
        self.num_classes = num_classes
        self.classifier = nn.Sequential(
                nn.Conv2d(3, 192, kernel_size=5, stride=1, padding=2),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 160, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(160,  96, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
                nn.Dropout(0.5),

                nn.Conv2d(96, 192, kernel_size=5, stride=1, padding=2),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.AvgPool2d(kernel_size=3, stride=2, padding=1),
                nn.Dropout(0.5),

                nn.Conv2d(192, 192, kernel_size=3, stride=1, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(192,  10, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.AvgPool2d(kernel_size=8, stride=1, padding=0),

                )

    def forward(self, x):
        x = self.classifier(x)
        logits = x.view(x.size(0), self.num_classes)
        #probas = torch.softmax(logits, dim=1)
        return logits


# In[10]:


set_all_seeds(RANDOM_SEED)

model = NiN(NUM_CLASSES)
model.to(DEVICE)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  


# ## Training

# In[11]:


import tracemalloc


tracemalloc.start()
log_dict = train_classifier_simple_v1(num_epochs=2, model=model, 
                                      optimizer=optimizer, device=DEVICE, 
                                      train_loader=train_loader, valid_loader=valid_loader,
                                      logging_interval=50)

current, peak =  tracemalloc.get_traced_memory()
print(f"{current}, {peak}")
tracemalloc.stop()


# In[12]:


### Delete model and free memory

model.cpu()
del model


# # Part 2: Modified NiN with Gradient Checkpointing

# The changes we have to make to the NiN code are highlighted below. Note that this example uses only 1 segment in `checkpoint_sequential`. Generally, a lower number of segments improves memory efficiency but making the computational performance worse since more values need to be recomputed. For this architecture, I found that `segments=1` represents a good trade-off, though.

# In[13]:


##########################
### MODEL
##########################


###### NEW ####################################################
from torch.utils.checkpoint import checkpoint_sequential
###############################################################


class NiN(nn.Module):
    def __init__(self, num_classes):
        super(NiN, self).__init__()
        self.num_classes = num_classes
        self.classifier = nn.Sequential(
                nn.Conv2d(3, 192, kernel_size=5, stride=1, padding=2),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 160, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(160,  96, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
                nn.Dropout(0.5),

                nn.Conv2d(96, 192, kernel_size=5, stride=1, padding=2),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.AvgPool2d(kernel_size=3, stride=2, padding=1),
                nn.Dropout(0.5),

                nn.Conv2d(192, 192, kernel_size=3, stride=1, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(192, 192, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.Conv2d(192,  10, kernel_size=1, stride=1, padding=0),
                nn.ReLU(inplace=True),
                nn.AvgPool2d(kernel_size=8, stride=1, padding=0),
                )
        
        ###### NEW ####################################################
        self.classifier_modules = [module for k, module in self.classifier._modules.items()]
        ###############################################################

    def forward(self, x):
        
        ###### NEW ####################################################
        x.requires_grad = True
        x = checkpoint_sequential(functions=self.classifier_modules, 
                                  segments=1, 
                                  input=x)
        ###############################################################
        
        x = x.view(x.size(0), self.num_classes)
        #probas = torch.softmax(x, dim=1)
        return x


# In[14]:


set_all_seeds(RANDOM_SEED)

model = NiN(NUM_CLASSES)
model.to(DEVICE)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  


# In[15]:


tracemalloc.start()
log_dict = train_classifier_simple_v1(num_epochs=2, model=model, 
                                      optimizer=optimizer, device=DEVICE, 
                                      train_loader=train_loader, valid_loader=valid_loader,
                                      logging_interval=50)

current, peak =  tracemalloc.get_traced_memory()
print(f"{current}, {peak}")
tracemalloc.stop()


# # Conclusion

# We can see that the gradient checkpointing improves peak memory efficiency by approximately 22% while the computational performance (runtime) becomes only 14% worse.