Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.

Generating Validation Set Splits

Often, we obtain datasets for which only training and test splits are provided, and validation splits are missing. As we all know, the use of validation sets for repeated model tuning and evaluation is recommended to avoid overfitting on the test set.

Since we sometimes want to rotate the validation set, or merge training and validation sets at a later stage to obtain more training data, it is not always convenient to define a separate validation set, and it can be more convenient to split the validation set portion off the training set if/when we need it.

Suppose we load the MNIST dataset as follows -- note that there is no validation set pre-specified for MNIST, and the same is true for CIFAR-10/100.

A Typical Dataset (here: MNIST)

In [1]:
import torch
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

BATCH_SIZE = 64
In [2]:
##########################
### MNIST DATASET
##########################

# Note transforms.ToTensor() scales input images
# to 0-1 range
train_dataset = datasets.MNIST(root='data', 
                               train=True, 
                               transform=transforms.ToTensor(),
                               download=True)

test_dataset = datasets.MNIST(root='data', 
                              train=False, 
                              transform=transforms.ToTensor())


train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          shuffle=True)

test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,
                         num_workers=4,
                         shuffle=False)

# Checking the dataset
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break
0it [00:00, ?it/s]
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
9920512it [00:02, 4390618.70it/s]                             
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz
32768it [00:00, 293812.98it/s]                           
0it [00:00, ?it/s]
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
1654784it [00:00, 2762205.03it/s]                            
8192it [00:00, 124866.40it/s]
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Image batch dimensions: torch.Size([64, 1, 28, 28])
Image label dimensions: torch.Size([64])
In [3]:
print(f'Total number of training examples: {len(train_dataset)}')
Total number of training examples: 60000

Subset Method

Most of the time, a convenient method for splitting a training set into a training subset and validation subset is the Subset method . However, note that we have to use the same transform methodology for both training and test sets (which may not be desired in all cases; for instance, if we want to perform random cropping or rotation for training set augmentation).

Concretely, we will reserve the first 1000 training examples for validation and use the remaining 59000 examples for the new training set. Note that the Subset method will automatically shuffle the data prior to each epoch.

In [4]:
from torch.utils.data.dataset import Subset
In [5]:
valid_indices = torch.arange(0, 1000)
train_indices = torch.arange(1000, 60000)


train_and_valid = datasets.MNIST(root='data', 
                                 train=True, 
                                 transform=transforms.ToTensor(),
                                 download=True)

train_dataset = Subset(train_and_valid, train_indices)
valid_dataset = Subset(train_and_valid, valid_indices)
In [6]:
train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          shuffle=True)

valid_loader = DataLoader(dataset=valid_dataset, 
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          shuffle=False)
In [7]:
# Checking the dataset
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break
Image batch dimensions: torch.Size([64, 1, 28, 28])
Image label dimensions: torch.Size([64])
In [8]:
# Check that shuffling works properly
# i.e., label indices should be in random order.
# Also, the label order should be different in the second
# epoch.

for images, labels in train_loader:  
    pass
print(labels[:10])

for images, labels in train_loader:  
    pass
print(labels[:10])
tensor([1, 7, 2, 4, 7, 7, 8, 4, 0, 5])
tensor([5, 5, 6, 4, 2, 3, 8, 0, 7, 5])
In [9]:
# Check that shuffling works properly.
# i.e., label indices should be in random order.
# Via the fixed random seed, both epochs should return
# the same label sequence.

torch.manual_seed(123)
for images, labels in train_loader:  
    pass
print(labels[:10])

torch.manual_seed(123)
for images, labels in train_loader:  
    pass
print(labels[:10])
tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])
tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])

SubsetRandomSampler Method

Compared to the Subset method, the SubsetRandomSampler is a more convenient solution if we want to assign different transformation methods to training and test subsets. Similar to the Subset example, we will use the first 1000 examples for the validation set and the remaining 59000 examples for training.

In [10]:
from torch.utils.data import SubsetRandomSampler
In [11]:
train_indices = torch.arange(1000, 60000)
valid_indices = torch.arange(0, 1000)


train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(valid_indices)


training_transform = transforms.Compose([transforms.Resize((32, 32)),
                                         transforms.RandomCrop((28, 28)),
                                         transforms.ToTensor()])

valid_transform = transforms.Compose([transforms.Resize((32, 32)),
                                         transforms.CenterCrop((28, 28)),
                                         transforms.ToTensor()])



train_dataset = datasets.MNIST(root='data', 
                               train=True, 
                               transform=training_transform,
                               download=True)

# note that this is the same dataset as "train_dataset" above
# however, we can now choose a different transform method
valid_dataset = datasets.MNIST(root='data', 
                               train=True, 
                               transform=valid_transform,
                               download=False)

test_dataset = datasets.MNIST(root='data', 
                              train=False, 
                              transform=valid_transform,
                              download=False)

train_loader = DataLoader(train_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          sampler=train_sampler)

valid_loader = DataLoader(valid_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          sampler=valid_sampler)

test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,
                         num_workers=4,
                         shuffle=False)
In [12]:
# Checking the dataset
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break
Image batch dimensions: torch.Size([64, 1, 28, 28])
Image label dimensions: torch.Size([64])
In [13]:
# Check that shuffling works properly
# i.e., label indices should be in random order.
# Also, the label order should be different in the second
# epoch.

for images, labels in train_loader:  
    pass
print(labels[:10])

for images, labels in train_loader:  
    pass
print(labels[:10])
tensor([5, 7, 4, 9, 1, 7, 4, 1, 6, 7])
tensor([8, 2, 0, 7, 1, 3, 2, 6, 0, 4])
In [14]:
# Check that shuffling works properly.
# i.e., label indices should be in random order.
# Via the fixed random seed, both epochs should return
# the same label sequence.

torch.manual_seed(123)
for images, labels in train_loader:  
    pass
print(labels[:10])

torch.manual_seed(123)
for images, labels in train_loader:  
    pass
print(labels[:10])
tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])
tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])