卷积神经网络 Convolutional Neural Networks

本课程中,我们将学习基本的卷积神经网络(CNN),并应用到自然语言(NLP)处理中。CNN 常用于图像处理,并且有很多例子。但我们专注于将它用于文本数据,也有惊人的效果。

In this lesson we will learn the basics of Convolutional Neural Networks (CNNs) applied to text for natural language processing (NLP) tasks. CNNs are traditionally used on images and there are plenty of tutorials that cover this. But we're going to focus on using CNN on text data which yields amazing results.

总览 Overview

论文中图表展示了句子中的词如何进行一维卷积计算。

The diagram below from this paper shows how 1D convolution is applied to the words in a sentence.

  • 目的: 检测输入数据的空间结构。Detect spatial substructure from input data.
  • 优点:
    • 权重数量小(共享)Small number of weights (shared)
    • 可并行的 Parallelizable
    • 检测空间结构(特征提取)Detects spatial substrcutures (feature extractors)
    • 通过滤波器可翻译的 Interpretable via filters
    • 可用于图像、文本、时间序列等 Used for in images/text/time-series etc.
  • 缺点:
    • 很多超参数(核大小,步幅等)Many hyperparameters (kernel size, strides, etc.)
    • 输入宽度必须相同(图像维度,文本长度等)Inputs have to be of same width (image dimensions, text length, etc.)
  • 其他:
    • 很多深度 CNN 框架持续更新 SOTA 性能 Lot's of deep CNN architectures constantly updated for SOTA performance

滤波器 Filters

CNN 的核心是滤波器(权重,核等),它通过对输入进行卷积计算,提取相关特征。虽然滤波器是随机初始化的,但可以学习从输入选取有意义的特征,有助于优化目标。我们将通过不寻常的方式介绍 CNN,聚焦应用于二维文本数据。每个输入有多个词组成,我们使用独热编码(one-hot encoder)表示每个词,得到一个二维输入。每个滤波器代表一个特征,我们将应用滤波器到其他输入,捕获相同特征。这也称为参数共享。

At the core of CNNs are filters (weights, kernels, etc.) which convolve (slide) across our input to extract relevant features. The filters are initialized randomly but learn to pick up meaningful features from the input that aid in optimizing for the objective. We're going to teach CNNs in an unorthodox method where we entirely focus on applying it to 2D text data. Each input is composed of words and we will be representing each word as one-hot encoded vector which gives us our 2D input. The intuition here is that each filter represents a feature and we will use this filter on other inputs to capture the same feature. This is known as parameter sharing.

In [0]:
# Loading PyTorch library
!pip3 install torch
Requirement already satisfied: torch in /usr/local/lib/python3.6/dist-packages (1.0.0)
In [0]:
import torch
import torch.nn as nn

我们的输入是一组二维文本数据。设定输入有64个样本,其中每个样本有8个词,每个词用一个长度为10的数组(词汇量为10的独热编码)表示。这样输入的体积为(64,8,10)。Python CNN 模块倾向于输入的通道维度(本例中是一个独热编码向量)在第2个位置,所以我们的输入维度是(64,10,8)。

Our inputs are a batch of 2D text data. Let's make an input with 64 samples, where each sample has 8 words and each word is represented by a array of 10 values (one hot encoded with vocab size of 10). This gives our inputs the size (64, 8, 10). The PyTorch CNN modules prefer inputs to have the channel dim (one hot vector dim in our case) to be in the second position, so our inputs are of shape (64, 10, 8).

In [0]:
# Assume all our inputs have the same # of words
batch_size = 64
sequence_size = 8 # words per input
one_hot_size = 10 # vocab size (num_input_channels)
x = torch.randn(batch_size, one_hot_size, sequence_size)
print("Size: {}".format(x.shape))
Size: torch.Size([64, 10, 8])

使用滤波器对输入进行卷积运算。为了简单一点,我们使用5个(1,2)的滤波器,深度和通道数相同(独热编码大小)。这样,滤波器维度是(5,2,10)。但是如上所述, PyTorch 更倾向于通道维度在第2个位置,所以滤波器维度是(5,10,2)。

We want to convolve on this input using filters. For simplicity we will use just 5 filters that is of size (1, 2) and has the same depth as the number of channels (one_hot_size). This gives our filter a shape of (5, 2, 10) but recall that PyTorch CNN modules prefer to have the channel dim (one hot vector dim in our case) to be in the second position so the filter is of shape (5, 10, 2).

In [0]:
# Create filters for a conv layer
out_channels = 5 # of filters
kernel_size = 2 # filters size 2
conv1 = nn.Conv1d(in_channels=one_hot_size, out_channels=out_channels, kernel_size=kernel_size)
print("Size: {}".format(conv1.weight.shape))
print("Filter size: {}".format(conv1.kernel_size[0]))
print("Padding: {}".format(conv1.padding[0]))
print("Stride: {}".format(conv1.stride[0]))
Size: torch.Size([5, 10, 2])
Filter size: 2
Padding: 0
Stride: 1

我们在输入上使用这个滤波器,得到(64,5,7)的输出。64是样本大小,5是通道维度(因为使用了5个滤波器),7是卷积输出,因为:

$\frac{W - F + 2P}{S} + 1 = \frac{8 - 2 + 2(0)}{1} + 1 = 7$

其中:

  • W:输入宽度
  • F:滤波器大小
  • P:填充
  • S:步幅

When we apply this filter on our inputs, we receive an output of shape (64, 5, 7). We get 64 for the batch size, 5 for the channel dim because we used 5 filters and 7 for the conv outputs because:

$\frac{W - F + 2P}{S} + 1 = \frac{8 - 2 + 2(0)}{1} + 1 = 7$

where:

  • W: width of each input
  • F: filter size
  • P: padding
  • S: stride

In [0]:
# Convolve using filters
conv_output = conv1(x)
print("Size: {}".format(conv_output.shape))
Size: torch.Size([64, 5, 7])

池化 Pooling

对输入进行卷积计算的结果是一个特征映射。因为卷积和重叠,特征映射有很多冗余信息。池化是一种降维的方式。池化可以是一定接收域的最大值,平局值等。

The result of convolving filters on an input is a feature map. Due to the nature of convolution and overlaps, our feature map will have lots of redundant information. Pooling is a way to summarize a high-dimensional feature map into a lower dimensional one for simplified downstream computation. The pooling operation can be the max value, average, etc. in a certain receptive field.

In [0]:
# Max pooling
kernel_size = 2
pool1 = nn.MaxPool1d(kernel_size=kernel_size, stride=2, padding=0)
pool_output = pool1(conv_output)
print("Size: {}".format(pool_output.shape))
Size: torch.Size([64, 5, 3])

$\frac{W-F}{S} + 1 = \frac{7-2}{2} + 1 = \text{floor }(2.5) + 1 = 3$

CNN 文本处理 CNNs on text

我们正在用卷积神经网络处理文本数据,进行词级别的卷积计算,获取有用的n-grams(一种语言模型)。

你可以把这个配置用于时间序列 数据或者与其他神经网络相结合。对于文本数据,我们会建立不同大小的滤波器,(1,2),(1,3)和(1,4),他们用于不同 n-gram 的特征选取器。把输出拼接起来,输入到一个全连接网络。本例中,我们使用字符级的一维卷积。在 embeddings notebook,我们在词级中应用了一维卷积。

词嵌入:捕获相邻词汇的临时联系,使得相似单词有相似意义。

字符嵌入:创建一个字符级别映射。例如"toy"和“toys”彼此接近。

We're going use convolutional neural networks on text data which typically involves convolving on the character level representation of the text to capture meaningful n-grams.

You can easily use this set up for time series data or combine it with other networks. For text data, we will create filters of varying kernel sizes (1,2), (1,3), and (1,4) which act as feature selectors of varying n-gram sizes. The outputs are concated and fed into a fully-connected layer for class predictions. In our example, we will be applying 1D convolutions on letter in a word. In the embeddings notebook, we will apply 1D convolutions on words in a sentence.

Word embeddings: capture the temporal correlations among adjacent tokens so that similar words have similar representations. Ex. "New Jersey" is close to "NJ" is close to "Garden State", etc.

Char embeddings: create representations that map words at a character level. Ex. "toy" and "toys" will be close to each other.

配置 Set up

In [0]:
import os
from argparse import Namespace
import collections
import copy
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import torch
In [0]:
# Set Numpy and PyTorch seeds
def set_seeds(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
        
# Creating directories
def create_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
In [0]:
# Arguments
args = Namespace(
    seed=1234,
    cuda=False,
    shuffle=True,
    data_file="names.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="names",
    train_size=0.7,
    val_size=0.15,
    test_size=0.15,
    num_epochs=20,
    early_stopping_criteria=5,
    learning_rate=1e-3,
    batch_size=64,
    num_filters=100,
    dropout_p=0.1,
)

# Set seeds
set_seeds(seed=args.seed, cuda=args.cuda)

# Create save dir
create_dirs(args.save_dir)

# Expand filepaths
args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)
args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))
Using CUDA: False

数据 Data

In [0]:
import re
import urllib
In [0]:
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/surnames.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open(args.data_file, 'wb') as fp:
    fp.write(html)
In [0]:
# Raw data
df = pd.read_csv(args.data_file, header=0)
df.head()
Out[0]:
surname nationality
0 Woodford English
1 Coté French
2 Kore English
3 Koury Arabic
4 Lebzak Russian
In [0]:
# Split by nationality
by_nationality = collections.defaultdict(list)
for _, row in df.iterrows():
    by_nationality[row.nationality].append(row.to_dict())
for nationality in by_nationality:
    print ("{0}: {1}".format(nationality, len(by_nationality[nationality])))
English: 2972
French: 229
Arabic: 1603
Russian: 2373
Japanese: 775
Chinese: 220
Italian: 600
Czech: 414
Irish: 183
German: 576
Greek: 156
Spanish: 258
Polish: 120
Dutch: 236
Vietnamese: 58
Korean: 77
Portuguese: 55
Scottish: 75
In [0]:
# Create split data
final_list = []
for _, item_list in sorted(by_nationality.items()):
    if args.shuffle:
        np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_size*n)
    n_val = int(args.val_size*n)
    n_test = int(args.test_size*n)

  # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  

    # Add to final list
    final_list.extend(item_list)
In [0]:
# df with split datasets
split_df = pd.DataFrame(final_list)
split_df["split"].value_counts()
Out[0]:
train    7680
test     1660
val      1640
Name: split, dtype: int64
In [0]:
# Preprocessing
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    
split_df.surname = split_df.surname.apply(preprocess_text)
split_df.head()
Out[0]:
nationality split surname
0 Arabic train bishara
1 Arabic train nahas
2 Arabic train ghanem
3 Arabic train tannous
4 Arabic train mikhail

词汇表 Vocabulary

In [0]:
class Vocabulary(object):
    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):

        # Token to index
        if token_to_idx is None:
            token_to_idx = {}
        self.token_to_idx = token_to_idx

        # Index to token
        self.idx_to_token = {idx: token \
                             for token, idx in self.token_to_idx.items()}
        
        # Add unknown token
        self.add_unk = add_unk
        self.unk_token = unk_token
        if self.add_unk:
            self.unk_index = self.add_token(self.unk_token)

    def to_serializable(self):
        return {'token_to_idx': self.token_to_idx,
                'add_unk': self.add_unk, 'unk_token': self.unk_token}

    @classmethod
    def from_serializable(cls, contents):
        return cls(**contents)

    def add_token(self, token):
        if token in self.token_to_idx:
            index = self.token_to_idx[token]
        else:
            index = len(self.token_to_idx)
            self.token_to_idx[token] = index
            self.idx_to_token[index] = token
        return index

    def add_tokens(self, tokens):
        return [self.add_token[token] for token in tokens]

    def lookup_token(self, token):
        if self.add_unk:
            index = self.token_to_idx.get(token, self.unk_index)
        else:
            index =  self.token_to_idx[token]
        return index

    def lookup_index(self, index):
        if index not in self.idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self.idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self.token_to_idx)
In [0]:
# Vocabulary instance
nationality_vocab = Vocabulary(add_unk=False)
for index, row in df.iterrows():
    nationality_vocab.add_token(row.nationality)
print (nationality_vocab) # __str__
index = nationality_vocab.lookup_token("English")
print (index)
print (nationality_vocab.lookup_index(index))
<Vocabulary(size=18)>
0
English

向量化 Vectorizer

In [0]:
class SurnameVectorizer(object):
    def __init__(self, surname_vocab, nationality_vocab):
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab

    def vectorize(self, surname):
        one_hot_matrix_size = (len(surname), len(self.surname_vocab))
        one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)
                               
        for position_index, character in enumerate(surname):
            character_index = self.surname_vocab.lookup_token(character)
            one_hot_matrix[position_index][character_index] = 1
        
        return one_hot_matrix
    
    def unvectorize(self, one_hot_matrix):
        len_name = len(one_hot_matrix)
        indices = np.zeros(len_name)
        for i in range(len_name):
            indices[i] = np.where(one_hot_matrix[i]==1)[0][0]
        surname = [self.surname_vocab.lookup_index(index) for index in indices]
        return surname

    @classmethod
    def from_dataframe(cls, df):
        surname_vocab = Vocabulary(add_unk=True)
        nationality_vocab = Vocabulary(add_unk=False)

        # Create vocabularies
        for index, row in df.iterrows():
            for letter in row.surname: # char-level tokenization
                surname_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)
        return cls(surname_vocab, nationality_vocab)

    @classmethod
    def from_serializable(cls, contents):
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab =  Vocabulary.from_serializable(contents['nationality_vocab'])
        return cls(surname_vocab, nationality_vocab)

    def to_serializable(self):
        return {'surname_vocab': self.surname_vocab.to_serializable(),
                'nationality_vocab': self.nationality_vocab.to_serializable()}
In [0]:
# Vectorizer instance
vectorizer = SurnameVectorizer.from_dataframe(split_df)
print (vectorizer.surname_vocab)
print (vectorizer.nationality_vocab)
vectorized_surname = vectorizer.vectorize(preprocess_text("goku"))
print (np.shape(vectorized_surname))
print (vectorized_surname)
print (vectorizer.unvectorize(vectorized_surname))
<Vocabulary(size=28)>
<Vocabulary(size=18)>
(4, 28)
[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]]
['g', 'o', 'k', 'u']

Note: Unlike the bagged ont-hot encoding method in the MLP notebook, we are able to preserve the semantic structure of the surnames. We are able to use one-hot encoding here because we are using characters but when we process text with large vocabularies, this method simply can't scale. We'll explore embedding based methods in subsequent notebooks.

数据集 Dataset

In [0]:
from torch.utils.data import Dataset, DataLoader
In [0]:
class SurnameDataset(Dataset):
    def __init__(self, df, vectorizer):
        self.df = df
        self.vectorizer = vectorizer

        # Data splits
        self.train_df = self.df[self.df.split=='train']
        self.train_size = len(self.train_df)
        self.val_df = self.df[self.df.split=='val']
        self.val_size = len(self.val_df)
        self.test_df = self.df[self.df.split=='test']
        self.test_size = len(self.test_df)
        self.lookup_dict = {'train': (self.train_df, self.train_size), 
                            'val': (self.val_df, self.val_size),
                            'test': (self.test_df, self.test_size)}
        self.set_split('train')

        # Class weights (for imbalances)
        class_counts = df.nationality.value_counts().to_dict()
        def sort_key(item):
            return self.vectorizer.nationality_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)

    @classmethod
    def load_dataset_and_make_vectorizer(cls, df):
        train_df = df[df.split=='train']
        return cls(df, SurnameVectorizer.from_dataframe(train_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, df, vectorizer_filepath):
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(df, vectorizer)

    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self.vectorizer.to_serializable(), fp)

    def set_split(self, split="train"):
        self.target_split = split
        self.target_df, self.target_size = self.lookup_dict[split]

    def __str__(self):
        return "<Dataset(split={0}, size={1})".format(
            self.target_split, self.target_size)

    def __len__(self):
        return self.target_size

    def __getitem__(self, index):
        row = self.target_df.iloc[index]
        surname_vector = self.vectorizer.vectorize(row.surname)
        nationality_index = self.vectorizer.nationality_vocab.lookup_token(row.nationality)
        return {'surname': surname_vector, 'nationality': nationality_index}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

    def generate_batches(self, batch_size, collate_fn, shuffle=True, 
                         drop_last=True, device="cpu"):
        dataloader = DataLoader(dataset=self, batch_size=batch_size,
                                collate_fn=collate_fn, shuffle=shuffle, 
                                drop_last=drop_last)
        for data_dict in dataloader:
            out_data_dict = {}
            for name, tensor in data_dict.items():
                out_data_dict[name] = data_dict[name].to(device)
            yield out_data_dict
In [0]:
# Dataset instance
dataset = SurnameDataset.load_dataset_and_make_vectorizer(split_df)
print (dataset) # __str__
print (np.shape(dataset[5]['surname'])) # __getitem__
print (dataset.class_weights)
<Dataset(split=train, size=7680)
(6, 28)
tensor([0.0006, 0.0045, 0.0024, 0.0042, 0.0003, 0.0044, 0.0017, 0.0064, 0.0055,
        0.0017, 0.0013, 0.0130, 0.0083, 0.0182, 0.0004, 0.0133, 0.0039, 0.0172])

模型 Model

In [0]:
import torch.nn as nn
import torch.nn.functional as F
In [0]:
class SurnameModel(nn.Module):
    def __init__(self, num_input_channels, num_output_channels, num_classes, dropout_p):
        super(SurnameModel, self).__init__()
        
        # Conv weights
        self.conv = nn.ModuleList([nn.Conv1d(num_input_channels, num_output_channels, 
                                             kernel_size=f) for f in [2,3,4]])
        self.dropout = nn.Dropout(dropout_p)
       
        # FC weights
        self.fc1 = nn.Linear(num_output_channels*3, num_classes)

    def forward(self, x, channel_first=False, apply_softmax=False):
        
        # Rearrange input so num_input_channels is in dim 1 (N, C, L)
        if not channel_first:
            x = x.transpose(1, 2)
            
        # Conv outputs
        z = [conv(x) for conv in self.conv]
        z = [F.max_pool1d(zz, zz.size(2)).squeeze(2) for zz in z]
        z = [F.relu(zz) for zz in z]
        
        # Concat conv outputs
        z = torch.cat(z, 1)
        z = self.dropout(z)

        # FC layer
        y_pred = self.fc1(z)
        
        if apply_softmax:
            y_pred = F.softmax(y_pred, dim=1)
        return y_pred

训练 Training

填充: 输入必须有相同的维度。我们的向量器把输入向量化,但是特殊的批次中,我们可能碰到不同大小的输入。解决方法是找到最长的输入,把其他的填充到这个长度。通常,数量最小的输入批次,用0向量填充。

我们使用 Trainer 类中的 pad_seq 函数,由 collate_fn 触发,传递到 Dataset 类中的 generate_batches 函数。基本上,批次生成器产生样本,我们使用 collate_fn 确定最大的输入,填充批次中的其他输入,得到一个统一的输入维度。

Padding: the inputs in a particular batch must all have the same shape. Our vectorizer converts the tokens into a vectorizer form but in a particular batch, we can have inputs of various sizes. The solution is to determine the longest input in a particular batch and pad all the other inputs to match that length. Usually, the smaller inputs in the batch are padded with zero vectors.

We do this using the pad_seq function in the Trainer class which is invoked by the collate_fn which is passed to generate_batches function in the Dataset class. Essentially, the batch generater generates samples into a batch and we use the collate_fn to determine the largest input and pad all the other inputs in the batch to get a uniform input shape.

In [0]:
import torch.optim as optim
In [0]:
class Trainer(object):
    def __init__(self, dataset, model, model_state_file, save_dir, device, shuffle, 
               num_epochs, batch_size, learning_rate, early_stopping_criteria):
        self.dataset = dataset
        self.class_weights = dataset.class_weights.to(device)
        self.model = model.to(device)
        self.save_dir = save_dir
        self.device = device
        self.shuffle = shuffle
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.loss_func = nn.CrossEntropyLoss(self.class_weights)
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer=self.optimizer, mode='min', factor=0.5, patience=1)
        self.train_state = {
            'done_training': False,
            'stop_early': False, 
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'early_stopping_criteria': early_stopping_criteria,
            'learning_rate': learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': model_state_file}
    
    def update_train_state(self):

        # Verbose
        print ("[EPOCH]: {0} | [LR]: {1} | [TRAIN LOSS]: {2:.2f} | [TRAIN ACC]: {3:.1f}% | [VAL LOSS]: {4:.2f} | [VAL ACC]: {5:.1f}%".format(
          self.train_state['epoch_index'], self.train_state['learning_rate'], 
            self.train_state['train_loss'][-1], self.train_state['train_acc'][-1], 
            self.train_state['val_loss'][-1], self.train_state['val_acc'][-1]))

        # Save one model at least
        if self.train_state['epoch_index'] == 0:
            torch.save(self.model.state_dict(), self.train_state['model_filename'])
            self.train_state['stop_early'] = False

        # Save model if performance improved
        elif self.train_state['epoch_index'] >= 1:
            loss_tm1, loss_t = self.train_state['val_loss'][-2:]

            # If loss worsened
            if loss_t >= self.train_state['early_stopping_best_val']:
                # Update step
                self.train_state['early_stopping_step'] += 1

            # Loss decreased
            else:
                # Save the best model
                if loss_t < self.train_state['early_stopping_best_val']:
                    torch.save(self.model.state_dict(), self.train_state['model_filename'])

                # Reset early stopping step
                self.train_state['early_stopping_step'] = 0

            # Stop early ?
            self.train_state['stop_early'] = self.train_state['early_stopping_step'] \
              >= self.train_state['early_stopping_criteria']
        return self.train_state
  
    def compute_accuracy(self, y_pred, y_target):
        _, y_pred_indices = y_pred.max(dim=1)
        n_correct = torch.eq(y_pred_indices, y_target).sum().item()
        return n_correct / len(y_pred_indices) * 100
    
    def pad_seq(self, seq, length):
        vector = np.zeros((length, len(self.dataset.vectorizer.surname_vocab)),
                          dtype=np.int64)
        for i in range(len(seq)):
            vector[i] = seq[i]
        return vector
    
    def collate_fn(self, batch):
        
        # Make a deep copy
        batch_copy = copy.deepcopy(batch)
        processed_batch = {"surname": [], "nationality": []}
        
        # Get max sequence length
        max_seq_len = max([len(sample["surname"]) for sample in batch_copy])
        
        # Pad
        for i, sample in enumerate(batch_copy):
            seq = sample["surname"]
            nationality = sample["nationality"]
            padded_seq = self.pad_seq(seq, max_seq_len)
            processed_batch["surname"].append(padded_seq)
            processed_batch["nationality"].append(nationality)
            
        # Convert to appropriate tensor types
        processed_batch["surname"] = torch.FloatTensor(
            processed_batch["surname"]) # need float for conv operations
        processed_batch["nationality"] = torch.LongTensor(
            processed_batch["nationality"])
        
        return processed_batch    
  
    def run_train_loop(self):
        for epoch_index in range(self.num_epochs):
            self.train_state['epoch_index'] = epoch_index
      
            # Iterate over train dataset

            # initialize batch generator, set loss and acc to 0, set train mode on
            self.dataset.set_split('train')
            batch_generator = self.dataset.generate_batches(
                batch_size=self.batch_size, collate_fn=self.collate_fn,
                shuffle=self.shuffle, device=self.device)
            running_loss = 0.0
            running_acc = 0.0
            self.model.train()

            for batch_index, batch_dict in enumerate(batch_generator):
                # zero the gradients
                self.optimizer.zero_grad()

                # compute the output
                y_pred = self.model(batch_dict['surname'])

                # compute the loss
                loss = self.loss_func(y_pred, batch_dict['nationality'])
                loss_t = loss.item()
                running_loss += (loss_t - running_loss) / (batch_index + 1)

                # compute gradients using loss
                loss.backward()

                # use optimizer to take a gradient step
                self.optimizer.step()
                
                # compute the accuracy
                acc_t = self.compute_accuracy(y_pred, batch_dict['nationality'])
                running_acc += (acc_t - running_acc) / (batch_index + 1)

            self.train_state['train_loss'].append(running_loss)
            self.train_state['train_acc'].append(running_acc)

            # Iterate over val dataset

            # initialize batch generator, set loss and acc to 0; set eval mode on
            self.dataset.set_split('val')
            batch_generator = self.dataset.generate_batches(
                batch_size=self.batch_size, collate_fn=self.collate_fn, 
                shuffle=self.shuffle, device=self.device)
            running_loss = 0.
            running_acc = 0.
            self.model.eval()

            for batch_index, batch_dict in enumerate(batch_generator):

                # compute the output
                y_pred =  self.model(batch_dict['surname'])

                # compute the loss
                loss = self.loss_func(y_pred, batch_dict['nationality'])
                loss_t = loss.to("cpu").item()
                running_loss += (loss_t - running_loss) / (batch_index + 1)

                # compute the accuracy
                acc_t = self.compute_accuracy(y_pred, batch_dict['nationality'])
                running_acc += (acc_t - running_acc) / (batch_index + 1)

            self.train_state['val_loss'].append(running_loss)
            self.train_state['val_acc'].append(running_acc)

            self.train_state = self.update_train_state()
            self.scheduler.step(self.train_state['val_loss'][-1])
            if self.train_state['stop_early']:
                break
          
    def run_test_loop(self):
        # initialize batch generator, set loss and acc to 0; set eval mode on
        self.dataset.set_split('test')
        batch_generator = self.dataset.generate_batches(
            batch_size=self.batch_size, collate_fn=self.collate_fn,
            shuffle=self.shuffle, device=self.device)
        running_loss = 0.0
        running_acc = 0.0
        self.model.eval()

        for batch_index, batch_dict in enumerate(batch_generator):
            # compute the output
            y_pred =  self.model(batch_dict['surname'])

            # compute the loss
            loss = self.loss_func(y_pred, batch_dict['nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # compute the accuracy
            acc_t = self.compute_accuracy(y_pred, batch_dict['nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)

        self.train_state['test_loss'] = running_loss
        self.train_state['test_acc'] = running_acc
    
    def plot_performance(self):
        # Figure size
        plt.figure(figsize=(15,5))

        # Plot Loss
        plt.subplot(1, 2, 1)
        plt.title("Loss")
        plt.plot(trainer.train_state["train_loss"], label="train")
        plt.plot(trainer.train_state["val_loss"], label="val")
        plt.legend(loc='upper right')

        # Plot Accuracy
        plt.subplot(1, 2, 2)
        plt.title("Accuracy")
        plt.plot(trainer.train_state["train_acc"], label="train")
        plt.plot(trainer.train_state["val_acc"], label="val")
        plt.legend(loc='lower right')

        # Save figure
        plt.savefig(os.path.join(self.save_dir, "performance.png"))

        # Show plots
        plt.show()
    
    def save_train_state(self):
        self.train_state["done_training"] = True
        with open(os.path.join(self.save_dir, "train_state.json"), "w") as fp:
            json.dump(self.train_state, fp)
In [0]:
# Initialization
dataset = SurnameDataset.load_dataset_and_make_vectorizer(split_df)
dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.vectorizer
model = SurnameModel(num_input_channels=len(vectorizer.surname_vocab),
                     num_output_channels=args.num_filters,
                     num_classes=len(vectorizer.nationality_vocab),
                     dropout_p=args.dropout_p)
print (model.named_modules)
<bound method Module.named_modules of SurnameModel(
  (conv): ModuleList(
    (0): Conv1d(28, 100, kernel_size=(2,), stride=(1,))
    (1): Conv1d(28, 100, kernel_size=(3,), stride=(1,))
    (2): Conv1d(28, 100, kernel_size=(4,), stride=(1,))
  )
  (dropout): Dropout(p=0.1)
  (fc1): Linear(in_features=300, out_features=18, bias=True)
)>
In [0]:
# Train
trainer = Trainer(dataset=dataset, model=model, 
                  model_state_file=args.model_state_file, 
                  save_dir=args.save_dir, device=args.device,
                  shuffle=args.shuffle, num_epochs=args.num_epochs, 
                  batch_size=args.batch_size, learning_rate=args.learning_rate, 
                  early_stopping_criteria=args.early_stopping_criteria)
trainer.run_train_loop()
[EPOCH]: 00 | [LR]: 0.001 | [TRAIN LOSS]: 2.82 | [TRAIN ACC]: 20.2% | [VAL LOSS]: 2.71 | [VAL ACC]: 37.1%
[EPOCH]: 01 | [LR]: 0.001 | [TRAIN LOSS]: 2.54 | [TRAIN ACC]: 43.9% | [VAL LOSS]: 2.40 | [VAL ACC]: 49.8%
[EPOCH]: 02 | [LR]: 0.001 | [TRAIN LOSS]: 2.18 | [TRAIN ACC]: 48.7% | [VAL LOSS]: 2.12 | [VAL ACC]: 48.2%
[EPOCH]: 03 | [LR]: 0.001 | [TRAIN LOSS]: 1.88 | [TRAIN ACC]: 51.1% | [VAL LOSS]: 1.90 | [VAL ACC]: 50.6%
[EPOCH]: 04 | [LR]: 0.001 | [TRAIN LOSS]: 1.67 | [TRAIN ACC]: 54.8% | [VAL LOSS]: 1.74 | [VAL ACC]: 50.6%
[EPOCH]: 05 | [LR]: 0.001 | [TRAIN LOSS]: 1.52 | [TRAIN ACC]: 57.4% | [VAL LOSS]: 1.65 | [VAL ACC]: 55.8%
[EPOCH]: 06 | [LR]: 0.001 | [TRAIN LOSS]: 1.37 | [TRAIN ACC]: 60.0% | [VAL LOSS]: 1.60 | [VAL ACC]: 60.4%
[EPOCH]: 07 | [LR]: 0.001 | [TRAIN LOSS]: 1.28 | [TRAIN ACC]: 62.2% | [VAL LOSS]: 1.53 | [VAL ACC]: 59.8%
[EPOCH]: 08 | [LR]: 0.001 | [TRAIN LOSS]: 1.20 | [TRAIN ACC]: 63.0% | [VAL LOSS]: 1.49 | [VAL ACC]: 58.5%
[EPOCH]: 09 | [LR]: 0.001 | [TRAIN LOSS]: 1.12 | [TRAIN ACC]: 64.4% | [VAL LOSS]: 1.43 | [VAL ACC]: 61.9%
[EPOCH]: 10 | [LR]: 0.001 | [TRAIN LOSS]: 1.05 | [TRAIN ACC]: 65.8% | [VAL LOSS]: 1.40 | [VAL ACC]: 62.7%
[EPOCH]: 11 | [LR]: 0.001 | [TRAIN LOSS]: 1.00 | [TRAIN ACC]: 66.3% | [VAL LOSS]: 1.38 | [VAL ACC]: 63.4%
[EPOCH]: 12 | [LR]: 0.001 | [TRAIN LOSS]: 0.95 | [TRAIN ACC]: 67.2% | [VAL LOSS]: 1.34 | [VAL ACC]: 64.7%
[EPOCH]: 13 | [LR]: 0.001 | [TRAIN LOSS]: 0.90 | [TRAIN ACC]: 69.1% | [VAL LOSS]: 1.35 | [VAL ACC]: 63.9%
[EPOCH]: 14 | [LR]: 0.001 | [TRAIN LOSS]: 0.86 | [TRAIN ACC]: 69.0% | [VAL LOSS]: 1.31 | [VAL ACC]: 67.8%
[EPOCH]: 15 | [LR]: 0.001 | [TRAIN LOSS]: 0.82 | [TRAIN ACC]: 70.8% | [VAL LOSS]: 1.32 | [VAL ACC]: 65.1%
[EPOCH]: 16 | [LR]: 0.001 | [TRAIN LOSS]: 0.78 | [TRAIN ACC]: 71.0% | [VAL LOSS]: 1.31 | [VAL ACC]: 66.1%
[EPOCH]: 17 | [LR]: 0.001 | [TRAIN LOSS]: 0.74 | [TRAIN ACC]: 72.2% | [VAL LOSS]: 1.28 | [VAL ACC]: 69.2%
[EPOCH]: 18 | [LR]: 0.001 | [TRAIN LOSS]: 0.72 | [TRAIN ACC]: 72.6% | [VAL LOSS]: 1.30 | [VAL ACC]: 68.1%
[EPOCH]: 19 | [LR]: 0.001 | [TRAIN LOSS]: 0.71 | [TRAIN ACC]: 72.7% | [VAL LOSS]: 1.30 | [VAL ACC]: 69.8%
In [0]:
# Plot performance
trainer.plot_performance()
In [0]:
# Test performance
trainer.run_test_loop()
print("Test loss: {0:.2f}".format(trainer.train_state['test_loss']))
print("Test Accuracy: {0:.1f}%".format(trainer.train_state['test_acc']))
Test loss: 1.30
Test Accuracy: 68.2%
In [0]:
# Save all results
trainer.save_train_state()

预测 Inference

In [0]:
class Inference(object):
    def __init__(self, model, vectorizer, device="cpu"):
        self.model = model.to(device)
        self.vectorizer = vectorizer
        self.device = device
  
    def predict_nationality(self, dataset):
        # Batch generator
        batch_generator = dataset.generate_batches(
            batch_size=len(dataset), shuffle=False, device=self.device)
        self.model.eval()
        
        # Predict
        for batch_index, batch_dict in enumerate(batch_generator):
            # compute the output
            y_pred =  self.model(batch_dict['surname'], apply_softmax=True)

            # Top k nationalities
            y_prob, indices = torch.topk(y_pred, k=len(self.vectorizer.nationality_vocab))
            probabilities = y_prob.detach().to('cpu').numpy()[0]
            indices = indices.detach().to('cpu').numpy()[0]

            results = []
            for probability, index in zip(probabilities, indices):
                nationality = self.vectorizer.nationality_vocab.lookup_index(index)
                results.append({'nationality': nationality, 'probability': probability})

        return results
In [0]:
# Load vectorizer
with open(args.vectorizer_file) as fp:
    vectorizer = SurnameVectorizer.from_serializable(json.load(fp))
In [0]:
# Load the model
model = SurnameModel(num_input_channels=len(vectorizer.surname_vocab),
                     num_output_channels=args.num_filters,
                     num_classes=len(vectorizer.nationality_vocab),
                     dropout_p=args.dropout_p)
model.load_state_dict(torch.load(args.model_state_file))
print (model.named_modules)
<bound method Module.named_modules of SurnameModel(
  (conv): ModuleList(
    (0): Conv1d(28, 100, kernel_size=(2,), stride=(1,))
    (1): Conv1d(28, 100, kernel_size=(3,), stride=(1,))
    (2): Conv1d(28, 100, kernel_size=(4,), stride=(1,))
  )
  (dropout): Dropout(p=0.1)
  (fc1): Linear(in_features=300, out_features=18, bias=True)
)>
In [0]:
# Initialize
inference = Inference(model=model, vectorizer=vectorizer, device=args.device)
In [0]:
class InferenceDataset(Dataset):
    def __init__(self, df, vectorizer):
        self.df = df
        self.vectorizer = vectorizer
        self.target_size = len(self.df)

    def __str__(self):
        return "<Dataset(size={1})>".format(self.target_size)

    def __len__(self):
        return self.target_size

    def __getitem__(self, index):
        row = self.df.iloc[index]
        surname_vector = self.vectorizer.vectorize(row.surname)
        return {'surname': surname_vector}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

    def generate_batches(self, batch_size, shuffle=True, drop_last=False, device="cpu"):
        dataloader = DataLoader(dataset=self, batch_size=batch_size, 
                                shuffle=shuffle, drop_last=drop_last)
        for data_dict in dataloader:
            out_data_dict = {}
            for name, tensor in data_dict.items():
                out_data_dict[name] = data_dict[name].to(device)
            yield out_data_dict
In [0]:
# Inference
surname = input("Enter a surname to classify: ")
infer_df = pd.DataFrame([surname], columns=['surname'])
infer_df.surname = infer_df.surname.apply(preprocess_text)
infer_dataset = InferenceDataset(infer_df, vectorizer)
results = inference.predict_nationality(dataset=infer_dataset)
results
Enter a surname to classify: Goku
Out[0]:
[{'nationality': 'Japanese', 'probability': 0.91458},
 {'nationality': 'Russian', 'probability': 0.025362423},
 {'nationality': 'Czech', 'probability': 0.021216955},
 {'nationality': 'Greek', 'probability': 0.011900755},
 {'nationality': 'Arabic', 'probability': 0.010394401},
 {'nationality': 'Korean', 'probability': 0.006069167},
 {'nationality': 'Polish', 'probability': 0.004974973},
 {'nationality': 'Vietnamese', 'probability': 0.0015066856},
 {'nationality': 'English', 'probability': 0.00087393716},
 {'nationality': 'Chinese', 'probability': 0.0006946102},
 {'nationality': 'Irish', 'probability': 0.00063219015},
 {'nationality': 'Dutch', 'probability': 0.0006308032},
 {'nationality': 'French', 'probability': 0.00042536092},
 {'nationality': 'German', 'probability': 0.00037614783},
 {'nationality': 'Scottish', 'probability': 0.0001570268},
 {'nationality': 'Portuguese', 'probability': 0.0001270036},
 {'nationality': 'Italian', 'probability': 4.4490163e-05},
 {'nationality': 'Spanish', 'probability': 3.3140543e-05}]

批量正则化 Batch normalization

尽管我们对输入做了标准化,有零平均值和单元方差,帮助收敛。在训练过程中,输入随着穿过不同层和非线性化,也在变化。这也称为内变量移位,它使训练变慢,并且需要更小的学习率。解决方法是批量正则化(batchnorm) ,把正则化作为模型架构的一部分。这使得我们可以使用更大的学习率,或的更好的性能,速度更快。

$ BN = \frac{a - \mu_{x}}{\sqrt{\sigma^2_{x} + \epsilon}} * \gamma + \beta $

其中:

  • $a$ = 激活函数 | $\in \mathbb{R}^{NXH}$ ($N$ 是样本数, $H$ 是隐含层维度)
  • $ \mu_{x}$ = 每个隐含层均值 | $\in \mathbb{R}^{1XH}$
  • $\sigma^2_{x}$ = 每个隐含层方差 | $\in \mathbb{R}^{1XH}$
  • $\epsilon$ = 噪声
  • $\gamma$ = 比例参数(已学习的参数)
  • $\beta$ = 移位参数 (已学习的参数)

Even though we standardized our inputs to have zero mean and unit variance to aid with convergence, our inputs change during training as they go through the different layers and nonlinearities. This is known as internal covariate shift and it slows down training and requires us to use smaller learning rates. The solution is batch normalization (batchnorm) which makes normalization a part of the model's architecture. This allows us to use much higher learning rates and get better performance, faster.

$ BN = \frac{a - \mu_{x}}{\sqrt{\sigma^2_{x} + \epsilon}} * \gamma + \beta $

where:

  • $a$ = activation | $\in \mathbb{R}^{NXH}$ ($N$ is the number of samples, $H$ is the hidden dim)
  • $ \mu_{x}$ = mean of each hidden | $\in \mathbb{R}^{1XH}$
  • $\sigma^2_{x}$ = variance of each hidden | $\in \mathbb{R}^{1XH}$
  • $\epsilon$ = noise
  • $\gamma$ = scale parameter (learned parameter)
  • $\beta$ = shift parameter (learned parameter)

但是在非线性操作之前,激活函数进行零平均值和单元方差的意义是什么呢?它不是整个激活矩阵都有的属性,只是替代了应用于隐含层维度(本例中是 num_output_channels)的批量正则化。所以计算了每个批次所有样本在每个隐含层的均值和方差。同样的,batchnorm 使用计算的训练中激活函数均值和方差。然而,测试时,因为模型使用训练中保存的均值和方差,可能使模型不准确。PyTorch 的 BatchNorm 会全部自动处理。

But what does it mean for our activations to have zero mean and unit variance before the nonlinearity operation. It doesn't mean that the entire activation matrix has this property but instead batchnorm is applied on the hidden (num_output_channels in our case) dimension. So each hidden's mean and variance is calculated using all samples across the batch. Also, batchnorm uses the calcualted mean and variance of the activations in the batch during training. However, during test, the sample size could be skewed so the model uses the saved population mean and variance from training. PyTorch's BatchNorm class takes care of all of this for us automatically.

In [0]:
# Model with batch normalization
class SurnameModel_BN(nn.Module):
    def __init__(self, num_input_channels, num_output_channels, num_classes, dropout_p):
        super(SurnameModel_BN, self).__init__()
        
        # Conv weights
        self.conv = nn.ModuleList([nn.Conv1d(num_input_channels, num_output_channels, 
                                             kernel_size=f) for f in [2,3,4]])
        self.conv_bn = nn.ModuleList([nn.BatchNorm1d(num_output_channels) # define batchnorms
                                      for i in range(3)])
        self.dropout = nn.Dropout(dropout_p)
       
        # FC weights
        self.fc1 = nn.Linear(num_output_channels*3, num_classes)

    def forward(self, x, channel_first=False, apply_softmax=False):
        
        # Rearrange input so num_input_channels is in dim 1 (N, C, L)
        if not channel_first:
            x = x.transpose(1, 2)
            
        # Conv outputs
        z = [F.relu(conv_bn(conv(x))) for conv, conv_bn in zip(self.conv, self.conv_bn)]
        z = [F.max_pool1d(zz, zz.size(2)).squeeze(2) for zz in z]
        
        # Concat conv outputs
        z = torch.cat(z, 1)
        z = self.dropout(z)

        # FC layer
        y_pred = self.fc1(z)
        
        if apply_softmax:
            y_pred = F.softmax(y_pred, dim=1)
        return y_pred
In [0]:
# Initialization
dataset = SurnameDataset.load_dataset_and_make_vectorizer(split_df)
dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.vectorizer
model = SurnameModel_BN(num_input_channels=len(vectorizer.surname_vocab),
                        num_output_channels=args.num_filters,
                        num_classes=len(vectorizer.nationality_vocab),
                        dropout_p=args.dropout_p)
print (model.named_modules)
<bound method Module.named_modules of SurnameModel_BN(
  (conv): ModuleList(
    (0): Conv1d(28, 100, kernel_size=(2,), stride=(1,))
    (1): Conv1d(28, 100, kernel_size=(3,), stride=(1,))
    (2): Conv1d(28, 100, kernel_size=(4,), stride=(1,))
  )
  (conv_bn): ModuleList(
    (0): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout): Dropout(p=0.1)
  (fc1): Linear(in_features=300, out_features=18, bias=True)
)>

You can train this model with batch normalization and you'll notice that the validation results improve by ~2-5%.

In [0]:
# Train
trainer = Trainer(dataset=dataset, model=model, 
                  model_state_file=args.model_state_file, 
                  save_dir=args.save_dir, device=args.device,
                  shuffle=args.shuffle, num_epochs=args.num_epochs, 
                  batch_size=args.batch_size, learning_rate=args.learning_rate, 
                  early_stopping_criteria=args.early_stopping_criteria)
trainer.run_train_loop()
[EPOCH]: 00 | [LR]: 0.001 | [TRAIN LOSS]: 2.63 | [TRAIN ACC]: 24.7% | [VAL LOSS]: 2.31 | [VAL ACC]: 42.2%
[EPOCH]: 01 | [LR]: 0.001 | [TRAIN LOSS]: 2.01 | [TRAIN ACC]: 43.1% | [VAL LOSS]: 1.95 | [VAL ACC]: 51.7%
[EPOCH]: 02 | [LR]: 0.001 | [TRAIN LOSS]: 1.63 | [TRAIN ACC]: 51.1% | [VAL LOSS]: 1.70 | [VAL ACC]: 58.9%
[EPOCH]: 03 | [LR]: 0.001 | [TRAIN LOSS]: 1.37 | [TRAIN ACC]: 58.1% | [VAL LOSS]: 1.56 | [VAL ACC]: 55.4%
[EPOCH]: 04 | [LR]: 0.001 | [TRAIN LOSS]: 1.22 | [TRAIN ACC]: 59.4% | [VAL LOSS]: 1.43 | [VAL ACC]: 57.8%
[EPOCH]: 05 | [LR]: 0.001 | [TRAIN LOSS]: 1.08 | [TRAIN ACC]: 63.2% | [VAL LOSS]: 1.47 | [VAL ACC]: 61.3%
[EPOCH]: 06 | [LR]: 0.001 | [TRAIN LOSS]: 0.97 | [TRAIN ACC]: 65.8% | [VAL LOSS]: 1.32 | [VAL ACC]: 61.5%
[EPOCH]: 07 | [LR]: 0.001 | [TRAIN LOSS]: 0.89 | [TRAIN ACC]: 67.9% | [VAL LOSS]: 1.31 | [VAL ACC]: 67.4%
[EPOCH]: 08 | [LR]: 0.001 | [TRAIN LOSS]: 0.81 | [TRAIN ACC]: 69.6% | [VAL LOSS]: 1.38 | [VAL ACC]: 66.8%
[EPOCH]: 09 | [LR]: 0.001 | [TRAIN LOSS]: 0.77 | [TRAIN ACC]: 70.8% | [VAL LOSS]: 1.32 | [VAL ACC]: 65.3%
[EPOCH]: 10 | [LR]: 0.001 | [TRAIN LOSS]: 0.66 | [TRAIN ACC]: 72.5% | [VAL LOSS]: 1.29 | [VAL ACC]: 66.8%
[EPOCH]: 11 | [LR]: 0.001 | [TRAIN LOSS]: 0.63 | [TRAIN ACC]: 74.0% | [VAL LOSS]: 1.28 | [VAL ACC]: 65.2%
[EPOCH]: 12 | [LR]: 0.001 | [TRAIN LOSS]: 0.60 | [TRAIN ACC]: 74.4% | [VAL LOSS]: 1.26 | [VAL ACC]: 68.4%
[EPOCH]: 13 | [LR]: 0.001 | [TRAIN LOSS]: 0.61 | [TRAIN ACC]: 74.4% | [VAL LOSS]: 1.33 | [VAL ACC]: 70.4%
[EPOCH]: 14 | [LR]: 0.001 | [TRAIN LOSS]: 0.58 | [TRAIN ACC]: 75.8% | [VAL LOSS]: 1.27 | [VAL ACC]: 69.2%
[EPOCH]: 15 | [LR]: 0.001 | [TRAIN LOSS]: 0.53 | [TRAIN ACC]: 76.8% | [VAL LOSS]: 1.27 | [VAL ACC]: 69.9%
[EPOCH]: 16 | [LR]: 0.001 | [TRAIN LOSS]: 0.53 | [TRAIN ACC]: 77.4% | [VAL LOSS]: 1.30 | [VAL ACC]: 71.2%
[EPOCH]: 17 | [LR]: 0.001 | [TRAIN LOSS]: 0.53 | [TRAIN ACC]: 77.6% | [VAL LOSS]: 1.26 | [VAL ACC]: 69.5%
[EPOCH]: 18 | [LR]: 0.001 | [TRAIN LOSS]: 0.50 | [TRAIN ACC]: 77.2% | [VAL LOSS]: 1.23 | [VAL ACC]: 70.5%
[EPOCH]: 19 | [LR]: 0.001 | [TRAIN LOSS]: 0.51 | [TRAIN ACC]: 77.8% | [VAL LOSS]: 1.27 | [VAL ACC]: 70.5%
In [0]:
# Plot performance
trainer.plot_performance()
In [0]:
# Test performance
trainer.run_test_loop()
print("Test loss: {0:.2f}".format(trainer.train_state['test_loss']))
print("Test Accuracy: {0:.1f}%".format(trainer.train_state['test_acc']))
Test loss: 1.24
Test Accuracy: 70.8%
In [0]:
 

待完善 TODO

  • 图像分类例子
  • 切片
  • 深度 CNN 架构
  • 小的 3X3 滤波器
  • 填充和步幅的细节(控制接收域,使每个像素作为滤波器中心)
  • 嵌套网络(1x1 卷积)
  • 残差连接/残差块
  • 可解释性(n-grams 火)
  • image classification example
  • segmentation
  • deep CNN architectures
  • small 3X3 filters
  • details on padding and stride (control receptive field, make every pixel the center of the filter, etc.)
  • network-in-network (1x1 conv)
  • residual connections / residual block
  • interpretability (which n-grams fire)