from fastai.gen_doc.nbdoc import *
from fastai.text import *
from fastai.text.models import *
text.models
module fully implements the AWD-LSTM from Stephen Merity et al. The main idea of the article is to use a RNN with dropout everywhere, but in an intelligent way. There is a difference with the usual dropout, which is why you’ll see a RNNDropout
module: we zero things, as is usual in dropout, but we always zero the same thing according to the sequence dimension (which is the first dimension in pytorch). This ensures consistency when updating the hidden state through the whole sentences/articles.
This being given, there are five different dropouts in the AWD-LSTM:
show_doc(get_language_model)
get_language_model
[source]
get_language_model
(vocab_sz
:int
,emb_sz
:int
,n_hid
:int
,n_layers
:int
,pad_token
:int
,tie_weights
:bool
=True
,qrnn
:bool
=False
,bias
:bool
=True
,bidir
:bool
=False
,output_p
:float
=0.4
,hidden_p
:float
=0.2
,input_p
:float
=0.6
,embed_p
:float
=0.1
,weight_p
:float
=0.5
) →Module
Create a full AWD-LSTM.
The first embedding of vocab_sz
by emb_sz
, a hidden size of n_hid
, RNNs with n_layers
that can be bidirectional if bidir
is True. The last RNN as an output size of emb_sz
so that we can use the same decoder as the encoder if tie_weights
is True. The decoder is a Linear
layer with or without bias
. If qrnn
is set to True, we use [QRNN cells] instead of LSTMS. pad_token
is the token used for padding.
embed_p
is used for the embedding dropout, input_p
is used for the input dropout, weight_p
is used for the weight dropout, hidden_p
is used for the hidden dropout and output_p
is used for the output dropout.
Note that the model returns a list of three things, the actual output being the first, the two others being the intermediate hidden states before and after dropout (used by the RNNTrainer
). Most loss functions expect one output, so you should use a Callback to remove the other two if you're not using RNNTrainer
.
show_doc(get_rnn_classifier)
get_rnn_classifier
[source]
get_rnn_classifier
(bptt
:int
,max_seq
:int
,vocab_sz
:int
,emb_sz
:int
,n_hid
:int
,n_layers
:int
,pad_token
:int
,layers
:Collection
[int
],drops
:Collection
[float
],bidir
:bool
=False
,qrnn
:bool
=False
,hidden_p
:float
=0.2
,input_p
:float
=0.6
,embed_p
:float
=0.1
,weight_p
:float
=0.5
) →Module
Create a RNN classifier model.
This model uses an encoder taken from an AWD-LSTM with arguments vocab_sz
, emb_sz
, n_hid
, n_layers
, bias
, bidir
, qrnn
, pad_token
and the dropouts parameters. This encoder is fed the sequence by successive bits of size bptt
and we only keep the last max_seq
outputs for the pooling layers.
The decoder use a concatenation of the last outputs, a MaxPooling
of all the ouputs and an AveragePooling
of all the outputs. It then uses a list of BatchNorm
, Dropout
, Linear
, ReLU
blocks (with no ReLU
in the last one), using a first layer size of 3*emb_sz
then follwoing the numbers in n_layers
. The dropouts probabilities are read in drops
.
Note that the model returns a list of three things, the actual output being the first, the two others being the intermediate hidden states before and after dropout (used by the RNNTrainer
). Most loss functions expect one output, so you should use a Callback to remove the other two if you're not using RNNTrainer
.
On top of the pytorch or the fastai layers
, the language models use some custom layers specific to NLP.
show_doc(EmbeddingDropout, title_level=3)
Each row of the embedding matrix has a probability embed_p
of being replaced by zeros while the others are rescaled accordingly.
enc = nn.Embedding(100, 7, padding_idx=1)
enc_dp = EmbeddingDropout(enc, 0.5)
tst_input = torch.randint(0,100,(8,))
enc_dp(tst_input)
tensor([[ 0.5721, -2.2245, -3.1669, -0.3286, -1.3392, -1.3890, 1.3677], [ 1.9181, 0.8162, 0.0547, -1.1909, 1.8688, -1.0324, 2.9438], [-1.1319, 0.4245, 6.3649, -2.0573, -0.0647, -0.1660, -0.8208], [-0.0000, 0.0000, 0.0000, -0.0000, 0.0000, 0.0000, -0.0000], [-0.0000, 0.0000, 0.0000, -0.0000, 0.0000, 0.0000, -0.0000], [ 1.9181, 0.8162, 0.0547, -1.1909, 1.8688, -1.0324, 2.9438], [-0.0000, 0.0000, 0.0000, -0.0000, 0.0000, 0.0000, -0.0000], [ 0.4246, 1.7266, -0.3707, 2.8732, -1.4541, 0.6501, 3.0350]], grad_fn=<EmbeddingBackward>)
show_doc(RNNDropout, title_level=3)
dp = RNNDropout(0.3)
tst_input = torch.randn(3,3,7)
tst_input, dp(tst_input)
(tensor([[[-1.3750, 0.0598, 0.5507, -0.1219, -1.4071, 0.5813, 0.9757], [-0.2612, -2.2168, -0.3012, -0.4310, -1.3489, 0.9916, 1.1717], [-1.7778, -0.7739, -2.2230, 0.5438, -0.2032, 0.7374, 1.1300]], [[-1.9824, -1.6155, -0.1078, -2.2462, -0.5045, -0.5635, 0.5041], [ 0.3810, 0.7194, 0.7611, 0.9812, 1.0620, 0.9317, 0.3176], [-1.8882, -0.0156, -1.4240, -0.0359, 0.6856, 0.0072, -0.6026]], [[-0.3039, -0.5425, -1.2921, -1.1725, -0.2109, 0.2727, -0.6178], [ 1.5460, 0.5858, -0.3476, -0.5885, -0.5179, 0.1737, -0.1857], [-0.1227, 0.1517, 0.1305, -0.4547, -0.8123, 0.0917, 0.1694]]]), tensor([[[-1.9642, 0.0000, 0.7867, -0.0000, -2.0101, 0.0000, 1.3939], [-0.3732, -0.0000, -0.4303, -0.0000, -1.9269, 0.0000, 1.6738], [-2.5398, -0.0000, -3.1757, 0.0000, -0.2903, 0.0000, 1.6143]], [[-2.8320, -0.0000, -0.0000, -3.2089, -0.0000, -0.8050, 0.7201], [ 0.5443, 0.0000, 0.0000, 1.4017, 0.0000, 1.3310, 0.4538], [-2.6975, -0.0000, -0.0000, -0.0513, 0.0000, 0.0104, -0.8609]], [[-0.0000, -0.7749, -1.8458, -1.6750, -0.3014, 0.0000, -0.0000], [ 0.0000, 0.8369, -0.4966, -0.8407, -0.7398, 0.0000, -0.0000], [-0.0000, 0.2167, 0.1864, -0.6496, -1.1605, 0.0000, 0.0000]]]))
show_doc(WeightDropout, title_level=3)
Applies dropout of probability weight_p
to the layers in layer_names
of module
in training mode. A copy of those weights is kept so that the dropout mask can change at every batch.
module = nn.LSTM(5, 2)
dp_module = WeightDropout(module, 0.4)
getattr(dp_module.module, 'weight_hh_l0')
Parameter containing: tensor([[ 0.0712, -0.6369], [-0.3654, 0.4196], [-0.6829, 0.6955], [ 0.6683, -0.4114], [ 0.5502, -0.1464], [-0.2557, -0.4861], [-0.4205, -0.2314], [ 0.4531, 0.3012]], requires_grad=True)
It's at the beginning of a forward pass that the dropout is applied to the weights.
tst_input = torch.randn(4,20,5)
h = (torch.zeros(1,20,2), torch.zeros(1,20,2))
x,h = dp_module(tst_input,h)
getattr(dp_module.module, 'weight_hh_l0')
tensor([[ 0.1186, -1.0615], [-0.0000, 0.6993], [-1.1382, 1.1591], [ 1.1138, -0.6856], [ 0.0000, -0.2439], [-0.0000, -0.0000], [-0.0000, -0.3857], [ 0.7551, 0.5020]], grad_fn=<MulBackward0>)
show_doc(SequentialRNN, title_level=3)
class
SequentialRNN
[source]
SequentialRNN
(args
) ::Sequential
A sequential module that passes the reset call to its children.
show_doc(SequentialRNN.reset)
reset
[source]
reset
()
Call the reset
function of self.children
(if they have one).
show_doc(dropout_mask)
dropout_mask
[source]
dropout_mask
(x
:Tensor
,sz
:Collection
[int
],p
:float
)
Return a dropout mask of the same type as x
, size sz
, with probability p
to cancel an element.
tst_input = torch.randn(3,3,7)
dropout_mask(tst_input, (3,7), 0.3)
tensor([[1.4286, 0.0000, 1.4286, 1.4286, 1.4286, 1.4286, 1.4286], [0.0000, 1.4286, 1.4286, 1.4286, 0.0000, 1.4286, 0.0000], [0.0000, 1.4286, 1.4286, 0.0000, 1.4286, 1.4286, 1.4286]])
Such a mask is then expanded in the sequence length dimension and multiplied by the input to do an RNNDropout
.
show_doc(RNNCore, title_level=3)
class
RNNCore
[source]
RNNCore
(vocab_sz
:int
,emb_sz
:int
,n_hid
:int
,n_layers
:int
,pad_token
:int
,bidir
:bool
=False
,hidden_p
:float
=0.2
,input_p
:float
=0.6
,embed_p
:float
=0.1
,weight_p
:float
=0.5
,qrnn
:bool
=False
) ::Module
AWD-LSTM/QRNN inspired by https://arxiv.org/abs/1708.02182.
This is the encoder of the model with an embedding layer of vocab_sz
by emb_sz
, a hidden size of n_hid
, n_layers
layers. pad_token
is passed to the Embedding
, if bidir
is True, the model is bidirectional. If qrnn
is True, we use QRNN cells instead of LSTMs. Dropouts are embed_p
, input_p
, weight_p
and hidden_p
.
show_doc(RNNCore.reset)
show_doc(LinearDecoder, title_level=3)
Create a the decoder to go on top of an RNNCore
encoder and create a language model. n_hid
is the dimension of the last hidden state of the encoder, n_out
the size of the output. Dropout of output_p
is applied. If a tie_encoder
is passed, it will be used for the weights of the linear layer, that will have bias
or not.
show_doc(MultiBatchRNNCore, title_level=3)
Text is passed by chunks of sequence length bptt
and only the last max_seq
outputs are kept for the next layer. args
and kwargs
are passed to the RNNCore
.
show_doc(MultiBatchRNNCore.concat)
concat
[source]
concat
(arrs
:Collection
[Tensor
]) →Tensor
Concatenate the arrs
along the batch dimension.
show_doc(PoolingLinearClassifier, title_level=3)
The last output, MaxPooling
of all the outputs and AvgPooling
of all the outputs are concatenated, then blocks of bn_drop_lin
are stacked, according to the values in layers
and drops
.
show_doc(PoolingLinearClassifier.pool)
The input tensor x
(of batch size bs
) is pooled along the batch dimension. is_max
decides if we do an AvgPooling
or a MaxPooling
.
show_doc(WeightDropout.forward)
forward
[source]
forward
(args
:ArgStar
)
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(RNNCore.forward)
forward
[source]
forward
(input
:LongTensor
) →Tuple
[Tensor
,Tensor
]
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(EmbeddingDropout.forward)
forward
[source]
forward
(words
:LongTensor
,scale
:Optional
[float
]=None
) →Tensor
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(RNNDropout.forward)
forward
[source]
forward
(x
:Tensor
) →Tensor
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(WeightDropout.reset)
reset
[source]
reset
()
show_doc(PoolingLinearClassifier.forward)
forward
[source]
forward
(input
:Tuple
[Tensor
,Tensor
]) →Tuple
[Tensor
,Tensor
,Tensor
]
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(MultiBatchRNNCore.forward)
forward
[source]
forward
(input
:LongTensor
) →Tuple
[Tensor
,Tensor
]
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
show_doc(LinearDecoder.forward)
forward
[source]
forward
(input
:Tuple
[Tensor
,Tensor
]) →Tuple
[Tensor
,Tensor
,Tensor
]
Defines the computation performed at every call. Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.