# SCat Research and Developement¶

The goal of this notebook is to use a recurrent neural network (GRU or LSTM) to learn to extract keyword in a text. This model will be used to extract arguments in a natural language request.

Each word will be submitted to a recurrent neural network unit, each unit take a word vector as the input and return some feature about the word being an "argument", here a city. These feature will be send to a single neuron that will tranform them into a probability.

To achieve this, the text input has to be tokenized and each word as to be transformed into a vector.

Usually, the word vector need to be calculated using something like GloVe or Word2Vec butfor simplicity reasons we will use a simple 1-hot encoding.

In [1]:
import torch

tokenizer = lambda text: text.split()

def vectorizer(tokens):
vocabulary = list(set(tokens))
embedding = dict()

for word_index, word in enumerate(vocabulary):
word_vec = torch.zeros(len(vocabulary))
word_vec[word_index] = 1
embedding[word] = word_vec

return embedding


Now let's write the model, I will use a simple GRU cell as the encoder and a raw non-recurrent layer activated by softmax as the decoder.

In [2]:
from torch import nn

class SCatCell(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SCatCell, self).__init__()
self.hidden_size = hidden_size

self.encoder = nn.GRUCell(input_size, hidden_size)
self.decoder = nn.Linear(hidden_size, output_size)

def forward(self, input, hidden):
next_hidden_state = self.encoder(input, hidden)
output = self.decoder(next_hidden_state.view(self.hidden_size))

return next_hidden_state, output

def init_hidden(self):
return Variable(torch.zeros(1, self.hidden_size))


Now we just need to prepare a simple dataset and train the model !

The dataset is a simple collection of only six weather question, the goal is for the network to identify the city in the sentence like in a supervised classifcation problem.

In [3]:
train_x = [
"What is the weather like in Paris ?",
"What kind of weather will it do in London ?",
"Give me the weather forecast in Berlin please .",
"Tell me the forecast in New York !",
"Give me the weather in San Francisco ...",
"I want the forecast in Dublin ."
]
train_y = [
('Paris'),
('London'),
('Berlin'),
('New', 'York'),
('San', 'Francisco'),
('Dublin')
]


Finally, train the model !

In [4]:
from torch import optim

learning_rate = 0.001
n_epoch = 1000

embeddings = vectorizer(tokenizer(' '.join(train_x + ['Los', 'Angeles']))) # add a city not in the training set for testing

model = SCatCell(len(embeddings), 10, 1)
loss = nn.MSELoss()

for epoch in range(n_epoch):
for s_x, s_y in zip([tokenizer(t) for t in train_x], train_y):
hidden_state = model.init_hidden()

ys = Variable(torch.FloatTensor([0]))
for word in s_x:
word_vec = Variable(embeddings[word].view(1, len(embeddings)))
word_y = Variable(torch.FloatTensor([int(word in s_y) / len(s_y)]))

hidden_state, pred = model(word_vec, hidden_state)

ys = torch.cat((ys, word_y), 0)
preds = torch.cat((preds, pred), 0)

error = loss(preds, ys)
error.backward()
optimizer.step()

if epoch % 100 == 0:
print("Epoch {} - Loss: {}".format(epoch, round(float(error), 4)))

Epoch 0 - Loss: 0.0117
Epoch 100 - Loss: 0.0007
Epoch 200 - Loss: 0.0003
Epoch 300 - Loss: 0.0002
Epoch 400 - Loss: 0.0001
Epoch 500 - Loss: 0.0
Epoch 600 - Loss: 0.0
Epoch 700 - Loss: 0.0
Epoch 800 - Loss: 0.0
Epoch 900 - Loss: 0.0


Learning seems a bit too easy but, anyway, let's check a training sample !

In [5]:
x = "Give me the forecast in Los Angeles"
s_x = tokenizer(x)
hidden_state = model.init_hidden()
preds = []

for word in s_x:
word_vec = Variable(embeddings[word].view(1, len(embeddings)))
hidden_state, pred = model(word_vec, hidden_state)
preds.append(float(pred))

print(word, float(pred))

Give 7.063150405883789e-06
me 4.3682754039764404e-05
the 4.016607999801636e-05
forecast 0.0023935437202453613
in 0.08705766499042511
Los 0.18460620939731598
Angeles 0.42705148458480835

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure()
plt.bar(range(len(preds)), preds)

Out[6]:
<Container object of 7 artists>

Results are quite encouraging, now it's time to define a... let's say a SCat Argument Determination Algorithm to transform network output into words. We could transform prediction into probabilities using softmax, this transformation will help the model to be stable. A good way to identify an argument to a non-argument is by getting words that are the most higher to the mean so we could substract the mean to each probailities.

In [7]:
from torch.nn import functional as F

preds = Variable(torch.FloatTensor(preds))
preds = F.softmax(preds)
preds = preds.data

plt.figure()
plt.bar(range(len(preds)), [float(el) for el in preds])

preds = preds - preds.mean()
preds = [float(el) for el in preds]

plt.figure()
plt.bar(range(len(preds)), preds)

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
after removing the cwd from sys.path.

Out[7]:
<Container object of 7 artists>

Using this method, we have cancelled small values, they are forced to be negative. We could just need to select positive values.

In [8]:
selected_words = []

for index, pred in enumerate(preds):
if pred > 0:
selected_words.append(s_x[index])

print(selected_words)

['Los', 'Angeles']


It work just fine ! Tried with many other example and the result are quite robust for the size of the dataset.