The data was scraped directly from Wikipedia from this link here. The data was then cleaned and analyzed by me, click this link here to see that analysis.
This notebook is going to be a word-based approach to nickname generation. We will be going through the preprocessing required to create affix embeddings, tokenize, encode, decode, and finally train the Keras LSTM neural network.
I also walk through this process utilizing a character-based approach; however, the results were hilariously bad.
# data manip
import pandas as pd
import numpy as np
# model building
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback
# load model
from tensorflow import keras
raw = pd.read_csv('data/cleaned.nicknames.csv')
raw.head()
fake name | real name | len fake | len real | category | notes | count | |
---|---|---|---|---|---|---|---|
0 | dumbo | randolph tex alles | 1 | 3 | domestic political figures | director of the united states secret service | 1 |
1 | wheres hunter | hunter biden | 2 | 2 | domestic political figures | american lawyer and lobbyist who is the second... | 1 |
2 | 1% joe | joe biden | 2 | 2 | domestic political figures | 47th vice president of the united states; form... | 1 |
3 | basement joe | joe biden | 3 | 2 | domestic political figures | 47th vice president of the united states; form... | 1 |
4 | beijing joe | joe biden | 3 | 2 | domestic political figures | 47th vice president of the united states; form... | 1 |
Here is the data we are working with. We have the nicknames (fake name) and the corrosponding real name of the individual the nickname was given to by trump. We also have a few other columns, however those will not matter for this task.
Here we need to seperate each word and add a tag. We will be adding a few created tags:
real names:
"joe biden"
will become "joe <name1> biden <name2>"
nicknames:
"basement joe hidin"
will become "basement <name1>"
<prefix>
tag will be added to every other word before the real name substitution, and a <suffix>
tag to the words after. A <name>
tag will be added to the substitution itself.<nope>
tag will be added to each word."basement <prefix> <name1> <name>"
These modifications will be made because we will use the generated <nameX>
tags to use from the user input to directly replace into the generated output. For example, if a user inputs "Joe Biden"
, and the generated name follows "<prefix> <name1>"
the generation algorithm can substitute the users <name1>
with joe
in this example, so that all we need to predict is the <prefix>
tag. Although the model will need to predict which name index to pull from the name tag. One fun issue for us is the outputted suffix tag is not truly always a true suffix. (i.e, a set of embeddings can be predicted as "<suffix> <prefix>"
, etc.) This will help with training, as we only need to predict the length of the nickname, then the set of tags that follow, then finally predict the tokens to convert the embeddings to a real word. For example, if we predict a length 4 name as the following embeddings "<prefix> <prefix> <name> <suffix>"
, then we can generate the best real word for each category. Meaning our generated nickname will end up with two prefix tokens, a name tag, and a suffix token. Then we replace the name tag with the users corresponding name token. This helps to generalize our model and keep the outputs quite random, which gives us more interesting and varied generated nicknames.
def tokenize(realname, nickname):
'''tokenizes and adds affix embeddings to the real name and nickname, then returns a tuple of both names fully tokenized and embedded as strings'''
# get a dictionary for the real name and corrospoinding token, ie input
real2token = {word: f'<name{X+1}>' for X, word in enumerate(realname.split(' '))}
# convert dictionary to word tokenized groups and join into single string
real_tokenized = ' '.join([f'{word} {real2token[word]}' for word in real2token])
# change nickname into single string with tokenization and substitution
# grab names to substitute
subs = [sub for sub in realname.split(' ') if sub in nickname]
if len(subs) == 0:
# then there are no splits, tokens are <nope>
nick_tokenized = ' '.join([f'{word} <nope>' for word in nickname.split(' ')])
return (real_tokenized, nick_tokenized)
substituted = ' '.join([word if word not in subs else f'{real2token[word]}' for word in nickname.split(' ')])
tokens = ['<prefix>', '<suffix>', '<name>']
index = 0
tokenized = []
for word in substituted.split(' '):
if 'name' in word:
index = 2
tokenized.append(f'{word} {tokens[index]}')
if index == 2:
index = 1
nick_tokenized = ' '.join(tokenized)
return (real_tokenized, nick_tokenized)
tokenized_names = [tokenize(realname, nickname) for i, nickname, realname in raw[['fake name', 'real name']].itertuples()]
print('real name | nickname')
tokenized_names[10:15]
real name | nickname
[('joe <name1> biden <name2>', 'sleepy <prefix> creepy <prefix> <name1> <name>'), ('joe <name1> biden <name2>', 'slow <prefix> <name1> <name>'), ('joe <name1> biden <name2>', '<name1> <name> hiden <suffix>'), ('joe <name1> biden <name2>', 'obiden <prefix>'), ('michael <name1> bloomberg <name2>', 'little <prefix> <name1> <name> <name2> <name>')]
Okay, not that everything is tokenized, we can start to vectorize it!
To vectorize this we need to define our vocabulary, then create a matrix for each name. The matrix will have the maximum token length as rows, and the total vocab words as columns. Our resulting input will be a list of matrices, where each matrix is a defined name. The name will have a 1 in the column of token it is (which is the index for the vocabulary tokens), on the row for the index of the token.
##### get probabilities of name length #####
name_matrix = raw.groupby(['len real', 'len fake']).agg({'count':'count'}).reset_index()
name_matrix = name_matrix[name_matrix['len real'] == 2]
p_gen_len = name_matrix[['len fake']]
p_gen_len['p'] = name_matrix['count']/name_matrix['count'].sum()
##### get vocab ######
# flatten paired list to get all names
flatten = [name for pair in tokenized_names for name in pair]
# make sure the created tokens go first in the vocad dictionaries
flatten[:0] = ['<prefix>', '<suffix>', '<name>', '<nope>', '<name1>', '<name2>', '<name3>', '<name4>', '<name5>', '<name6>']
###### get dictionaries #########
# get dictionary of nicknames to
nick2real = {nickname:realname for (realname, nickname) in tokenized_names}
# get dictionaries for vocab
uni_tokens = {token:0 for name in flatten for token in name.split(' ')}
# dictionary for token to index
token2i = {token:i for i, token in enumerate(uni_tokens)}
# dictionary for index to token
i2token = {token2i[token]: token for token in token2i}
# dictionary for index to real name token
i2name = {token2i[word]:word for pair in tokenized_names for word in pair[0].split(' ') if '<' not in word}
######### math ##########
# find total number of nicknames
n = len(tokenized_names)
# find max number of tokens, ie, rows in matrix
max_tokens = max([len(name.split(' ')) for name in flatten])
# find total number of tokens, ie columns in matrix
token_dimensions = len(i2token)
##### get matricies ########
# set up vectors for output = nicknames
output = np.zeros((n, max_tokens, token_dimensions))
# set up vectors for label = real names
input = np.zeros((n, max_tokens, token_dimensions))
#### vectorize names #######
for i, nickname in enumerate(nick2real):
# input and output assignment
for row, token in enumerate(nickname.split(' ')):
output[i, row, token2i[token]] = 1
input[i, row, token2i[token]] = 1
print(output[1])
[[0. 0. 0. ... 0. 0. 0.] [1. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]]
Awesome, here we can see a vectorized word! It doesn't look like much because the dimensions are very large, but there is a single 1 at every token index for the row of the token. For example, row 2 in this matrix says the first word is a prefix.
From here we need to build a few functions for the model!
Since this isnt a traditional training, we need to evaluate our model by actually generating nicknames. To do this, I will build a generate function and generate a few names at every nth epoch.
This generation function acts as our decoder. In essense I will use a few logical rules to ensure the quality of output from the model, as we had too few training data and vocabulary to build good distributions. In essence the idea is to:
"<prefix> <name1> <name1>"
def generate_name(model, input):
'''generates a nickname based on length of word and input given'''
word_vec = np.zeros((1, max_tokens, token_dimensions))
p_len = p_gen_len['p']
gen_len = p_gen_len.loc[np.random.choice(range(len(p_len)), p=p_len),'len fake']
# print(gen_len)
# get a dictionary for the real name and corrospoinding token, ie input
token2input = {f'<name{X+1}>': word for X, word in enumerate(input.split(' '))}
# convert dictionary to word tokenized groups and join into single string
input_tokenized = ' '.join([f'{token2input[token]} {token}' for token in token2input])
nName = 0 # tracks the number of name affix given
affixs = []
for index in range(gen_len):
index = (index+1)*2 -1
# pull probabilities for each affix
p_affix = list(model.predict(word_vec)[0,index])[0:4]
# normalize probabilities
p_affix_norm = p_affix / np.sum(p_affix)
# guess a affix
guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm)
affix = i2token[guess]
# check affix to make sure names are not chosen more than the given user amount
if 'name' in affix:
if nName < len(token2input):
nName += 1
else:
while 'name' in affix:
guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm) # choose an affix
affix = i2token[guess]
# assign value to generated name vector
word_vec[0, index, guess] = 1
# print(f'p={p_affix_norm}, g={i2token[guess]}, i={index}')
affixs.append(affix)
def predict_token(index, affix):
'''generates a token based on the given affix'''
# print(f'g={guess}, i={index}')
if 'name' in affix:
# pull probabilities for all names: index 4 - 4+given amount (max ind 9)
p_name = list(model.predict(word_vec)[0,index])[4:4+len(token2input)]
p_name_norm = p_name / np.sum(p_name)
# generate name token
guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)
# prevent repeat name tokens
while word_vec[0,index-2,guess] == 1.0:
guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)
# print(guess, len(p_name_norm)+4, len(token2input))
# print(i2token[guess], input)
token = i2token[guess]
word = token2input[token]
else:
# affix is prefix, suffix, or nope
# grab porbabilities of all non affix or name tokens: 10 - end
p_tokens = list(model.predict(word_vec)[0,index])[10:]
p_tokens_norm = p_tokens / np.sum(p_tokens)
# generate a guess from probabilities
guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)
# prevent named vocab from being chosen and repeat tokens
while guess in i2name or word_vec[0,index-2,guess] == 1.0:
# print(i2token[guess])
guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)
word = i2token[guess]
word_vec[0, index, guess] = 1
return word
# print(f'{input}: {affixs}')
tokens = [predict_token(i*2, affix) for i, affix in enumerate(affixs)]
return " ".join(tokens)
def generate_name_loop(epoch, _):
'''tells the model when to generate names during training'''
if epoch % (nEpochs//5) == 0 or epoch == nEpochs-1:
print(f'Nicknames generated on epoch {epoch}')
for i, name in enumerate(['karen','mitch mcconnel','dwayne the rock johnson']):
print(f'{name} | {generate_name(model, name)}')
print('-------------')
Awesome, now that those functions are done lets move on to building and training the model. I am going to use a single LSTM layer with a softmax adam optimizer.
Lets create and train our model!
# model attributes
nBatch = 60 # number of sample in training epoch
nEpochs = 1000 # number of epochs to train
nUnits = 300 # number of loops in lstm
model = Sequential()
# add LSTM layer to model
model.add(LSTM(nUnits, input_shape=(max_tokens, token_dimensions), return_sequences=True))
# add model attributes
model.add(Dense(token_dimensions, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# generate names on epochs
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)
model.fit(output, input, batch_size=nBatch, epochs=nEpochs, callbacks=[name_generator], verbose=0)
model.save(f"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")
Nicknames generated on epoch 0 karen | karen lockheed cutie flailer mitch mcconnel | sour pakistani tropics dwayne the rock johnson | flailer dwayne morning ------------- Nicknames generated on epoch 200 karen | schitt guy edge michigan mitch mcconnel | lying original half dwayne the rock johnson | pakistani favorite ------------- Nicknames generated on epoch 400 karen | sick puppy man michigan dumbo mitch mcconnel | wacky woman edge flunkie cbs dwayne the rock johnson | johnson original canada ------------- Nicknames generated on epoch 600 karen | 0% karen congresswoman half mitch mcconnel | 41 from dwayne the rock johnson | goofball woman guy ------------- Nicknames generated on epoch 800 karen | dicky karen mitch mcconnel | high mitch mcconnel dwayne the rock johnson | sick dwayne the ------------- Nicknames generated on epoch 999 karen | half karen mitch mcconnel | wheres mitch mcconnel dwayne the rock johnson | michigan dwayne the ------------- INFO:tensorflow:Assets written to: models/word/model.bs60.e1000.nL300.output\assets
Now that the model is trained we can get a sense of how the name generation was at each epoch. However, we want to get a wider glance at how well the model does with other inputs and more name generation.
Since there is no true measure for these nicknames, we will just need to manually look them over and decide which models to keep.
model = keras.models.load_model(f"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")
names = 'karen,Link,Donald trump,Joe Biden,mitch mcconnel,alexandria ocasio cortez,dwayne the rock johnson'.split(',')
# model = keras.models.load_model(f"models/word/keep/model.bs30.e500.nL50.output")
print('model is loaded')
print('-----------------')
print(f'model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output')
for name in names:
for i in range(5):
print(f'{name} | {generate_name(model, name)}')
print('-----------------------')
model is loaded ----------------- karen | texas karen moonbeam karen | gov karen wannabe karen | american karen bozo canada wannabe karen | karen apple canada karen | sneaky karen crazy ----------------------- Link | china Link moonbeam Link | shady creepy Link | cheatin Link frankenstein sham wannabe Link | obiden Link Link | husband Link tropics ----------------------- Donald trump | sick Donald trump lockheed Donald trump | and Donald trump Donald trump | leakin Donald Donald trump | pakistani Donald trump bozo Donald trump | wheres Donald ----------------------- Joe Biden | guy Joe Biden flakey Joe Biden | sham Joe Joe Biden | no Biden Joe moonbeam Joe Biden | rocket Biden Joe Joe Biden | and professor flakey 44 ----------------------- mitch mcconnel | congresswoman mitch mcconnel mitch mcconnel | chin mitch mcconnel mitch mcconnel | low energy crazy mitch mcconnel | wheres mitch mitch mcconnel | slippery puppy tropics professor puppy ----------------------- alexandria ocasio cortez | little cortez alexandria ocasio cortez | sick ocasio alexandria alexandria ocasio cortez | lying ocasio alexandria ocasio cortez | cryin alexandria alexandria ocasio cortez | psycho alexandria ocasio ----------------------- dwayne the rock johnson | slippery dwayne the dwayne dwayne the rock johnson | little dwayne moonbeam the sham dwayne the rock johnson | a dwayne dwayne the rock johnson | fat dwayne the dwayne the rock johnson | deranged dwayne the rock -----------------------
The results are pretty good! Nothing spectacular here, but that is part of the charm. This is a pretty unconventional method to do text generation, especially with an insanely low volume of data. Without a good quantifiable method, I would give this model a solid 7/10 on nickname generation, but a 9/10 on creativity.
Lets save the model to later import into the Elixir Web Application.
Run the cell below if we want to save the model into our keep folder.
# us this if want to save model in good
model.save(f"models/word/keep/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")
INFO:tensorflow:Assets written to: models/word/keep/model.bs60.e1000.nL300.output\assets