Notebook

Training a Nerual Network to generate Trump Nicknames¶

The data was scraped directly from Wikipedia from this link here. The data was then cleaned and analyzed by me, click this link here to see that analysis.

This notebook is going to be a word-based approach to nickname generation. We will be going through the preprocessing required to create affix embeddings, tokenize, encode, decode, and finally train the Keras LSTM neural network.

I also walk through this process utilizing a character-based approach; however, the results were hilariously bad.

Grabbing the data¶

In [1]:

# data manip
import pandas as pd
import numpy as np

# model building
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

# load model
from tensorflow import keras

raw = pd.read_csv('data/cleaned.nicknames.csv')

raw.head()

Out[1]:

	fake name	real name	len fake	len real	category	notes	count
0	dumbo	randolph tex alles	1	3	domestic political figures	director of the united states secret service	1
1	wheres hunter	hunter biden	2	2	domestic political figures	american lawyer and lobbyist who is the second...	1
2	1% joe	joe biden	2	2	domestic political figures	47th vice president of the united states; form...	1
3	basement joe	joe biden	3	2	domestic political figures	47th vice president of the united states; form...	1
4	beijing joe	joe biden	3	2	domestic political figures	47th vice president of the united states; form...	1

Here is the data we are working with. We have the nicknames (fake name) and the corrosponding real name of the individual the nickname was given to by trump. We also have a few other columns, however those will not matter for this task.

Creating affix embeddings and tokenizing¶

Here we need to seperate each word and add a tag. We will be adding a few created tags:

real names:
- these will be given a tag for each word in the name, for example,
  - "joe biden" will become "joe <name1> biden <name2>"
nicknames:
- these follow the rule that if a real name is in the nickname, it will be replaced with the corrosponding real name tag.
  - ie, "basement joe hidin" will become "basement <name1>"
- a <prefix> tag will be added to every other word before the real name substitution, and a <suffix> tag to the words after. A <name> tag will be added to the substitution itself.
- if there are no substitutions, then a <nope> tag will be added to each word.
- the final nickname would be "basement <prefix> <name1> <name>"

These modifications will be made because we will use the generated <nameX> tags to use from the user input to directly replace into the generated output. For example, if a user inputs "Joe Biden", and the generated name follows "<prefix> <name1>" the generation algorithm can substitute the users <name1> with joe in this example, so that all we need to predict is the <prefix> tag. Although the model will need to predict which name index to pull from the name tag. One fun issue for us is the outputted suffix tag is not truly always a true suffix. (i.e, a set of embeddings can be predicted as "<suffix> <prefix>", etc.) This will help with training, as we only need to predict the length of the nickname, then the set of tags that follow, then finally predict the tokens to convert the embeddings to a real word. For example, if we predict a length 4 name as the following embeddings "<prefix> <prefix> <name> <suffix>", then we can generate the best real word for each category. Meaning our generated nickname will end up with two prefix tokens, a name tag, and a suffix token. Then we replace the name tag with the users corresponding name token. This helps to generalize our model and keep the outputs quite random, which gives us more interesting and varied generated nicknames.

In [5]:

def tokenize(realname, nickname):
    '''tokenizes and adds affix embeddings to the real name and nickname, then returns a tuple of both names fully tokenized and embedded as strings'''

    # get a dictionary for the real name and corrospoinding token, ie input
    real2token = {word: f'<name{X+1}>' for X, word in enumerate(realname.split(' '))}
    # convert dictionary to word tokenized groups and join into single string
    real_tokenized = ' '.join([f'{word} {real2token[word]}' for word in real2token])

    # change nickname into single string with tokenization and substitution
    # grab names to substitute
    subs = [sub for sub in realname.split(' ') if sub in nickname]

    if len(subs) == 0:
        # then there are no splits, tokens are <nope>
        nick_tokenized = ' '.join([f'{word} <nope>' for word in nickname.split(' ')])
        return (real_tokenized, nick_tokenized)

    substituted = ' '.join([word if word not in subs else f'{real2token[word]}' for word in nickname.split(' ')])

    tokens = ['<prefix>', '<suffix>', '<name>']
    index = 0
    tokenized = []

    for word in substituted.split(' '):
        if 'name' in word:
            index = 2

        tokenized.append(f'{word} {tokens[index]}')
        if index == 2:
            index = 1
    
    nick_tokenized = ' '.join(tokenized)

    return (real_tokenized, nick_tokenized)


tokenized_names = [tokenize(realname, nickname) for i, nickname, realname in raw[['fake name', 'real name']].itertuples()]

print('real name | nickname')
tokenized_names[10:15]

real name | nickname

Out[5]:

[('joe <name1> biden <name2>',
  'sleepy <prefix> creepy <prefix> <name1> <name>'),
 ('joe <name1> biden <name2>', 'slow <prefix> <name1> <name>'),
 ('joe <name1> biden <name2>', '<name1> <name> hiden <suffix>'),
 ('joe <name1> biden <name2>', 'obiden <prefix>'),
 ('michael <name1> bloomberg <name2>',
  'little <prefix> <name1> <name> <name2> <name>')]

Okay, not that everything is tokenized, we can start to vectorize it!

Vectorizing the tokenized names¶

To vectorize this we need to define our vocabulary, then create a matrix for each name. The matrix will have the maximum token length as rows, and the total vocab words as columns. Our resulting input will be a list of matrices, where each matrix is a defined name. The name will have a 1 in the column of token it is (which is the index for the vocabulary tokens), on the row for the index of the token.

In [6]:

##### get probabilities of name length #####
name_matrix = raw.groupby(['len real', 'len fake']).agg({'count':'count'}).reset_index()
name_matrix = name_matrix[name_matrix['len real'] == 2]
p_gen_len = name_matrix[['len fake']]
p_gen_len['p'] = name_matrix['count']/name_matrix['count'].sum()

##### get vocab ######
# flatten paired list to get all names
flatten = [name for pair in tokenized_names for name in pair]
# make sure the created tokens go first in the vocad dictionaries
flatten[:0] = ['<prefix>', '<suffix>', '<name>', '<nope>', '<name1>', '<name2>', '<name3>', '<name4>', '<name5>', '<name6>']

###### get dictionaries #########
# get dictionary of nicknames to
nick2real = {nickname:realname for (realname, nickname) in tokenized_names}
# get dictionaries for vocab
uni_tokens = {token:0 for name in flatten for token in name.split(' ')}
# dictionary for token to index
token2i = {token:i for i, token in enumerate(uni_tokens)}
# dictionary for index to token
i2token = {token2i[token]: token for token in token2i}
# dictionary for index to real name token
i2name = {token2i[word]:word for pair in tokenized_names for word in pair[0].split(' ') if '<' not in word}

######### math ##########
# find total number of nicknames
n = len(tokenized_names)
# find max number of tokens, ie, rows in matrix
max_tokens = max([len(name.split(' ')) for name in flatten])
# find total number of tokens, ie columns in matrix
token_dimensions = len(i2token)

##### get matricies ########
# set up vectors for output = nicknames
output = np.zeros((n, max_tokens, token_dimensions))
# set up vectors for label = real names
input = np.zeros((n, max_tokens, token_dimensions))

#### vectorize names #######
for i, nickname in enumerate(nick2real):
    # input and output assignment
    for row, token in enumerate(nickname.split(' ')):
        output[i, row, token2i[token]] = 1
        input[i, row, token2i[token]] = 1


print(output[1])
    

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Awesome, here we can see a vectorized word! It doesn't look like much because the dimensions are very large, but there is a single 1 at every token index for the row of the token. For example, row 2 in this matrix says the first word is a prefix.

From here we need to build a few functions for the model!

Building functions for the model¶

Since this isnt a traditional training, we need to evaluate our model by actually generating nicknames. To do this, I will build a generate function and generate a few names at every nth epoch.

This generation function acts as our decoder. In essense I will use a few logical rules to ensure the quality of output from the model, as we had too few training data and vocabulary to build good distributions. In essence the idea is to:

predict the length of the nickname to give (n)
create a dictionary and tokenize the real name user input
predict n affix embedings
predict real word tokens from those affix embeddings
- uses all the previous words and embeddings to make its prediction
- hardcoded to not predict two exact name tags consectutively
  - i.e, no "<prefix> <name1> <name1>"
returns the full, de-embedded nickname

In [7]:

def generate_name(model, input):
    '''generates a nickname based on length of word and input given'''

    word_vec = np.zeros((1, max_tokens, token_dimensions))

    p_len = p_gen_len['p']
    gen_len = p_gen_len.loc[np.random.choice(range(len(p_len)), p=p_len),'len fake']
    # print(gen_len)
    # get a dictionary for the real name and corrospoinding token, ie input
    token2input = {f'<name{X+1}>': word for X, word in enumerate(input.split(' '))}
    # convert dictionary to word tokenized groups and join into single string
    input_tokenized = ' '.join([f'{token2input[token]} {token}' for token in token2input])
    nName = 0 # tracks the number of name affix given

    affixs = []
    for index in range(gen_len):
        index = (index+1)*2 -1

        # pull probabilities for each affix
        p_affix = list(model.predict(word_vec)[0,index])[0:4]
        # normalize probabilities
        p_affix_norm = p_affix / np.sum(p_affix)
        # guess a affix
        guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm)
        affix = i2token[guess]

        # check affix to make sure names are not chosen more than the given user amount
        if 'name' in affix:
            if nName < len(token2input):
                nName += 1
            else:
                while 'name' in affix:
                    guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm) # choose an affix
                    affix = i2token[guess]
        # assign value to generated name vector
        word_vec[0, index, guess] = 1
        # print(f'p={p_affix_norm}, g={i2token[guess]}, i={index}')
        affixs.append(affix)

    def predict_token(index, affix):
        '''generates a token based on the given affix''' 
        # print(f'g={guess}, i={index}')
        if 'name' in affix:
            # pull probabilities for all names: index 4 - 4+given amount (max ind 9)
            p_name = list(model.predict(word_vec)[0,index])[4:4+len(token2input)]
            p_name_norm = p_name / np.sum(p_name)
            # generate name token
            guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)

            # prevent repeat name tokens
            while word_vec[0,index-2,guess] == 1.0:
                guess = np.random.choice(range(4, len(p_name_norm)+4), p=p_name_norm)
                # print(guess, len(p_name_norm)+4, len(token2input))
                # print(i2token[guess], input)
            token = i2token[guess]
            word = token2input[token]
        
        else:
            # affix is prefix, suffix, or nope
            # grab porbabilities of all non affix or name tokens: 10 - end
            p_tokens = list(model.predict(word_vec)[0,index])[10:]
            p_tokens_norm = p_tokens / np.sum(p_tokens)
            # generate a guess from probabilities
            guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)

            # prevent named vocab from being chosen and repeat tokens
            while guess in i2name or word_vec[0,index-2,guess] == 1.0:
                # print(i2token[guess])
                guess = np.random.choice(range(10, len(p_tokens_norm)+10), p=p_tokens_norm)
            word = i2token[guess]
            
        word_vec[0, index, guess] = 1
        return word

    # print(f'{input}: {affixs}')
    tokens = [predict_token(i*2, affix) for i, affix in enumerate(affixs)]
    return " ".join(tokens)


def generate_name_loop(epoch, _):
    '''tells the model when to generate names during training'''
    if epoch % (nEpochs//5) == 0 or epoch == nEpochs-1:
        
        print(f'Nicknames generated on epoch {epoch}')

        for i, name in enumerate(['karen','mitch mcconnel','dwayne the rock johnson']):
            print(f'{name} | {generate_name(model, name)}')
        
        print('-------------')

Awesome, now that those functions are done lets move on to building and training the model. I am going to use a single LSTM layer with a softmax adam optimizer.

Training the model¶

Lets create and train our model!

In [180]:

# model attributes
nBatch = 60 # number of sample in training epoch
nEpochs = 1000 # number of epochs to train
nUnits = 300 # number of loops in lstm

model = Sequential()
# add LSTM layer to model
model.add(LSTM(nUnits, input_shape=(max_tokens, token_dimensions), return_sequences=True))
# add model attributes
model.add(Dense(token_dimensions, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# generate names on epochs
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)

model.fit(output, input, batch_size=nBatch, epochs=nEpochs, callbacks=[name_generator], verbose=0)

model.save(f"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")

Nicknames generated on epoch 0
karen | karen lockheed cutie flailer
mitch mcconnel | sour pakistani tropics
dwayne the rock johnson | flailer dwayne morning
-------------
Nicknames generated on epoch 200
karen | schitt guy edge michigan
mitch mcconnel | lying original half
dwayne the rock johnson | pakistani favorite
-------------
Nicknames generated on epoch 400
karen | sick puppy man michigan dumbo
mitch mcconnel | wacky woman edge flunkie cbs
dwayne the rock johnson | johnson original canada
-------------
Nicknames generated on epoch 600
karen | 0% karen congresswoman half
mitch mcconnel | 41 from
dwayne the rock johnson | goofball woman guy
-------------
Nicknames generated on epoch 800
karen | dicky karen
mitch mcconnel | high mitch mcconnel
dwayne the rock johnson | sick dwayne the
-------------
Nicknames generated on epoch 999
karen | half karen
mitch mcconnel | wheres mitch mcconnel
dwayne the rock johnson | michigan dwayne the
-------------
INFO:tensorflow:Assets written to: models/word/model.bs60.e1000.nL300.output\assets

Now that the model is trained we can get a sense of how the name generation was at each epoch. However, we want to get a wider glance at how well the model does with other inputs and more name generation.

Evaluating the model¶

Since there is no true measure for these nicknames, we will just need to manually look them over and decide which models to keep.

In [22]:

model = keras.models.load_model(f"models/word/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")
names = 'karen,Link,Donald trump,Joe Biden,mitch mcconnel,alexandria ocasio cortez,dwayne the rock johnson'.split(',')


# model = keras.models.load_model(f"models/word/keep/model.bs30.e500.nL50.output")

print('model is loaded')
print('-----------------')
print(f'model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output')
for name in names:
    for i in range(5):
        print(f'{name} | {generate_name(model, name)}')
    print('-----------------------')

model is loaded
-----------------
karen | texas karen moonbeam
karen | gov karen wannabe
karen | american karen bozo canada wannabe
karen | karen apple canada
karen | sneaky karen crazy
-----------------------
Link | china Link moonbeam
Link | shady creepy
Link | cheatin Link frankenstein sham wannabe
Link | obiden Link
Link | husband Link tropics
-----------------------
Donald trump | sick Donald trump lockheed
Donald trump | and Donald trump
Donald trump | leakin Donald
Donald trump | pakistani Donald trump bozo
Donald trump | wheres Donald
-----------------------
Joe Biden | guy Joe Biden flakey
Joe Biden | sham Joe
Joe Biden | no Biden Joe moonbeam
Joe Biden | rocket Biden Joe
Joe Biden | and professor flakey 44
-----------------------
mitch mcconnel | congresswoman mitch mcconnel
mitch mcconnel | chin mitch mcconnel
mitch mcconnel | low energy crazy
mitch mcconnel | wheres mitch
mitch mcconnel | slippery puppy tropics professor puppy
-----------------------
alexandria ocasio cortez | little cortez
alexandria ocasio cortez | sick ocasio alexandria
alexandria ocasio cortez | lying ocasio
alexandria ocasio cortez | cryin alexandria
alexandria ocasio cortez | psycho alexandria ocasio
-----------------------
dwayne the rock johnson | slippery dwayne the dwayne
dwayne the rock johnson | little dwayne moonbeam the sham
dwayne the rock johnson | a dwayne
dwayne the rock johnson | fat dwayne the
dwayne the rock johnson | deranged dwayne the rock
-----------------------

The results are pretty good! Nothing spectacular here, but that is part of the charm. This is a pretty unconventional method to do text generation, especially with an insanely low volume of data. Without a good quantifiable method, I would give this model a solid 7/10 on nickname generation, but a 9/10 on creativity.

Lets save the model to later import into the Elixir Web Application.

Save the model¶

Run the cell below if we want to save the model into our keep folder.

In [182]:

# us this if want to save model in good
model.save(f"models/word/keep/model.bs{nBatch}.e{nEpochs}.nL{nUnits}.output")

INFO:tensorflow:Assets written to: models/word/keep/model.bs60.e1000.nL300.output\assets