NLP tools for Twitter

The first portion:

Consists of scripts to connect to Twitter's API and to pull the last 3200 Tweets from any given account, then storing these tweets into a CSV with a bit of metadata.

The second portion:

Consists of functions I've developed to clean and parse these tweets into a format suited for analysis and for use in the context of deep learning / machine learning applications.

The third portion:

Borrows from the embedding and word2vec lessons from Udacity's Deep Learning Nanodegree. In this portion, we build an embedding network with Tensorflow that 'learns' the relationship between words in the corpus of tweets from a particular user. This is more of a thought experiment to see if we can develop a better understanding of a particular user's worldview through the way they express themselves via Twitter.

The fourth portion

Will be dedicated to building a seq2seq network to generate new tweets in the style of the target user(s) tweets.

In [2]:
import numpy as np
import matplotlib as plt
import sklearn
import tensorflow as tf
import tweepy 
import time
import csv
import pandas as pd
import re
from collections import Counter
from nltk.corpus import stopwords
import random

Section 1:

Get the API keys set up and test to make sure things are running smoothly

In [3]:
consumer_key = 'consumer_key'
consumer_secret = 'consumer_secret'

access_token = 'access_token'
access_secret = 'access_secret'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
tweet_bucket = [tweet.text for tweet in public_tweets]

print(tweet_bucket[:5])
['Google Home now has a screen — and Spotify https://t.co/4ub1nxZO4l https://t.co/ILe6bRVGA0', 'Google will let you search LinkedIn, Monster, Facebook, and more job sites all at once https://t.co/DKT0cvRxgY https://t.co/SqndE62r3p', 'McCaw does best Iguodala imitation in Game 2 win https://t.co/D4UlXfKBxK https://t.co/OSUuOS7N5h', "'In the right place at the right time:' Tow truck driver helps secure SUV dangling off I-5 https://t.co/tEQRbCXN3j… https://t.co/A5UAa1Cc5E", '[Podcast] What *is* content marketing? https://t.co/qS75MwN3cm #growth']

Next step:

Use Tweepy API to pull tweets and associated metadata from Trump's profile.

In [34]:
# Testing and checking to make sure we're creating the appropriate class instance
user = api.get_user('realDonaldTrump')

print(user.screen_name)
print(user.followers_count)
realDonaldTrump
29664751
In [5]:
def get_all_tweets(screen_name):
    # Borrowed and optimized for python 3 from https://drive.google.com/file/d/0Bw1LIIbSl0xuNnJ0N1ppSkRjQjQ/view
    
    # Authenticate with keys and establish connection
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    api = tweepy.API(auth)
    
    # List to hold tweets
    allTweets = []
    
    new_tweets = api.user_timeline(screen_name = screen_name, count=200)
    allTweets.extend(new_tweets)
    
    oldest = allTweets[-1].id - 1
    
    while len(new_tweets) > 0:
        # All subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
        
        # Save most recent tweets
        allTweets.extend(new_tweets)
        
        # Update the id of the oldest tweet less one
        oldest = allTweets[-1].id - 1

        print('{} tweets downloaded so far'.format(len(allTweets)))
    
    # Build list with tweet text and some metadata
    # Encoding the tweets with UTF-8 was making this REALLY buggy. lots of weird characters.
    outtweets = [[tweet.id_str, tweet.created_at, tweet.text] for tweet in allTweets]

    # Write to csv
    with open('{}s_tweets.csv'.format(screen_name), 'w') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","text"])
        writer.writerows(outtweets)

    pass

    # Close the file
    print("Done")
    f.close()
In [6]:
get_all_tweets('realdonaldtrump')
400 tweets downloaded so far
600 tweets downloaded so far
800 tweets downloaded so far
999 tweets downloaded so far
1199 tweets downloaded so far
1399 tweets downloaded so far
1599 tweets downloaded so far
1799 tweets downloaded so far
1999 tweets downloaded so far
2199 tweets downloaded so far
2399 tweets downloaded so far
2599 tweets downloaded so far
2799 tweets downloaded so far
2999 tweets downloaded so far
3199 tweets downloaded so far
3210 tweets downloaded so far
3210 tweets downloaded so far
Done
In [7]:
# Read csv into dataframe for manipulation, testing column to run tests and maintain
# integrity of the original text column. Also makes it easy to spot differences.
tweetFrame = pd.read_csv('realdonaldtrumps_tweets.csv', encoding='utf-8')

tweetFrame['testing'] = tweetFrame['text']

tweetFrame.head()
Out[7]:
id created_at text testing
0 864565791076872192 2017-05-16 19:38:42 It was a great honor to welcome the President ... It was a great honor to welcome the President ...
1 864511331029921792 2017-05-16 16:02:17 'U.S. Industrial Production Surged in April' h... 'U.S. Industrial Production Surged in April' h...
2 864452996129853444 2017-05-16 12:10:29 I have been asking Director Comey & others... I have been asking Director Comey & others...
3 864438529472049152 2017-05-16 11:13:00 ...to terrorism and airline flight safety. Hum... ...to terrorism and airline flight safety. Hum...
4 864436162567471104 2017-05-16 11:03:36 As President I wanted to share with Russia (at... As President I wanted to share with Russia (at...

Section 2:

Functions to clean and parse a corpus of tweets for use in machine learning / deep learning applications

In [9]:
import re

def cleanTweets(tweetList):
    """
    Takes: list of Tweets
    Cleans tweets so that only words, @mentions, and hashtags are left. 
    Conjuctive apostraphes are also included, but other apostraphes are removed.
    Returns: a list of clean tweets for use in other applications
    """
    p = re.compile('^b'
                   "|'(?!(?<! ')[uvts])"
                   '|"'
                   "|^\\\\*"
                   "|^\\n"
                   "|\\n+"
                   "|LIVE?.*\w"
                   "|Watch?.*\w"
                   "|https?:\/\/.*[\r\n]*"
                   "|http?\/\/.*[\r\n]*")
    
    cleaned = []
    
    for i in range(len(tweetList)):
        cleaned.append(p.sub(' ', tweetList[i]).replace('\\n', '').strip())
        
        # Change number after % to print status at different intervals
        if i % 500 == 0:
            print('Tweets Cleaned: {}'.format(i))
            
    return cleaned
In [10]:
# Reading the tweets into a list before cleaning is much faster
# than running the tweet cleaner over the dataframe
tweetBucket = [tweet for tweet in tweetFrame['text']]
In [11]:
cleanBucket = cleanTweets(tweetBucket)
Tweets Cleaned: 0
Tweets Cleaned: 500
Tweets Cleaned: 1000
Tweets Cleaned: 1500
Tweets Cleaned: 2000
Tweets Cleaned: 2500
Tweets Cleaned: 3000
In [17]:
print(tweetBucket[5])
print(cleanBucket[5])
#PeaceOfficersMemorialDay and
#PoliceWeek Proclamation: https://t.co/o4IXVfZuHw https://t.co/UMJ6hklx4a
#PeaceOfficersMemorialDay and #PoliceWeek Proclamation:
In [18]:
def tokenizeThatIsh(wordList):
    """
    Takes: list of words
    Replaces any instance of the token key found in the list of words
    with its corresponding token found in token values. This will also
    add one space on each side of the token as a buffer.
    Returns: tokenized list of words, token_values set object
    """
    
    
    token_keys = ['.', ',', '"', ';', '!', '?', '(', ')', '--', '\n', '&amp', '...', '-', ':', '…']
    token_values = ['||period||', '||comma||', '||quotation||', '||semicolon||', '||exclamation||',
                '||question||', '||lparentheses||', '||rparentheses||', '||dash||', '||return||', '||ampersand||',
                   '||ellipses||', '||hyphen||', '||colon||', '||ellipsestoo||']
    
    token_keys = set(token_keys)
    token_values = set(token_values)
    
    tokenDict = {k: v for k, v in zip(token_keys,token_values)}
    
    tokenized = []
    
    for word in wordList:
        for key, value in tokenDict.items():
            word = word.replace(key, ' {} '.format(value))
        tokenized.append(word)
        
    return tokenized, token_values
    
In [35]:
def parseCorpus(tweetList, tokenized=True):
    """
    Takes: List of tweets, flag for tokenization
    Iterates through all words in corpus to remove stopwords and adds lower-case
    words into a list to use in an embedding layer. If tokenized, also removes the tokens
    to return a list of words without punctuation to help with training and visualization.
    Returns: list of words, lookup tables for embedding, and some basic stats
    """
    
    words = []
    counts = []
    word_len = []
    stop_word_counter = []
    
    stop = set(stopwords.words('english'))
    
    # Lower case, multi-dimensional tokenized list of tweets
    if tokenized:
        bucket, token_values = tokenizeThatIsh(tweetList)
    else:
        bucket = tweetList
    
    for i, _ in enumerate(bucket):
        temp = bucket[i].split()
        for word in temp:
            if word.lower() not in stop:
                if tokenized and word.lower() not in token_values:
                    words.append(word.lower())
                elif not tokenized:
                    words.append(word.lower())
            else:
                stop_word_counter.append(word.lower())
                
            word_len.append(len(word))
            
        count = len(temp)
        counts.append(count)
        
    wordset = set(words)
    avgCount = sum(counts) // len(counts)
    total_count = sum(counts)
    avg_word_length = sum(word_len) // len(word_len)
    
    vocab_to_int = {c: i for i, c in enumerate(wordset)}
    int_to_vocab = dict(enumerate(wordset))
    
    print('Total words in corpus: {}'.format(total_count))
    print('Total words without stop-words: {}'.format(len(words)))
    print('Number of stop-words removed: {}'.format(len(stop_word_counter)))
    print('Number of unique tokens / words in corpus: {}'.format(len(wordset)))
    print('Average tokens per tweet: {}'.format(avgCount))
    print('Average characters per word: {}'.format(avg_word_length))
    
    return words, vocab_to_int, int_to_vocab, total_count

Checking stats

For tokenized and untokenized versions.

The theory here is that tokenization will reduce the number of unique words and make it easier for the stop word filter to work its magic.

In [36]:
parsed_words, v2int, int2v, total_count = parseCorpus(cleanBucket, tokenized=True)
Total words in corpus: 66040
Total words without stop-words: 33109
Number of stop-words removed: 22315
Number of unique tokens / words in corpus: 6320
Average tokens per tweet: 20
Average characters per word: 6
In [37]:
parsed_words, v2int, int2v, total_count = parseCorpus(cleanBucket, tokenized=False)
Total words in corpus: 55334
Total words without stop-words: 34340
Number of stop-words removed: 20994
Number of unique tokens / words in corpus: 9741
Average tokens per tweet: 17
Average characters per word: 5

Subsampling

The idea here is to remove words that fall below a frequency threshold. If we discard some of them, we can remove some of the noise from our data, train faster, and have better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

In [24]:
def subSample(wordList, threshold=1e-6):
    """
    Takes: list of individual words, threshold
    Evaluates frequency of occurrence for each word in list and
    measures against a drop probability to filter out words that
    don't occur regularly and improve efficiency for downstream operations
    """
    # Create counter for word occurrences
    word_counts = Counter(wordList)
    
    # Dictionary of words mapped to their frequency of occurrence
    freqs = {word: float(count/total_count) for word, count in word_counts.items()}
    
    # Drop probability for each word
    p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in wordList}
    
    # Cleaned list of words given through Mikolov subsampling process
    train_words = [word for word in wordList if p_drop[word] < random.random()]
    
    return train_words
    
    
In [25]:
# Final function to parse Tweet corpus and filter into training array
def finalPrep(tweetList):
    """
    Takes: list of tweets
    Runs a few of my helper functions
    Returns: flattened list of training words, their integer representations, and the int2v dictionary
    """
    
    # Get parsed_words, vocab_to_int, int_to_vocab, total_count
    parsed_words, v2int, int2v, total_count = parseCorpus(tweetList, tokenized=True)
    
    # Subsampling to remove words with low frequency of occurence 
    train_words = subSample(parsed_words, threshold = 1e-3)
    
    # Int representation of words to pass through network
    int_words = [v2int[word] for word in train_words]
    
    print('Number of words after subsampling: {}'.format(len(train_words)))
    return train_words, int_words, int2v
    
    
In [26]:
train_words, int_words, int2v = finalPrep(cleanBucket)
Total words in corpus: 66040
Total words without stop-words: 33109
Number of stop-words removed: 22315
Number of unique tokens / words in corpus: 6320
Average tokens per tweet: 20
Average characters per word: 6
Number of words after subsampling: 30378

Now that the tweets have been cleaned

The next section borrows and expands from the embedding lessons found in Udacity's DL Nanodegree.

The goal is to create a network that 'learns' relationships between words in corpus of text. In this particular example, training the network on a corpus of tweets should provide some insight into that particular user's worldview expressed through the syntax and content of their tweets.

In [27]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. 
    Samples from a random distribution so that words further from the index
    are sampled less regularly, resulting in less weight being given to terms
    that are further away from the index'''
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R

    target_words = set(words[start:idx] + words[idx+1:stop+1])
    
    return list(target_words)
In [28]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        
        yield x, y
    
In [29]:
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')
In [30]:
print(len(int_words))
print(len(int2v))
30378
6320
In [31]:
n_vocab = len(int2v)
n_embedding = 400 # Number of embedding features 
with train_graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs)
In [32]:
# Number of negative labels to sample
n_sampled = 150
lr = 0.003
with train_graph.as_default():
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
    softmax_b = tf.Variable(tf.zeros(n_vocab))
    
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, 
                                      labels, embed,
                                      n_sampled, n_vocab)
    
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer(lr).minimize(cost)
In [33]:
with train_graph.as_default():
    ## Borrowed from Thushan Ganegedara's implementation
    ## https://github.com/thushv89/udacity_deeplearning_complete/blob/master/5_word2vec.ipynb
    
    valid_size = 20 # Random set of words to evaluate similarity on.
    valid_window = 100
    
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples, 
                               random.sample(range(1000,1000+valid_window), valid_size//2))

    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
    normalized_embedding = embedding / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embedding, valid_dataset)
    similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embedding))
In [33]:
# If the checkpoints directory doesn't exist:
!mkdir checkpoints
mkdir: checkpoints: File exists
In [34]:
epochs = 40
batch_size = 128
window_size = 5

with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer())

    for e in range(1, epochs+1):
        batches = get_batches(int_words, batch_size, window_size)
        start = time.time()
        for x, y in batches:
            
            feed = {inputs: x,
                    labels: np.array(y)[:, None]}
            
            train_loss, _ = sess.run([cost, optimizer], feed_dict=feed)
            
            loss += train_loss
            
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Avg. Training loss: {:.4f}".format(loss/100),
                      "{:.4f} sec/batch".format((end-start)/100))
                loss = 0
                start = time.time()
            
            if iteration % 1000 == 0:
                # note that this is expensive (~20% slowdown if computed every 500 steps)
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = int2v[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = int2v[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
            
            iteration += 1
            
    save_path = saver.save(sess, "checkpoints/text8.ckpt")
    embed_mat = sess.run(normalized_embedding)
Epoch 1/40 Iteration: 100 Avg. Training loss: 4.6379 0.0398 sec/batch
Epoch 1/40 Iteration: 200 Avg. Training loss: 4.4665 0.0374 sec/batch
Epoch 2/40 Iteration: 300 Avg. Training loss: 3.8966 0.0242 sec/batch
Epoch 2/40 Iteration: 400 Avg. Training loss: 3.3847 0.0373 sec/batch
Epoch 3/40 Iteration: 500 Avg. Training loss: 3.1680 0.0096 sec/batch
Epoch 3/40 Iteration: 600 Avg. Training loss: 2.7987 0.0373 sec/batch
Epoch 3/40 Iteration: 700 Avg. Training loss: 2.7371 0.0374 sec/batch
Epoch 4/40 Iteration: 800 Avg. Training loss: 2.4832 0.0331 sec/batch
Epoch 4/40 Iteration: 900 Avg. Training loss: 2.3504 0.0377 sec/batch
Epoch 5/40 Iteration: 1000 Avg. Training loss: 2.2746 0.0196 sec/batch
Nearest to killing: commercial, rumors, adviser, clapper, #anthonyweiner, favors, reach, going,
Nearest to sneak: changer, soars, @scottwrasmussen, sweeping, openings, figures, @foxnation, horrors,
Nearest to w/: lima, veteran, force, stupor, raising, regarding, extraordinary, drawn,
Nearest to inaccurately: knocking, @tzard000, shortly, yesterday, packed, contract, w/fbi, sources,
Nearest to remarkably: common, arabella, always, conceived, small, announced, amendment, 185,
Nearest to states: ignoring, partnership, prosecuted, trump's, ashamed, voice, seems, bubba,
Nearest to phrase: @anncoulter, daytona, telepromter, @reploubarletta, bombshell, motor, jinping, july,
Nearest to traitor: #ahca, goi, future, 2004, information, 11pm, deflect, conde,
Nearest to ben: boynton, tariffs, containing, exhausted, #wattersworld, @lunsfordwhitney, @joebowman12, rumors,
Nearest to capable: eleven, syrian, wherever, augustine, @darrellissa, hacks, grandstand, walking,
Nearest to pulling: nieto, magnificent, number, slog, compared, due, @atensnut, misrepresented,
Nearest to romney: 1pme, ad, here’s, legal, @buiidthewall, gov, deflect, air,
Nearest to erratic: wil, accomplishments, @jlund04, “open, answered, column, anger, i've,
Nearest to gee: price, she’ll, doesn't, safe, disney, lopez, gave, opens,
Nearest to rolling: munich, briefing, charged, involved, cnn, truth, 8pm, saucier's,
Nearest to atlanta: waters, estes, pueblo, frankly, extraordinary, hacking, #trump2016, anna,
Nearest to 823: disgusting, hypocrites, beauty, reason, 92, auto, softball, alive,
Nearest to shinzo: trump/russia, hacks, secretly, alternatives, others, protect, abe, shouldn't,
Nearest to ireland: w, alongside, @maidaa17, patriots, 226th, 24, goofy, georgia,
Nearest to #bigleag: media, trashes, held, convention, #voterregistrationday, heroes, neither, goofy,
Epoch 5/40 Iteration: 1100 Avg. Training loss: 2.2202 0.0379 sec/batch
Epoch 6/40 Iteration: 1200 Avg. Training loss: 2.1163 0.0056 sec/batch
Epoch 6/40 Iteration: 1300 Avg. Training loss: 2.0226 0.0377 sec/batch
Epoch 6/40 Iteration: 1400 Avg. Training loss: 2.0526 0.0377 sec/batch
Epoch 7/40 Iteration: 1500 Avg. Training loss: 2.0093 0.0308 sec/batch
Epoch 7/40 Iteration: 1600 Avg. Training loss: 1.9200 0.0376 sec/batch
Epoch 8/40 Iteration: 1700 Avg. Training loss: 1.8833 0.0158 sec/batch
Epoch 8/40 Iteration: 1800 Avg. Training loss: 1.8630 0.0401 sec/batch
Epoch 9/40 Iteration: 1900 Avg. Training loss: 1.8683 0.0016 sec/batch
Epoch 9/40 Iteration: 2000 Avg. Training loss: 1.7969 0.0405 sec/batch
Nearest to killing: commercial, 500%, clapper, going, bad, reach, @ainsleyearhardt, favors,
Nearest to sneak: changer, soars, @scottwrasmussen, sweeping, #trumppence16, openings, regards, figures,
Nearest to w/: lima, raising, force, turnout, tel, courage, stupor, spoken,
Nearest to inaccurately: knocking, @tzard000, yesterday, shortly, nasty, w/fbi, defrauded, control,
Nearest to remarkably: announced, common, facing, small, funds, amendment, #votetrump2016, always,
Nearest to states: egypt, ignoring, partnership, @pastormarkburns, subpoenaed, ashamed, voice, prosecuted,
Nearest to phrase: @anncoulter, bombshell, daytona, telepromter, speaker, exclusive, praising, hatred,
Nearest to traitor: goi, deflect, #ahca, future, sex, conde, 2004, information,
Nearest to ben: carson, containing, boynton, tariffs, electricity, @joebowman12, fury, exhausted,
Nearest to capable: eleven, wherever, syrian, suspending, augustine, hacks, walking, developing,
Nearest to pulling: number, slog, nieto, ppl, magnificent, due, compared, misrepresented,
Nearest to romney: 1pme, legal, @buiidthewall, air, cousin, sass, ad, hershey,
Nearest to erratic: wil, answered, violent, @jlund04, yt, column, #60minutes, “open,
Nearest to gee: reporter, price, safe, gave, disney, she’ll, lopez, @american32,
Nearest to rolling: munich, thunder, briefing, cnn, involved, trump=competence, meet, kill,
Nearest to atlanta: extraordinary, pueblo, #trump2016, anna, waters, estes, businessman, says,
Nearest to 823: disgusting, reason, beauty, insecure, alive, atlantic, hypocrites, bemoan,
Nearest to shinzo: abe, trump/russia, hacks, others, protect, secretly, afternoon, alternatives,
Nearest to ireland: @whitehouse, w, @senatemajldr, kvirikashvili, sex, 226th, @maidaa17, goofy,
Nearest to #bigleag: media, trashes, held, convention, ten, heroes, thursdays, #voterregistrationday,
Epoch 9/40 Iteration: 2100 Avg. Training loss: 1.8277 0.0384 sec/batch
Epoch 10/40 Iteration: 2200 Avg. Training loss: 1.7841 0.0259 sec/batch
Epoch 10/40 Iteration: 2300 Avg. Training loss: 1.8151 0.0384 sec/batch
Epoch 11/40 Iteration: 2400 Avg. Training loss: 1.7692 0.0114 sec/batch
Epoch 11/40 Iteration: 2500 Avg. Training loss: 1.7990 0.0394 sec/batch
Epoch 11/40 Iteration: 2600 Avg. Training loss: 1.7320 0.0384 sec/batch
Epoch 12/40 Iteration: 2700 Avg. Training loss: 1.7216 0.0344 sec/batch
Epoch 12/40 Iteration: 2800 Avg. Training loss: 1.7428 0.0374 sec/batch
Epoch 13/40 Iteration: 2900 Avg. Training loss: 1.6819 0.0207 sec/batch
Epoch 13/40 Iteration: 3000 Avg. Training loss: 1.6919 0.0373 sec/batch
Nearest to killing: commercial, clapper, sweep, 500%, bad, housing, favors, @ainsleyearhardt,
Nearest to sneak: pearl, soars, @scottwrasmussen, changer, gave, figures, sweeping, @foxnation,
Nearest to w/: lima, raising, stupor, force, spoken, courage, effort, extraordinary,
Nearest to inaccurately: nasty, knocking, @tzard000, shortly, yesterday, certain, control, w/fbi,
Nearest to remarkably: announced, opinion, always, small, lawfare, 185, common, facing,
Nearest to states: voice, partnership, @pastormarkburns, burying, #defendthesecond, egypt, guard, subpoenaed,
Nearest to phrase: @anncoulter, praising, daytona, bombshell, justice, exclusive, telepromter, july,
Nearest to traitor: goi, manning, ungrateful, #ahca, deflect, sex, conde, hearing,
Nearest to ben: carson, hud, containing, seriously, boynton, electricity, exhausted, tariffs,
Nearest to capable: eleven, developing, suspending, wherever, syrian, augustine, hacks, walking,
Nearest to pulling: number, slog, ppl, nieto, due, magnificent, compared, misrepresented,
Nearest to romney: 1pme, ad, sass, deflect, #orprimary, legal, air, susan,
Nearest to erratic: violent, wil, answered, #60minutes, journal, accomplishments, @jlund04, describing,
Nearest to gee: reporter, safe, @thebrodyfile, gave, 45pm, difference, price, ✔️,
Nearest to rolling: thunder, munich, @scottwrasmussen, perry, kill, trump=competence, involved, briefing,
Nearest to atlanta: extraordinary, #trump2016, pueblo, hacking, 11pm, noon, michigan, estes,
Nearest to 823: disgusting, deleted, insecure, reason, bemoan, hypocrites, improvements, alive,
Nearest to shinzo: abe, trump/russia, hacks, household, afternoon, protect, @realericjallen, alternatives,
Nearest to ireland: w, @whitehouse, visit, patriots, @senatemajldr, alongside, kvirikashvili, @maidaa17,
Nearest to #bigleag: trashes, media, heroes, releasing, @chernuna, ten, votes, thursdays,
Epoch 14/40 Iteration: 3100 Avg. Training loss: 1.6579 0.0072 sec/batch
Epoch 14/40 Iteration: 3200 Avg. Training loss: 1.6417 0.0397 sec/batch
Epoch 14/40 Iteration: 3300 Avg. Training loss: 1.7179 0.0377 sec/batch
Epoch 15/40 Iteration: 3400 Avg. Training loss: 1.6280 0.0303 sec/batch
Epoch 15/40 Iteration: 3500 Avg. Training loss: 1.6485 0.0386 sec/batch
Epoch 16/40 Iteration: 3600 Avg. Training loss: 1.6104 0.0172 sec/batch
Epoch 16/40 Iteration: 3700 Avg. Training loss: 1.6341 0.0373 sec/batch
Epoch 17/40 Iteration: 3800 Avg. Training loss: 1.6339 0.0030 sec/batch
Epoch 17/40 Iteration: 3900 Avg. Training loss: 1.5689 0.0370 sec/batch
Epoch 17/40 Iteration: 4000 Avg. Training loss: 1.6277 0.0370 sec/batch
Nearest to killing: commercial, 500%, clapper, development, bad, importantly, going, sweep,
Nearest to sneak: pearl, soars, sweeping, @scottwrasmussen, changer, @foxnation, gave, experienced,
Nearest to w/: lima, raising, courage, turnout, force, stupor, effort, #ushcclegi,
Nearest to inaccurately: nasty, knocking, @tzard000, certain, yesterday, control, shortly, w/fbi,
Nearest to remarkably: opinion, always, lawfare, announced, small, common, describing, amendment,
Nearest to states: subpoenaed, egypt, @pastormarkburns, partnership, guard, anniversary, nh, voice,
Nearest to phrase: @anncoulter, bombshell, praising, justice, exclusive, hatred, enhances, july,
Nearest to traitor: ungrateful, goi, manning, deflect, sex, #ahca, hearing, information,
Nearest to ben: carson, hud, seriously, electricity, containing, head, boynton, exhausted,
Nearest to capable: developing, eleven, suspending, syrian, stages, fantastic, augustine, wherever,
Nearest to pulling: number, slog, insurer, due, ppl, events, nieto, magnificent,
Nearest to romney: sass, ad, #teamtrump, 1pme, deflect, cousin, susan, @mainegop,
Nearest to erratic: violent, describing, wil, elect, excoriates, #60minutes, @jlund04, accomplishments,
Nearest to gee: reporter, @wolfstopper, difference, safe, @thebrodyfile, @djd_thunder, 45pm, strength,
Nearest to rolling: thunder, @scottwrasmussen, munich, kill, charged, cnn, affection, bus,
Nearest to atlanta: extraordinary, #trump2016, pueblo, ceo, endorsed, louisiana, screening, initiated,
Nearest to 823: insecure, staff, disgusting, deleted, reason, bemoan, @katiepavlich, unions,
Nearest to shinzo: abe, trump/russia, afternoon, prime, household, hacks, minister, others,
Nearest to ireland: visit, @whitehouse, kvirikashvili, patriots, w, @senatemajldr, $19, georgia,
Nearest to #bigleag: media, trashes, @chernuna, thursdays, heroes, votes, neither, ten,
Epoch 18/40 Iteration: 4100 Avg. Training loss: 1.5341 0.0284 sec/batch
Epoch 18/40 Iteration: 4200 Avg. Training loss: 1.6129 0.0374 sec/batch
Epoch 19/40 Iteration: 4300 Avg. Training loss: 1.6014 0.0128 sec/batch
Epoch 19/40 Iteration: 4400 Avg. Training loss: 1.5633 0.0379 sec/batch
Epoch 19/40 Iteration: 4500 Avg. Training loss: 1.5610 0.0404 sec/batch
Epoch 20/40 Iteration: 4600 Avg. Training loss: 1.5096 0.0384 sec/batch
Epoch 20/40 Iteration: 4700 Avg. Training loss: 1.5949 0.0392 sec/batch
Epoch 21/40 Iteration: 4800 Avg. Training loss: 1.4916 0.0230 sec/batch
Epoch 21/40 Iteration: 4900 Avg. Training loss: 1.5113 0.0392 sec/batch
Epoch 22/40 Iteration: 5000 Avg. Training loss: 1.5693 0.0089 sec/batch
Nearest to killing: commercial, 500%, clapper, african, wrecked, bad, son, sweep,
Nearest to sneak: pearl, sweeping, @scottwrasmussen, changer, soars, experienced, tulsa, gave,
Nearest to w/: lima, courage, commerce, turnout, took, force, raising, stimulate,
Nearest to inaccurately: nasty, certain, control, knocking, yesterday, @tzard000, cancelled, packed,
Nearest to remarkably: opinion, lawfare, always, announced, small, @sara_wejesa, impossible, common,
Nearest to states: subpoenaed, egypt, registered, @pastormarkburns, guard, pick, vital, trump's,
Nearest to phrase: @anncoulter, bombshell, praising, incorrectly, july, daytona, enhances, justice,
Nearest to traitor: ungrateful, goi, manning, deflect, sex, #ahca, hearing, trump,
Nearest to ben: carson, hud, seriously, electricity, exhausted, n, tariffs, containing,
Nearest to capable: developing, eleven, stages, syrian, learn, suspending, weapon, augustine,
Nearest to pulling: number, insurer, slog, ppl, due, magnificent, events, nieto,
Nearest to romney: sass, #teamtrump, ad, deflect, @mainegop, 1pme, legal, cousin,
Nearest to erratic: violent, wil, describing, elect, excoriates, answered, @jlund04, #60minutes,
Nearest to gee: @wolfstopper, reporter, @djd_thunder, difference, gave, conferences, safe, strength,
Nearest to rolling: thunder, @scottwrasmussen, munich, affection, @frankylamouche, bus, kill, donald's,
Nearest to atlanta: extraordinary, pueblo, endorsed, #trump2016, car, ceo, screening, myanmar,
Nearest to 823: staff, disgusting, insecure, deleted, reason, unions, bemoan, hypocrites,
Nearest to shinzo: abe, trump/russia, minister, prime, hosting, japanese, others, afternoon,
Nearest to ireland: @whitehouse, visit, @senatemajldr, patriots, kvirikashvili, w, magnificent, alongside,
Nearest to #bigleag: thursdays, trashes, votes, heroes, ten, neither, releasing, media,
Epoch 22/40 Iteration: 5100 Avg. Training loss: 1.5504 0.0393 sec/batch
Epoch 22/40 Iteration: 5200 Avg. Training loss: 1.5652 0.0389 sec/batch
Epoch 23/40 Iteration: 5300 Avg. Training loss: 1.5657 0.0329 sec/batch
Epoch 23/40 Iteration: 5400 Avg. Training loss: 1.5422 0.0377 sec/batch
Epoch 24/40 Iteration: 5500 Avg. Training loss: 1.4839 0.0184 sec/batch
Epoch 24/40 Iteration: 5600 Avg. Training loss: 1.5468 0.0374 sec/batch
Epoch 25/40 Iteration: 5700 Avg. Training loss: 1.4885 0.0048 sec/batch
Epoch 25/40 Iteration: 5800 Avg. Training loss: 1.5309 0.0375 sec/batch
Epoch 25/40 Iteration: 5900 Avg. Training loss: 1.5099 0.0382 sec/batch
Epoch 26/40 Iteration: 6000 Avg. Training loss: 1.4690 0.0288 sec/batch
Nearest to killing: commercial, 500%, clapper, consequences, african, development, bad, wrecked,
Nearest to sneak: pearl, sweeping, @scottwrasmussen, experienced, values, gave, tulsa, soars,
Nearest to w/: lima, courage, effort, commerce, stimulate, turnout, took, force,
Nearest to inaccurately: nasty, certain, yesterday, @tzard000, cancelled, control, knocking, w/fbi,
Nearest to remarkably: opinion, lawfare, announced, @sara_wejesa, impossible, absolutely, common, funds,
Nearest to states: subpoenaed, egypt, @pastormarkburns, registered, united, nh, guard, republican,
Nearest to phrase: @anncoulter, bombshell, hatred, praising, enhances, july, incorrectly, self,
Nearest to traitor: ungrateful, manning, goi, chelsea, sex, deflect, #ahca, 2004,
Nearest to ben: carson, hud, seriously, electricity, head, @cbs, n, containing,
Nearest to capable: developing, eleven, stages, suspending, weapon, augustine, learn, words,
Nearest to pulling: number, insurer, slog, ppl, due, events, magnificent, nieto,
Nearest to romney: sass, mitt, choked, #orprimary, @mainegop, #teamtrump, congratulate, wrong,
Nearest to erratic: violent, describing, wil, answered, elect, excoriates, @jlund04, undertones,
Nearest to gee: @wolfstopper, reporter, @djd_thunder, conferences, difference, win, safe, gave,
Nearest to rolling: thunder, @scottwrasmussen, affection, munich, @frankylamouche, donald's, kill, bus,
Nearest to atlanta: extraordinary, #trump2016, endorsed, ceo, screening, pueblo, wins, tribute,
Nearest to 823: staff, disgusting, insecure, deleted, unions, reason, @katiepavlich, bemoan,
Nearest to shinzo: abe, minister, japanese, hosting, prime, trump/russia, @realericjallen, golf,
Nearest to ireland: visit, w, @whitehouse, patriots, kvirikashvili, docs, alongside, @senatemajldr,
Nearest to #bigleag: thursdays, heroes, @kellyannepolls, employee, @chernuna, decent, ten, son,
Epoch 26/40 Iteration: 6100 Avg. Training loss: 1.4949 0.0376 sec/batch
Epoch 27/40 Iteration: 6200 Avg. Training loss: 1.4883 0.0139 sec/batch
Epoch 27/40 Iteration: 6300 Avg. Training loss: 1.5217 0.0388 sec/batch
Epoch 28/40 Iteration: 6400 Avg. Training loss: 1.4986 0.0004 sec/batch
Epoch 28/40 Iteration: 6500 Avg. Training loss: 1.4418 0.0387 sec/batch
Epoch 28/40 Iteration: 6600 Avg. Training loss: 1.5146 0.0388 sec/batch
Epoch 29/40 Iteration: 6700 Avg. Training loss: 1.4673 0.0242 sec/batch
Epoch 29/40 Iteration: 6800 Avg. Training loss: 1.5247 0.0404 sec/batch
Epoch 30/40 Iteration: 6900 Avg. Training loss: 1.5224 0.0107 sec/batch
Epoch 30/40 Iteration: 7000 Avg. Training loss: 1.4445 0.0416 sec/batch
Nearest to killing: commercial, 500%, consequences, african, son, housing, development, wrecked,
Nearest to sneak: pearl, sweeping, values, @scottwrasmussen, tulsa, experienced, changer, soars,
Nearest to w/: lima, courage, commerce, #ushcclegi, stimulate, roundtable, effort, turnout,
Nearest to inaccurately: nasty, certain, cancelled, yesterday, @tzard000, control, knocking, w/fbi,
Nearest to remarkably: opinion, lawfare, announced, @sara_wejesa, impossible, common, funds, #votetrump2016,
Nearest to states: subpoenaed, pick, registered, republican, egypt, elevation, nh, earthquake,
Nearest to phrase: @anncoulter, bombshell, winners, july, hatred, praising, enhances, @svhlevi,
Nearest to traitor: ungrateful, goi, chelsea, manning, sex, #ahca, deflect, heading,
Nearest to ben: carson, hud, seriously, electricity, i've, head, exhausted, considering,
Nearest to capable: developing, stages, weapon, eleven, suspending, learn, syrian, hacks,
Nearest to pulling: insurer, number, slog, ppl, events, due, magnificent, nieto,
Nearest to romney: sass, mitt, #teamtrump, congratulate, wrong, wishes, @mainegop, ‘widespread,
Nearest to erratic: violent, describing, excoriates, wil, elect, @jlund04, answered, obsolete,
Nearest to gee: @wolfstopper, reporter, @djd_thunder, conferences, difference, 45pm, thing, gave,
Nearest to rolling: thunder, @scottwrasmussen, affection, donald's, munich, perry, @frankylamouche, #trumpdallas,
Nearest to atlanta: extraordinary, #trump2016, screening, ceo, myanmar, noon, endorsed, tribute,
Nearest to 823: staff, deleted, reason, unions, disgusting, roseanne, @katiepavlich, hypocrites,
Nearest to shinzo: abe, japanese, minister, prime, golf, hosting, trump/russia, afternoon,
Nearest to ireland: @whitehouse, kvirikashvili, w, visit, patriots, docs, @endakennytd, @senatemajldr,
Nearest to #bigleag: @kellyannepolls, decent, heroes, employee, @chernuna, thursdays, irredeemable, ten,
Epoch 30/40 Iteration: 7100 Avg. Training loss: 1.4605 0.0405 sec/batch
Epoch 31/40 Iteration: 7200 Avg. Training loss: 1.5054 0.0360 sec/batch
Epoch 31/40 Iteration: 7300 Avg. Training loss: 1.3990 0.0404 sec/batch
Epoch 32/40 Iteration: 7400 Avg. Training loss: 1.4607 0.0207 sec/batch
Epoch 32/40 Iteration: 7500 Avg. Training loss: 1.4838 0.0417 sec/batch
Epoch 33/40 Iteration: 7600 Avg. Training loss: 1.4748 0.0072 sec/batch
Epoch 33/40 Iteration: 7700 Avg. Training loss: 1.4440 0.0443 sec/batch
Epoch 33/40 Iteration: 7800 Avg. Training loss: 1.4778 0.0385 sec/batch
Epoch 34/40 Iteration: 7900 Avg. Training loss: 1.4260 0.0302 sec/batch
Epoch 34/40 Iteration: 8000 Avg. Training loss: 1.4627 0.0382 sec/batch
Nearest to killing: commercial, african, wrecked, housing, @gordonsr1052, 500%, sweep, development,
Nearest to sneak: pearl, sweeping, @scottwrasmussen, tulsa, values, changer, experienced, soars,
Nearest to w/: lima, courage, #ushcclegi, effort, 48, commerce, tel, @greta,
Nearest to inaccurately: nasty, certain, cancelled, meantime, @tzard000, control, yesterday, w/fbi,
Nearest to remarkably: lawfare, opinion, always, @sara_wejesa, announced, impossible, denver, common,
Nearest to states: subpoenaed, guard, egypt, pick, united, elevation, @pastormarkburns, drone,
Nearest to phrase: @anncoulter, #followthemoney, bombshell, magnificent, telepromter, enhances, winners, july,
Nearest to traitor: ungrateful, goi, chelsea, manning, sex, deflect, plenty, #ahca,
Nearest to ben: carson, hud, seriously, i've, electricity, exhausted, considering, 762,
Nearest to capable: developing, stages, weapon, eleven, nuclear, suspending, hacks, wherever,
Nearest to pulling: insurer, number, slog, events, due, ppl, copying, although,
Nearest to romney: sass, mitt, ad, #teamtrump, choked, wrong, congratulate, boards,
Nearest to erratic: violent, describing, excoriates, wil, elect, obsolete, answered, @jlund04,
Nearest to gee: @wolfstopper, @djd_thunder, reporter, safe, conferences, difference, thing, 45pm,
Nearest to rolling: thunder, @scottwrasmussen, affection, perry, munich, @mariaernandez3b, #trumpdallas, arena,
Nearest to atlanta: extraordinary, screening, noon, #trump2016, ceo, wins, pueblo, myanmar,
Nearest to 823: staff, deleted, disgusting, unions, roseanne, hypocrites, reason, @katiepavlich,
Nearest to shinzo: abe, japanese, prime, hosting, afternoon, minister, golf, trump/russia,
Nearest to ireland: kvirikashvili, @whitehouse, patriots, @endakennytd, visit, #ushcclegi, w, capital,
Nearest to #bigleag: @kellyannepolls, decent, thursdays, irredeemable, @foxnation, employee, heroes, releasing,
Epoch 35/40 Iteration: 8100 Avg. Training loss: 1.4717 0.0156 sec/batch
Epoch 35/40 Iteration: 8200 Avg. Training loss: 1.4515 0.0374 sec/batch
Epoch 36/40 Iteration: 8300 Avg. Training loss: 1.4629 0.0020 sec/batch
Epoch 36/40 Iteration: 8400 Avg. Training loss: 1.4546 0.0414 sec/batch
Epoch 36/40 Iteration: 8500 Avg. Training loss: 1.4744 0.0396 sec/batch
Epoch 37/40 Iteration: 8600 Avg. Training loss: 1.4323 0.0259 sec/batch
Epoch 37/40 Iteration: 8700 Avg. Training loss: 1.4146 0.0385 sec/batch
Epoch 38/40 Iteration: 8800 Avg. Training loss: 1.4639 0.0123 sec/batch
Epoch 38/40 Iteration: 8900 Avg. Training loss: 1.4185 0.0434 sec/batch
Epoch 38/40 Iteration: 9000 Avg. Training loss: 1.4173 0.0414 sec/batch
Nearest to killing: wrecked, commercial, african, events, housing, 500%, son, @gordonsr1052,
Nearest to sneak: pearl, gave, sweeping, harbor, tulsa, @scottwrasmussen, regards, experienced,
Nearest to w/: courage, lima, 48, @liliantintori, low, stimulate, nonsense, employee,
Nearest to inaccurately: nasty, meantime, cancelled, control, certain, tone, @tzard000, yesterday,
Nearest to remarkably: lawfare, opinion, always, @sara_wejesa, announced, absolutely, impossible, denver,
Nearest to states: subpoenaed, guard, egypt, pick, united, earthquake, bubba, drone,
Nearest to phrase: #followthemoney, @anncoulter, bombshell, back, magnificent, @billclinton, telepromter, churchill,
Nearest to traitor: ungrateful, manning, goi, chelsea, sex, trump, deflect, heading,
Nearest to ben: carson, hud, seriously, i've, electricity, head, considering, exhausted,
Nearest to capable: developing, stages, weapon, hacks, nuclear, 38%, ricketts, parts,
Nearest to pulling: insurer, number, slog, ppl, due, persistent, events, recently,
Nearest to romney: sass, mitt, choked, wrong, ad, graham, congratulate, senator,
Nearest to erratic: violent, describing, excoriates, wil, answered, elect, obsolete, @jlund04,
Nearest to gee: @wolfstopper, @djd_thunder, reporter, conferences, difference, thing, safe, 45pm,
Nearest to rolling: thunder, @scottwrasmussen, affection, perry, @mariaernandez3b, munich, gathering, #trumpdallas,
Nearest to atlanta: extraordinary, screening, #trump2016, noon, says, 50k, tribute, scared,
Nearest to 823: staff, deleted, unions, insecure, disgusting, @katiepavlich, reason, roseanne,
Nearest to shinzo: abe, japanese, prime, minister, hosting, others, afternoon, golf,
Nearest to ireland: kvirikashvili, visit, @whitehouse, @endakennytd, resort, capital, patriots, docs,
Nearest to #bigleag: decent, @kellyannepolls, employee, heroes, releasing, irredeemable, @foxnation, thursdays,
Epoch 39/40 Iteration: 9100 Avg. Training loss: 1.4055 0.0380 sec/batch
Epoch 39/40 Iteration: 9200 Avg. Training loss: 1.4424 0.0404 sec/batch
Epoch 40/40 Iteration: 9300 Avg. Training loss: 1.4239 0.0226 sec/batch
Epoch 40/40 Iteration: 9400 Avg. Training loss: 1.4116 0.0413 sec/batch
In [35]:
with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    embed_mat = sess.run(embedding)
In [36]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
In [37]:
viz_words = 150
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])
In [38]:
fig, ax = plt.subplots(figsize=(27,27 ))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int2v[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.5)

Thoughts

  1. After optimizing the data pipeline, the loss from learning the embeddings seems to stick around ~1.45
  2. To improve on this, the first and easiest fix would be to gather more quotes to work with
  3. Another thought would be to include an option in the regex filter to remove @mentions since they appear quite frequently. On that same vein, it might make sense to pull quotes from iPhone and Android separately since the different tweet authors could be confusing the algorithm.
  4. One other approach would be to see about using a seq2seq architecture to learn the relationships between words. Since we are only using a long list of individual words, and each tweet is ~17 words long, it's possible that the algorithm is 'learning' from separate tweets and this would certainly confuse the algorithm. Something to look into for sure.
  5. Regardless of the relatively high loss, the process is efficient enough to scale to larger data sets in the future. For now, the next steps will be to build the seq2seq architecture for tweet generation using the core of what I've built above.
In [ ]: