Problem statement: How does Trump’s interlocutor change his choice of word?
Hypothesis: Trump’s choice of words depends on who he is speaking to, i.e. if he mentions someone in a tweet or references an external url.
# First import all required dependencies
import pandas as pd
import datetime as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.metrics import f1_score, plot_precision_recall_curve, r2_score
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeRegressor, plot_tree
from imblearn.over_sampling import SMOTE
from statsmodels.tsa.seasonal import seasonal_decompose
import scipy as sp
from scipy import stats
import re
import ast
import tldextract
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
pd.options.mode.chained_assignment = None # this suppresses setting with copy warnings
# Read in the data
hate = pd.read_csv('hate_speech_dataset.csv')
pd.set_option('display.max_colwidth',150)
hate.head()
text | hateful | |
---|---|---|
0 | 1. i joined gab to remind myself how retarded jew haters are. You wouldn't be typing on your abacus without them you retard.\n | 1 |
1 | 1. This is what the left is really scared of. https://redd.it/9rfkts\n2. \tThat literally looks like a monkey. Why are we supposed to pretend i... | 1 |
2 | 1. It makes you an asshole.\n2. \tGive it to a soldier who has defend it , then try to burn it . Show me what a badass you really are. The flag ... | 1 |
3 | 1. So they manage to provide a whole lot of data, except a critical little nugget. Gay men are well over half the patients who present at STI cl... | 1 |
4 | 1. Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? @PugLife\n2. \tFuck off\... | 1 |
This dataset is quite skewed in that there are many more instances of non-hateful speech than there are of hateful speech.
So we need to check that we have enough samples first.
To do this I will use the following from Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. :
Find the proportion p of positive cases and negative cases. Take the smaller of the two.
Find the number of independent variables k.
Let the minimum number of cases be N=10kp. The minimum should always be set to at least 100.
# Find the proportion
hate.hateful.value_counts(normalize=True)
1 0.891422 0 0.108578 Name: hateful, dtype: float64
So p=.108578
Later I'll determine that k=3 - text, text length, sentiment value). See heading 2.1.4 of this notebook for the specifics.
This means that N=30.108578 which gives N=276
# And thankfully that will be fine since we have a lot more than 276 samples
hate.hateful.value_counts()
1 15016 0 1829 Name: hateful, dtype: int64
# Looks like there are quite a few unwanted characters
# So will need to tidy that up
def clean(text):
"""
This function removes the following uwanted strings from input text:
1., 2., 3. if they occurs at the start of a string.
Any url patterns.
'\n', '\t' characters.
'@username' handles.
:param text: string, input text.
:return clean_text: string, input text with above characteristics removed.
"""
# Define patterns
# The url regex is very basic but works for this dataset since the urls are automatically generated by reddit
# So they consequently all have the same pattern
newline_and_tab_pattern = r'[\n\t]'
numbers_patterns = r'[(1.)(2.)(3.)]'
url_patterns = r'http.[^\s]+'
username_handles = r'@[^\s]+'
# Remove newlines and tabs
# Then remove regex patterns from text
new_text_step_1 = re.sub(newline_and_tab_pattern, '', text)
new_text_step_2 = re.sub(numbers_patterns, '', new_text_step_1)
new_text_step_3 = re.sub(url_patterns, '', new_text_step_2)
final_text = re.sub(username_handles, '', new_text_step_3)
return final_text
hate.text = hate.text.apply(clean)
hate.head()
text | hateful | |
---|---|---|
0 | i joined gab to remind myself how retarded jew haters are You wouldn't be typing on your abacus without them you retard | 1 |
1 | This is what the left is really scared of That literally looks like a monkey Why are we supposed to pretend it’s a person bc it’s wearing a r... | 1 |
2 | It makes you an asshole Give it to a soldier who has defend it , then try to burn it Show me what a badass you really are The flag is helpless... | 1 |
3 | So they manage to provide a whole lot of data, except a critical little nugget Gay men are well over half the patients who present at STI clini... | 1 |
4 | Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? Fuck off wow, what a rude... | 1 |
# I'm now going to create a few extra features that should be useful in predicting hate
# Create a column for the length of each post
hate['post_length'] = hate.text.apply(lambda x: len(x))
# Define a function that accepts text and returns the polarity.
def detect_sentiment(text):
return TextBlob(text).sentiment.polarity
# Now apply this to create a sentiment column
hate['sentiment'] = hate.text.apply(detect_sentiment)
hate.head()
text | hateful | post_length | sentiment | |
---|---|---|---|---|
0 | i joined gab to remind myself how retarded jew haters are You wouldn't be typing on your abacus without them you retard | 1 | 120 | -0.850000 |
1 | This is what the left is really scared of That literally looks like a monkey Why are we supposed to pretend it’s a person bc it’s wearing a r... | 1 | 164 | -0.045000 |
2 | It makes you an asshole Give it to a soldier who has defend it , then try to burn it Show me what a badass you really are The flag is helpless... | 1 | 357 | -0.181250 |
3 | So they manage to provide a whole lot of data, except a critical little nugget Gay men are well over half the patients who present at STI clini... | 1 | 378 | -0.029687 |
4 | Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? Fuck off wow, what a rude... | 1 | 152 | -0.030000 |
# Will check if this worked
# See a positive score
hate[hate.sentiment == 1].text.sample()
9796 Just what I needed The perfect response to that boomer cunt #TheBoomerPlague Name: text, dtype: object
# And a negative score
hate[hate.sentiment == -1].text.sample()
3716 Thats p/c for nasty cunt !!! Name: text, dtype: object
# Looks more or less like this worked, there's a lot of extreme language in here so it's probably quite rough
# To test if it's really helped at all, let's consider a boxplot of sentiment grouped by hateful or not
hate.boxplot(column='sentiment', by='hateful', showfliers=False, figsize=(12,6));
As the above graph illustrates, this doesn't seem to add too much although the distribution is quite different for hateful text.
So, I'll leave it in for now and then tweak it later when I've got a prototype model up and running and am moving onto training it.
# First create a feature matrix and target variable
X = hate[['text', 'post_length', 'sentiment']]
y = hate.hateful
# Then train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)
# Define a function to lemmatize the text
def split_into_lemmas(text):
"""
Lemmatizes input text.
:param text: string, text to be lemmatized.
:return lemmatized_text: list, the lemmas in each word.
"""
text = str(text).lower()
words = TextBlob(text).words
lemmatized_text = [word.lemmatize() for word in words]
return lemmatized_text
# Use this with a TfidfVectorizer to create a document text model
vect = TfidfVectorizer(analyzer=split_into_lemmas, stop_words="english")
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
# As we'll discover later, I'll be using a multinomial Naive Bayes model for this problem
# Unfortunately this model doesn't accept negative inputs so we'll need to scale the data to address this
# Use a min max scaler to scale all the values between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_X_train = scaler.fit_transform(X_train.drop('text', axis=1)) # removing text now it's vectorized
scaled_X_test = scaler.transform(X_test.drop('text', axis=1)) # removing text now it's vectorized
# And lastly to add features to the document-term matrix to make everything ready for modelling
X_train_features_matrix = sp.sparse.csr_matrix(scaled_X_train).astype(float)
X_train_combined_matrix = sp.sparse.hstack((X_train_dtm, X_train_features_matrix))
X_test_features_matrix = sp.sparse.csr_matrix(scaled_X_test).astype(float)
X_test_combined_matrix = sp.sparse.hstack((X_test_dtm, X_test_features_matrix))
# Now I wil use SMOTE to randomly oversample the data to try and fix this class imbalance problem
# I got better results than using balanced class weights with the logistic regression model doing this
# Hence taking this approach
oversample = SMOTE()
X_train_rs, y_train_rs = oversample.fit_resample(X_train_combined_matrix, y_train)
At first glance, it looks like the most important metric for this model is classification accuracy. In the scope of this project, I'm using this model to classify text as hate or not in order to subsequently determine which features affect Trump's hatefulness.
False positives and false negatives aren't too significant since I'm using the model for investigative purposes rather than to make a real world decision such as whether to approve a loan application or not.
However, precision-recall is going to be a bit more relevant than accuracy here due to an imbalanced class distribution in the dataset - there are many more hateful scores than there are non-hateful scores.
Briefly, precision is a measure of result relevancy, while recall measures how many truly relevant results are returned.
More formally:
Precision (P) is defined as the number of true positives (Tp) over the number of true positives plus the number of false positives.
P=TpTp+FpRecall (R) is defined as the number of true positives (Tp) over the number of true positives plus the number of false negatives (Fn).
R=TpTp+FnThese quantities are also related to the (F1) score, which is defined as the harmonic mean of precision and recall - the harmonic mean is used here since it's more appropriate for measuring rates, which is what precision and recall both measure. See this article for more information on harmonic means and the F1 score.
F1=2P⋅RP+RSo, I will use two metrics to assess model performance in this notebook - Root Mean Squared Error (RMSE) to measure model accuracy and an F1 Score to measure Precision-Recall.
I will start with the null hypothesis before moving on to try a simple logistic regression model and then a multinomial naive bayes model.
I'm going to start with multinomial naive bayes since it's likely better than logistic regression as well as being fast and simple to use.
So the null hypothesis is 89% - that's what the most frequent class was in the intial dataset.
def evaluate_text_classification_model(X_train, X_test, y_train, y_test,
*, model):
"""
A function to evaluate model performance across my desired metrics and plot relevant outputs.
:params X_train, X_test, y_train, y_test: feature matrices and target variable arrays.
:param model: various datatypes, the model to test.
:params k_min, k_max: int, min/max k values for cross validation.
"""
# First measure model using negative MSE
# Cross validate
score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print("The negative MSE is {}".format(score.mean()))
# Now using the F1 Score
score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
print("The F1 score is {}".format(score.mean()))
# Now plot precision vs recall
disp = plot_precision_recall_curve(model, X_test, y_test)
disp.ax_.set_title('Precision-Recall curve')
# Call function
evaluate_text_classification_model(X_train_rs,
X_test_combined_matrix,
y_train_rs,
y_test,
model=logreg1)
The negative MSE is -0.08826002237359398 The F1 score is 0.91051040410233
These scores look quite good but I'm still very concerned about the null hypothesis problem.
So I'm going to see what happens if I apply this to some of the test data.
X_test['predictions'] = y_pred
# Let's test some hateful predictions
for row in X_test[['text', 'predictions']].loc[X_test.predictions == 1].sample(5).values:
print(row)
[" Shut the fuck up about democrats you retarded boomer cuckservatives If you still give a shit about the kike controlled us govt, you're doing it wrong" 1] [" how many jews have pagans burned at the stake ? How many whites have christians burnt at the stake? well technically none a whole heap of heretics but sort of no christians whatsoever ex christians sure torched loads of them4 So none of the white heretics & white pagans you guys burnt at the stake, & massacred were white 5 Bit strange that he/she claims no Christians too A whole heap of the people burned alive and hanged for witchcraft during the Protestant Reformation were Catholics6 lol, we're talking about CI's, thicker then even regular christians CI's are some of the most retarded idiots on the planet, all they do is lie & bullshit, thats their entire strategy after their entire evidence for CI got nuked & Rekt! in the CQ They got Rekt! so hard, they dont even admit theyre christian identerians anymore, top kek 7 Yeah bro, they got rekt!8 And these are the people on here that rag on white Christian's, while pretending to be in tune with their pagan faggot viking cuck god" 1] [' Everyone make your own nigger haiku for the new topic! Redneck? That baby is in a watermelon That baby must be an albino African 4 5 get my grape drank ho it be time for the ballgame what you trippin fo?6 😹😹 grape drank ' 1] [" His name was Brandon Arndt He helped his black elderly neighbor only to be killed for his good deeds Perhaps no one ever gave him this sage advice, avoid the groid\xa0 That's in their nigger blood, to kill the White Folk They kill their own kind too and that just might be a good thing because those are the worst of their race We can rid ourselves of that bad seed through attrition" 1] [' CNN’s John Berman Fires Back At Trump After The President Criticizes The ‘Mainstream Media’ On Twitter Who gives a shit what that faggot says? Glad his security clearance was revoked Eat Dirt Brennan' 1]
# Let's test some non hateful predictions
for row in X_test[['text', 'predictions']].loc[X_test.predictions == 0].sample(5).values:
print(row)
[" Hi and welcome to this episode of Reddit cooking show Preheat the oven to 5 Fahrenheit We'll start of by using the r/TumblrInAction as a base and then layer other ingredients on top Add a dash of r/thathappened spice it up with r/iamverybadass and you've got yourself this shit lasagna of a post Sounds tasty!" 0] [" But black people can't be racist! That would mean they're held to the same standards as everyone else? [removed] Ok, Cletus the redneck You kind of actually got racist Like, a lot " 0] [" Yeah sorry but video games are more fun than invisible ball Whenever I see the donate toys or clothes to kids overseas, I always think about giving it to just one kid, so they get everything and then see how long it takes until the town/village turns against them I'd send over one console per village Stage an Enter the Dragon style fight to the death tournament, winner gets the console Film the tournament and sell as pay-per-view Give a percentage of the proceeds to the village children Pros Gives the kids something to work towards Raises money Reduces population and so there's more resources left for others 4 Kids get to play some cool games 5 Tournament is interesting content for viewers and better than most shows on TV now anyway Cons Kids have to fight each other to the death There might not be electricity to plug in the console so it could be wasted on these kids anyway That's 5 pros vs cons Could potentially spice things up by having animals armed with weapons be tag team partners for the kids" 0] [' She’s not wrong, it was fueled by sexism The blatant sexism displayed by those involved in the production and by their defenders was indefensible and doubtlessly created more enemies of the film than fans I have a strong suspicion that the reason 99% of the "comedy" in the movie is terrible improv is because Paul Feig was afraid of actually giving the female actors direction because he didn\'t want to seem sexist He just pointed a camera at them and let them riff because actually telling them what they should be doing would be mansplaining Which is kinda sexist if you think about it They\'re professional actors and you\'re their director Direct them Either that or he\'s just a shit director Although he did direct some episodes of The Office that I like, but that could just come down to a strong writing staff and a solid cast I saw a review of the Ghostbusters reboot which included a detailed analysis of the controversies surrounding it and gave some insights about its creation According to the reviewer which is backed by excerpts from interviews with the cast or [this] article Paul Feig has a really bizarre fetish for improvisations He deliberately includes them in every movie he directs The trouble is that you have to have people who are awesome at improv and be careful about that I don\'t say improv is bad - it is bad when it is overdone, and when people don\'t actually know how to do it well, yet some of [the best scenes] in movies as we know them were made possible because of improvisations I think Feig is just holding a proverbial hammer in his hand and sees only nails everywhere4 A good example is ricky gervais, or steve carell They rarely do the same line twice but it\'s always in-step with what\'s happening5 That\'s the thing good improv takes the general idea of what should be happening and and goes from there Feigbusters improv was incoherent rambling without any sense of meaning6 Damn right There\'s a universe of difference between steve carell coming up with an unexpected banger; unasked and mcarthy being forced to improv 7 What\'s funny is that there wasn\'t much improv in The Office, according to the cast Other than the kiss between michael and oscar, which the reactions were 00% genuine 8 Improv can work when you have people that are legitimately funny, and not just “comedic actors” in your film Best in Show - one of my favorite movies, not just comedies, of all time - is damn near pure improvisation by all of the actors Christopher Guest and Eugene Levy wrote a quick outline and sort of gave each actor their character, and it was mostly on the actors to figure out how that character worked and go from there What they turned out was amazing But that movie is filled to the brim with hilarious people, Ghostbusters was not None of the 4 principle actresses are innately funny Sure, they can read some funny dialogue and overact to make it work for the SNL crowd, but that’s it They’re *acting* funny, they’re not actually funny There is a difference Point being, improv can absolutely carry a movie when done with the right people, both in front of and behind the camera Ghostbusters Femme Edition just didn’t have the juice 9 I agree with you about McCarthy and Jones They seemed the weakest by far in the movie Switch them out with real improv people like Amy Poehler, Niecy Nash, or even Tina Fay 0 See, I would have never guessed that Best in Show was nearly pure improv Then again, there is a subtlety to the movie and lots of the bits, like in Spinal Tap, are just quick little pieces of dialogue I wonder if he relied on it too much here, because I think only Kate McKinnon has the chops and timing to shine in improv I don\'t think that\'s strictly the case It isn\'t good improvisational skills that are required, tho they may help What\'s required is a talented actor who really understands their character, the story, the scene and the genre Although not comedy, one example I can give is Margot Robbie and Leo DiCaprio is Wolf of Wall Street In one scene, she loses her temper with him and slaps him The slap was improvised and when Robbie profusely apologised afterwards, DiCaprio assured her she did a fantastic job and did what felt right for the character Another example, also DiCaprio and also drama, in a scene in Django Unchained he actually cuts his hand on a broken glass, and without missing a beat, continues in character and wraps his hand That made the final edit Now I know comedy is different from drama, I\'m not claiming they\'re the same But there\'s a difference between improv as they did in GB06 with Paul Fieg, and improv \\*in character\\* and in context You can\'t make a good film the way he did, just letting the camera roll and the actors do whatever They need a script and to let the actors do it their way, perhaps a few takes to try different things with the director directing them as needed The trouble comes when the director doesn\'t have a clear enough vision for the final product The Hobbit trilogy also famously suffered for similar reasons I understand what you mean Yes, I completely agree4 What gets me is, Feig was pretty good in Heavyweights He was a fun character in that I still cant believe it\'s the same person that acted in that 5 That was a long, long time ago6 > Paul Feig was afraid of actually giving the female actors direction because he didn\'t want to seem sexist I think this is super unlikely Feig has directed women plenty before He directed some of the women in this movie before, to great results Bridesmaids The problem is that neither he nor the studio suits *understood* Ghostbusters and why it was so well regarded in the first place They thought they could just take their own sensibilities, slap a Slimer into it, and call it a day Obviously, that didn\'t work7 They just plain didn\'t respect it is where the real trouble lies They could have honored the series by giving the old characters a proper send off But, nope They wanted to steal the name and pretend the first one didn\'t happen' 0] [" Ha! Just saw that stat on tv, turned to the Mrs and said don't become a statistic I posted this joke on r/jokes I don’t think we’re welcome there lol I appreciate u bro 😂 👍 it's always a touching moment when you meet a like minded cunt I'm tearing up4 Haha I’m just happy downvotes don’t take away from accounts karma or my account wouldve been deleted now for sure lol5 😂" 0]
# Now that's done will delete the predictions columns
X_test.drop('predictions', axis=1, inplace=True)
The above actually looks pretty good; so combined with the precision/recall scores this seems OK.
# Repeat the above steps for this model
nb = MultinomialNB()
nb.fit(X_train_rs, y_train_rs)
evaluate_text_classification_model(X_train_rs, X_test_combined_matrix, y_train_rs, y_test,
model=nb)
The negative MSE is -0.19331830125081564 The F1 score is 0.7650917937399067
So from the above measure it looks like logistic regression actually performs slightly better than both the null accuracy and the multinomial naive bayes - simplest is best!
This is probably due to the fact that naive bayes assumes features are conditionally independent, which likely doesn't apply to this dataset - i.e. whether something is hate or not is very likely to depend on non-independent combinations of words.
To take an example from a above, if 'jews' and 'hate' are in a sentence then this combination is likely to cause 'yid'.
Now we've established that logistic regression is well suited to this problem, I'm going to see if I can create a better vectorizer and also randomly oversample the data to see if that helps the class imbalance.
# Create a new vectorizer with different parameters
vect2 = TfidfVectorizer(analyzer=split_into_lemmas, stop_words='english', ngram_range=(1, 2), min_df=.05,
max_df=.3, max_features=100)
# Train test split the data again using the same split and random state
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)
X_train_dtm = vect2.fit_transform(X_train['text'])
X_test_dtm = vect2.transform(X_test['text'])
# And lastly to add features to the document-term matrix to make everything ready for modelling
X_train_features_matrix = sp.sparse.csr_matrix(X_train.drop('text', axis=1)).astype(float)
X_train_combined_matrix = sp.sparse.hstack((X_train_dtm, X_train_features_matrix))
# Repeat for testing set.
X_test_features_matrix = sp.sparse.csr_matrix(X_test.drop('text', axis=1)).astype(float)
X_test_combined_matrix = sp.sparse.hstack((X_test_dtm, X_test_features_matrix))
# Resample the data again
X_train_rs_2, y_train_rs_2 = oversample.fit_resample(X_train_combined_matrix, y_train)
# Instantiate and fit model; then plot precision recall curve
logreg2 = LogisticRegression(max_iter=1000)
logreg2.fit(X_train_rs_2, y_train_rs_2)
disp = plot_precision_recall_curve(logreg2, X_test_combined_matrix, y_test)
disp.ax_.set_title('Precision-Recall curve')
Text(0.5, 1.0, 'Precision-Recall curve')
That looks very marginally better although not by enough to justify the extra work
# I'm now going to create some out of sample data to really test if this works
# Sorry for my language :-S
# Create new out of sample data
test_hate_speech = ["Die you stupid fucking evil cunt",
"I really like eating cookies out of the cookie jar",
"Go to hell you stupid fat bastard"]
# Prepare data the same way as before
test_df = pd.DataFrame({'text': test_hate_speech})
test_df.text = test_df.text.apply(clean)
test_df['post_length'] = test_df.text.apply(lambda x: len(x))
test_df['sentiment'] = test_df.text.apply(detect_sentiment)
test_df.head(3)
text | post_length | sentiment | |
---|---|---|---|
0 | Die you stupid fucking evil cunt | 32 | -0.9 |
1 | I really like eating cookies out of the cookie jar | 50 | 0.2 |
2 | Go to hell you stupid fat bastard | 33 | -0.8 |
# Now time to predict some probabilities for this
test_dtm = vect.transform(test_df.text)
scaled_test_features = scaler.fit_transform(test_df.drop('text', axis=1))
test_feature_matrix = sp.sparse.csr_matrix(scaled_test_features).astype(float)
test_combined_matrix = sp.sparse.hstack((test_dtm, test_feature_matrix))
test_scores = pd.DataFrame(logreg1.predict_proba(test_combined_matrix), columns=['Not Hate Speech Probability',
'Hate Speech Probability'])
test_scores['original_text'] = test_df.text
test_scores
Not Hate Speech Probability | Hate Speech Probability | original_text | |
---|---|---|---|
0 | 0.019495 | 0.980505 | Die you stupid fucking evil cunt |
1 | 0.964331 | 0.035669 | I really like eating cookies out of the cookie jar |
2 | 0.010542 | 0.989458 | Go to hell you stupid fat bastard |
Hooray! This seems to be working :-)
# Now read in the Trump data
trump = pd.read_csv('Trump_tweets_more_info.csv', index_col=0)
trump.sample(5)
id | text | isRetweet | isDeleted | device | favorites | retweets | date | urls | reply_count | quote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
37792 | 788468036541874176 | Thank you Colorado Springs. If I’m elected President I am going to keep Radical Islamic Terrorists out of our count... https://t.co/N74UK73RLK | f | f | Twitter for iPhone | 26542 | 10577 | 2016-10-18 19:53:23 | https://twitter.com/realDonaldTrump/status/788468036541874176/photo/1 | 3388 | 1335 |
13391 | 260487521430040576 | Everybody is asking about my announcement this Wednesday concerning Barack Obama---just wait and see! | f | f | Twitter Web Client | 169 | 544 | 2012-10-22 21:07:19 | NO DATA | 522 | 7 |
33588 | 607700233011757056 | """@SEETEK_AU: Watch, listen, and learn. You can’t know it all yourself. Anyone who thinks they do is destined for mediocrity.― Donald Trump" | f | f | Twitter for Android | 141 | 101 | 2015-06-08 00:06:40 | NO DATA | 27 | 5 |
48687 | 1189146892292313088 | A great book by a great guy. Get it now! https://t.co/hwFtbpbIO0 | f | f | Twitter for iPhone | 34003 | 7706 | 2019-10-29 11:48:07 | https://twitter.com/dbongino/status/1188636122764718080 | 2126 | 235 |
20113 | 338060998793646080 | @DannyZuker, are you ready for the deal? | f | f | Twitter for Android | 23 | 33 | 2013-05-24 22:36:37 | NO DATA | 25 | 1 |
# Replace no data as nan
trump.urls.replace('NO DATA', np.nan, inplace=True)
# The id column is unecessary so I can remove that
trump.drop('id', axis=1, inplace=True)
# Now going to convert the date column to a pandas datetime object
trump['date'] = pd.to_datetime(trump.date)
# Now I will factorize the columns with categorical variables so they are numerically encoded
trump.device = pd.factorize(trump.device)[0]
trump.isRetweet = pd.factorize(trump.isRetweet)[0]
trump.isDeleted = pd.factorize(trump.isDeleted)[0]
# The urls one is quite a lot fiddlier
# First I need to extract the root domain from the url
# i.e. convert https://twitter.com/realDonaldTrump to 'twitter'
# This is because I'm interested in traffic sources, i.e. the above url should be encoded the same as the below:
# https://twitter.com/JoeBiden
# The point I'm interested in is that they both represent traffic from twitter
def get_root_domain(urls):
"""
This converts an input list of urls into the root domain of a url.
:param urls: string, contains list of urls encoded as string OR one url as string, np.nan in case of no data.
:return domain: string, the root domain(s) OR np.nan in case of no data.
"""
# First return no data if appropriate
if isinstance(urls, type(np.nan)):
return urls
else:
# Convert string representations of lists into lists
# If the string is a list
try:
urls = ast.literal_eval(urls)
extracted_domains = []
for url in urls:
ext = tldextract.extract(url)
extracted_domains.append(ext.domain)
domain = ",".join(extracted_domains)
return domain
# This is a bit cheeky but since I created this data column I know exceptions will mean
# that the string isn't a list so there's only one url
except Exception:
pass
ext = tldextract.extract(urls)
domain = ext.domain
return domain
# Apply the function the urls column
trump['domains'] = trump.urls.apply(get_root_domain)
# Let's check this actually worked
trump.domains.value_counts()
twitter 6302 bit 1378 tl 369 pscp 339 instagram 272 ... createaforum 1 trumpgolfireland 1 coloradosun,twitter 1 chn 1 eventbrite,eventbrite,twitter 1 Name: domains, Length: 917, dtype: int64
trump.loc[trump.domains=='tl'].sample(3)[['urls', 'domains']]
urls | domains | |
---|---|---|
15211 | http://tl.gd/gv14j8 | tl |
15181 | http://tl.gd/h25ov4 | tl |
15539 | http://tl.gd/g1vo61 | tl |
trump.loc[trump.domains=='pscp'].sample(3)[['urls', 'domains']]
urls | domains | |
---|---|---|
44790 | https://www.pscp.tv/w/bolO8zFvTlFsTFJub1dwUXd8MU93eFdXd09aTUF4UUC3oF_Q3pkR0oyuvTU2e1ScZovXxPvSHzypF5VgXz8v?t=1s | pscp |
45529 | https://www.pscp.tv/w/bjOY3TFvTlFsTFJub1dwUXd8MU93eFdXcmJRTkR4UYead0duJSwkVRgKsYSG96dk-GyQ5jmjgK36wWdE84ez?t=1s | pscp |
51988 | https://www.pscp.tv/w/b8Zu2zFvTlFsTFJub1dwUXd8MWRqR1hwRUV3TW9HWkSLS2I9C4IQU-7hm2yX1sz1KKWAL8CxUWxH6mrM2goh?t=43s | pscp |
# Boom shakalaka that is looking good.
# Right, now moving on...
# First drop the urls column since we no longer need it
trump.drop('urls', axis=1, inplace=True)
# There are too many values to encode all of these numerically
trump.domains.value_counts()
twitter 6302 bit 1378 tl 369 pscp 339 instagram 272 ... createaforum 1 trumpgolfireland 1 coloradosun,twitter 1 chn 1 eventbrite,eventbrite,twitter 1 Name: domains, Length: 917, dtype: int64
# So I'm going to filter out for domains that appear less than ten times and encode those as Nan
indices_for_values_less_than_ten = np.where(trump.domains.value_counts().values < 10)[0].tolist()
domains_to_encode_as_nan = trump.domains.value_counts()[indices_for_values_less_than_ten].index.to_list()
trump['domains_filtered'] = trump.domains.apply(lambda x: np.nan if x in domains_to_encode_as_nan else x)
trump.domains_filtered.value_counts()
twitter 6302 bit 1378 tl 369 pscp 339 instagram 272 ... bostonherald 10 nydn 10 es 10 facebook,twitter,twitter,twitter,twitter 10 spectator 10 Name: domains_filtered, Length: 104, dtype: int64
# Now I will encode these
trump_dummies_1 = trump.domains_filtered.str.get_dummies(sep=',')
trump_dummies_1.sample(3)
ArmyForTrump | DonaldJTrump | Vote | abcn | amzn | apne | bit | bloom | bloomberg | bongino | ... | washingtonexaminer | washingtonpost | washingtontimes | wh | whitehouse | winred | wsj | yhoo | youtu | youtube | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35526 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
35142 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
49016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 88 columns
# The other thing I want to do now is extract twitter handles
# This way I can see who Trump is talking too
def handle_extraction(tweet):
"""
Function to extract a twitter handle from text.
:param tweet: string, input data.
:return output: string, comma separated list of handles.
"""
username_handles = r'\B@\w*' # nb slightly different to simpler above handle for regex
result = re.findall(username_handles, tweet)
# If only one handle in tweet
if result:
if len(result) == 1:
output = result[0]
return output
#If multiple handles
if len(result) > 1:
output = ",".join(result)
return output
trump['handles'] = trump.text.apply(handle_extraction)
# Check this actually worked
trump.handles.value_counts()
@realDonaldTrump 708 @BarackObama 541 @WhiteHouse 372 @foxandfriends 316 @FoxNews 222 ... @unicef 1 @ray_chipendo 1 @debragarrett,@usminority 1 @CLewandowski_,@realdonaldtrump 1 @Molly_Stew,@realDonaldTrump 1 Name: handles, Length: 18782, dtype: int64
# Encode as before, removing handles that appear less than 10 times
indices_for_values_less_than_ten = np.where(trump.handles.value_counts().values < 10)[0].tolist()
domains_to_encode_as_nan = trump.handles.value_counts()[indices_for_values_less_than_ten].index.to_list()
trump['handles_filtered'] = trump.handles.apply(lambda x: np.nan if x in domains_to_encode_as_nan else x)
trump.handles_filtered.value_counts()
@realDonaldTrump 708 @BarackObama 541 @WhiteHouse 372 @foxandfriends 316 @FoxNews 222 ... @kyleraccio,@realDonaldTrump 10 @_Snurk,@realDonaldTrump 10 @TrumpTurnberry 10 @ArsenioHall 10 @jacknicklaus 10 Name: handles_filtered, Length: 222, dtype: int64
# Lots of self-referential tweets, interesting...
# Encoding now...
trump_dummies_2 = trump.handles_filtered.str.get_dummies(sep=',')
trump_dummies_2.sample(3)
@ | @60Minutes | @ABC | @ACTBrigitte | @AGSchneiderman | @AP | @AbeShinzo | @AlanDersh | @AlexSalmond | @AmSpec | ... | @politico | @realDonaldTrump | @seanhannity | @senatemajldr | @tedcruz | @thebradfordfile | @thehill | @trish_regan | @washingtonpost | @yankees | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17997 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4054 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
31927 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 203 columns
# Remove the blank handle one
trump_dummies_2.drop('@', axis=1, inplace=True)
# Now add the dummy variables to the main dataset and remove unecessary columns
trump.drop(['domains', 'handles', 'handles_filtered', 'domains_filtered'], axis=1, inplace=True)
trump_final = pd.concat([trump, trump_dummies_1, trump_dummies_2], axis=1)
# Now I need to clean up some of the noise in the text before vectorizing it
# Looks like there were quite a few unwanted characters
# So will need to tidy that up
def clean_again(text):
"""
This function removes the following uwanted strings from input text:
urls, hashtags, &, RT :, newlines, tabs, twitter handles.
:param text: string, input text.
:return clean_text: string, input text with above characteristics removed.
"""
# Define patterns
# The url regex is very basic but works for this dataset since the urls are automatically generated by reddit
# So they consequently all have the same pattern
newline_and_tab__and_hashtag_and_exclamation_mark_pattern = r'[\n\t#"!:]'
url_patterns = r'http.*[^\s]+'
username_handles = r'\B@\w*'
# Remove phrases
text_1 = text.replace('RT', '')
text_2 = text_1.replace('&', '')
# Remove regex patterns from text
new_text_step_1 = re.sub(newline_and_tab__and_hashtag_and_exclamation_mark_pattern, '', text_2)
new_text_step_2 = re.sub(username_handles, '', new_text_step_1)
final_text = re.sub(url_patterns, '', new_text_step_2)
return final_text
trump_final.text = trump.text.apply(clean_again)
# Remove blank strings if they have been created by the above
trump_final.text.replace('', np.nan, inplace=True)
trump_final.dropna(subset=['text'], inplace=True)
# In order for the logistic regression model created earlier to work, the feature matrix will need to be the same
# So that means creating the same features
trump_final['post_length'] = trump_final.text.apply(lambda x: len(x))
# Now apply this to create a sentiment column
trump_final['sentiment'] = trump_final.text.apply(detect_sentiment)
# Now it's time to vectorize the text and add a hatefulness score
# To do this I'll use the logistic regression model created earlier and the same vectorizer
trump_dtm = vect.transform(trump_final.text)
# Now combine with the relevant features
trump_scaled_features = scaler.fit_transform(trump_final[['post_length', 'sentiment']])
trump_feature_matrix = sp.sparse.csr_matrix(trump_scaled_features).astype(float)
trump_combined_matrix = sp.sparse.hstack((trump_dtm, trump_feature_matrix))
# Now load in the model saved earlier
# And use this to classify the probability that Trump's speech is hateful or not
# Note - I expect most of it won't be compared to the sample hate data
# But I'm more looking for trends of hateful or not
test_scores = pd.DataFrame(logreg1.predict_proba(trump_combined_matrix),
columns=['Not Hate Speech Probability', 'Hate Speech Probability'])
test_scores['original_text'] = trump_final.text
# Not sure why but this has some Nan values and blank strings missed so removing those
test_scores.original_text.replace(' ', np.nan, inplace=True)
test_scores.dropna(inplace=True)
test_scores.loc[test_scores['Hate Speech Probability'] > .7].sample(10)
Not Hate Speech Probability | Hate Speech Probability | original_text | |
---|---|---|---|
47146 | 0.289630 | 0.710370 | Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border. They co... |
15117 | 0.240332 | 0.759668 | awesome Thsnks |
28296 | 0.267540 | 0.732460 | If you would buy the team that would be awesome Just an extra win for my squad) |
16096 | 0.288126 | 0.711874 | This is great How awesome of Mr. Trump Thank you. |
29691 | 0.250606 | 0.749394 | This very expensive GLOBAL WARMING bullshit has got to stop. Our planet is freezing, record low temps,and our GW scientists are stuck in ice |
17562 | 0.239279 | 0.760721 | “MSNBC'S TOURÉ HAS EPIC RACE-BAITING MELTDOWN ON CNN” |
35341 | 0.174289 | 0.825711 | Hi Katie, let's get Donald Trump in the WH. He's the man to get this country back in order |
18682 | 0.060416 | 0.939584 | I feel like is the only person on my tl that has common sense when it comes to the future of our country |
24633 | 0.079892 | 0.920108 | it's by far my favorite Mac Miller song. Can't beat Donald Trump |
48469 | 0.181754 | 0.818246 | 95% Approval Rating in the Republican Party. Thank you |
test_scores.loc[test_scores['Hate Speech Probability'] < .3].sample(10)
Not Hate Speech Probability | Hate Speech Probability | original_text | |
---|---|---|---|
47801 | 0.878038 | 0.121962 | Do not believe any article or story you read or see that uses “anonymous sources” having to do with trade or any other subject. Only accept inform... |
47806 | 0.778821 | 0.221179 | Nancy Pelosi just had a nervous fit. She hates that we will soon have 182 great new judges and sooo much more. Stock Market and employment records... |
46237 | 0.738021 | 0.261979 | This is my 500th. Day in Office and we have accomplished a lot - many believe more than any President in his first 500 days. Massive Tax , Regulat... |
5953 | 0.955222 | 0.044778 | 15 DAYS TO SLOW THE SPREAD |
44276 | 0.899759 | 0.100241 | Today in the East Room of the , it was my true privilege to award seven extraordinary Americans with the Presidential Medal of Freedom... |
531 | 0.830135 | 0.169865 | Why isn’t Biden corruption trending number one on Twitter? Biggest world story, and nowhere to be found. There is no”trend”, only negative stories... |
6466 | 0.972319 | 0.027681 | Amy Coney Barrett is an outstanding judge and an even better person. I commend President Trump on another exceptional pi… |
20351 | 0.852116 | 0.147884 | True and thanks. |
49392 | 0.925036 | 0.074964 | The first so-called second hand information “Whistleblower” got my phone conversation almost completely wrong, so now word is they are going to th... |
46129 | 0.759820 | 0.240180 | The World has taken a big step back from potential Nuclear catastrophe No more rocket launches, nuclear testing or research The hostages are back ... |
# Let's check the distribution to see if it looks alright
test_scores['Hate Speech Probability'].hist(bins=200);
Overall, this looks like it's not actually picking up hate speech since Trump doesn't use language like in the previous dataset - say swear words, racial slurs.
I think what is taking place here is data drift - the dataset I trained the hate speech classifier on is significantly different to the dataset of Trump's tweets.
Recall however that I'm not trying to measure hatefulness in Trump - my model is clearly insufficient for that - but rather trends in his word choice, which this model does seem to be picking up.
Specifically, it looks like the model is picking up more divisive areas that would typically lead to hate speech, such as immigration, global warming, swearing; these are typically signs of more nationalist rhetoric. I've checked this by making lots of manual tests of the data - this is definitely something that could be fine tuned in further iterations of this model beyond a prototype.
So this metric will serve as a measure of what makes Trump's speech divisive, i.e. what might lead or be more likely to lead to hate speech; that's what the next stage of this notebook will explore.
I'm now going to try and build a new model to see which features make Trump's speech more or less divisive. The point of doing this is really investigative - I'm interested in seeing who in his network makes him more or less divisive.
So interpretability is therefore very important since it's the main point of this whole exercise. Additionally, I'm not too worried about false negatives/positives since there is no action being taken after modelling.
As you can see from the distribution of the target variable above, there isn't a class imbalance so I am mostly concerned with accuracy here.
I will therefore use R2 as a metric since I'm interested in how the model performs specifically on the data in this set alone since I'm trying to understand the relationships between features and the target variable.
An R2 score measures the proportion of variance in the target variable - in this case, the divisiveness of Trump's speech - that is predictable from the independent variable.
It is defined as below:
R2=1−RSSTSSRSS = the sum of squares of the residuals TSS = total sum of all squares
Using this metric, an R2 score of 0 is the baseline for model performance - this is what a model that predicts ˉy will give.
The above considerations all suggest that a regression tree would be a good fit for this; there's no class imbalance and the results are highly interpretable.
Furthermore, I have a lot of features and large amount of data too.
So I will use a regression tree for this as below.
# Select feature matrix and target variables
# Removing text since I'm interested in how the other people Trump communicates with affect his language
# So text is effectively the target variable in a different form
X = trump_final[[x for x in trump_final.columns if not x in ['divisiveness_score', 'text']]]
y = trump_final["divisiveness_score"]
# Some of the y 0 values have wrongly encoded as NaN so will fix that here
y.replace(np.nan, 0, inplace=True)
# Encoded time column to numeric
X['date']=X['date'].map(dt.datetime.toordinal)
# First train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)
# Then instantiate a decision tree regressor
treereg = DecisionTreeRegressor(random_state=1, max_depth=3)
# Fit model
treereg.fit(X_train, y_train)
# Evaluate
scores = cross_val_score(treereg, X_test, y_test, cv=14, scoring='r2')
scores.mean()
0.13757747676863255
That's a little bit better than the baseline, hopefully I can improve on that though.
# That wasn't very good so I will try and tune some hyperparameters to improve it
param_grid = {
'max_depth': list(range(1, 15)),
"min_samples_split": list(range(2,6)),
"min_samples_leaf": list(range(1, 6)),
"max_features": ['auto', 'sqrt', 'log2']
}
tree_reg_optimal = RandomizedSearchCV(treereg, param_grid, random_state=42, scoring='r2')
search = tree_reg_optimal.fit(X_train, y_train)
search.best_params_
{'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 5}
# Now let's try a tuned regression
treereg2 = DecisionTreeRegressor(random_state=1,
max_depth=5,
min_samples_split=2,
min_samples_leaf=2,
max_features='auto')
# Fit model
treereg2.fit(X_train, y_train)
# Evaluate
scores = cross_val_score(treereg2, X_test, y_test, cv=14, scoring='r2')
scores.mean()
0.18333373669044883
This is quite low generally speaking but I think fine for my purposes - it's better than the baseline and predicting human behaviour is going to be hard.
What I'm looking for, as stated previously, is trends; so I this model should still be capable of interpreting feature relationships to the target variable.
# Access the model's feature importance
feature_importance = pd.DataFrame({'feature': X.columns.to_list(),
'importance': treereg2.feature_importances_}).sort_values(by='importance',
ascending=False)
feature_importance.loc[feature_importance.importance > 0]
feature | importance | |
---|---|---|
5 | date | 0.934353 |
298 | post_length | 0.063586 |
4 | retweets | 0.001050 |
3 | favorites | 0.001011 |
So this looks like date is overwhelmingly important, which is quite interesting; that may be due to the fact that the dataset starts before Trump became the President and was less political.
The only interlocutor here is twimg, which is twitter's own image hosting site. So that doesn't actually suggest the interlocutor is important, it does suggest that posts with images are more likely to contain divisive messaging .
Now I will see if I can find any correlation between the interlocutor features - i.e. seeing if a decision tree regressor will work using only the Twitter handle and url information.
# Remove other features
cols_to_drop = ['isRetweet', 'isDeleted', 'device', 'favorites', 'retweets', 'date', 'reply_count',
'quote_count', 'sentiment', 'post_length', 'text', 'divisiveness_score']
# Create new feature matrix and target variable
X_new = trump_final[[x for x in trump_final.columns if not x in cols_to_drop]]
y_new = trump_final["divisiveness_score"]
# Test baseline r2_score
X_new['prediction'] = y_new.median()
print(r2_score(y_new, X_new.prediction))
X_new.drop('prediction', axis=1, inplace=True)
-0.004172031322973169
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, random_state=42, test_size=0.3)
# Instantiate new model
treereg3 = DecisionTreeRegressor(random_state=1, max_depth=3)
# Fit model
treereg3.fit(X_train, y_train)
# Evaluate
scores = cross_val_score(treereg3, X_test, y_test, cv=14, scoring='r2')
scores.mean()
0.006789968440402948
# So that doesn't really work - no correlation
# Let's see if that changes with hyperparameter tuning
tree_reg_optimal_2 = RandomizedSearchCV(treereg3, param_grid, random_state=42, scoring='r2')
search = tree_reg_optimal_2.fit(X_train, y_train)
search.best_params_
{'min_samples_split': 4, 'min_samples_leaf': 5, 'max_features': 'auto', 'max_depth': 10}
# Now score this tuned model
treereg4 = DecisionTreeRegressor(random_state=1,
max_depth=10,
min_samples_split=4,
min_samples_leaf=5,
max_features='auto')
# Fit model
treereg4.fit(X_train, y_train)
# Evaluate
scores = cross_val_score(treereg4, X_test, y_test, cv=14, scoring='r2')
scores.mean()
0.011374888952897098
That's still actually quite a lot worse than the previous one, so I'll stick to the second model. I'm now going to visualize the decision tree it made to see if that confirms what I thought from the feature importance scores - that the date is way more important than other considerations.
# Plot tree to depth 3 - it isn't that readable past this
fig, ax = plt.subplots(1,1, figsize=(14,6))
plot_tree(treereg2, feature_names=X.columns.to_list(), max_depth=3, fontsize=12, filled=True);
In conclusion, it looks like Trump's divisive language is actually not affected by his interlocutor.
From the feature importance metrics above, and as is visible from the tree diagram above, it is in fact most correlated with date; none of the other features really generated a strong correlation at all.
Let's see how the divisiveness score looks when plotted by date.
# Create a feature matix of just the date and divisiveness score for time series plotting
trump_plot = trump_final[['date', 'divisiveness_score']]
trump_plot.index = trump_plot.date
trump_plot.drop('date', axis=1, inplace=True)
# Use seasonal decompose to see how this has changed over a 3 month period
decomposition = seasonal_decompose(trump_final.divisiveness_score, period=3)
fig = plt.figure()
fig = decomposition.plot()
fig.set_size_inches(12, 10)
<Figure size 432x288 with 0 Axes>
# This is very noisy so I will try and use rolling statistics to smooth this out
# Here I'm resampling by quarterly frequency, summing this so it's cumulative per quarter
# And then calculating an exponentially weighted mean to smooth out some of the noise
trump_plot.divisiveness_score.resample('Q').sum().ewm(span=2).mean().plot(legend=True,
label='Trump Tweet Divisiveness');
This seems intuitively plausible - it broadly tracks Trump's increasing political profile - 2011 speech at the Conservative Political Action Conference, increasing campainging until election in 2016 and then skyrocketing values in 2020 with a comparatively quiet period in some of the interm years of his presidency.
Even though the hypothesis was false, this project nonetheless leads to quite a few usual conclusions:
# Here is a very brief illustration of how that might be achieved
# Select a particularly divisive area from the graph above
# Then hone in exclusively on the features that denote his interlocutors - domains and twitter handles
divisive_period = trump_final.loc[(trump_final.date >= '2012-09') & (trump_final.date <= '2013-09')]
cols_to_remove = ['text', 'isRetweet', 'isDeleted', 'device', 'favorites', 'retweets','date', 'reply_count',
'quote_count', 'post_length', 'sentiment', 'divisiveness_score']
divisive_period_target = divisive_period.drop(cols_to_remove, axis=1).sum()
# Get top 15 interlocutors in this time period
top_15 = divisive_period_target.sort_values(ascending=False).nlargest(15)
# Plot results
plt.bar(top_15.index, top_15)
plt.title("Trump's Top 15 Interlocutors; 2012-09 to 2013-09")
plt.ylabel('Domain or handle citations')
plt.xticks(rotation=90);
Bit here means bitly which is not very interesting, but it's intriguing to see the other handles that are quite high up - like Lord Sugar and Danny Zuker. Tl also is the shortened version of tweetlonger urls, which suggests Trump wrote lots of longer tweets to get his message across.