Project Overview¶

Problem statement: How does Trump’s interlocutor change his choice of word?

Hypothesis: Trump’s choice of words depends on who he is speaking to, i.e. if he mentions someone in a tweet or references an external url.

Technical Work¶

In [153]:

# First import all required dependencies

import pandas as pd
import datetime as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.metrics import f1_score, plot_precision_recall_curve, r2_score
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeRegressor, plot_tree
from imblearn.over_sampling import SMOTE
from statsmodels.tsa.seasonal import seasonal_decompose
import scipy as sp
from scipy import stats
import re
import ast
import tldextract
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
pd.options.mode.chained_assignment = None  # this suppresses setting with copy warnings

Hate Speech Classification¶

In [2]:

# Read in the data

hate = pd.read_csv('hate_speech_dataset.csv')
pd.set_option('display.max_colwidth',150)
hate.head()

Out[2]:

	text	hateful
0	1. i joined gab to remind myself how retarded jew haters are. You wouldn't be typing on your abacus without them you retard.\n	1
1	1. This is what the left is really scared of. https://redd.it/9rfkts\n2. \tThat literally looks like a monkey. Why are we supposed to pretend i...	1
2	1. It makes you an asshole.\n2. \tGive it to a soldier who has defend it , then try to burn it . Show me what a badass you really are. The flag ...	1
3	1. So they manage to provide a whole lot of data, except a critical little nugget. Gay men are well over half the patients who present at STI cl...	1
4	1. Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? @PugLife\n2. \tFuck off\...	1

Sample size¶

This dataset is quite skewed in that there are many more instances of non-hateful speech than there are of hateful speech.

So we need to check that we have enough samples first.

To do this I will use the following from Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. :

Find the proportion $p$ of positive cases and negative cases. Take the smaller of the two.
Find the number of independent variables $k$ .
Let the minimum number of cases be $N = \frac{10k}{p}$ . The minimum should always be set to at least $100$ .

In [3]:

# Find the proportion
hate.hateful.value_counts(normalize=True)

Out[3]:

1    0.891422
0    0.108578
Name: hateful, dtype: float64

So $p = .108578$

Later I'll determine that $k=3$ - text, text length, sentiment value). See heading 2.1.4 of this notebook for the specifics.

This means that $N = \frac{30}{.108578}$ which gives $N = 276$

In [4]:

# And thankfully that will be fine since we have a lot more than 276 samples

hate.hateful.value_counts()

Out[4]:

1    15016
0     1829
Name: hateful, dtype: int64

Clean data¶

In [5]:

# Looks like there are quite a few unwanted characters 
# So will need to tidy that up

def clean(text):
    """
    This function removes the following uwanted strings from input text:
    1., 2., 3. if they occurs at the start of a string.
    Any url patterns.
    '\n', '\t' characters.
    '@username' handles.
    :param text: string, input text.
    :return clean_text: string, input text with above characteristics removed.
    """
    
    # Define patterns
    # The url regex is very basic but works for this dataset since the urls are automatically generated by reddit
    # So they consequently all have the same pattern
    newline_and_tab_pattern = r'[\n\t]'
    numbers_patterns = r'[(1.)(2.)(3.)]'
    url_patterns = r'http.[^\s]+'
    username_handles = r'@[^\s]+'
    
    # Remove newlines and tabs
    # Then remove regex patterns from text
    new_text_step_1 = re.sub(newline_and_tab_pattern, '', text)
    new_text_step_2 = re.sub(numbers_patterns, '', new_text_step_1)
    new_text_step_3 = re.sub(url_patterns, '', new_text_step_2)
    final_text = re.sub(username_handles, '', new_text_step_3)
    
    return final_text

hate.text = hate.text.apply(clean)
hate.head()

Out[5]:

	text	hateful
0	i joined gab to remind myself how retarded jew haters are You wouldn't be typing on your abacus without them you retard	1
1	This is what the left is really scared of That literally looks like a monkey Why are we supposed to pretend it’s a person bc it’s wearing a r...	1
2	It makes you an asshole Give it to a soldier who has defend it , then try to burn it Show me what a badass you really are The flag is helpless...	1
3	So they manage to provide a whole lot of data, except a critical little nugget Gay men are well over half the patients who present at STI clini...	1
4	Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? Fuck off wow, what a rude...	1

Feature Engineering¶

In [6]:

# I'm now going to create a few extra features that should be useful in predicting hate
# Create a column for the length of each post
hate['post_length'] = hate.text.apply(lambda x: len(x))

# Define a function that accepts text and returns the polarity.
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Now apply this to create a sentiment column
hate['sentiment'] = hate.text.apply(detect_sentiment)
hate.head()

Out[6]:

	text	hateful	post_length	sentiment
0	i joined gab to remind myself how retarded jew haters are You wouldn't be typing on your abacus without them you retard	1	120	-0.850000
1	This is what the left is really scared of That literally looks like a monkey Why are we supposed to pretend it’s a person bc it’s wearing a r...	1	164	-0.045000
2	It makes you an asshole Give it to a soldier who has defend it , then try to burn it Show me what a badass you really are The flag is helpless...	1	357	-0.181250
3	So they manage to provide a whole lot of data, except a critical little nugget Gay men are well over half the patients who present at STI clini...	1	378	-0.029687
4	Hi there, i,m Keith, i hope you are doing well, i would be very honored to get to know you if you would be interested? Fuck off wow, what a rude...	1	152	-0.030000

In [7]:

# Will check if this worked
# See a positive score
hate[hate.sentiment == 1].text.sample()

Out[7]:

9796     Just what I needed The perfect response to that boomer cunt  #TheBoomerPlague
Name: text, dtype: object

In [8]:

# And a negative score
hate[hate.sentiment == -1].text.sample()

Out[8]:

3716      Thats p/c for nasty cunt !!!
Name: text, dtype: object

In [9]:

# Looks more or less like this worked, there's a lot of extreme language in here so it's probably quite rough
# To test if it's really  helped at all, let's consider  a boxplot of sentiment grouped by hateful or not
hate.boxplot(column='sentiment', by='hateful', showfliers=False, figsize=(12,6));

As the above graph illustrates, this doesn't seem to add too much although the distribution is quite different for hateful text.

So, I'll leave it in for now and then tweak it later when I've got a prototype model up and running and am moving onto training it.

Pre-Process Data¶

In [10]:

# First create a feature matrix and target variable
X = hate[['text', 'post_length', 'sentiment']]
y = hate.hateful


# Then train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

In [11]:

# Define a function to lemmatize the text

def split_into_lemmas(text):
    """
    Lemmatizes input text.
    :param text: string, text to be lemmatized.
    :return lemmatized_text: list, the lemmas in each word.
    """
    text = str(text).lower()
    words = TextBlob(text).words
    lemmatized_text = [word.lemmatize() for word in words]
    return lemmatized_text

In [12]:

# Use this with a TfidfVectorizer to create a document text model

vect = TfidfVectorizer(analyzer=split_into_lemmas, stop_words="english")
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)

In [13]:

# As we'll discover later, I'll be using a multinomial Naive Bayes model for this problem
# Unfortunately this model doesn't accept negative inputs so we'll need to scale the data to address this

# Use a min max scaler to scale all the values between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_X_train = scaler.fit_transform(X_train.drop('text', axis=1))  # removing text now it's vectorized
scaled_X_test = scaler.transform(X_test.drop('text', axis=1))  # removing text now it's vectorized

In [14]:

# And lastly to add features to the document-term matrix to make everything ready for modelling

X_train_features_matrix = sp.sparse.csr_matrix(scaled_X_train).astype(float)
X_train_combined_matrix = sp.sparse.hstack((X_train_dtm, X_train_features_matrix))
X_test_features_matrix = sp.sparse.csr_matrix(scaled_X_test).astype(float)
X_test_combined_matrix = sp.sparse.hstack((X_test_dtm, X_test_features_matrix))

In [15]:

# Now I wil use SMOTE to randomly oversample the data to try and fix this class imbalance problem
# I got better results than using balanced class weights with the logistic regression model doing this
# Hence taking this approach

oversample = SMOTE()
X_train_rs, y_train_rs = oversample.fit_resample(X_train_combined_matrix, y_train)

Relevant metrics¶

At first glance, it looks like the most important metric for this model is classification accuracy. In the scope of this project, I'm using this model to classify text as hate or not in order to subsequently determine which features affect Trump's hatefulness.

False positives and false negatives aren't too significant since I'm using the model for investigative purposes rather than to make a real world decision such as whether to approve a loan application or not.

However, precision-recall is going to be a bit more relevant than accuracy here due to an imbalanced class distribution in the dataset - there are many more hateful scores than there are non-hateful scores.

Briefly, precision is a measure of result relevancy, while recall measures how many truly relevant results are returned.

More formally:

Precision ( $P$ ) is defined as the number of true positives ( $T_{p}$ ) over the number of true positives plus the number of false positives.

$P = \frac{T_{p}}{T_{p} + F_{p}}$

Recall ( $R$ ) is defined as the number of true positives ( $T_{p}$ ) over the number of true positives plus the number of false negatives ( $F_{n}$ ).

$R = \frac{T_{p}}{T_{p} + F_{n}}$

These quantities are also related to the ( $F1$ ) score, which is defined as the harmonic mean of precision and recall - the harmonic mean is used here since it's more appropriate for measuring rates, which is what precision and recall both measure. See this article for more information on harmonic means and the F1 score.

$F1 = 2 \frac{P \centerdot R}{P + R}$

So, I will use two metrics to assess model performance in this notebook - Root Mean Squared Error (RMSE) to measure model accuracy and an F1 Score to measure Precision-Recall.

Model selection¶

I will start with the null hypothesis before moving on to try a simple logistic regression model and then a multinomial naive bayes model.

I'm going to start with multinomial naive bayes since it's likely better than logistic regression as well as being fast and simple to use.

Logistic Regression¶

So the null hypothesis is 89% - that's what the most frequent class was in the intial dataset.

In [17]:

def evaluate_text_classification_model(X_train, X_test, y_train, y_test, 
                                       *, model):
    """
    A function to evaluate model performance across my desired metrics and plot relevant outputs.
    :params X_train, X_test, y_train, y_test: feature matrices and target variable arrays.
    :param model: various datatypes, the model to test.
    :params k_min, k_max: int, min/max k values for cross validation.
    """
    
    # First measure model using negative MSE 
    # Cross validate
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print("The negative MSE is {}".format(score.mean()))

    # Now using the F1 Score
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    print("The F1 score is {}".format(score.mean()))
    
    # Now plot precision vs recall
    disp = plot_precision_recall_curve(model, X_test, y_test)
    disp.ax_.set_title('Precision-Recall curve')

In [18]:

# Call function
evaluate_text_classification_model(X_train_rs, 
                                   X_test_combined_matrix, 
                                   y_train_rs, 
                                   y_test,
                                   model=logreg1)

The negative MSE is -0.08826002237359398
The F1 score is 0.91051040410233

These scores look quite good but I'm still very concerned about the null hypothesis problem.

So I'm going to see what happens if I apply this to some of the test data.

In [19]:

X_test['predictions'] = y_pred

# Let's test some hateful predictions
for row in X_test[['text', 'predictions']].loc[X_test.predictions == 1].sample(5).values:
    print(row)

[" Shut the fuck up about democrats you retarded boomer cuckservatives If you still give a shit about the kike controlled us govt, you're doing it wrong"
 1]
[" how many jews have pagans burned at the stake ?  How many whites have christians burnt at the stake? well technically none  a whole heap of heretics but  sort of no christians whatsoever  ex christians sure torched loads of them4 So none of the white heretics & white pagans you guys burnt at the stake, & massacred were white 5 Bit strange that he/she claims no Christians too A whole heap of the people burned alive and hanged for witchcraft during the Protestant Reformation were Catholics6 lol, we're talking about CI's, thicker then even regular christians    CI's are some of the most retarded idiots on the planet, all they do is lie & bullshit, thats their entire strategy after their entire evidence for CI got nuked & Rekt! in the CQ   They got Rekt! so hard, they dont even admit theyre christian identerians anymore, top kek 7 Yeah bro, they got rekt!8 And these are the people on here that rag on white Christian's, while pretending to be in tune with their pagan faggot viking cuck god"
 1]
[' Everyone make your own nigger haiku for the new topic!   Redneck? That baby is in a watermelon That baby must be an albino African 4 5 get my grape drank ho   it be time for the ballgame   what you trippin fo?6 😹😹 grape drank '
 1]
[" His name was Brandon Arndt He helped his black elderly neighbor only to be killed for his good deeds Perhaps no one ever gave him this sage advice, avoid the groid\xa0   That's in their nigger blood, to kill the White Folk They kill their own kind too and that just might be a good thing because those are the worst of their race We can rid ourselves of that bad seed through attrition"
 1]
[' CNN’s John Berman Fires Back At Trump After The President Criticizes The ‘Mainstream Media’ On Twitter      Who gives a shit what that faggot says? Glad his security clearance was revoked Eat Dirt Brennan'
 1]

In [20]:

# Let's test some non hateful predictions
for row in X_test[['text', 'predictions']].loc[X_test.predictions == 0].sample(5).values:
    print(row)

[" Hi and welcome to this episode of Reddit cooking show Preheat the oven to 5 Fahrenheit We'll start of by using the r/TumblrInAction as a base and then layer other ingredients on top Add a dash of r/thathappened spice it up with r/iamverybadass and you've got yourself this shit lasagna of a post Sounds tasty!"
 0]
[" But black people can't be racist! That would mean they're held to the same standards as everyone else? [removed] Ok, Cletus the redneck You kind of actually got racist Like, a lot "
 0]
[" Yeah sorry but video games are more fun than invisible ball Whenever I see the donate toys or clothes to kids overseas, I always think about giving it to just one kid, so they get everything and then see how long it takes until the town/village turns against them I'd send over one console per village Stage an Enter the Dragon style fight to the death tournament, winner gets the console Film the tournament and sell as pay-per-view Give a percentage of the proceeds to the village children   Pros   Gives the kids something to work towards  Raises money  Reduces population and so there's more resources left for others 4 Kids get to play some cool games 5 Tournament is interesting content for viewers and better than most shows on TV now anyway  Cons   Kids have to fight each other to the death  There might not be electricity to plug in the console so it could be wasted on these kids anyway  That's 5 pros vs  cons Could potentially spice things up by having animals armed with weapons be tag team partners for the kids"
 0]
[' She’s not wrong, it was fueled by sexism  The blatant sexism displayed by those involved in the production and by their defenders was indefensible and doubtlessly created more enemies of the film than fans I have a strong suspicion that the reason 99% of the "comedy" in the movie is terrible improv is because Paul Feig was afraid of actually giving the female actors direction because he didn\'t want to seem sexist He just pointed a camera at them and let them riff because actually telling them what they should be doing would be mansplaining Which is kinda sexist if you think about it They\'re professional actors and you\'re their director Direct them   Either that or he\'s just a shit director Although he did direct some episodes of The Office that I like, but that could just come down to a strong writing staff and a solid cast  I saw a review of the Ghostbusters reboot which included a detailed analysis of the controversies surrounding it and gave some insights about its creation According to the reviewer which is backed by excerpts from interviews with the cast or [this] article Paul Feig has a really bizarre fetish for improvisations He deliberately includes them in every movie he directs The trouble is that you have to have people who are awesome at improv and be careful about that  I don\'t say improv is bad - it is bad when it is overdone, and when people don\'t actually know how to do it well, yet  some of [the best scenes] in movies as we know them were made possible because of improvisations  I think Feig is just holding a proverbial hammer in his hand and sees only nails everywhere4 A good example is ricky gervais, or steve carell They rarely do the same line twice but it\'s always in-step with what\'s happening5 That\'s the thing good improv takes the general idea of what should be happening and and goes from there Feigbusters improv was incoherent rambling without any sense of meaning6 Damn right There\'s a universe of difference between steve carell coming up with an unexpected banger; unasked and mcarthy being forced to improv   7 What\'s funny is that there wasn\'t much improv in The Office, according to the cast Other than the kiss between michael and oscar, which the reactions were 00% genuine 8 Improv can work when you have people that are legitimately funny, and not just “comedic actors” in your film Best in Show - one of my favorite movies, not just comedies, of all time - is damn near pure improvisation by all of the actors Christopher Guest and Eugene Levy wrote a quick outline and sort of gave each actor their character, and it was mostly on the actors to figure out how that character worked and go from there What they turned out was amazing But that movie is filled to the brim with hilarious people, Ghostbusters was not None of the 4 principle actresses are innately funny Sure, they can read some funny dialogue and overact to make it work for the SNL crowd, but that’s it They’re *acting* funny, they’re not actually funny There is a difference   Point being, improv can absolutely carry a movie when done with the right people, both in front of and behind the camera Ghostbusters Femme Edition just didn’t have the juice 9 I agree with you about McCarthy and Jones They seemed the weakest by far in the movie  Switch them out with real improv people like Amy Poehler, Niecy Nash, or even Tina Fay 0 See, I would have never guessed that Best in Show was nearly pure improv Then again, there is a subtlety to the movie and lots of the bits, like in Spinal Tap, are just quick little pieces of dialogue  I wonder if he relied on it too much here, because I think only Kate McKinnon has the chops and timing to shine in improv  I don\'t think that\'s strictly the case It isn\'t good improvisational skills that are required, tho they may help What\'s required is a talented actor who really understands their character, the story, the scene and the genre  Although not comedy, one example I can give is Margot Robbie and Leo DiCaprio is Wolf of Wall Street In one scene, she loses her temper with him and slaps him The slap was improvised and when Robbie profusely apologised afterwards, DiCaprio assured her she did a fantastic job and did what felt right for the character  Another example, also DiCaprio and also drama, in a scene in Django Unchained he actually cuts his hand on a broken glass, and without missing a beat, continues in character and wraps his hand That made the final edit  Now I know comedy is different from drama, I\'m not claiming they\'re the same But there\'s a difference between improv as they did in GB06 with Paul Fieg, and improv \\*in character\\* and in context You can\'t make a good film the way he did, just letting the camera roll and the actors do whatever They need a script and to let the actors do it their way, perhaps a few takes to try different things with the director directing them as needed  The trouble comes when the director doesn\'t have a clear enough vision for the final product The Hobbit trilogy also famously suffered for similar reasons I understand what you mean Yes, I completely agree4 What gets me is, Feig was pretty good in Heavyweights  He was a fun character in that I still cant believe it\'s the same person that acted in that 5 That was a long, long time ago6 > Paul Feig was afraid of actually giving the female actors direction because he didn\'t want to seem sexist  I think this is super unlikely Feig has directed women plenty before He directed some of the women in this movie before, to great results Bridesmaids  The problem is that neither he nor the studio suits *understood* Ghostbusters and why it was so well regarded in the first place They thought they could just take their own sensibilities, slap a Slimer into it, and call it a day Obviously, that didn\'t work7 They just plain didn\'t respect it is where the real trouble lies They could have honored the series by giving the old characters a proper send off But, nope They wanted to steal the name and pretend the first one didn\'t happen'
 0]
[" Ha! Just saw that stat on tv, turned to the Mrs and said don't become a statistic I posted this joke on r/jokes I don’t think we’re welcome there lol  I appreciate u bro 😂 👍 it's always a touching moment when you meet a like minded cunt I'm tearing up4 Haha I’m just happy downvotes don’t take away from accounts karma or my account wouldve been deleted now for sure lol5 😂"
 0]

In [21]:

# Now that's done will delete the predictions columns

X_test.drop('predictions', axis=1, inplace=True)

The above actually looks pretty good; so combined with the precision/recall scores this seems OK.

Multinomial Naive Bayes¶

In [22]:

# Repeat the above steps for this model

nb = MultinomialNB()
nb.fit(X_train_rs, y_train_rs)
evaluate_text_classification_model(X_train_rs, X_test_combined_matrix, y_train_rs, y_test,
                                  model=nb)

The negative MSE is -0.19331830125081564
The F1 score is 0.7650917937399067

So from the above measure it looks like logistic regression actually performs slightly better than both the null accuracy and the multinomial naive bayes - simplest is best!

This is probably due to the fact that naive bayes assumes features are conditionally independent, which likely doesn't apply to this dataset - i.e. whether something is hate or not is very likely to depend on non-independent combinations of words.

To take an example from a above, if 'jews' and 'hate' are in a sentence then this combination is likely to cause 'yid'.

Model Tuning¶

Now we've established that logistic regression is well suited to this problem, I'm going to see if I can create a better vectorizer and also randomly oversample the data to see if that helps the class imbalance.

In [23]:

# Create a new vectorizer with different parameters
vect2 = TfidfVectorizer(analyzer=split_into_lemmas, stop_words='english', ngram_range=(1, 2), min_df=.05, 
                      max_df=.3, max_features=100)

# Train test split the data again using the same split and random state
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)
X_train_dtm = vect2.fit_transform(X_train['text'])
X_test_dtm = vect2.transform(X_test['text'])

# And lastly to add features to the document-term matrix to make everything ready for modelling
X_train_features_matrix = sp.sparse.csr_matrix(X_train.drop('text', axis=1)).astype(float)
X_train_combined_matrix = sp.sparse.hstack((X_train_dtm, X_train_features_matrix))

# Repeat for testing set.
X_test_features_matrix = sp.sparse.csr_matrix(X_test.drop('text', axis=1)).astype(float)
X_test_combined_matrix = sp.sparse.hstack((X_test_dtm, X_test_features_matrix))

# Resample the data again
X_train_rs_2, y_train_rs_2 = oversample.fit_resample(X_train_combined_matrix, y_train)

# Instantiate and fit model; then plot precision recall curve
logreg2 = LogisticRegression(max_iter=1000)
logreg2.fit(X_train_rs_2, y_train_rs_2)
disp = plot_precision_recall_curve(logreg2, X_test_combined_matrix, y_test)
disp.ax_.set_title('Precision-Recall curve')

Out[23]:

Text(0.5, 1.0, 'Precision-Recall curve')

That looks very marginally better although not by enough to justify the extra work

In [25]:

# I'm now going to create some out of sample data to really test if this works
# Sorry for my language :-S

# Create new out of sample data
test_hate_speech = ["Die you stupid fucking evil cunt",
                   "I really like eating cookies out of the cookie jar",
                   "Go to hell you stupid fat bastard"]

# Prepare data the same way as before
test_df = pd.DataFrame({'text': test_hate_speech})
test_df.text = test_df.text.apply(clean)
test_df['post_length'] = test_df.text.apply(lambda x: len(x))
test_df['sentiment'] = test_df.text.apply(detect_sentiment)

test_df.head(3)

Out[25]:

	text	post_length	sentiment
0	Die you stupid fucking evil cunt	32	-0.9
1	I really like eating cookies out of the cookie jar	50	0.2
2	Go to hell you stupid fat bastard	33	-0.8

In [26]:

# Now time to predict some probabilities for this
test_dtm = vect.transform(test_df.text)

scaled_test_features = scaler.fit_transform(test_df.drop('text', axis=1))
test_feature_matrix = sp.sparse.csr_matrix(scaled_test_features).astype(float)
test_combined_matrix = sp.sparse.hstack((test_dtm, test_feature_matrix))

test_scores = pd.DataFrame(logreg1.predict_proba(test_combined_matrix), columns=['Not Hate Speech Probability',
                                                                                 'Hate Speech Probability'])
test_scores['original_text'] = test_df.text
test_scores

Out[26]:

	Not Hate Speech Probability	Hate Speech Probability	original_text
0	0.019495	0.980505	Die you stupid fucking evil cunt
1	0.964331	0.035669	I really like eating cookies out of the cookie jar
2	0.010542	0.989458	Go to hell you stupid fat bastard

Hooray! This seems to be working :-)

Divisive Language in Trump's Twitter Network¶

In [27]:

# Now read in the Trump data
trump = pd.read_csv('Trump_tweets_more_info.csv', index_col=0)
trump.sample(5)

Out[27]:

	id	text	isRetweet	isDeleted	device	favorites	retweets	date	urls	reply_count	quote_count
37792	788468036541874176	Thank you Colorado Springs. If I’m elected President I am going to keep Radical Islamic Terrorists out of our count... https://t.co/N74UK73RLK	f	f	Twitter for iPhone	26542	10577	2016-10-18 19:53:23	https://twitter.com/realDonaldTrump/status/788468036541874176/photo/1	3388	1335
13391	260487521430040576	Everybody is asking about my announcement this Wednesday concerning Barack Obama---just wait and see!	f	f	Twitter Web Client	169	544	2012-10-22 21:07:19	NO DATA	522	7
33588	607700233011757056	"""@SEETEK_AU: Watch, listen, and learn. You can’t know it all yourself. Anyone who thinks they do is destined for mediocrity.― Donald Trump"	f	f	Twitter for Android	141	101	2015-06-08 00:06:40	NO DATA	27	5
48687	1189146892292313088	A great book by a great guy. Get it now! https://t.co/hwFtbpbIO0	f	f	Twitter for iPhone	34003	7706	2019-10-29 11:48:07	https://twitter.com/dbongino/status/1188636122764718080	2126	235
20113	338060998793646080	@DannyZuker, are you ready for the deal?	f	f	Twitter for Android	23	33	2013-05-24 22:36:37	NO DATA	25	1

In [28]:

# Replace no data as nan
trump.urls.replace('NO DATA', np.nan, inplace=True)

In [29]:

# The id column is unecessary so I can remove that
trump.drop('id', axis=1, inplace=True)

In [30]:

# Now going to convert the date column to a pandas datetime object
trump['date'] = pd.to_datetime(trump.date)

# Now I will factorize the columns with categorical variables so they are numerically encoded
trump.device = pd.factorize(trump.device)[0]
trump.isRetweet = pd.factorize(trump.isRetweet)[0]
trump.isDeleted = pd.factorize(trump.isDeleted)[0]

In [31]:

# The urls one is quite a lot fiddlier
# First I need to extract the root domain from the url
# i.e. convert https://twitter.com/realDonaldTrump to 'twitter'
# This is because I'm interested in traffic sources, i.e. the above url should be encoded the same as the below:
# https://twitter.com/JoeBiden
# The point I'm interested in is that they both represent traffic from twitter

def get_root_domain(urls):
    """
    This converts an input list of urls into the root domain of a url.
    :param urls: string, contains list of urls encoded as string OR one url as string, np.nan in case of no data.
    :return domain: string, the root domain(s) OR np.nan in case of no data.
    """
    
    # First return no data if appropriate
    if isinstance(urls, type(np.nan)):
        return urls
    else:
        # Convert string representations of lists into lists
        # If the string is a list
        try:
            urls = ast.literal_eval(urls)
            extracted_domains = []
            for url in urls:
                ext = tldextract.extract(url)
                extracted_domains.append(ext.domain)
            domain = ",".join(extracted_domains)
            return domain
        # This is a bit cheeky but since I created this data column I know exceptions will mean
        # that the string isn't a list so there's only one url
        except Exception:
            pass
            ext = tldextract.extract(urls)
            domain = ext.domain
            return domain

# Apply the function the urls column
trump['domains'] = trump.urls.apply(get_root_domain)

In [32]:

# Let's check this actually worked
trump.domains.value_counts()

Out[32]:

twitter                          6302
bit                              1378
tl                                369
pscp                              339
instagram                         272
                                 ... 
createaforum                        1
trumpgolfireland                    1
coloradosun,twitter                 1
chn                                 1
eventbrite,eventbrite,twitter       1
Name: domains, Length: 917, dtype: int64

In [33]:

trump.loc[trump.domains=='tl'].sample(3)[['urls', 'domains']]

Out[33]:

	urls	domains
15211	http://tl.gd/gv14j8	tl
15181	http://tl.gd/h25ov4	tl
15539	http://tl.gd/g1vo61	tl

In [34]:

trump.loc[trump.domains=='pscp'].sample(3)[['urls', 'domains']]

Out[34]:

	urls	domains
44790	https://www.pscp.tv/w/bolO8zFvTlFsTFJub1dwUXd8MU93eFdXd09aTUF4UUC3oF_Q3pkR0oyuvTU2e1ScZovXxPvSHzypF5VgXz8v?t=1s	pscp
45529	https://www.pscp.tv/w/bjOY3TFvTlFsTFJub1dwUXd8MU93eFdXcmJRTkR4UYead0duJSwkVRgKsYSG96dk-GyQ5jmjgK36wWdE84ez?t=1s	pscp
51988	https://www.pscp.tv/w/b8Zu2zFvTlFsTFJub1dwUXd8MWRqR1hwRUV3TW9HWkSLS2I9C4IQU-7hm2yX1sz1KKWAL8CxUWxH6mrM2goh?t=43s	pscp

In [35]:

# Boom shakalaka that is looking good.
# Right, now moving on...
# First drop the urls column since we no longer need it

trump.drop('urls', axis=1, inplace=True)

In [36]:

# There are too many values to encode all of these numerically

trump.domains.value_counts()

Out[36]:

twitter                          6302
bit                              1378
tl                                369
pscp                              339
instagram                         272
                                 ... 
createaforum                        1
trumpgolfireland                    1
coloradosun,twitter                 1
chn                                 1
eventbrite,eventbrite,twitter       1
Name: domains, Length: 917, dtype: int64

In [37]:

# So I'm going to filter out for domains that appear less than ten times and encode those as Nan

indices_for_values_less_than_ten = np.where(trump.domains.value_counts().values < 10)[0].tolist()
domains_to_encode_as_nan = trump.domains.value_counts()[indices_for_values_less_than_ten].index.to_list()

trump['domains_filtered'] = trump.domains.apply(lambda x: np.nan if x in domains_to_encode_as_nan else x)
trump.domains_filtered.value_counts()

Out[37]:

twitter                                     6302
bit                                         1378
tl                                           369
pscp                                         339
instagram                                    272
                                            ... 
bostonherald                                  10
nydn                                          10
es                                            10
facebook,twitter,twitter,twitter,twitter      10
spectator                                     10
Name: domains_filtered, Length: 104, dtype: int64

In [38]:

# Now I will encode these

trump_dummies_1 = trump.domains_filtered.str.get_dummies(sep=',')
trump_dummies_1.sample(3)

Out[38]:

	ArmyForTrump	DonaldJTrump	Vote	abcn	amzn	apne	bit	bloom	bloomberg	bongino	...	washingtonexaminer	washingtonpost	washingtontimes	wh	whitehouse	winred	wsj	yhoo	youtu	youtube
35526	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
35142	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49016	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

3 rows × 88 columns

Feature Engineering¶

Once you've added the hatefulness score you have a target variable
However before doing that you need to recreate the same features as in the previous model
So now you should look at correlation and repeat some of the EDA
Won't I actually want to drop the text once I've got a hatefulness score? YES because I'm not interested in how the text affects hatefulness, the text is the target variable. Nice!

In [39]:

# The other thing I want to do now is extract twitter handles
# This way I can see who Trump is talking too

def handle_extraction(tweet):
    """
    Function to extract a twitter handle from text.
    :param tweet: string, input data.
    :return output: string, comma separated list of handles.
    """
    
    username_handles = r'\B@\w*' # nb slightly different to simpler above handle for regex
    result = re.findall(username_handles, tweet)
    
    # If only one handle in tweet
    if result:
        if len(result) == 1:
            output = result[0]
            return output
        
        #If multiple handles
        if len(result) > 1:
            output = ",".join(result)
        return output

trump['handles'] = trump.text.apply(handle_extraction)

In [40]:

# Check this actually worked
trump.handles.value_counts()

Out[40]:

@realDonaldTrump                   708
@BarackObama                       541
@WhiteHouse                        372
@foxandfriends                     316
@FoxNews                           222
                                  ... 
@unicef                              1
@ray_chipendo                        1
@debragarrett,@usminority            1
@CLewandowski_,@realdonaldtrump      1
@Molly_Stew,@realDonaldTrump         1
Name: handles, Length: 18782, dtype: int64

In [41]:

# Encode as before, removing handles that appear less than 10 times

indices_for_values_less_than_ten = np.where(trump.handles.value_counts().values < 10)[0].tolist()
domains_to_encode_as_nan = trump.handles.value_counts()[indices_for_values_less_than_ten].index.to_list()

trump['handles_filtered'] = trump.handles.apply(lambda x: np.nan if x in domains_to_encode_as_nan else x)
trump.handles_filtered.value_counts()

Out[41]:

@realDonaldTrump                708
@BarackObama                    541
@WhiteHouse                     372
@foxandfriends                  316
@FoxNews                        222
                               ... 
@kyleraccio,@realDonaldTrump     10
@_Snurk,@realDonaldTrump         10
@TrumpTurnberry                  10
@ArsenioHall                     10
@jacknicklaus                    10
Name: handles_filtered, Length: 222, dtype: int64

In [42]:

# Lots of self-referential tweets, interesting...
# Encoding now...

trump_dummies_2 = trump.handles_filtered.str.get_dummies(sep=',')
trump_dummies_2.sample(3)

Out[42]:

	@	@60Minutes	@ABC	@ACTBrigitte	@AGSchneiderman	@AP	@AbeShinzo	@AlanDersh	@AlexSalmond	@AmSpec	...	@politico	@realDonaldTrump	@seanhannity	@senatemajldr	@tedcruz	@thebradfordfile	@thehill	@trish_regan	@washingtonpost	@yankees
17997	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4054	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
31927	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

3 rows × 203 columns

In [43]:

# Remove the blank handle one
trump_dummies_2.drop('@', axis=1, inplace=True)

In [44]:

# Now add the dummy variables to the main dataset and remove unecessary columns

trump.drop(['domains', 'handles', 'handles_filtered', 'domains_filtered'], axis=1, inplace=True)
trump_final = pd.concat([trump, trump_dummies_1, trump_dummies_2], axis=1)      

In [45]:

# Now I need to clean up some of the noise in the text before vectorizing it
# Looks like there were quite a few unwanted characters 
# So will need to tidy that up

def clean_again(text):
    """
    This function removes the following uwanted strings from input text:
    urls, hashtags, &amp, RT :, newlines, tabs, twitter handles.
    :param text: string, input text.
    :return clean_text: string, input text with above characteristics removed.
    """
    
    # Define patterns
    # The url regex is very basic but works for this dataset since the urls are automatically generated by reddit
    # So they consequently all have the same pattern
    newline_and_tab__and_hashtag_and_exclamation_mark_pattern = r'[\n\t#"!:]'
    url_patterns = r'http.*[^\s]+'
    username_handles = r'\B@\w*'
    
    # Remove phrases
    text_1 = text.replace('RT', '')
    text_2 = text_1.replace('&amp', '')

    # Remove regex patterns from text
    new_text_step_1 = re.sub(newline_and_tab__and_hashtag_and_exclamation_mark_pattern, '', text_2)
    new_text_step_2 = re.sub(username_handles, '', new_text_step_1)
    final_text = re.sub(url_patterns, '', new_text_step_2)
    
    return final_text

trump_final.text = trump.text.apply(clean_again)

In [46]:

# Remove blank strings if they have been created by the above
trump_final.text.replace('', np.nan, inplace=True)
trump_final.dropna(subset=['text'], inplace=True)

In [47]:

# In order for the logistic regression model created earlier to work, the feature matrix will need to be the same
# So that means creating the same features

trump_final['post_length'] = trump_final.text.apply(lambda x: len(x))

# Now apply this to create a sentiment column
trump_final['sentiment'] = trump_final.text.apply(detect_sentiment)

In [48]:

# Now it's time to vectorize the text and add a hatefulness score
# To do this I'll use the logistic regression model created earlier and the same vectorizer

trump_dtm = vect.transform(trump_final.text)

In [49]:

# Now combine with the relevant features

trump_scaled_features = scaler.fit_transform(trump_final[['post_length', 'sentiment']])
trump_feature_matrix = sp.sparse.csr_matrix(trump_scaled_features).astype(float)
trump_combined_matrix = sp.sparse.hstack((trump_dtm, trump_feature_matrix))

In [114]:

# Now load in the model saved earlier
# And use this to classify the probability that Trump's speech is hateful or not
# Note - I expect most of it won't be compared to the sample hate data 
# But I'm more looking for trends of hateful or not

test_scores = pd.DataFrame(logreg1.predict_proba(trump_combined_matrix), 
                                 columns=['Not Hate Speech Probability', 'Hate Speech Probability'])
test_scores['original_text']  = trump_final.text

# Not sure why but this has some Nan values and blank strings missed so removing those
test_scores.original_text.replace('  ', np.nan, inplace=True)
test_scores.dropna(inplace=True)
test_scores.loc[test_scores['Hate Speech Probability'] > .7].sample(10)

Out[114]:

	Not Hate Speech Probability	Hate Speech Probability	original_text
47146	0.289630	0.710370	Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border. They co...
15117	0.240332	0.759668	awesome Thsnks
28296	0.267540	0.732460	If you would buy the team that would be awesome Just an extra win for my squad)
16096	0.288126	0.711874	This is great How awesome of Mr. Trump Thank you.
29691	0.250606	0.749394	This very expensive GLOBAL WARMING bullshit has got to stop. Our planet is freezing, record low temps,and our GW scientists are stuck in ice
17562	0.239279	0.760721	“MSNBC'S TOURÉ HAS EPIC RACE-BAITING MELTDOWN ON CNN”
35341	0.174289	0.825711	Hi Katie, let's get Donald Trump in the WH. He's the man to get this country back in order
18682	0.060416	0.939584	I feel like is the only person on my tl that has common sense when it comes to the future of our country
24633	0.079892	0.920108	it's by far my favorite Mac Miller song. Can't beat Donald Trump
48469	0.181754	0.818246	95% Approval Rating in the Republican Party. Thank you

In [72]:

test_scores.loc[test_scores['Hate Speech Probability'] < .3].sample(10)

Out[72]:

	Not Hate Speech Probability	Hate Speech Probability	original_text
47801	0.878038	0.121962	Do not believe any article or story you read or see that uses “anonymous sources” having to do with trade or any other subject. Only accept inform...
47806	0.778821	0.221179	Nancy Pelosi just had a nervous fit. She hates that we will soon have 182 great new judges and sooo much more. Stock Market and employment records...
46237	0.738021	0.261979	This is my 500th. Day in Office and we have accomplished a lot - many believe more than any President in his first 500 days. Massive Tax , Regulat...
5953	0.955222	0.044778	15 DAYS TO SLOW THE SPREAD
44276	0.899759	0.100241	Today in the East Room of the , it was my true privilege to award seven extraordinary Americans with the Presidential Medal of Freedom...
531	0.830135	0.169865	Why isn’t Biden corruption trending number one on Twitter? Biggest world story, and nowhere to be found. There is no”trend”, only negative stories...
6466	0.972319	0.027681	Amy Coney Barrett is an outstanding judge and an even better person. I commend President Trump on another exceptional pi…
20351	0.852116	0.147884	True and thanks.
49392	0.925036	0.074964	The first so-called second hand information “Whistleblower” got my phone conversation almost completely wrong, so now word is they are going to th...
46129	0.759820	0.240180	The World has taken a big step back from potential Nuclear catastrophe No more rocket launches, nuclear testing or research The hostages are back ...

In [52]:

# Let's check the distribution to see if it looks alright

test_scores['Hate Speech Probability'].hist(bins=200);

Overall, this looks like it's not actually picking up hate speech since Trump doesn't use language like in the previous dataset - say swear words, racial slurs.

I think what is taking place here is data drift - the dataset I trained the hate speech classifier on is significantly different to the dataset of Trump's tweets.

Recall however that I'm not trying to measure hatefulness in Trump - my model is clearly insufficient for that - but rather trends in his word choice, which this model does seem to be picking up.

Specifically, it looks like the model is picking up more divisive areas that would typically lead to hate speech, such as immigration, global warming, swearing; these are typically signs of more nationalist rhetoric. I've checked this by making lots of manual tests of the data - this is definitely something that could be fine tuned in further iterations of this model beyond a prototype.

So this metric will serve as a measure of what makes Trump's speech divisive, i.e. what might lead or be more likely to lead to hate speech; that's what the next stage of this notebook will explore.

Relevant metrics¶

I'm now going to try and build a new model to see which features make Trump's speech more or less divisive. The point of doing this is really investigative - I'm interested in seeing who in his network makes him more or less divisive.

So interpretability is therefore very important since it's the main point of this whole exercise. Additionally, I'm not too worried about false negatives/positives since there is no action being taken after modelling.

As you can see from the distribution of the target variable above, there isn't a class imbalance so I am mostly concerned with accuracy here.

I will therefore use R2 as a metric since I'm interested in how the model performs specifically on the data in this set alone since I'm trying to understand the relationships between features and the target variable.

An R2 score measures the proportion of variance in the target variable - in this case, the divisiveness of Trump's speech - that is predictable from the independent variable.

It is defined as below:

$R^2 = 1 - \frac{RSS}{TSS}$

$RSS$ = the sum of squares of the residuals $TSS$ = total sum of all squares

Using this metric, an $R^2$ score of 0 is the baseline for model performance - this is what a model that predicts $\bar{y}$ will give.

Model selection¶

The above considerations all suggest that a regression tree would be a good fit for this; there's no class imbalance and the results are highly interpretable.

Furthermore, I have a lot of features and large amount of data too.

So I will use a regression tree for this as below.

In [54]:

# Select feature matrix and target variables
# Removing text since I'm interested in how the other people Trump communicates with affect his language
# So text is effectively the target variable in a different form 

X = trump_final[[x for x in trump_final.columns if not x in ['divisiveness_score', 'text']]]
y = trump_final["divisiveness_score"]

In [55]:

# Some of the y 0 values have wrongly encoded as NaN so will fix that here

y.replace(np.nan, 0, inplace=True)

In [56]:

# Encoded time column to numeric

X['date']=X['date'].map(dt.datetime.toordinal)

In [57]:

# First train test split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

# Then instantiate a decision tree regressor
treereg = DecisionTreeRegressor(random_state=1, max_depth=3)

# Fit model
treereg.fit(X_train, y_train)

# Evaluate
scores = cross_val_score(treereg, X_test, y_test, cv=14, scoring='r2')
scores.mean()

Out[57]:

0.13757747676863255

That's a little bit better than the baseline, hopefully I can improve on that though.

In [58]:

# That wasn't very good so I will try and tune some hyperparameters to improve it

param_grid = {
    'max_depth': list(range(1, 15)),
    "min_samples_split": list(range(2,6)),
    "min_samples_leaf": list(range(1, 6)),
    "max_features": ['auto', 'sqrt', 'log2']
}
tree_reg_optimal = RandomizedSearchCV(treereg, param_grid, random_state=42, scoring='r2')
search = tree_reg_optimal.fit(X_train, y_train)
search.best_params_

Out[58]:

{'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 5}

In [59]:

# Now let's try a tuned regression
treereg2 = DecisionTreeRegressor(random_state=1,
                                max_depth=5,
                                min_samples_split=2,
                                min_samples_leaf=2,
                                max_features='auto')

# Fit model
treereg2.fit(X_train, y_train)

# Evaluate
scores = cross_val_score(treereg2, X_test, y_test, cv=14, scoring='r2')
scores.mean()

Out[59]:

0.18333373669044883

This is quite low generally speaking but I think fine for my purposes - it's better than the baseline and predicting human behaviour is going to be hard.

What I'm looking for, as stated previously, is trends; so I this model should still be capable of interpreting feature relationships to the target variable.

Feature Importance¶

In [60]:

# Access the model's feature importance 

feature_importance = pd.DataFrame({'feature': X.columns.to_list(), 
                                   'importance': treereg2.feature_importances_}).sort_values(by='importance', 
                                                                                             ascending=False)
feature_importance.loc[feature_importance.importance > 0]

Out[60]:

	feature	importance
5	date	0.934353
298	post_length	0.063586
4	retweets	0.001050
3	favorites	0.001011

So this looks like date is overwhelmingly important, which is quite interesting; that may be due to the fact that the dataset starts before Trump became the President and was less political.

The only interlocutor here is twimg, which is twitter's own image hosting site. So that doesn't actually suggest the interlocutor is important, it does suggest that posts with images are more likely to contain divisive messaging .

Now I will see if I can find any correlation between the interlocutor features - i.e. seeing if a decision tree regressor will work using only the Twitter handle and url information.

In [61]:

# Remove other features

cols_to_drop = ['isRetweet', 'isDeleted', 'device', 'favorites', 'retweets', 'date', 'reply_count',
               'quote_count', 'sentiment', 'post_length', 'text', 'divisiveness_score']

# Create new feature matrix and target variable
X_new = trump_final[[x for x in trump_final.columns if not x in cols_to_drop]]
y_new = trump_final["divisiveness_score"]

# Test baseline r2_score
X_new['prediction'] = y_new.median()
print(r2_score(y_new, X_new.prediction))
X_new.drop('prediction', axis=1, inplace=True)

-0.004172031322973169

In [62]:

# Train test split

X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, random_state=42, test_size=0.3)

# Instantiate new model
treereg3 = DecisionTreeRegressor(random_state=1, max_depth=3)

# Fit model
treereg3.fit(X_train, y_train)

# Evaluate
scores = cross_val_score(treereg3, X_test, y_test, cv=14, scoring='r2')
scores.mean()

Out[62]:

0.006789968440402948

In [63]:

# So that doesn't really work - no correlation
# Let's see if that changes with hyperparameter tuning

tree_reg_optimal_2 = RandomizedSearchCV(treereg3, param_grid, random_state=42, scoring='r2')
search = tree_reg_optimal_2.fit(X_train, y_train)
search.best_params_

Out[63]:

{'min_samples_split': 4,
 'min_samples_leaf': 5,
 'max_features': 'auto',
 'max_depth': 10}

In [64]:

# Now score this tuned model

treereg4 = DecisionTreeRegressor(random_state=1, 
                                 max_depth=10, 
                                 min_samples_split=4, 
                                 min_samples_leaf=5,
                                 max_features='auto')

# Fit model
treereg4.fit(X_train, y_train)

# Evaluate
scores = cross_val_score(treereg4, X_test, y_test, cv=14, scoring='r2')
scores.mean()

Out[64]:

0.011374888952897098

That's still actually quite a lot worse than the previous one, so I'll stick to the second model. I'm now going to visualize the decision tree it made to see if that confirms what I thought from the feature importance scores - that the date is way more important than other considerations.

In [65]:

# Plot tree to depth 3 - it isn't that readable past this

fig, ax = plt.subplots(1,1, figsize=(14,6)) 
plot_tree(treereg2, feature_names=X.columns.to_list(), max_depth=3, fontsize=12, filled=True);

Hypothesis Assessment¶

In conclusion, it looks like Trump's divisive language is actually not affected by his interlocutor.

From the feature importance metrics above, and as is visible from the tree diagram above, it is in fact most correlated with date; none of the other features really generated a strong correlation at all.

Let's see how the divisiveness score looks when plotted by date.

In [66]:

# Create a feature matix of just the date and divisiveness score for time series plotting

trump_plot = trump_final[['date', 'divisiveness_score']]
trump_plot.index = trump_plot.date
trump_plot.drop('date', axis=1, inplace=True)

In [67]:

# Use seasonal decompose to see how this has changed over a 3 month period

decomposition = seasonal_decompose(trump_final.divisiveness_score, period=3)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(12, 10)

<Figure size 432x288 with 0 Axes>

In [68]:

# This is very noisy so I will try and use rolling statistics to smooth this out
# Here I'm resampling by quarterly frequency, summing this so it's cumulative per quarter
# And then calculating an exponentially weighted mean to smooth out some of the noise

trump_plot.divisiveness_score.resample('Q').sum().ewm(span=2).mean().plot(legend=True, 
                                                                          label='Trump Tweet Divisiveness');

This seems intuitively plausible - it broadly tracks Trump's increasing political profile - 2011 speech at the Conservative Political Action Conference, increasing campainging until election in 2016 and then skyrocketing values in 2020 with a comparatively quiet period in some of the interm years of his presidency.

Summary¶

Even though the hypothesis was false, this project nonetheless leads to quite a few usual conclusions:

It's fairly straightforward to predict hate speech with relative success when it's very far over the line and not very contextual, i.e. when it uses racial slurs rather than being generally divisive.
The hate speech classifier in this project is very much a prototype and could be significantly improved with less imbalanced data and a dataset closer to Trump's usage on Twitter.
Such a model also appears to have some use in picking out trends in language like that of President Trump which is very divisive but avoids using explicit swear words or racial slurs.
The divisiveness of Trump's language does not correlate well with who he communicates with on Twitter; rather it appears much more related to the surrounding context of his own political career.
However, the approach used in this project is quite useful to map out the contours of this context - i.e. when it is more or less divisive - which could provide a way to then see who he talks to the most in those periods. While this information is not correlated with the divisiveness of Trump's speech, it does allow monitoring of how Trump's divisive speech spreads on the Twitter platform. It achieves this by providing a way to hone in on who Trump's network of interlocutors are when he is at his most divisive.
The success of such a model over time could be measured by using time series modelling to predict out of sample data, i.e. by modelling how Trump's speech appears in the next electoral period and then comparing this to the actual data after the election has taken place.

In [154]:

# Here is a very brief illustration of how that might be achieved
# Select a particularly divisive area from the graph above
# Then hone in exclusively on the features that denote his interlocutors - domains and twitter handles 

divisive_period = trump_final.loc[(trump_final.date >= '2012-09') & (trump_final.date  <= '2013-09')]
cols_to_remove = ['text', 'isRetweet', 'isDeleted', 'device', 'favorites', 'retweets','date', 'reply_count', 
                  'quote_count', 'post_length', 'sentiment', 'divisiveness_score']
divisive_period_target = divisive_period.drop(cols_to_remove, axis=1).sum()

# Get top 15 interlocutors in this time period
top_15 = divisive_period_target.sort_values(ascending=False).nlargest(15)

# Plot results
plt.bar(top_15.index, top_15)
plt.title("Trump's Top 15 Interlocutors; 2012-09 to 2013-09")
plt.ylabel('Domain or handle citations')
plt.xticks(rotation=90);

Bit here means bitly which is not very interesting, but it's intriguing to see the other handles that are quite high up - like Lord Sugar and Danny Zuker. Tl also is the shortened version of tweetlonger urls, which suggests Trump wrote lots of longer tweets to get his message across.