by Talha Oz and Halil Bisgin
Presented this work in the 4th International Workshop on Social Web for Disaster Management (SWDM'16), co-located with CIKM 2016; here is the paper.
The Flint water crisis is a story of government failure at all levels. By studying microblog posts about it, we understand how citizens assign responsibility and blame regarding such a man-made disaster online. We form hypotheses based on social scientific theories in disaster research and then operationalize them on unobtrusive, observational social media data. In particular, we investigate the following phenomena: the source for blame; the partisan predisposition; the concerned geographies; and the contagion of complaining.
This paper adds to the sociology of disasters research by exploiting a new, rarely used data source (the social web), and by employing new computational methods (such as sentiment analysis and retrospective cohort study design) on this new form of data. In this regard, this work should be seen as the first step toward drawing more challenging inferences on the sociology of disasters from "big social data".
# read the JSON data and save it to Flint.pkl once,
# whenever want to read the data, read the pickle,
# instead of the raw JSON files.
# This code block is here just to show how we created the pickle (.pkl) file.
import pandas as pd
import json
from glob import glob
from datetime import datetime
tw = []
for f in glob("data/TweetCollection/*.json"):
with open(f, 'r',encoding='utf-8') as fin:
for line in fin:
a = json.loads(line)
tw.append({'id':a['id_str'],
'created_at':datetime.strptime(a['created_at'],'%a %b %d %H:%M:%S +0000 %Y'),
'hashtagged':any(['flintwatercrisis' in h['text'].lower() for h in a['entities']['hashtags']]),
'screen_name':a['user']['screen_name'],
'location':a['user']['location'],
'followers':a['user']['followers_count'],
'verified':bool(a['user']['verified']),
'text':a['text']})
df = pd.DataFrame(tw).set_index('id').drop_duplicates()
#df.to_pickle('data/Flint.pkl')
import pandas as pd
import numpy as np
pd.set_option('max_colwidth',200)
df = pd.read_pickle('../data/Flint.pkl')
from utilities.geocoder import Geocoder
gc = Geocoder('utilities/geodata/state_abbr_file', 'utilities/geodata/city_file')
df['latlon'] = df.location.str.strip().apply(gc.geocode)
from IPython.display import HTML
HTML(df.head().to_html(index=False)) #how the data looks like
created_at | followers | hashtagged | location | screen_name | text | verified | latlon |
---|---|---|---|---|---|---|---|
2016-01-15 21:00:24 | 265 | True | Sugar Land, Texas | zachsciba | RT @TheDailyShow: #FlintWaterCrisis could have been prevented by an easy $100/day solution. https://t.co/4Jf7oH20EX https://t.co/7fLogvuwrx | False | (29.599580, -95.614089) |
2016-01-15 21:00:07 | 968 | True | None | scootey | You can thank the Republican party for this #Michigan #FlintWaterCrisis #GOP #Uniteblue https://t.co/wK7IFvkk8k | False | None |
2016-01-15 21:00:30 | 189 | True | s. pasadena,ca | steve1204 | RT @TheDailyShow: #FlintWaterCrisis could have been prevented by an easy $100/day solution. https://t.co/4Jf7oH20EX https://t.co/7fLogvuwrx | False | (34.112958, -118.155778) |
2016-01-15 21:00:09 | 8053 | True | Lansing, Michigan | ProgressMich | Snyder still won’t say when he knew about #FlintWaterCrisis. Protest with us on Tuesday to demand answers: https://t.co/aRfLc99QUy #MISOTS | False | (42.717585, -84.554916) |
2016-01-15 21:00:35 | 7 | True | None | marcgilbert77 | RT @TheDailyShow: #FlintWaterCrisis could have been prevented by an easy $100/day solution. https://t.co/4Jf7oH20EX https://t.co/7fLogvuwrx | False | None |
g = df.groupby('text').size().reset_index()
g.columns = ['text','cnt']
g = g.sort_values('cnt',ascending=False)
print('total tw:',len(df),'\nunique tw:',len(g))
g.head() #most popular tweets
total tw: 664775 unique tw: 344384
text | cnt | |
---|---|---|
274256 | RT @xoShakarra: Friendly reminder that it STILL takes one hour and 23 gallons of water to take a bath in Flint. #FlintWaterCrisis https://t… | 7093 |
202990 | RT @BernieSanders: How do we have so much money to go to war in Iraq but somehow not enough money to provide clean drinking water to Flint?… | 5825 |
261826 | RT @markmobility: #FlintWaterCrisis \n- 99,000 residents\n- 57% Black\n- 40% Poor\n- 9,000 kids with lead poisoning\nFlint HOSPITAL Water: https… | 4710 |
265544 | RT @opinionatedcxnt: Saw this on Tumblr & it made me cringe. The Flint crisis is a horrific nightmare\nhttps://t.co/j6sT5c5p3O | 2354 |
204672 | RT @BuzzFeedVideo: People See What Flint Water Looks Like\nhttps://t.co/3fV2EZFz21 | 1950 |
# the original dates are in UTC/GMT, convert them to EST.
# also, as given in footnote #4, report the missing dates
import pytz
eastern = pytz.timezone('US/Eastern')
# group tweets by day
df.created_at = df.created_at.dt.tz_localize(pytz.utc).dt.tz_convert(eastern)
# print missing date intervals in our dataset
day = df.groupby(df.created_at.dt.strftime('%m-%d'))['created_at'].count()
days = day.index.tolist()
for i in range(len(days)-1):
m1,d1 = days[i].split('-')
m2,d2 = days[i+1].split('-')
if m1 == m2:
if int(d1) == int(d2) - 1:
continue
else:
if d2 == '01':
continue
print('('+days[i]+','+days[i+1]+')',end=' ')
(01-22,01-25) (02-13,02-15) (02-16,02-20) (02-20,02-29) (02-29,03-03) (04-27,05-04) (05-06,05-08) (05-12,05-26)
#Figure 1
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import seaborn as sns
matplotlib.style.use('fivethirtyeight')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['font.size'] = 14
#plot daily activity
ax = day.plot(kind="bar",figsize=(18, 4)) #,title='#FlintWaterCrisis Activity on Twitter'
#ax.set_xlabel('Days After Flint Became a Federal State of Emergency on 2016-01-16', fontsize=14)
ax.set_ylabel('Tweets in the 1% sample', fontsize=14)
for label in ax.xaxis.get_ticklabels()[::2]:
label.set_visible(False)
ax.annotate('Federal State of Emergency', xy=(0, 31000))
ax.annotate('Gov. Rick Snyder holds a news conf.\n'\
'Groups file a federal lawsuit', xy=(12, 41000),ha='center')
ax.annotate('First Flint hearing in Congress\n'\
'Hillary visits Flint', xy=(22, 9000),ha='center')
ax.annotate('GOP debate in Detroit,MI\nRubio defends MI governor', xy=(31, 23000),ha='center')
ax.annotate('DEM debate in Flint,MI\nBoth candidates calls Snyder to resign', xy=(36, 45000),ha='center')
ax.annotate('MI primaries\nfor both parties', xy=(40, 33000),ha='center')
ax.annotate('Gov Snyder & EPA admin McCarthy\ntestify before Congress', xy=(47, 11000),ha='center')
ax.annotate('A local\'s complaining tweet goes viral\nGov Snyder asks the lawsuit be dismissed', xy=(69, 9000),ha='center')
ax.annotate('Obama visits Flint', xy=(88, 10000),ha='center')
ax.set_xlim([-1, 93])
ax.set_xlabel('');
ax.get_figure().savefig('../figs/daily.pdf',dpi=150,bbox_inches='tight')
#Figure 2
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import seaborn as sns
#matplotlib.style.use('fivethirtyeight')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['font.size'] = 14
l = pd.DataFrame()
for i in range(5):
r = pd.read_csv('../data/training/Flint'+str(i+1)+'_train.csv')
r['rater'] = i
l = l.append(r)
l = l.fillna('missing')
l['label'] = l.c.replace({',.*':'','missing':10},regex=True).astype(int) #removes multiple labels
#get pairwise kappas
from itertools import combinations
from statsmodels.stats.inter_rater import fleiss_kappa
from statsmodels.stats.inter_rater import aggregate_raters
kappa = []
for r1,r2 in combinations(range(5), 2):
rr = l[l.rater==r1].merge(l[l.rater==r2],on='text')[['label_x','label_y']]
k = fleiss_kappa(aggregate_raters(rr,n_cat=11)[0])
kappa.append(('r'+str(r1),'r'+str(r2),k))
kappa.append(('r'+str(r2),'r'+str(r1),k)) #(r2,r1,k)
a = pd.DataFrame(kappa).pivot(0,1,2) #pairwise inter-rater fleiss-kappa
a.index.name = None
a.columns.name = None
plt.figure(num=None, figsize=(6, 4), facecolor='w', edgecolor='w')
labels = ['No blame','MI Governor','POTUS','Flint Mayor',
'EPA','Emergency M.','Republicans','Democrats','Government','Other indiv.', 'Unsure']
cnt = [len(l[l.c.str.contains(str(i))]) for i in range(10)] #count of each label
cnt.append(len(l[l.c.str.contains('missing')]))
ax = plt.subplot()
ax.margins(0, 0)
colors = '#777777 #E24A33 #348ABD #348ABD #348ABD #E24A33 #E24A33 #348ABD #FBC15E #8EBA42 #FFB5B8'.split()
#[color['color'] for color in list(plt.rcParams['axes.prop_cycle'])]
ax.barh(range(len(cnt)),cnt,tick_label=labels,align='center',color=colors)
#ax.set(xlabel='Manually coded tweets'); #title='Attribution of Blame/Responsibility',
#ax.grid(color='grey', linestyle='dotted', linewidth=0.5)
plt.axes([.4, .33, .55, .55])
sns.heatmap(a,annot=True,vmin=0,vmax=1,cmap='RdBu_r',annot_kws={'size':12})
#ax.get_figure().savefig('../figs/coders.pdf',dpi=150,bbox_inches='tight')
<matplotlib.axes._axes.Axes at 0x126bd5d68>
#Table 1
df1 = pd.read_csv('../data/us-city-populations.csv',usecols=['CityST','2000','2010','LAT','LON','County_Name'])
df2 = pd.read_csv('../data/city_file.csv',dtype={'lat':str,'lon':str})
df2['CityST'] = df2.city + ', ' + df2.state
cities = df1.merge(df2, on = 'CityST', how = 'inner')
cities['latlon'] = cities[['lat','lon']].apply(tuple, axis=1)
cnt = pd.DataFrame(df.groupby(by='latlon').size().reset_index().rename(columns={0:'cnt'}))
cities = cities.merge(cnt,on='latlon',how='inner').rename(columns={'2010':'cpop','County_Name':'county'})
cities = cities[cities.cnt>=3]
cities.loc[cities.cpop.isnull(),'cpop'] = cities[cities.cpop.isnull()]['2000']
cities = cities.sort_values('cnt',ascending=False).reset_index().drop(['index','LAT','LON','2000'],1)
cities.cpop = cities.cpop.astype(int)
#cities.to_csv('data/cities.csv',index=False)
cities.head(10) #tweet counts without normalization
fil = cities[cities.cpop>88].copy()
fil['normalized'] = fil.cnt * 1000 / fil.cpop
fil = fil[fil.normalized>=1]
fil.sort_values('normalized',ascending=False).head(10) #normalized
city10 = fil.sort_values('normalized',ascending=False).head(10).reset_index()
city10 = city10.rename(columns={'CityST':'Cities'})
cofil = fil.groupby(['county','state']).sum()
cofil.normalized = cofil.cnt / np.sqrt(cofil.cpop)
county10 = cofil.sort_values(by='normalized',ascending=False).head(10).reset_index()
county10['Counties'] = county10.county +', '+county10.state
cc = pd.concat([city10.Cities,county10.Counties],axis=1)
cc.index += 1
print(cc.to_latex())
\begin{tabular}{lll} \toprule {} & Cities & Counties \\ \midrule 1 & Flint, MI & Genesee, MI \\ 2 & Gaylord, MI & Dist Columbia, DC \\ 3 & Grand Blanc, MI & Otsego, MI \\ 4 & Mount Morris, MI & Wayne, MI \\ 5 & Bloomfield Hills, MI & Ingham, MI \\ 6 & Lansing, MI & Washtenaw, MI \\ 7 & Sedona, AZ & Multiple, GA \\ 8 & Davison, MI & Kent, MI \\ 9 & Traverse City, MI & Coconino, AZ \\ 10 & Ann Arbor, MI & Cook, IL \\ \bottomrule \end{tabular}
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
snyder = df.text.str.contains('governor|nyder|onetoughnerd',case=False)
EM = df.text.str.contains('mgr|manager|Darnell|Earley|Kurtz',case=False)
mayor = df.text.str.contains('Dayne|Walling|ayor',case=False)
print([len(df[x]) for x in [snyder,EM,mayor]]) #Footnote 10.
[97577, 6028, 11609]
from matplotlib import animation,font_manager
import matplotlib.pyplot as plt
from html import unescape
import os
plt.rcParams['savefig.dpi']=150
plt.rcParams['animation.html'] = 'html5'
fig, ax = plt.subplots(figsize=(6, 1))
ax.set_axis_off()
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)
prop = font_manager.FontProperties(fname='Quivira.otf') # 'Symbola.ttf'
text = ax.text(.5, .5, '', fontsize=11, va='center', ha='center', wrap=True, fontproperties = prop)
txt = list(g.head(30).text) #g is a pandas dataframe
def animate(i):
text.set_text('('+str(i+1)+') '+unescape(txt[i]))
return (text,)
anim = animation.FuncAnimation(fig, animate, frames=len(txt), interval=2000, blit=True)
anim.save('top30.mp4') #matplotlib can save as mp4, but not as gif yet.
os.system("convert -delay 200 top30.mp4 top30.gif") #imagemagick's convert
anim #eye candy for the presentation :-)
# Figure 3 (new)
c = pd.DataFrame()
for i in range(5):
r = pd.read_csv('../data/training/Flint'+str(i+1)+'_train.csv')
r['rater'] = i
c = c.append(r)
c = c.dropna()
print(len(c[c.c.str.contains('6')]),len(c[c.c.str.contains('7')]))
r = df.screen_name[df.text.isin(c[c.c.str.contains('6')].text)]
d = df.screen_name[df.text.isin(c[c.c.str.contains('7')].text)]
m = mayoronly[mayoronly.cmpnd!=0]
s = snyderonly[snyderonly.cmpnd!=0]
#matplotlib.style.use('fivethirtyeight')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['xtick.labelsize'] = 16
matplotlib.rcParams['ytick.labelsize'] = 16
matplotlib.rcParams['axes.titlesize'] = 18
co = {'color':'black'}
ma = {'color':'black','linestyle':'-'}
boxprops = dict(linestyle='-', color='black')
f, ax = plt.subplots(1, 2, sharey=True,figsize=(8,3))
titles = ['Governor','Mayor']
for i,a in enumerate([s,m]):
bp = ax[i].boxplot([a[a.screen_name.isin(d)].cmpnd,a[a.screen_name.isin(r)].cmpnd], patch_artist=True,
whiskerprops=co,capprops=co,medianprops=ma,boxprops=boxprops,labels=['Blaming R','Blaming D'])
for box, color in zip(bp['boxes'], ['#348ABD','#E24A33']):
box.set_color('black')
box.set_facecolor(color)
ax[i].set_title(titles[i],y=.9)
ax[i].yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
ax[0].set_ylabel('Sentiment score',fontsize=18);
f.savefig('../figs/box-partisanship.pdf',dpi=150,bbox_inches='tight')
62 24
snyderonly = df[snyder&~mayor&~EM].copy()
mayoronly=df[mayor&~snyder&~EM].copy()
a = pd.DataFrame(list(snyderonly.text.apply(sid.polarity_scores)))
snyderonly = pd.concat([snyderonly.reset_index(),a.rename(columns={'compound':'cmpnd'})],axis=1)
a = pd.DataFrame(list(mayoronly.text.apply(sid.polarity_scores)))
mayoronly = pd.concat([mayoronly.reset_index(),a.rename(columns={'compound':'cmpnd'})],axis=1)
from scipy.stats import ks_2samp
from math import sqrt
c_a = 1.95 #coefficient c_a is 1.36 for alpha 0.05 and 1.95 for alpha 0.001
for i,a in enumerate([s,m]):
print(ks_2samp(a[a.screen_name.isin(d)].cmpnd,a[a.screen_name.isin(r)].cmpnd))
n1 = len(a[a.screen_name.isin(d)])
n2 = len(a[a.screen_name.isin(r)])
print('Critical value D_a:',c_a*sqrt((n1+n2)/(n1*n2)))
Ks_2sampResult(statistic=0.36016301579215487, pvalue=3.4967160958746161e-06) 0.27806837264879597 Ks_2sampResult(statistic=0.33333333333333337, pvalue=0.71191004965601257) 1.0148121747397396
#now the contagion "experiment"
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import seaborn as sns
#matplotlib.style.use('fivethirtyeight')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
plt.rcParams['savefig.dpi']=227 #DPI of my 13.3 MacBook Pro Retina
f = df[df.latlon == gc.geocode("Flint, MI")].groupby('screen_name').size()
f.sort_values(ascending=False).plot(ylim=(0,30),linestyle="None",marker='.',figsize=(10,5))
f2 = list(f[(f>2)&(f<20)].index.values)
print(len(f2)) #Natural selection: Flinters who tweeted 2<x<20 times
262
import twitter as t
from functools import partial
import sys, time
#friends = {}
#auth = t.oauth.OAuth("", "", "", "")
#twitter_api = t.Twitter(auth=auth)
for i,u in enumerate(f2):
print(i+j,'trying',u)
try:
friends[u] = get_friends(u,twitter_api)
except Exception as e:
print(e)
continue
trying EzE_2o11_ trying FGCofC "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying FYINation
Encountered 404 Error (Not Found)
trying FaithGiddings2 trying FiggaDaKID trying FitLifeAmber trying Flint4Bernie trying FlintCoalition trying FlintDDA trying FlintFWDProject trying FlintHandmade trying FlintHorrorCon trying FlintLocal432 trying FlintMI1 trying FlintPoliceOps
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying FlintRoguesRFC
...ZzZ...Awake now and trying again.
trying FlintStitch trying FlintWaterDoc trying FlintWaterPrjct trying Flint_NC trying Flintrainbowmom trying FlintstoneQue10 "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying ForgeFlint
Encountered 404 Error (Not Found)
trying FreeChoppa trying GCHD_MI trying GEARup2LEAD trying GREGJOSLIN trying Gnicole15 Twitter sent status 404 for URL: 1.1/users/lookup.json using parameters: (oauth_consumer_key=gWPAsZSr8Vff6oNEGIcZgA&oauth_nonce=1267690250305170804&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1477284134&oauth_token=2184712454-gAtpJPWGyiGMjbqXz3S0ycmKqFMDPSyBAXutBub&oauth_version=1.0&user_id=%5B50896539%2C%2051228255%5D&oauth_signature=Q551rY1rRrdYXmgvYHRafGaKxqo%3D) details: {'errors': [{'message': 'No user matches for specified terms.', 'code': 17}]} trying GrFlintHealthCo trying Hashtag_Flint
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying HipHopMarauder
...ZzZ...Awake now and trying again.
trying HoodRichCain "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying HurleyMedical
Encountered 404 Error (Not Found)
trying ILikeFootball7 "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying ITJBrown
Encountered 404 Error (Not Found)
trying JM3_4_INT "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying JackTriple91
Encountered 404 Error (Not Found)
trying JakeCarah trying JamirWorld_ trying JayeMonet trying Jaylen_22_ trying JessycaMathews trying Jmaddy31 "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying JoeBoo3
Encountered 404 Error (Not Found)
trying JonConnorMusic
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying Kasey_Posa
...ZzZ...Awake now and trying again.
trying KearsleyAD trying KennediDiane trying KetteringU trying KidsPriority trying Kiki2720 trying KodaPayne trying LGilkey3 trying LHonestAvery trying LWVFA trying Leonard_Solano trying Leonspencer1 trying Live4Gr8ness trying Lucci2x_ trying MDOC_FOA_R6
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying MIGlutenFreeGal
...ZzZ...Awake now and trying again.
trying MMLakers trying MVincent810 trying MakMichigan trying MarseilleAllen trying MattF810 trying McLovinsTwin trying MindonGlory "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying MistaFLintastic
Encountered 404 Error (Not Found)
trying MonaHannaA trying MoneyBall_Sam "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying MoneyMusicHeart
Encountered 404 Error (Not Found)
trying Motivated_Icon trying MrAllinger trying NeuvooFlint "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying NotSafeToDrink
Encountered 401 Error (Not Authorized) Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ... ...ZzZ...Awake now and trying again.
trying NotYourAvgSarah trying ProjectFWC trying QGroce trying RandyConat trying ReactionDJA trying Region6PTAC trying ReinvestFlint trying Rev__Church_Boy trying Rhas_Dukes trying RickThompsonTCC trying Rob810 trying RobinInFlint trying RotatinMy_Tires trying SamandFamilyy
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying SantinoGuerra
...ZzZ...Awake now and trying again.
trying Shemy trying Simply_Jonny13 trying SkytzoBeatz trying SloanMuseum trying SlopeTastic trying Spectacle_tv trying Stoneywoney23 trying Supernova1177 trying TGreene32 trying THFtweets trying TeacherBeard trying TeamRevelationM trying TeeTee45thST trying ThaLadiesMan
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying TheMarcusJones
...ZzZ...Awake now and trying again.
trying The_X_Ray trying ThisIsSkyy trying ThomasJean trying TidesOfTheSun trying TreTaylorMusic trying TrendingFandG "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying Trillustrator
Encountered 404 Error (Not Found)
trying Tye_Mf_Allen trying UMFlint trying UWGeneseeCo trying UncleDepri trying VideoManJamal trying VoodooHoney trying WCRZFM
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying WEYIKyle
...ZzZ...Awake now and trying again. Encountered 404 Error (Not Found)
"['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying WeAreFlint trying Z927WDZZ trying _AfterSummer trying _HighlightReel2 trying _ImFvmousHoe trying _JustSwangin trying _RAMONYEA trying _RJack1_ trying __DKA trying __JayBenz "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying __mere02
Encountered 404 Error (Not Found)
trying _itsreallyher_ trying _jayyd0ll "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying _ruizadrianna
Encountered 404 Error (Not Found) Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying _tiffmonique
...ZzZ...Awake now and trying again.
trying agerald68 trying ajl18_ trying alliwant_isyou "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying alynb76
Encountered 401 Error (Not Authorized)
trying andrewspeaight trying arlrbrtsn trying banana1015radio trying bikermom2005 trying br14n_70 trying cacavaliers trying callmetavis trying car_rocky trying cdayflint trying cityofflintpr "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying deadlocke96
Encountered 401 Error (Not Authorized) Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ... ...ZzZ...Awake now and trying again.
trying dessi_WORLD trying ernestoalaniz trying fieticeira "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying firsttrinitymbc
Encountered 404 Error (Not Found)
trying fixflintfirst1 trying flintcitychurch trying flintlibrary trying flintsnewstalk trying flinttownboy trying fuck__Yu trying get_it_cocoa "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying gggouten
Encountered 401 Error (Not Authorized)
trying goodboi021 trying gr8_bambino
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying graysam13
...ZzZ...Awake now and trying again.
trying grigg_nancy trying grimringler trying ididsocnu trying itsA_Zthang trying itstravis trying jackfrizza trying janathanrobinso trying jaymarcellus trying jem181818 "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying jesssssicab
Encountered 404 Error (Not Found)
trying kevinastarnes trying keysasmith trying kjchristian Twitter sent status 404 for URL: 1.1/users/lookup.json using parameters: (oauth_consumer_key=gWPAsZSr8Vff6oNEGIcZgA&oauth_nonce=13523515237202840127&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1477294253&oauth_token=2184712454-gAtpJPWGyiGMjbqXz3S0ycmKqFMDPSyBAXutBub&oauth_version=1.0&user_id=%5B11245382%2C%2014096845%5D&oauth_signature=oz7O5iNVXt6fa1LOFBMIPuin2rM%3D) details: {'errors': [{'message': 'No user matches for specified terms.', 'code': 17}]} trying kurtneiswender
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying ladytdiva1187
...ZzZ...Awake now and trying again.
trying laylaymitchell_ trying leftinflint trying magmagcity trying mariano411 trying megisomeso trying mikenstephanie trying mona_haydar trying mottdean trying najladw trying nienie_strangep trying novaprime79 trying orochiburenso trying overall3171 trying p85rice
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying pageiv
...ZzZ...Awake now and trying again.
trying patandaj trying peoplevssnyder trying planetofboom trying rappolee trying ronfonger trying rubosuave "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying seonthompson
Encountered 404 Error (Not Found)
trying sexynacole trying sharrington2016 trying simply1m3 "['id_str' 'screen_name' 'name' 'location' 'description' 'created_at'\n 'friends_count' 'followers_count' 'statuses_count' 'favourites_count'] not in index" trying smhodges4
Encountered 401 Error (Not Authorized)
trying standupflint trying stevemintline trying swaydalyricist
Encountered 429 Error (Rate Limit Exceeded) Retrying in 15 minutes...ZzZ...
trying tammy_loren
...ZzZ...Awake now and trying again.
trying tdgalbraith trying teenagev0w trying tenacitybrewing trying theTOMTOMSmusic trying whitingflint trying xojassx trying yrnmg trying zachizz1476
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
ff = pd.concat(friends.values(), keys=friends.keys(),names=['flinter'])
#ff.to_csv('../data/ff.csv')
fr = ff.screen_name.unique()
sents = pd.DataFrame(list(df[df.screen_name.isin(fr)].text.apply(sid.polarity_scores)))['compound']
fs = pd.concat([df[df.screen_name.isin(fr)].screen_name.reset_index(), sents],axis=1,ignore_index=True)
fs.columns = ['twid','screen_name','cmpnd']
ffdf = pd.read_csv('../data/ffdf.csv')
ffdf.columns = ['screen_name'] + list(ffdf.columns[1:])
#ffdf.head() #fsent is missing
print(len(ffdf[ffdf.usent>0]),len(ffdf[ffdf.usent<0]))
101 115
udf = {}
for u in ffdf.index.values:
usent = pd.DataFrame(list(df[df.screen_name == u].text.apply(sid.polarity_scores)))['compound']
udf[u] = {'utwcnt':len(usent),'usent':usent.mean()}
usdf = pd.DataFrame.from_dict(udf,orient='index')
usdf
utwcnt | usent | |
---|---|---|
1Goal1Passion | 7 | 0.096586 |
1_namillionme37 | 10 | -0.215550 |
810DIRTVILLE | 4 | 0.000000 |
AD16Gaming | 6 | 0.128550 |
AdamBiggers81 | 8 | 0.315200 |
AliciaRoose | 6 | 0.078383 |
AllahnaSteve | 3 | -0.467833 |
AmandaEmeryNews | 19 | -0.415800 |
Americans4Flint | 5 | 0.265980 |
AmourNiyy_ | 4 | 0.411850 |
AmylHovey | 8 | 0.001463 |
AnthonyWJRT | 14 | -0.036036 |
AqueousEye | 6 | 0.552383 |
Area72ENT | 5 | 0.362420 |
AsmoovMF | 5 | -0.255680 |
DJJayBig | 5 | 0.123320 |
DaHoodsOrnament | 12 | -0.212075 |
DannyDeadhack | 9 | 0.338244 |
DerekDohrman | 3 | -0.026333 |
DesireeDuell | 5 | 0.478120 |
Domanyce_Da_1 | 16 | 0.069812 |
DortEventCenter | 5 | 0.519600 |
DoubtingTomFYI | 14 | 0.057836 |
DylanLuna1931 | 10 | 0.049650 |
EmilyDoerr | 7 | 0.379657 |
EricRob229 | 5 | -0.169580 |
EzE_2o11_ | 3 | 0.031433 |
FYINation | 10 | -0.040650 |
FaithGiddings2 | 5 | -0.126800 |
FiggaDaKID | 14 | -0.187557 |
... | ... | ... |
mona_haydar | 7 | -0.082043 |
mottdean | 4 | -0.037375 |
najladw | 4 | 0.075150 |
nienie_strangep | 4 | -0.157050 |
novaprime79 | 13 | 0.234992 |
orochiburenso | 5 | -0.125580 |
overall3171 | 8 | -0.020588 |
p85rice | 11 | -0.166100 |
pageiv | 3 | 0.277000 |
patandaj | 13 | -0.118046 |
peoplevssnyder | 4 | 0.199250 |
planetofboom | 19 | 0.049321 |
rappolee | 9 | -0.013789 |
ronfonger | 16 | -0.342650 |
seonthompson | 3 | 0.743100 |
sexynacole | 6 | 0.711350 |
sharrington2016 | 3 | 0.320367 |
smhodges4 | 9 | -0.195844 |
standupflint | 12 | 0.140400 |
stevemintline | 8 | 0.102813 |
swaydalyricist | 5 | 0.178440 |
tammy_loren | 19 | 0.023889 |
tdgalbraith | 5 | 0.000000 |
teenagev0w | 3 | -0.290533 |
tenacitybrewing | 3 | -0.058700 |
theTOMTOMSmusic | 4 | 0.281325 |
whitingflint | 4 | 0.463100 |
xojassx | 6 | -0.029650 |
yrnmg | 3 | 0.323100 |
zachizz1476 | 5 | 0.066800 |
223 rows × 2 columns
fdf = {}
for k,v in friends.items():
fsk = fs[fs.screen_name.isin(v.screen_name)]
fsent = fsk.cmpnd.mean()
fdf[k] = {'ftwcnt':len(fsk),'dfcnt':len(fsk.screen_name.unique()),'tfcnt':len(v),'fsent':fsent}
ffdf = pd.DataFrame.from_dict(fdf,orient='index')
ffdf = ffdf.join(usdf)
#ffdf.to_csv('../data/ffdf.csv')
ffdf
fsent | tfcnt | ftwcnt | dfcnt | utwcnt | usent | |
---|---|---|---|---|---|---|
1Goal1Passion | -0.118901 | 505 | 2330 | 124 | 7 | 0.096586 |
1_namillionme37 | -0.154041 | 634 | 458 | 37 | 10 | -0.215550 |
810DIRTVILLE | -0.050515 | 987 | 2607 | 86 | 4 | 0.000000 |
AD16Gaming | 0.000000 | 53 | 1 | 1 | 6 | 0.128550 |
AdamBiggers81 | -0.082383 | 1327 | 1219 | 154 | 8 | 0.315200 |
AliciaRoose | -0.096432 | 918 | 319 | 44 | 6 | 0.078383 |
AllahnaSteve | -0.081462 | 70 | 50 | 19 | 3 | -0.467833 |
AmandaEmeryNews | -0.103015 | 468 | 3682 | 197 | 19 | -0.415800 |
Americans4Flint | -0.097908 | 354 | 3374 | 90 | 5 | 0.265980 |
AmourNiyy_ | -0.043152 | 265 | 148 | 41 | 4 | 0.411850 |
AmylHovey | -0.073058 | 422 | 2684 | 166 | 8 | 0.001463 |
AnthonyWJRT | -0.127652 | 260 | 2542 | 94 | 14 | -0.036036 |
AqueousEye | -0.116753 | 85 | 514 | 24 | 6 | 0.552383 |
Area72ENT | -0.137371 | 343 | 214 | 39 | 5 | 0.362420 |
AsmoovMF | -0.068034 | 191 | 282 | 57 | 5 | -0.255680 |
DJJayBig | -0.045407 | 4900 | 943 | 343 | 5 | 0.123320 |
DaHoodsOrnament | -0.081913 | 3396 | 12086 | 157 | 12 | -0.212075 |
DannyDeadhack | -0.060800 | 1078 | 1591 | 146 | 9 | 0.338244 |
DerekDohrman | -0.097318 | 335 | 579 | 48 | 3 | -0.026333 |
DesireeDuell | -0.103413 | 2000 | 21260 | 364 | 5 | 0.478120 |
Domanyce_Da_1 | -0.072672 | 1090 | 521 | 98 | 16 | 0.069812 |
DortEventCenter | -0.070818 | 511 | 2093 | 127 | 5 | 0.519600 |
DoubtingTomFYI | -0.189894 | 1107 | 6066 | 454 | 14 | 0.057836 |
DylanLuna1931 | -0.068016 | 766 | 1890 | 139 | 10 | 0.049650 |
EmilyDoerr | -0.067301 | 1139 | 3705 | 263 | 7 | 0.379657 |
EricRob229 | -0.171226 | 205 | 325 | 18 | 5 | -0.169580 |
EzE_2o11_ | -0.109453 | 446 | 327 | 23 | 3 | 0.031433 |
FYINation | -0.128731 | 51 | 193 | 25 | 10 | -0.040650 |
FaithGiddings2 | -0.092667 | 246 | 1734 | 60 | 5 | -0.126800 |
FiggaDaKID | -0.055347 | 752 | 600 | 112 | 14 | -0.187557 |
... | ... | ... | ... | ... | ... | ... |
mona_haydar | -0.093202 | 414 | 1482 | 107 | 7 | -0.082043 |
mottdean | -0.097917 | 363 | 669 | 34 | 4 | -0.037375 |
najladw | -0.121085 | 904 | 1378 | 117 | 4 | 0.075150 |
nienie_strangep | -0.222324 | 419 | 715 | 67 | 4 | -0.157050 |
novaprime79 | -0.291021 | 63 | 62 | 8 | 13 | 0.234992 |
orochiburenso | -0.020238 | 344 | 8 | 7 | 5 | -0.125580 |
overall3171 | -0.168117 | 448 | 407 | 65 | 8 | -0.020588 |
p85rice | -0.085477 | 1108 | 305 | 60 | 11 | -0.166100 |
pageiv | -0.114743 | 613 | 3945 | 150 | 3 | 0.277000 |
patandaj | -0.061967 | 944 | 803 | 85 | 13 | -0.118046 |
peoplevssnyder | -0.029003 | 34 | 749 | 17 | 4 | 0.199250 |
planetofboom | -0.155146 | 161 | 383 | 23 | 19 | 0.049321 |
rappolee | -0.170460 | 432 | 283 | 36 | 9 | -0.013789 |
ronfonger | -0.098598 | 876 | 6638 | 263 | 16 | -0.342650 |
seonthompson | -0.140791 | 1381 | 811 | 109 | 3 | 0.743100 |
sexynacole | -0.083383 | 243 | 116 | 23 | 6 | 0.711350 |
sharrington2016 | -0.059061 | 104 | 1031 | 33 | 3 | 0.320367 |
smhodges4 | -0.105832 | 1143 | 6310 | 269 | 9 | -0.195844 |
standupflint | -0.086806 | 442 | 8344 | 285 | 12 | 0.140400 |
stevemintline | -0.088749 | 270 | 2122 | 104 | 8 | 0.102813 |
swaydalyricist | -0.137180 | 822 | 86 | 22 | 5 | 0.178440 |
tammy_loren | -0.074358 | 257 | 4216 | 113 | 19 | 0.023889 |
tdgalbraith | -0.113630 | 62 | 326 | 24 | 5 | 0.000000 |
teenagev0w | -0.064114 | 748 | 450 | 30 | 3 | -0.290533 |
tenacitybrewing | -0.061528 | 1202 | 2968 | 258 | 3 | -0.058700 |
theTOMTOMSmusic | -0.007381 | 595 | 296 | 59 | 4 | 0.281325 |
whitingflint | -0.040434 | 436 | 3165 | 151 | 4 | 0.463100 |
xojassx | -0.068348 | 1269 | 972 | 334 | 6 | -0.029650 |
yrnmg | -0.097200 | 1146 | 736 | 132 | 3 | 0.323100 |
zachizz1476 | 0.139836 | 27 | 25 | 5 | 5 | 0.066800 |
223 rows × 6 columns
len(fr),df[df.screen_name.isin(fr)].screen_name.nunique()
(122953, 8339)
print(len(ffdf[ffdf.usent>0]),len(ffdf[ffdf.usent<0]),len(ffdf[ffdf.fsent>0]),len(ffdf[ffdf.fsent<0]))
101 115 15 206
ffdf.corr(method='pearson') #.loc['fsent','usent'] = .16
fsent | tfcnt | ftwcnt | dfcnt | utwcnt | usent | |
---|---|---|---|---|---|---|
fsent | 1.000000 | -0.041997 | -0.011399 | 0.044034 | -0.128727 | 0.161092 |
tfcnt | -0.041997 | 1.000000 | 0.150289 | 0.588166 | 0.023542 | -0.076660 |
ftwcnt | -0.011399 | 0.150289 | 1.000000 | 0.487535 | 0.185268 | -0.001390 |
dfcnt | 0.044034 | 0.588166 | 0.487535 | 1.000000 | 0.094961 | 0.030336 |
utwcnt | -0.128727 | 0.023542 | 0.185268 | 0.094961 | 1.000000 | -0.108274 |
usent | 0.161092 | -0.076660 | -0.001390 | 0.030336 | -0.108274 | 1.000000 |
ffdf[ffdf.usent>0].fsent.mean(),ffdf[ffdf.usent<0].fsent.mean()
(-0.074941076829484934, -0.10539607280382769)
colorm = dict(boxes='lightgreen', whiskers='black', medians='black', caps='black')
#ax=compare[['followers','population']].plot(kind='box', patch_artist=True, showfliers=False)
boxprops = dict(linestyle='-', color='black')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['xtick.labelsize'] = 20
matplotlib.rcParams['ytick.labelsize'] = 18
matplotlib.rcParams['axes.titlesize'] = 14
co = {'color':'black'}
ma = {'color':'black','linestyle':'-'}
plt.figure(figsize=(9,3))
cohort = ffdf[ffdf.usent<0].fsent #ffdf[ffdf.fsent<0].usent
control= ffdf[ffdf.usent>0].fsent.dropna() #ffdf[ffdf.fsent>0].usent
print(cohort.mean(),control.mean())
bp = plt.boxplot([cohort,control],patch_artist=True, showfliers=False,
whiskerprops=co,capprops=co,medianprops=ma,boxprops=boxprops,labels=['Friends of the cohort','Friends of the control'])
ax = plt.gca()
for patch, color in zip(bp['boxes'], ['magenta','lightgreen']):
patch.set_facecolor(color)
ax.set_ylabel('Sentiment score',fontsize=22)
ax.set_ylim(-.23,.08)
#plt.yticks(np.arange(-.6, .6, .1))
ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
ax.get_figure().savefig('../figs/contagion-exp2.pdf', bbox_inches='tight')
-0.105396072804 -0.0749410768295
from scipy.stats import ks_2samp
from math import sqrt
c_a = 1.36 #coefficient c_a is 1.36 for alpha 0.05 and 1.95 for alpha 0.001
print(ks_2samp(ffdf[ffdf.fsent<0].usent,ffdf[ffdf.fsent>0].usent))
n1 = len(ffdf[ffdf.fsent<0])
n2 = len(ffdf[ffdf.fsent>0])
print('Critical value D_a (ks statistic (D) should be greater than this):',c_a*sqrt((n1+n2)/(n1*n2)))
#that is the case for 95% confidence level: https://daithiocrualaoich.github.io/kolmogorov_smirnov/
Ks_2sampResult(statistic=0.37184466019417473, pvalue=0.030545168312102647) Critical value D_a (ks statistic (D) should be greater than this): 0.3637104720012413
from scipy.stats import ks_2samp
from math import sqrt
c_a = 1.36 #coefficient c_a is 1.36 for alpha 0.05 and 1.95 for alpha 0.001
print(ks_2samp(ffdf[ffdf.usent<0].fsent,ffdf[ffdf.usent>0].fsent))
n1 = len(ffdf[ffdf.usent<0])
n2 = len(ffdf[ffdf.usent>0])
print('Critical value D_a (ks statistic (D) should be greater than this):',c_a*sqrt((n1+n2)/(n1*n2)))
#that is the case for 95% confidence level: https://daithiocrualaoich.github.io/kolmogorov_smirnov/
Ks_2sampResult(statistic=0.20619888075764092, pvalue=0.01743249763074449) Critical value D_a (ks statistic (D) should be greater than this): 0.1854625286897552
def get_friends(screen_name,twitter_api,limit=5000):
get_followers_ids = partial(make_twitter_request,twitter_api.friends.ids, count=5000)
ids = []
cursor = -1
while cursor != 0:
response = get_followers_ids(screen_name=screen_name, cursor=cursor)
if response is not None:
ids += response['ids']
cursor = response['next_cursor']
# print('Fetched {0} total {1} ids for {2}. next_cursor: {3}'.format(
# len(ids), label, screen_name, cursor))
if len(ids) >= limit or response is None:
break
return ids_to_snames(twitter_api,ids[:limit],screen_name=screen_name)
def ids_to_snames(twitter_api,fids,screen_name='tozcss'):
get_snames = partial(make_twitter_request,twitter_api.users.lookup)
resp = []
for i in range(1+(len(fids)-1)//100):
resp.extend(get_snames(user_id=fids[100*i:100*(i+1)]))
header = ['id_str','screen_name', 'name', 'location', 'description', 'created_at', \
'friends_count','followers_count','statuses_count','favourites_count']
df = pd.DataFrame.from_dict(resp)[header].set_index('id_str')
return df
def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw):
# A nested helper function that handles common HTTPErrors. Return an updated
# value for wait_period if the problem is a 500 level error. Block until the
# rate limit is reset if it's a rate limiting issue (429 error). Returns None
# for 401 and 404 errors, which requires special handling by the caller.
def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
if wait_period > 3600: # Seconds
print ('Too many retries. Quitting.',file=sys.stderr)
raise e
# See https://dev.twitter.com/docs/error-codes-responses for common codes
if e.e.code == 401:
print ('Encountered 401 Error (Not Authorized)',file=sys.stderr)
return None
elif e.e.code == 404:
print ('Encountered 404 Error (Not Found)',file=sys.stderr)
return None
elif e.e.code == 429:
print ('Encountered 429 Error (Rate Limit Exceeded)',file=sys.stderr)
if sleep_when_rate_limited:
print ("Retrying in 15 minutes...ZzZ...",file=sys.stderr)
sys.stderr.flush()
time.sleep(60*15 + 5)
print ('...ZzZ...Awake now and trying again.',file=sys.stderr)
return 2
else:
raise e # Caller must handle the rate limiting issue
elif e.e.code in (500, 502, 503, 504):
print ('Encountered',e.e.code,'Error. Retrying in',wait_period,'seconds',file=sys.stderr)
time.sleep(wait_period)
wait_period *= 1.5
return wait_period
else:
raise e
# End of nested helper function
wait_period = 2
error_count = 0
while True:
try:
return twitter_api_func(*args, **kw)
except t.api.TwitterHTTPError as e:
error_count = 0
wait_period = handle_twitter_http_error(e, wait_period)
if wait_period is None:
return
except URLError as e:
error_count += 1
time.sleep(wait_period)
wait_period *= 1.5
print ("URLError encountered. Continuing.",file = sys.stderr)
if error_count > max_errors:
print ("Too many consecutive errors...bailing out.",file=sys.stderr)
raise
except BadStatusLine as e:
error_count += 1
time.sleep(wait_period)
wait_period *= 1.5
print ("BadStatusLine encountered. Continuing.",file=sys.stderr)
if error_count > max_errors:
print ("Too many consecutive errors...bailing out.",file=sys.stderr)
raise
import subprocess #the table that went into the presentation
template = r'''\documentclass[preview]{{standalone}}
\usepackage{{booktabs}}
\usepackage[vcentering,dvips]{{geometry}}
\geometry{{total={{3.05in}}}}
\begin{{document}}
{}
\end{{document}}
'''
filename="../figs/concerned_geo.tex"
with open(filename, 'w') as f:
f.write(template.format(cc.to_latex()))
subprocess.call(['pdflatex', filename],cwd=r'../figs');
# Figure 4 of version1
compare = pd.read_table('../data/popVSfollower1000.txt',header=0, sep="\t")
colorm = dict(boxes='lightgreen', whiskers='black', medians='black', caps='black')
#ax=compare[['followers','population']].plot(kind='box', patch_artist=True, showfliers=False)
boxprops = dict(linestyle='-', color='black')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['xtick.labelsize'] = 20
matplotlib.rcParams['ytick.labelsize'] = 18
matplotlib.rcParams['axes.titlesize'] = 14
co = {'color':'black'}
ma = {'color':'black','linestyle':'-'}
plt.figure(figsize=(9,3))
bp = plt.boxplot([compare.followers,compare.population],patch_artist=True, showfliers=False,
whiskerprops=co,capprops=co,medianprops=ma,boxprops=boxprops,labels=['mayor','governor'])
ax = plt.gca()
for patch, color in zip(bp['boxes'], ['magenta','lightgreen']):
patch.set_facecolor(color)
ax.xaxis.set_ticklabels(['Cohort','Control'])
ax.set_ylabel('Sentiment score',fontsize=22)
#ax.set_ylim([-.23, -.12])
ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
ax.get_figure().savefig('../figs/contagion-exp.pdf', bbox_inches='tight')
# Figure 3
mayonly_avgsent=pd.DataFrame(mayoronly.groupby(['screen_name'],as_index=False).mean()['sent'])
snyderonly_avgsent=pd.DataFrame(snyderonly.groupby(['screen_name'],as_index=False).mean()['sent'])
print(mayonly_avgsent.sent.mean())
print(snyderonly_avgsent.sent.mean())
# fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
colorm = dict(boxes='magenta', whiskers='black', medians='black', caps='black')
colorg = dict(boxes='lightgreen', whiskers='black', medians='black', caps='black')
#matplotlib.style.use('fivethirtyeight')
matplotlib.style.use('ggplot')
plt.rcParams['axes.facecolor']='w'
plt.rcParams['savefig.facecolor']='w'
matplotlib.rcParams['font.size'] = 14
#plt.figure(num=None, figsize=(12, 8), facecolor='w', edgecolor='w')
c = {'color':'black'}
m = {'color':'black','linestyle':'-'}
boxprops = dict(linestyle='-', color='black')
bp = plt.boxplot([mayonly_avgsent,snyderonly_avgsent], patch_artist=True,
whiskerprops=c,capprops=c,medianprops=m,boxprops=boxprops,labels=['Mayor','Governor'])
for patch, color in zip(bp['boxes'], ['#348ABD','#E24A33']):
patch.set_facecolor(color)
ax = plt.gca()
ax.set_ylabel('Sentiment score')
ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
ax.get_figure().savefig('../figs/box-mayor-gov.pdf',dpi=150,bbox_inches='tight')
-0.122567031844 -0.312806395394
# Figure 4
mayonly_avgsent=pd.DataFrame(mayoronly.groupby(['screen_name'],as_index=False)['sent'].mean())
snyderonly_avgsent=pd.DataFrame(snyderonly.groupby(['screen_name'],as_index=False)['sent'].mean())
pro_may_avgent = mayonly_avgsent[mayonly_avgsent.sent>0].screen_name.unique()
comment_both = snyderonly_avgsent[snyderonly_avgsent.screen_name.isin(pro_may_avgent)].screen_name.unique()
ax=mayonly_avgsent[mayonly_avgsent.screen_name.isin(comment_both) & mayonly_avgsent.sent!=0].sent.plot(kind='density', xlim=(-1,1),color='#348ABD')
snyderonly_avgsent[snyderonly_avgsent.screen_name.isin(comment_both) & snyderonly_avgsent.sent!=0 ].sent.plot(kind='density', ax=ax, xlim=(-1,1), color = '#E24A33')
ax.legend(['Mayor','Governor'],loc=2)
ax.set_xlabel('Sentiment score')
ax.get_figure().savefig('../figs/pro_mayors_gov.pdf', bbox_inches='tight')
plt.figure(num=None, figsize=(12, 6), facecolor='w', edgecolor='w')
fil.cnt.plot(loglog=True,linestyle='',marker='.')
fil.cpop.plot(loglog=True,linestyle='',marker='.')
#fil.normalized.plot(loglog=True,linestyle='',marker='.')
plt.legend(['tweet count','city population']);
ax = fil.plot.scatter(x='cnt',y='cpop',figsize=(12,5))
fil[['cnt','cpop','CityST']].apply(lambda x: ax.text(*x),axis=1);
plt.xlim(0,20000);
plt.ylim(0,3000000);
#compare the sentiments of blame and no-blame tweets
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
nobl = pd.DataFrame(list(l[l.c=='1'].text.apply(sid.polarity_scores)))['compound']
blame = pd.DataFrame(list(l[l.c!='2'].text.apply(sid.polarity_scores)))['compound']
nobl.plot.density()
blame.plot.density()
plt.legend(['no blame','blame'])
plt.xlim(-1,1);
plt.gcf().set_size_inches(6,2)
#plt.gcf().savefig('../figs/blame-sentiment.pdf',dpi=150,bbox_inches='tight')
blamers = [df[df.text.isin(l[l.c.str.contains(str(i))].text.unique())].screen_name.unique() for i in range(10)]
blamers.append(df[df.text.isin(l[l.c == 'missing'].text.unique())].screen_name.unique())
blamers = pd.DataFrame(blamers).transpose()
blamers.columns = labels
blamers.apply(pd.Series.nunique) #number of (unique) blamers
from collections import defaultdict
from itertools import permutations
r = defaultdict(dict)
for a1,a2 in permutations(range(1,10), 2):
s1 = set(df[df.screen_name.isin(blamers[labels[a1]])].text.unique())
s2 = set(df[df.screen_name.isin(blamers[labels[a2]])].text.unique())
r[labels[a1]][labels[a2]] = len(s1 & s2) / len(s2)
plt.figure(num=None, figsize=(12, 8), facecolor='w', edgecolor='w')
z = pd.DataFrame.from_dict(r) #what percentage of blamers of x also blame y
sns.heatmap(z,annot=True,vmin=0,vmax=1,cmap='RdBu_r',annot_kws={'size':12});
#import nltk
#nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from collections import defaultdict
synonyms = defaultdict(set)
words = 'blame fault responsible fail resign jail prison sentence accountable liable cause accuse treason poison'
for w in words.split():
for synset in wn.synsets(w):
synonyms[w].update([lemma.name() for lemma in synset.lemmas()])
from pprint import pprint
pprint(dict(synonyms),indent=2,width=300)
{ 'accountable': {'accountable'}, 'accuse': {'accuse', 'criminate', 'charge', 'incriminate', 'impeach'}, 'blame': {'goddamn', 'darned', 'damn', 'incrimination', 'goddamned', 'inculpation', 'fault', 'damned', 'blamed', 'charge', 'goddam', 'find_fault', 'deuced', 'blame', 'blasted', 'infernal', 'rap', 'pick', 'blessed'}, 'cause': {'causa', 'effort', 'movement', 'cause', 'suit', 'causal_agent', 'do', 'make', 'stimulate', 'drive', 'crusade', 'reason', 'campaign', 'case', 'lawsuit', 'induce', 'get', 'causal_agency', 'have', 'grounds'}, 'fail': {'give_way', 'give_out', 'run_out', 'miscarry', 'go_wrong', 'die', 'fail', 'flush_it', 'bomb', 'break', 'neglect', 'conk_out', 'go_bad', 'flunk', 'betray', 'go', 'break_down'}, 'fault': {'shift', 'error', 'faulting', 'mistake', 'geological_fault', 'defect', 'break', 'fault', 'flaw', 'demerit', 'fracture', 'blame'}, 'jail': {'lag', 'poky', 'jail', 'jailhouse', 'incarcerate', 'remand', 'clink', 'slammer', 'imprison', 'immure', 'pokey', 'jug', 'put_behind_bars', 'gaol', 'put_away'}, 'liable': {'liable', 'unresistant', 'nonresistant', 'nonimmune', 'apt'}, 'poison': {'poisonous_substance', 'toxicant', 'poison', 'envenom'}, 'prison': {'prison', 'prison_house'}, 'resign': {'free', 'vacate', 'give_up', 'submit', 'release', 'quit', 'leave_office', 'resign', 'step_down', 'relinquish', 'renounce', 'reconcile'}, 'responsible': {'creditworthy', 'responsible_for', 'responsible'}, 'sentence': {'sentence', 'time', 'doom', 'conviction', 'condemn', 'judgment_of_conviction', 'condemnation', 'prison_term'}, 'treason': {'perfidy', 'subversiveness', 'treason', 'betrayal', 'lese_majesty', 'high_treason', 'treachery', 'traitorousness'}}
blame_words = {'responsible': 'account responsible blame accus[ie] \sliab \scause',
'fault': 'fault error mistake flaw',
'reason': 'ignor negl[ie] accident discriminat intention ideology decisi',
'sentenced': 'arrest convict jail jug bars prison sentence',
'betrayed': 'betray traitor treason',
'resign': 'resign quit remove.+office leave.+office step\sdown',
'poison': 'poison'}
bw = {}
for k,v in blame_words.items():
bw[k] = v.split()
bw = pd.DataFrame.from_dict(blame_words, orient='index').rename(columns={0:'blame words per category'})
print(bw.to_latex())
\begin{tabular}{ll} \toprule {} & blame words per category \\ \midrule fault & fault error mistake flaw \\ poison & poison \\ betrayed & betray traitor treason \\ responsible & account responsible blame accus[ie] \textbackslashsliab \textbackslashscause \\ sentenced & arrest convict jail jug bars prison sentence \\ resign & resign quit remove.+office leave.+office step\textbackslashs down \\ reason & ignor negl[ie] accident discriminat intention ideology decisi \\ \bottomrule \end{tabular}
blame_words = {'responsible': 'account responsible blame accus[ie] \sliab \scause',
'fault': 'fault error mistake flaw',
'reason': 'ignor negl[ie] accident discriminat intention ideology decision',
'sentenced': 'arrest convict jail jug bars prison sentence',
'betrayed': 'betray traitor treason',
'resign': 'resign quit remove.+office leave.+office step\sdown',
'poison': 'poison'}
blame_tw = {} #unique text
blame_rt = {} #rt matters
total = set()
for k,v in blame_words.items():
indices = set.union(*[set(df[df.text.str.contains(w,case=False)].index) for w in v.split()])
blame_tw[k] = df.loc[indices,].text.nunique()
blame_rt[k] = len(indices)
total.update(indices)
blame_rt['total'] = len(total)
blame_tw['total'] = df.loc[total,].text.nunique()
pd.DataFrame([blame_rt,blame_tw],index=['tweets in the dataset (RTs count)','# of tweets w/ unique text'])
betrayed | fault | poison | reason | resign | responsible | sentenced | total | |
---|---|---|---|---|---|---|---|---|
tweets in the dataset (RTs count) | 400 | 6047 | 51513 | 9233 | 17224 | 20961 | 20809 | 113340 |
# of tweets w/ unique text | 292 | 2851 | 19338 | 4288 | 7795 | 11886 | 9663 | 50700 |
l = [v.split() for v in blame_words.values()]
blame_filter = '|'.join([item for sublist in l for item in sublist])
blames = df[df.text.str.contains(blame_filter,case=False)].copy()
blames = blames.replace({'\r': ' ','\n': ' '}, regex=True)
# group by tweet text
grouped = blames.groupby('text').size()
g = grouped.reset_index().rename(columns={0:'RT'})
g = g.sort_values('RT',ascending=False)
"""
sample = g.sample(n=2000,random_state=3).sort_values('RT',ascending=False).copy()
s1 = sample.sample(n=200,random_state=5)
s2 = sample.sample(n=200,random_state=7)
s3 = sample.sample(n=200,random_state=9)
s4 = sample.sample(n=200,random_state=11)
s5 = sample.sample(n=200,random_state=13)
s6 = sample.sample(n=200,random_state=15)
from itertools import combinations
for p,q in combinations(range(1,5),2):
p = 's'+str(p)
q = 's'+str(q)
print('|'+p+'|','∩','|'+q+'|','=',len(set(eval(p).index)&set(eval(q).index)))
s1.to_csv('data/s1.csv',index=False)
s2.to_csv('data/s2.csv',index=False)
s3.to_csv('data/s3.csv',index=False)
s4.to_csv('data/s4.csv',index=False)
s5.to_csv('data/s5.csv',index=False)
s6.to_csv('data/s6.csv',index=False)
"""
snyder = blames.text.str.contains('gov|nyder|onetoughnerd|bern',case=False)
em = blames.text.str.contains('mgr|manager|Darnell|Earley|Kurtz',case=False)
mayor = blames.text.str.contains('Dayne|Walling|ayor',case=False)
obama = blames.text.str.contains('obama|POTUS',case=False)
obama = obama & ~blames.text.str.contains('pledge|announc|nyder|governor',case=False)
epa = blames.text.str.contains('\sEPA\s',case=False)
republic = blames.text.str.contains('republic',case=False)
democrat = blames.text.str.contains('democrat',case=False)
def perform(fun, *args):
return fun(*args)
def meetmin(x,y):
x = blames[x].screen_name
y = blames[y].screen_name
return 100 * len(set(x) & set(y)) / min(len(set(x)), len(set(y)))
scores = []
s = snyder& ~(epa|obama|mayor|em)
for y in [s,epa,obama,mayor,em]:
for f in [meetmin]:
scores.append({'EM':perform(f,em,y),
'Mayor':perform(f,mayor,y),
'President':perform(f,obama,y),
'EPA':perform(f,epa,y),
'Snyder':perform(f,s,y)})
pd.DataFrame(scores,index=['Also blame Snyder','Also blame EPA','Also blame President','Also blame Mayor','Also blame EM'])
EM | EPA | Mayor | President | Snyder | |
---|---|---|---|---|---|
Also blame Snyder | 33.592881 | 18.976198 | 20.074349 | 30.438675 | 100.000000 |
Also blame EPA | 7.730812 | 100.000000 | 8.550186 | 15.219338 | 18.976198 |
Also blame President | 6.893465 | 15.219338 | 6.195787 | 100.000000 | 30.438675 |
Also blame Mayor | 9.293680 | 8.550186 | 100.000000 | 6.195787 | 20.074349 |
Also blame EM | 100.000000 | 7.730812 | 9.293680 | 6.893465 | 33.592881 |
We filtered our dataset using blame words and labeled one percent sample of the tweets manually regarding who the blame is attributed to.
Reviwers also identified whether they were confused or unsure about who the tweet assigns blame to.
import pandas as pd
l = pd.DataFrame()
for i in range(5):
r = pd.read_csv('data/training/Flint'+str(i+1)+'_train.csv')
r['rater'] = i
l = l.append(r)
l = l.fillna('missing')
from mpl_toolkits.axes_grid.inset_locator import inset_axes
inset_axes = inset_axes(parent_axes,
width="30%", # width = 30% of parent_bbox
height=1., # height : 1 inch
loc=3)
19.1 2.5079872408
df = pd.read_csv('data/us-city-populations.csv',usecols=['CityST','2010','LAT','LON'])
df2 = pd.read_csv('data/city_file.csv',dtype={'lat':str,'lon':str})
df2['CityST'] = df2.city + ', ' + df2.state
merged = df.merge(df2, on = 'CityST', how = 'inner')
merged['latlon'] = merged[['lat','lon']].apply(tuple, axis=1)
merged.head()
CityST | 2010 | LAT | LON | city | state | lat | lon | latlon | |
---|---|---|---|---|---|---|---|---|---|
0 | Anchorage, AK | 291826.0 | 61.177549 | -149.274354 | Anchorage | AK | 61.191900 | -149.762097 | (61.191900, -149.762097) |
1 | Barrow, AK | 4212.0 | 71.254083 | -156.798949 | Barrow | AK | 71.300371 | -156.735840 | (71.300371, -156.735840) |
2 | Bethel, AK | 6080.0 | 60.792913 | -161.793405 | Bethel | AK | 60.789724 | -161.779332 | (60.789724, -161.779332) |
3 | Fairbanks, AK | 31535.0 | 64.836531 | -147.651745 | Fairbanks | AK | 64.838092 | -147.726378 | (64.838092, -147.726378) |
4 | Homer, AK | 5003.0 | 59.639985 | -151.511234 | Homer | AK | 59.643059 | -151.525900 | (59.643059, -151.525900) |
pos = {}
neg = {}
mean = {}
for g in ('s','epa','obama','mayor','em'):
pos[g]=len(blames[(blames.sp>0) & eval(g)].text.unique())
neg[g]=len(blames[(blames.sp<0) & eval(g)].text.unique())
mean[g] = blames[(blames.sp<0) & eval(g)].sp.mean()
pd.DataFrame([pos,neg,mean],index=['pos tw unique','neg tw unique','mean'])
em | epa | mayor | obama | s | |
---|---|---|---|---|---|
pos tw unique | 137.000000 | 309.000000 | 102.000000 | 183.000000 | 1454.000000 |
neg tw unique | 94.000000 | 245.000000 | 120.000000 | 320.000000 | 2266.000000 |
mean | -0.295925 | -0.366463 | -0.160341 | -0.344728 | -0.342608 |
with pd.option_context('display.max_colwidth', 114):
print(blames[epa].text[:30].to_string(index=False))
id RT @NolanHack: Navajo Blame EPA Inaction For Suicides\n\n#FlintWaterCrisis \nhttps://t.co/wgPjnIGRIg #StopNati... MidwestViews: #FlintWaterCrisis Politics as usual BernieSanders & Dems want onetoughnerd resign but not ep... @billmaher\nAfter the Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faul... @MaddowBlog\nAfter the Flint disaster,you may accept the fact that EPA never implemented the CWA due to a faul... @chrislhayes\nAfter Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... Government #EPA has failed #FlintWaterCrisis Obama is a failure. EPA knew!!!!! RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... @meredithshiner\nAfter Flint disaster,you may accept the fact that EPA never implemented the CWA due to a faul... @sganim\nAfter the Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty ... @janelleNBC\nAfter Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty ... @CaseyWianCNN\nAfter Flint disaster, you may accept the fact that EPA never implemented the CWA due to a fault... @cnnsara\nAfter the Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty... The rats will somehow blame the EPA & @potus later. EM and Gov knowingly poisoned people for $$ #FlintWate... @maddow If it's fair to blame Snyder for the Flint River, then it's fair to blame Obama for what his EPA did h... RT @NolanHack: Navajo Blame EPA Inaction For Suicides\n\n#FlintWaterCrisis \nhttps://t.co/wgPjnIGRIg #StopNati... MidwestViews: #FlintWaterCrisis Politics as usual BernieSanders & Dems want onetoughnerd resign but not ep... @billmaher\nAfter the Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faul... @MaddowBlog\nAfter the Flint disaster,you may accept the fact that EPA never implemented the CWA due to a faul... @chrislhayes\nAfter Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... Government #EPA has failed #FlintWaterCrisis Obama is a failure. EPA knew!!!!! RT @VickiMasterson2: @owillis Interesting how Fournier was strangely quiet about Flint until he found an angle... @meredithshiner\nAfter Flint disaster,you may accept the fact that EPA never implemented the CWA due to a faul... @sganim\nAfter the Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty ... @janelleNBC\nAfter Flint disaster, you may accept the fact that EPA never implemented the CWA due to a faulty ...
from wordcloud import WordCloud, STOPWORDS
from scipy.misc import imread
from PIL import Image
import numpy as np
import calendar
#mask = imread('twitter_mask.png', flatten=True)
mask = np.array(Image.open("twitter_mask.png"))
wc = WordCloud(mask=mask,background_color='white',stopwords=STOPWORDS,width=2200,height=1400).generate(words)
plt.figure().suptitle(calendar.month_name[month]+', 2016')
plt.axis('off')
plt.imshow(wc)
plt.savefig('figs/wc_'+calendar.month_name[month]+'.png', dpi=300, bbox_inches='tight')