This is this second in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in Chapter 1. In this second notebook I will show you how to conduct various account-level (organizational-level) analyses of the Twitter data. Essentially, we will be taking the tweet-level data and aggregating to the account level.

Chapter 2: Analyze Twitter Data at the Account (Organization) Level¶

First, we will import several necessary Python packages and set some options for viewing the data. As with Chapter 1, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations.

Import packages and set viewing options¶

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:

#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)

Import graphing packages¶

We'll be producing some figures at the end of this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid.

In [46]:

import matplotlib.pyplot as plt
print matplotlib.__version__

1.3.1

In [39]:

#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*

One of the great innovations of ipython notebook is the ability to see output and graphics "inline," that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line.

In [43]:

%matplotlib inline

We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn.

In [45]:

import seaborn as sns
print sns.__version__

0.5.1

Read in data¶

In Chapter 1 we deleted tweets from one unneeded Twitter account and also omitted several unnecessary columns (variables). We then saved, or "pickled," the updated dataframe. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 32,330 tweets.

In [5]:

df = pd.read_pickle('CSR tweets - 2013 by 41 accounts.pkl')
print len(df)
df.head(2)

Out[5]:

	rowid	query	tweet_id_str	inserted_date	language	coordinates	retweeted_status	created_at	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
0	67340	humanavitality	306897327585652736	2014-03-09 13:46:50.222857	en	NaN	NaN	2013-02-27 22:43:19.000000	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	306218267737989120.00	web	NaN	nan	NaN	NaN	NaN	0	0	0	121	19	nan	NaN	NaN	nan	nan	nan	NaN	NaN	79	Humana	0	1
1	39454	FundacionPfizer	308616393706844160	2014-03-09 13:38:20.679967	es	NaN	NaN	2013-03-04 16:34:17.000000	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	nan	web	NaN	nan	NaN	NaN	NaN	0	0	0	138	20	nan	NaN	NaN	nan	nan	nan	NaN	NaN	40	Pfizer	0	1

2 rows × 54 columns

List all the columns in the DataFrame

In [6]:

df.columns

Out[6]:

Index([u'rowid', u'query', u'tweet_id_str', u'inserted_date', u'language', u'coordinates', u'retweeted_status', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')

We can use the len function again here to see how many columns there are in the dataframe: 60.

In [7]:

len(df.columns)

Out[7]:

Number of Accounts in DF¶

In Chapter 1 we used the unique function to first list all of the unique Twitter accounts represented in our collection of tweets and then provide a count of those accounts. There are 41 separate accounts.

In [8]:

pd.unique(df.from_user_screen_name.ravel())

Out[8]:

array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving',
       'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen',
       'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)

Note again how we can use len to find out how many accounts are in the array.

In [9]:

len(pd.unique(df.from_user_screen_name.ravel()))

Out[9]:

So we know there are 41 accounts in our dataset. But how many companies are controlling these accounts? To answer this question, we can run unique for the Company variable. The result? There are actually only 34 unique companies represented in the dataset. Several of the companies have multiple CSR-related Twitter accounts. We will find out exactly which ones soon.

In [11]:

len(pd.unique(df.Company.ravel()))

Out[11]:

Number of Tweets per Company¶

Now let's turn to our account-level analyses. To do this, we need to convert our tweet-level dataset -- a dataset where each row is dedicated to one of the 32,330 tweets -- to an account-level dataset. This process is called aggregation and the output will be a new dataframe with 41 rows -- one per Twitter account.

We will perform a simple aggregation here -- we are only interested in a count of how many tweets each account has sent over the course of 2013. Our first step is to create a function to spell out which variables (columns) from our dataframe that we wish to keep and/or aggregate. Specifically, the following function is first designed to produce a variable called Number_of_tweets that is a count of the number of tweets sent; we are basing this on the "content" column but we could have chosen others. Second, we are asking for the new dataframe to provide a column called Company; we are using the min (minimum) function here, but that's just one way to approach it -- for each tweet sent by a given company, the length of the company name will be the same, and our function here will grab the "shortest" (using the min. function) name. This is just a convenience; we could have easily chosen to return the "maximum" value for Company. Similarly, we will include the Description the company has provided for its Twitter account.

In [115]:

def f(x):
     return Series(dict(Number_of_tweets = x['content'].count(), 
                        Company=x['Company'].min(),
                        Description=x['from_user_description'].min(),
                        ))

In the following block of code we will now apply the above function to our dataframe. Note the groupby command. This is how we aggregate the data. It is an invaluable function, so I'll provide more details. We are asking PANDAS to create a new dataframe, called account_count, and asking that this new dataframe be based on aggregating our original dataframe by the from_user_screen_name column, applying the function f we wrote above. In other words, we are going to our original dataframe of 32,330 tweets, and aggregating or collapsing that data based on who sent the tweets. We are thus converting our tweet-level dataset with 32,330 rows into an account-level dataset with 41 rows. As you can see in the output, there are 41 observations in this new dataframe -- one per account -- and for each account there is a column showing the name of the Fortune 200 company, the description it created for its Twitter account, and the number of tweets sent over the course of 2013.

In [116]:

account_count = df.groupby('from_user_screen_name').apply(f) 
print len(account_count)
account_count

Out[116]:

	Company	Description	Number_of_tweets
from_user_screen_name
3M_FoodSafety	3M	3M Food Safety \| global manufacturer of innova...	325
ATTAspire	AT&T	Inspiring students to achieve their dreams.	336
AlcoaFoundation	Alcoa	Since 1952, we’ve invested $570 million to imp...	1053
AmgenFoundation	Amgen	The Amgen Foundation seeks to advance science ...	135
BofA_Community	Bank of America Corp.	We’re connecting local communities to the peop...	1722
CiscoCSR	Cisco Systems	Sharing stories about how Cisco and our partne...	2511
CiscoEDU	Cisco Systems	Tweets on the Cisco Connected Learning Experie...	932
CitizenDisney	Walt Disney	We believe in the power of stories, families a...	211
ClickToEmpower	Allstate	Click to Empower is a Web initiative of The Al...	149
Comcastdreambig	Comcast	Comcast empowers communities across the countr...	581
DE_Youtility	Duke Energy	This profile will retire soon, so head on over...	300
Dell4Good	Dell	News from Dell corp. responsibility team: sust...	829
DellEDU	Dell	Connecting with students, educators, school ad...	1108
DuPont_ability	DuPont	DuPont sustainability news: clean energy, sola...	1509
EnviroSears	Sears Holdings	Discussing sustainability topics at Sears Hold...	53
FordDriveGreen	Ford Motor	Ford is committed to affordable, sustainable p...	27
FundacionPfizer	Pfizer	Noticias sobre Responsabilidad Social y Fundac...	421
GreenIBM	International Business Machines	Official Twitter account for IBM Big Green Inn...	81
HeartRescue	Medtronic	A collaborative effort supported by the Medtro...	322
HoneywellBuild	Honeywell International	Honeywell Building Solutions installs, integra...	242
IBMSmartCities	International Business Machines	Official IBM Smarter Cities account. Managed b...	1570
Intelinvolved	Intel	Connecting & enriching lives worldwide to crea...	1524
JNJStories	Johnson & Johnson	We're tweeting about social good over on @JNJC...	9
Microsoft_Green	Microsoft	The official Twitter account for Microsoft's E...	436
PG_CSDW	Procter & Gamble	News from P&G's Children’s Safe Drinking Water...	187
PPGIdeascapes	PPG Industries	PPG Ideascapes provides sustainable building p...	160
PromesaPepsiCo	PepsiCo	PepsiCo's commitment to the Hispanic community...	7
SprintGreenNews	Sprint Nextel	News and Information about Sprint's sustainabi...	201
TeachingMoney	Capital One Financial	Teaching Money is a new initiative created to ...	187
WalmartAction	Wal-Mart Stores	Our Community Action Network works to improve ...	1915
WalmartGreen	Wal-Mart Stores	Together, we can create a sustainable world & ...	1434
citizenIBM	International Business Machines	This official Citizen IBM Twitter feed is admi...	1039
ecomagination	General Electric	A forum for fresh thinking and conversation ab...	594
gehealthy	General Electric	A shared commitment to creating better health ...	2461
googlestudents	Google	Google news and updates especially for student...	211
hpglobalcitizen	Hewlett-Packard	HP Living Progress is focused to create a bett...	470
humanavitality	Humana	This is the official Twitter account for Human...	762
mathmovesu	Raytheon	We represent the centerpiece citizenship initi...	1146
msftcitizenship	Microsoft	Sharing stories about how Microsoft and our pa...	2493
nikebetterworld	Nike	A better world needs the world’s biggest team....	153
verizongiving	Verizon Communications	We are focused on using technology to solve cr...	2524

41 rows × 3 columns

Now that we have our account-level dataframe, we can check which companies are managing multiple Twitter accounts. Note how we again use the groupby function. This time we are performing an aggregation based on the Company column. Don't worry too much about understanding all of the details of this block of code. For now, just try to understand that we are creating a series (think of it as a list, not a dataset) with the number of rows each company has in the account-level dataframe, then converting that series to a dataframe, and then dropping those cases that have only 1 row in the account-level dataframe. You can use this block of code as a template for any similar "duplicate checks" you wish to find in your own data.

What do we find? 5 of the companies have 2 Twitter accounts and 1 has 3. There are thus 41 separate CSR-related Twitter accounts run by 34 companies.

In [32]:

counts = account_count.groupby('Company').size()  #create a series of number of rows per company
df2 = pd.DataFrame(counts, columns = ['size'])    #convert series to a dataframe
df2 = df2[df2['size']>1]                           #keep only those cases with more than one row in account-level dataframe
df2                                               #show dataframe

Out[32]:

	size
Company
Cisco Systems	2
Dell	2
General Electric	2
International Business Machines	3
Microsoft	2
Wal-Mart Stores	2

6 rows × 1 columns

Output account-level data to CSV file¶

We now have the first meaningful pieces of information for our report. Let's output account_count to a CSV file. This will give us a file with the four columns shown above: from_user_screen_name, Company, Description, and Number_of_tweets. This table will make for a useful appendix to our research paper.

In [33]:

account_count.to_csv('Number of Tweets per Account.csv')

PANDAS can output to a number of different formats. I most commonly use to_csv, to_excel, to_json, to_csv, and to_pickle.

Graphing our account-level data¶

We will now do a simple bar graph of the number of tweets sent by each account. First, we will sort the dataframe.

To access the help for any function we can use "?"

In [64]:

DataFrame.sort?

Let's sort the data in descending order by number of tweets

In [117]:

account_count = account_count.sort(['Number_of_tweets'], ascending=False)
account_count

Out[117]:

	Company	Description	Number_of_tweets
from_user_screen_name
verizongiving	Verizon Communications	We are focused on using technology to solve cr...	2524
CiscoCSR	Cisco Systems	Sharing stories about how Cisco and our partne...	2511
msftcitizenship	Microsoft	Sharing stories about how Microsoft and our pa...	2493
gehealthy	General Electric	A shared commitment to creating better health ...	2461
WalmartAction	Wal-Mart Stores	Our Community Action Network works to improve ...	1915
BofA_Community	Bank of America Corp.	We’re connecting local communities to the peop...	1722
IBMSmartCities	International Business Machines	Official IBM Smarter Cities account. Managed b...	1570
Intelinvolved	Intel	Connecting & enriching lives worldwide to crea...	1524
DuPont_ability	DuPont	DuPont sustainability news: clean energy, sola...	1509
WalmartGreen	Wal-Mart Stores	Together, we can create a sustainable world & ...	1434
mathmovesu	Raytheon	We represent the centerpiece citizenship initi...	1146
DellEDU	Dell	Connecting with students, educators, school ad...	1108
AlcoaFoundation	Alcoa	Since 1952, we’ve invested $570 million to imp...	1053
citizenIBM	International Business Machines	This official Citizen IBM Twitter feed is admi...	1039
CiscoEDU	Cisco Systems	Tweets on the Cisco Connected Learning Experie...	932
Dell4Good	Dell	News from Dell corp. responsibility team: sust...	829
humanavitality	Humana	This is the official Twitter account for Human...	762
ecomagination	General Electric	A forum for fresh thinking and conversation ab...	594
Comcastdreambig	Comcast	Comcast empowers communities across the countr...	581
hpglobalcitizen	Hewlett-Packard	HP Living Progress is focused to create a bett...	470
Microsoft_Green	Microsoft	The official Twitter account for Microsoft's E...	436
FundacionPfizer	Pfizer	Noticias sobre Responsabilidad Social y Fundac...	421
ATTAspire	AT&T	Inspiring students to achieve their dreams.	336
3M_FoodSafety	3M	3M Food Safety \| global manufacturer of innova...	325
HeartRescue	Medtronic	A collaborative effort supported by the Medtro...	322
DE_Youtility	Duke Energy	This profile will retire soon, so head on over...	300
HoneywellBuild	Honeywell International	Honeywell Building Solutions installs, integra...	242
googlestudents	Google	Google news and updates especially for student...	211
CitizenDisney	Walt Disney	We believe in the power of stories, families a...	211
SprintGreenNews	Sprint Nextel	News and Information about Sprint's sustainabi...	201
PG_CSDW	Procter & Gamble	News from P&G's Children’s Safe Drinking Water...	187
TeachingMoney	Capital One Financial	Teaching Money is a new initiative created to ...	187
PPGIdeascapes	PPG Industries	PPG Ideascapes provides sustainable building p...	160
nikebetterworld	Nike	A better world needs the world’s biggest team....	153
ClickToEmpower	Allstate	Click to Empower is a Web initiative of The Al...	149
AmgenFoundation	Amgen	The Amgen Foundation seeks to advance science ...	135
GreenIBM	International Business Machines	Official Twitter account for IBM Big Green Inn...	81
EnviroSears	Sears Holdings	Discussing sustainability topics at Sears Hold...	53
FordDriveGreen	Ford Motor	Ford is committed to affordable, sustainable p...	27
JNJStories	Johnson & Johnson	We're tweeting about social good over on @JNJC...	9
PromesaPepsiCo	PepsiCo	PepsiCo's commitment to the Hispanic community...	7

41 rows × 3 columns

We'll also make the default plot larger.

In [156]:

plt.rcParams['figure.figsize'] = (20, 10)

PANDAS can produce plots quite easily, as you can see in the following example. We're asking PANDAS to plot a bar graph based on the Number_of_tweets column in our account-level dataframe.

In [160]:

account_count['Number_of_tweets'].plot(kind='bar')

Out[160]:

<matplotlib.axes.AxesSubplot at 0x12ad45950>

We can also try a horizontal bar graph

In [161]:

account_count['Number_of_tweets'].plot(kind='barh')

Out[161]:

<matplotlib.axes.AxesSubplot at 0x11efb5110>

Typically the above default graphs have a few things we'd like to tweak. Learning the ins and outs of all the possible modifications takes time, so don't worry about learning them all now. Instead, I'd recommend using the following examples as a template for your own data and then learning new options as you need them.

In the code block below I modify the transparency of the graph (using "alpha"), change the font size and the rotation of the x-axis ticks labels, add/change the y-axis and x-axis headings, make them bold, and add some extra spacing. I then save the output as a .png file that you can then insert into your Word or LaTeX file. I have left the title out of these figures -- I recommend adding these in later in Word/LaTeX.

In [162]:

account_plot = account_count['Number_of_tweets'].plot(kind='bar', alpha=0.75)
xticks(fontsize = 10,rotation=60, ha ="right")
account_plot.set_xlabel('Twitter Account Name', weight='bold', labelpad=30)   
account_plot.set_ylabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - bar graph.png', bbox_inches='tight', dpi=300, format='png')

And here's a modified version of the horizontal bar graph.

In [158]:

account_plot = account_count['Number_of_tweets'].plot(kind='barh', alpha=0.75)
xticks(fontsize = 12) #,rotation=60, ha ="right")
account_plot.set_ylabel('Twitter Account Name', weight='bold', labelpad=20)   
account_plot.set_xlabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - horizontal bar graph.png', bbox_inches='tight', dpi=300, format='png')

OK, we have covered a few important steps in this tutorial. We now have an account-level dataset, and used it to output a CSV-formatted table showing the number of tweets sent by each of the 41 Twitter accounts. We'll likely use this table as an appendix in our research report. We have also generated a bar graph representation of these data, which we could also include. Finally, we know which of the Fortune 200 companies are managing multiple CSR-related Twitter accounts. I hope you've found this tutorial helpful. In the next tutorial we will cover how to aggregate the data by time.

For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton