This is this second in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in Chapter 1. In this second notebook I will show you how to conduct various account-level (organizational-level) analyses of the Twitter data. Essentially, we will be taking the tweet-level data and aggregating to the account level.

Chapter 2: Analyze Twitter Data at the Account (Organization) Level

First, we will import several necessary Python packages and set some options for viewing the data. As with Chapter 1, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations.

Import packages and set viewing options

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
In [2]:
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)

Import graphing packages

We'll be producing some figures at the end of this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid.

In [46]:
import matplotlib.pyplot as plt
print matplotlib.__version__
1.3.1
In [39]:
#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*

One of the great innovations of ipython notebook is the ability to see output and graphics "inline," that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line.

In [43]:
%matplotlib inline  

We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn.

In [45]:
import seaborn as sns
print sns.__version__
0.5.1

Read in data

In Chapter 1 we deleted tweets from one unneeded Twitter account and also omitted several unnecessary columns (variables). We then saved, or "pickled," the updated dataframe. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 32,330 tweets.

In [5]:
df = pd.read_pickle('CSR tweets - 2013 by 41 accounts.pkl')
print len(df)
df.head(2)
32330
Out[5]:
rowid query tweet_id_str inserted_date language coordinates retweeted_status created_at month year content from_user_screen_name from_user_id from_user_followers_count from_user_friends_count from_user_listed_count from_user_favourites_count from_user_statuses_count from_user_description from_user_location from_user_created_at retweet_count favorite_count entities_urls entities_urls_count entities_hashtags entities_hashtags_count entities_mentions entities_mentions_count in_reply_to_screen_name in_reply_to_status_id source entities_expanded_urls entities_media_count media_expanded_url media_url media_type video_link photo_link twitpic num_characters num_words retweeted_user retweeted_user_description retweeted_user_screen_name retweeted_user_followers_count retweeted_user_listed_count retweeted_user_statuses_count retweeted_user_location retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability specific_project_initiative_area
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 306218267737989120.00 web NaN nan NaN NaN NaN 0 0 0 121 19 nan NaN NaN nan nan nan NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... México Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumonía 2 NaN 0 NaN nan web NaN nan NaN NaN NaN 0 0 0 138 20 nan NaN NaN nan nan nan NaN NaN 40 Pfizer 0 1

2 rows × 54 columns


List all the columns in the DataFrame

In [6]:
df.columns
Out[6]:
Index([u'rowid', u'query', u'tweet_id_str', u'inserted_date', u'language', u'coordinates', u'retweeted_status', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')


We can use the len function again here to see how many columns there are in the dataframe: 60.

In [7]:
len(df.columns)
Out[7]:
54

Number of Accounts in DF

In Chapter 1 we used the unique function to first list all of the unique Twitter accounts represented in our collection of tweets and then provide a count of those accounts. There are 41 separate accounts.

In [8]:
pd.unique(df.from_user_screen_name.ravel())
Out[8]:
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving',
       'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen',
       'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)


Note again how we can use len to find out how many accounts are in the array.

In [9]:
len(pd.unique(df.from_user_screen_name.ravel()))
Out[9]:
41


So we know there are 41 accounts in our dataset. But how many companies are controlling these accounts? To answer this question, we can run unique for the Company variable. The result? There are actually only 34 unique companies represented in the dataset. Several of the companies have multiple CSR-related Twitter accounts. We will find out exactly which ones soon.

In [11]:
len(pd.unique(df.Company.ravel()))
Out[11]:
34

Number of Tweets per Company

Now let's turn to our account-level analyses. To do this, we need to convert our tweet-level dataset -- a dataset where each row is dedicated to one of the 32,330 tweets -- to an account-level dataset. This process is called aggregation and the output will be a new dataframe with 41 rows -- one per Twitter account.

We will perform a simple aggregation here -- we are only interested in a count of how many tweets each account has sent over the course of 2013. Our first step is to create a function to spell out which variables (columns) from our dataframe that we wish to keep and/or aggregate. Specifically, the following function is first designed to produce a variable called Number_of_tweets that is a count of the number of tweets sent; we are basing this on the "content" column but we could have chosen others. Second, we are asking for the new dataframe to provide a column called Company; we are using the min (minimum) function here, but that's just one way to approach it -- for each tweet sent by a given company, the length of the company name will be the same, and our function here will grab the "shortest" (using the min. function) name. This is just a convenience; we could have easily chosen to return the "maximum" value for Company. Similarly, we will include the Description the company has provided for its Twitter account.

In [115]:
def f(x):
     return Series(dict(Number_of_tweets = x['content'].count(), 
                        Company=x['Company'].min(),
                        Description=x['from_user_description'].min(),
                        ))


In the following block of code we will now apply the above function to our dataframe. Note the groupby command. This is how we aggregate the data. It is an invaluable function, so I'll provide more details. We are asking PANDAS to create a new dataframe, called account_count, and asking that this new dataframe be based on aggregating our original dataframe by the from_user_screen_name column, applying the function f we wrote above. In other words, we are going to our original dataframe of 32,330 tweets, and aggregating or collapsing that data based on who sent the tweets. We are thus converting our tweet-level dataset with 32,330 rows into an account-level dataset with 41 rows. As you can see in the output, there are 41 observations in this new dataframe -- one per account -- and for each account there is a column showing the name of the Fortune 200 company, the description it created for its Twitter account, and the number of tweets sent over the course of 2013.

In [116]:
account_count = df.groupby('from_user_screen_name').apply(f) 
print len(account_count)
account_count
41
Out[116]:
Company Description Number_of_tweets
from_user_screen_name
3M_FoodSafety 3M 3M Food Safety | global manufacturer of innova... 325
ATTAspire AT&T Inspiring students to achieve their dreams. 336
AlcoaFoundation Alcoa Since 1952, we’ve invested $570 million to imp... 1053
AmgenFoundation Amgen The Amgen Foundation seeks to advance science ... 135
BofA_Community Bank of America Corp. We’re connecting local communities to the peop... 1722
CiscoCSR Cisco Systems Sharing stories about how Cisco and our partne... 2511
CiscoEDU Cisco Systems Tweets on the Cisco Connected Learning Experie... 932
CitizenDisney Walt Disney We believe in the power of stories, families a... 211
ClickToEmpower Allstate Click to Empower is a Web initiative of The Al... 149
Comcastdreambig Comcast Comcast empowers communities across the countr... 581
DE_Youtility Duke Energy This profile will retire soon, so head on over... 300
Dell4Good Dell News from Dell corp. responsibility team: sust... 829
DellEDU Dell Connecting with students, educators, school ad... 1108
DuPont_ability DuPont DuPont sustainability news: clean energy, sola... 1509
EnviroSears Sears Holdings Discussing sustainability topics at Sears Hold... 53
FordDriveGreen Ford Motor Ford is committed to affordable, sustainable p... 27
FundacionPfizer Pfizer Noticias sobre Responsabilidad Social y Fundac... 421
GreenIBM International Business Machines Official Twitter account for IBM Big Green Inn... 81
HeartRescue Medtronic A collaborative effort supported by the Medtro... 322
HoneywellBuild Honeywell International Honeywell Building Solutions installs, integra... 242
IBMSmartCities International Business Machines Official IBM Smarter Cities account. Managed b... 1570
Intelinvolved Intel Connecting & enriching lives worldwide to crea... 1524
JNJStories Johnson & Johnson We're tweeting about social good over on @JNJC... 9
Microsoft_Green Microsoft The official Twitter account for Microsoft's E... 436
PG_CSDW Procter & Gamble News from P&G's Children’s Safe Drinking Water... 187
PPGIdeascapes PPG Industries PPG Ideascapes provides sustainable building p... 160
PromesaPepsiCo PepsiCo PepsiCo's commitment to the Hispanic community... 7
SprintGreenNews Sprint Nextel News and Information about Sprint's sustainabi... 201
TeachingMoney Capital One Financial Teaching Money is a new initiative created to ... 187
WalmartAction Wal-Mart Stores Our Community Action Network works to improve ... 1915
WalmartGreen Wal-Mart Stores Together, we can create a sustainable world & ... 1434
citizenIBM International Business Machines This official Citizen IBM Twitter feed is admi... 1039
ecomagination General Electric A forum for fresh thinking and conversation ab... 594
gehealthy General Electric A shared commitment to creating better health ... 2461
googlestudents Google Google news and updates especially for student... 211
hpglobalcitizen Hewlett-Packard HP Living Progress is focused to create a bett... 470
humanavitality Humana This is the official Twitter account for Human... 762
mathmovesu Raytheon We represent the centerpiece citizenship initi... 1146
msftcitizenship Microsoft Sharing stories about how Microsoft and our pa... 2493
nikebetterworld Nike A better world needs the world’s biggest team.... 153
verizongiving Verizon Communications We are focused on using technology to solve cr... 2524

41 rows × 3 columns


Now that we have our account-level dataframe, we can check which companies are managing multiple Twitter accounts. Note how we again use the groupby function. This time we are performing an aggregation based on the Company column. Don't worry too much about understanding all of the details of this block of code. For now, just try to understand that we are creating a series (think of it as a list, not a dataset) with the number of rows each company has in the account-level dataframe, then converting that series to a dataframe, and then dropping those cases that have only 1 row in the account-level dataframe. You can use this block of code as a template for any similar "duplicate checks" you wish to find in your own data.

What do we find? 5 of the companies have 2 Twitter accounts and 1 has 3. There are thus 41 separate CSR-related Twitter accounts run by 34 companies.

In [32]:
counts = account_count.groupby('Company').size()  #create a series of number of rows per company
df2 = pd.DataFrame(counts, columns = ['size'])    #convert series to a dataframe
df2 = df2[df2['size']>1]                           #keep only those cases with more than one row in account-level dataframe
df2                                               #show dataframe
Out[32]:
size
Company
Cisco Systems 2
Dell 2
General Electric 2
International Business Machines 3
Microsoft 2
Wal-Mart Stores 2

6 rows × 1 columns

Output account-level data to CSV file

We now have the first meaningful pieces of information for our report. Let's output account_count to a CSV file. This will give us a file with the four columns shown above: from_user_screen_name, Company, Description, and Number_of_tweets. This table will make for a useful appendix to our research paper.

In [33]:
account_count.to_csv('Number of Tweets per Account.csv')

PANDAS can output to a number of different formats. I most commonly use to_csv, to_excel, to_json, to_csv, and to_pickle.

Graphing our account-level data

We will now do a simple bar graph of the number of tweets sent by each account. First, we will sort the dataframe.

To access the help for any function we can use "?"

In [64]:
DataFrame.sort?

Let's sort the data in descending order by number of tweets

In [117]:
account_count = account_count.sort(['Number_of_tweets'], ascending=False)
account_count
Out[117]:
Company Description Number_of_tweets
from_user_screen_name
verizongiving Verizon Communications We are focused on using technology to solve cr... 2524
CiscoCSR Cisco Systems Sharing stories about how Cisco and our partne... 2511
msftcitizenship Microsoft Sharing stories about how Microsoft and our pa... 2493
gehealthy General Electric A shared commitment to creating better health ... 2461
WalmartAction Wal-Mart Stores Our Community Action Network works to improve ... 1915
BofA_Community Bank of America Corp. We’re connecting local communities to the peop... 1722
IBMSmartCities International Business Machines Official IBM Smarter Cities account. Managed b... 1570
Intelinvolved Intel Connecting & enriching lives worldwide to crea... 1524
DuPont_ability DuPont DuPont sustainability news: clean energy, sola... 1509
WalmartGreen Wal-Mart Stores Together, we can create a sustainable world & ... 1434
mathmovesu Raytheon We represent the centerpiece citizenship initi... 1146
DellEDU Dell Connecting with students, educators, school ad... 1108
AlcoaFoundation Alcoa Since 1952, we’ve invested $570 million to imp... 1053
citizenIBM International Business Machines This official Citizen IBM Twitter feed is admi... 1039
CiscoEDU Cisco Systems Tweets on the Cisco Connected Learning Experie... 932
Dell4Good Dell News from Dell corp. responsibility team: sust... 829
humanavitality Humana This is the official Twitter account for Human... 762
ecomagination General Electric A forum for fresh thinking and conversation ab... 594
Comcastdreambig Comcast Comcast empowers communities across the countr... 581
hpglobalcitizen Hewlett-Packard HP Living Progress is focused to create a bett... 470
Microsoft_Green Microsoft The official Twitter account for Microsoft's E... 436
FundacionPfizer Pfizer Noticias sobre Responsabilidad Social y Fundac... 421
ATTAspire AT&T Inspiring students to achieve their dreams. 336
3M_FoodSafety 3M 3M Food Safety | global manufacturer of innova... 325
HeartRescue Medtronic A collaborative effort supported by the Medtro... 322
DE_Youtility Duke Energy This profile will retire soon, so head on over... 300
HoneywellBuild Honeywell International Honeywell Building Solutions installs, integra... 242
googlestudents Google Google news and updates especially for student... 211
CitizenDisney Walt Disney We believe in the power of stories, families a... 211
SprintGreenNews Sprint Nextel News and Information about Sprint's sustainabi... 201
PG_CSDW Procter & Gamble News from P&G's Children’s Safe Drinking Water... 187
TeachingMoney Capital One Financial Teaching Money is a new initiative created to ... 187
PPGIdeascapes PPG Industries PPG Ideascapes provides sustainable building p... 160
nikebetterworld Nike A better world needs the world’s biggest team.... 153
ClickToEmpower Allstate Click to Empower is a Web initiative of The Al... 149
AmgenFoundation Amgen The Amgen Foundation seeks to advance science ... 135
GreenIBM International Business Machines Official Twitter account for IBM Big Green Inn... 81
EnviroSears Sears Holdings Discussing sustainability topics at Sears Hold... 53
FordDriveGreen Ford Motor Ford is committed to affordable, sustainable p... 27
JNJStories Johnson & Johnson We're tweeting about social good over on @JNJC... 9
PromesaPepsiCo PepsiCo PepsiCo's commitment to the Hispanic community... 7

41 rows × 3 columns


We'll also make the default plot larger.

In [156]:
plt.rcParams['figure.figsize'] = (20, 10)


PANDAS can produce plots quite easily, as you can see in the following example. We're asking PANDAS to plot a bar graph based on the Number_of_tweets column in our account-level dataframe.

In [160]:
account_count['Number_of_tweets'].plot(kind='bar')
Out[160]:
<matplotlib.axes.AxesSubplot at 0x12ad45950>


We can also try a horizontal bar graph

In [161]:
account_count['Number_of_tweets'].plot(kind='barh')
Out[161]:
<matplotlib.axes.AxesSubplot at 0x11efb5110>


Typically the above default graphs have a few things we'd like to tweak. Learning the ins and outs of all the possible modifications takes time, so don't worry about learning them all now. Instead, I'd recommend using the following examples as a template for your own data and then learning new options as you need them.

In the code block below I modify the transparency of the graph (using "alpha"), change the font size and the rotation of the x-axis ticks labels, add/change the y-axis and x-axis headings, make them bold, and add some extra spacing. I then save the output as a .png file that you can then insert into your Word or LaTeX file. I have left the title out of these figures -- I recommend adding these in later in Word/LaTeX.

In [162]:
account_plot = account_count['Number_of_tweets'].plot(kind='bar', alpha=0.75)
xticks(fontsize = 10,rotation=60, ha ="right")
account_plot.set_xlabel('Twitter Account Name', weight='bold', labelpad=30)   
account_plot.set_ylabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - bar graph.png', bbox_inches='tight', dpi=300, format='png')


And here's a modified version of the horizontal bar graph.

In [158]:
account_plot = account_count['Number_of_tweets'].plot(kind='barh', alpha=0.75)
xticks(fontsize = 12) #,rotation=60, ha ="right")
account_plot.set_ylabel('Twitter Account Name', weight='bold', labelpad=20)   
account_plot.set_xlabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - horizontal bar graph.png', bbox_inches='tight', dpi=300, format='png')


OK, we have covered a few important steps in this tutorial. We now have an account-level dataset, and used it to output a CSV-formatted table showing the number of tweets sent by each of the 41 Twitter accounts. We'll likely use this table as an appendix in our research report. We have also generated a bar graph representation of these data, which we could also include. Finally, we know which of the Fortune 200 companies are managing multiple CSR-related Twitter accounts. I hope you've found this tutorial helpful. In the next tutorial we will cover how to aggregate the data by time.


For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton