This is this second in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in Chapter 1. In this second notebook I will show you how to conduct various account-level (organizational-level) analyses of the Twitter data. Essentially, we will be taking the tweet-level data and aggregating to the account level.
First, we will import several necessary Python packages and set some options for viewing the data. As with Chapter 1, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations.
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)
We'll be producing some figures at the end of this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid.
import matplotlib.pyplot as plt
print matplotlib.__version__
1.3.1
#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*
One of the great innovations of ipython notebook is the ability to see output and graphics "inline," that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line.
%matplotlib inline
We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn.
import seaborn as sns
print sns.__version__
0.5.1
In Chapter 1 we deleted tweets from one unneeded Twitter account and also omitted several unnecessary columns (variables). We then saved, or "pickled," the updated dataframe. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 32,330 tweets.
df = pd.read_pickle('CSR tweets - 2013 by 41 accounts.pkl')
print len(df)
df.head(2)
32330
rowid | query | tweet_id_str | inserted_date | language | coordinates | retweeted_status | created_at | month | year | content | from_user_screen_name | from_user_id | from_user_followers_count | from_user_friends_count | from_user_listed_count | from_user_favourites_count | from_user_statuses_count | from_user_description | from_user_location | from_user_created_at | retweet_count | favorite_count | entities_urls | entities_urls_count | entities_hashtags | entities_hashtags_count | entities_mentions | entities_mentions_count | in_reply_to_screen_name | in_reply_to_status_id | source | entities_expanded_urls | entities_media_count | media_expanded_url | media_url | media_type | video_link | photo_link | twitpic | num_characters | num_words | retweeted_user | retweeted_user_description | retweeted_user_screen_name | retweeted_user_followers_count | retweeted_user_listed_count | retweeted_user_statuses_count | retweeted_user_location | retweeted_tweet_created_at | Fortune_2012_rank | Company | CSR_sustainability | specific_project_initiative_area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67340 | humanavitality | 306897327585652736 | 2014-03-09 13:46:50.222857 | en | NaN | NaN | 2013-02-27 22:43:19.000000 | 2 | 2013 | @louloushive (Tweet 2) We encourage other empl... | humanavitality | 274041023 | 2859 | 440 | 38 | 25 | 1766 | This is the official Twitter account for Human... | NaN | Tue Mar 29 16:23:02 +0000 2011 | 0 | 0 | NaN | 0 | NaN | 0 | louloushive | 1 | louloushive | 306218267737989120.00 | web | NaN | nan | NaN | NaN | NaN | 0 | 0 | 0 | 121 | 19 | nan | NaN | NaN | nan | nan | nan | NaN | NaN | 79 | Humana | 0 | 1 |
1 | 39454 | FundacionPfizer | 308616393706844160 | 2014-03-09 13:38:20.679967 | es | NaN | NaN | 2013-03-04 16:34:17.000000 | 3 | 2013 | ¿Sabes por qué la #vacuna contra la #neumonía ... | FundacionPfizer | 188384056 | 2464 | 597 | 50 | 11 | 2400 | Noticias sobre Responsabilidad Social y Fundac... | México | Wed Sep 08 16:14:11 +0000 2010 | 1 | 0 | NaN | 0 | vacuna, neumonía | 2 | NaN | 0 | NaN | nan | web | NaN | nan | NaN | NaN | NaN | 0 | 0 | 0 | 138 | 20 | nan | NaN | NaN | nan | nan | nan | NaN | NaN | 40 | Pfizer | 0 | 1 |
2 rows × 54 columns
List all the columns in the DataFrame
df.columns
Index([u'rowid', u'query', u'tweet_id_str', u'inserted_date', u'language', u'coordinates', u'retweeted_status', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')
We can use the len function again here to see how many columns there are in the dataframe: 60.
len(df.columns)
54
In Chapter 1 we used the unique function to first list all of the unique Twitter accounts represented in our collection of tweets and then provide a count of those accounts. There are 41 separate accounts.
pd.unique(df.from_user_screen_name.ravel())
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR', 'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW', 'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship', 'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green', 'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen', 'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation', 'nikebetterworld', 'HoneywellBuild', 'googlestudents', '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower', 'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)
Note again how we can use len to find out how many accounts are in the array.
len(pd.unique(df.from_user_screen_name.ravel()))
41
So we know there are 41 accounts in our dataset. But how many companies are controlling these accounts? To answer this question, we can run unique for the Company variable. The result? There are actually only 34 unique companies represented in the dataset. Several of the companies have multiple CSR-related Twitter accounts. We will find out exactly which ones soon.
len(pd.unique(df.Company.ravel()))
34
Now let's turn to our account-level analyses. To do this, we need to convert our tweet-level dataset -- a dataset where each row is dedicated to one of the 32,330 tweets -- to an account-level dataset. This process is called aggregation and the output will be a new dataframe with 41 rows -- one per Twitter account.
We will perform a simple aggregation here -- we are only interested in a count of how many tweets each account has sent over the course of 2013. Our first step is to create a function to spell out which variables (columns) from our dataframe that we wish to keep and/or aggregate. Specifically, the following function is first designed to produce a variable called Number_of_tweets that is a count of the number of tweets sent; we are basing this on the "content" column but we could have chosen others. Second, we are asking for the new dataframe to provide a column called Company; we are using the min (minimum) function here, but that's just one way to approach it -- for each tweet sent by a given company, the length of the company name will be the same, and our function here will grab the "shortest" (using the min. function) name. This is just a convenience; we could have easily chosen to return the "maximum" value for Company. Similarly, we will include the Description the company has provided for its Twitter account.
def f(x):
return Series(dict(Number_of_tweets = x['content'].count(),
Company=x['Company'].min(),
Description=x['from_user_description'].min(),
))
In the following block of code we will now apply the above function to our dataframe. Note the groupby command. This is how we aggregate the data. It is an invaluable function, so I'll provide more details. We are asking PANDAS to create a new dataframe, called account_count, and asking that this new dataframe be based on aggregating our original dataframe by the from_user_screen_name column, applying the function f we wrote above. In other words, we are going to our original dataframe of 32,330 tweets, and aggregating or collapsing that data based on who sent the tweets. We are thus converting our tweet-level dataset with 32,330 rows into an account-level dataset with 41 rows. As you can see in the output, there are 41 observations in this new dataframe -- one per account -- and for each account there is a column showing the name of the Fortune 200 company, the description it created for its Twitter account, and the number of tweets sent over the course of 2013.
account_count = df.groupby('from_user_screen_name').apply(f)
print len(account_count)
account_count
41
Company | Description | Number_of_tweets | |
---|---|---|---|
from_user_screen_name | |||
3M_FoodSafety | 3M | 3M Food Safety | global manufacturer of innova... | 325 |
ATTAspire | AT&T | Inspiring students to achieve their dreams. | 336 |
AlcoaFoundation | Alcoa | Since 1952, we’ve invested $570 million to imp... | 1053 |
AmgenFoundation | Amgen | The Amgen Foundation seeks to advance science ... | 135 |
BofA_Community | Bank of America Corp. | We’re connecting local communities to the peop... | 1722 |
CiscoCSR | Cisco Systems | Sharing stories about how Cisco and our partne... | 2511 |
CiscoEDU | Cisco Systems | Tweets on the Cisco Connected Learning Experie... | 932 |
CitizenDisney | Walt Disney | We believe in the power of stories, families a... | 211 |
ClickToEmpower | Allstate | Click to Empower is a Web initiative of The Al... | 149 |
Comcastdreambig | Comcast | Comcast empowers communities across the countr... | 581 |
DE_Youtility | Duke Energy | This profile will retire soon, so head on over... | 300 |
Dell4Good | Dell | News from Dell corp. responsibility team: sust... | 829 |
DellEDU | Dell | Connecting with students, educators, school ad... | 1108 |
DuPont_ability | DuPont | DuPont sustainability news: clean energy, sola... | 1509 |
EnviroSears | Sears Holdings | Discussing sustainability topics at Sears Hold... | 53 |
FordDriveGreen | Ford Motor | Ford is committed to affordable, sustainable p... | 27 |
FundacionPfizer | Pfizer | Noticias sobre Responsabilidad Social y Fundac... | 421 |
GreenIBM | International Business Machines | Official Twitter account for IBM Big Green Inn... | 81 |
HeartRescue | Medtronic | A collaborative effort supported by the Medtro... | 322 |
HoneywellBuild | Honeywell International | Honeywell Building Solutions installs, integra... | 242 |
IBMSmartCities | International Business Machines | Official IBM Smarter Cities account. Managed b... | 1570 |
Intelinvolved | Intel | Connecting & enriching lives worldwide to crea... | 1524 |
JNJStories | Johnson & Johnson | We're tweeting about social good over on @JNJC... | 9 |
Microsoft_Green | Microsoft | The official Twitter account for Microsoft's E... | 436 |
PG_CSDW | Procter & Gamble | News from P&G's Children’s Safe Drinking Water... | 187 |
PPGIdeascapes | PPG Industries | PPG Ideascapes provides sustainable building p... | 160 |
PromesaPepsiCo | PepsiCo | PepsiCo's commitment to the Hispanic community... | 7 |
SprintGreenNews | Sprint Nextel | News and Information about Sprint's sustainabi... | 201 |
TeachingMoney | Capital One Financial | Teaching Money is a new initiative created to ... | 187 |
WalmartAction | Wal-Mart Stores | Our Community Action Network works to improve ... | 1915 |
WalmartGreen | Wal-Mart Stores | Together, we can create a sustainable world & ... | 1434 |
citizenIBM | International Business Machines | This official Citizen IBM Twitter feed is admi... | 1039 |
ecomagination | General Electric | A forum for fresh thinking and conversation ab... | 594 |
gehealthy | General Electric | A shared commitment to creating better health ... | 2461 |
googlestudents | Google news and updates especially for student... | 211 | |
hpglobalcitizen | Hewlett-Packard | HP Living Progress is focused to create a bett... | 470 |
humanavitality | Humana | This is the official Twitter account for Human... | 762 |
mathmovesu | Raytheon | We represent the centerpiece citizenship initi... | 1146 |
msftcitizenship | Microsoft | Sharing stories about how Microsoft and our pa... | 2493 |
nikebetterworld | Nike | A better world needs the world’s biggest team.... | 153 |
verizongiving | Verizon Communications | We are focused on using technology to solve cr... | 2524 |
41 rows × 3 columns
Now that we have our account-level dataframe, we can check which companies are managing multiple Twitter accounts. Note how we again use the groupby function. This time we are performing an aggregation based on the Company column. Don't worry too much about understanding all of the details of this block of code. For now, just try to understand that we are creating a series (think of it as a list, not a dataset) with the number of rows each company has in the account-level dataframe, then converting that series to a dataframe, and then dropping those cases that have only 1 row in the account-level dataframe. You can use this block of code as a template for any similar "duplicate checks" you wish to find in your own data.
What do we find? 5 of the companies have 2 Twitter accounts and 1 has 3. There are thus 41 separate CSR-related Twitter accounts run by 34 companies.
counts = account_count.groupby('Company').size() #create a series of number of rows per company
df2 = pd.DataFrame(counts, columns = ['size']) #convert series to a dataframe
df2 = df2[df2['size']>1] #keep only those cases with more than one row in account-level dataframe
df2 #show dataframe
size | |
---|---|
Company | |
Cisco Systems | 2 |
Dell | 2 |
General Electric | 2 |
International Business Machines | 3 |
Microsoft | 2 |
Wal-Mart Stores | 2 |
6 rows × 1 columns
We now have the first meaningful pieces of information for our report. Let's output account_count to a CSV file. This will give us a file with the four columns shown above: from_user_screen_name, Company, Description, and Number_of_tweets. This table will make for a useful appendix to our research paper.
account_count.to_csv('Number of Tweets per Account.csv')
PANDAS can output to a number of different formats. I most commonly use to_csv, to_excel, to_json, to_csv, and to_pickle.
We will now do a simple bar graph of the number of tweets sent by each account. First, we will sort the dataframe.
To access the help for any function we can use "?"
DataFrame.sort?
Let's sort the data in descending order by number of tweets
account_count = account_count.sort(['Number_of_tweets'], ascending=False)
account_count
Company | Description | Number_of_tweets | |
---|---|---|---|
from_user_screen_name | |||
verizongiving | Verizon Communications | We are focused on using technology to solve cr... | 2524 |
CiscoCSR | Cisco Systems | Sharing stories about how Cisco and our partne... | 2511 |
msftcitizenship | Microsoft | Sharing stories about how Microsoft and our pa... | 2493 |
gehealthy | General Electric | A shared commitment to creating better health ... | 2461 |
WalmartAction | Wal-Mart Stores | Our Community Action Network works to improve ... | 1915 |
BofA_Community | Bank of America Corp. | We’re connecting local communities to the peop... | 1722 |
IBMSmartCities | International Business Machines | Official IBM Smarter Cities account. Managed b... | 1570 |
Intelinvolved | Intel | Connecting & enriching lives worldwide to crea... | 1524 |
DuPont_ability | DuPont | DuPont sustainability news: clean energy, sola... | 1509 |
WalmartGreen | Wal-Mart Stores | Together, we can create a sustainable world & ... | 1434 |
mathmovesu | Raytheon | We represent the centerpiece citizenship initi... | 1146 |
DellEDU | Dell | Connecting with students, educators, school ad... | 1108 |
AlcoaFoundation | Alcoa | Since 1952, we’ve invested $570 million to imp... | 1053 |
citizenIBM | International Business Machines | This official Citizen IBM Twitter feed is admi... | 1039 |
CiscoEDU | Cisco Systems | Tweets on the Cisco Connected Learning Experie... | 932 |
Dell4Good | Dell | News from Dell corp. responsibility team: sust... | 829 |
humanavitality | Humana | This is the official Twitter account for Human... | 762 |
ecomagination | General Electric | A forum for fresh thinking and conversation ab... | 594 |
Comcastdreambig | Comcast | Comcast empowers communities across the countr... | 581 |
hpglobalcitizen | Hewlett-Packard | HP Living Progress is focused to create a bett... | 470 |
Microsoft_Green | Microsoft | The official Twitter account for Microsoft's E... | 436 |
FundacionPfizer | Pfizer | Noticias sobre Responsabilidad Social y Fundac... | 421 |
ATTAspire | AT&T | Inspiring students to achieve their dreams. | 336 |
3M_FoodSafety | 3M | 3M Food Safety | global manufacturer of innova... | 325 |
HeartRescue | Medtronic | A collaborative effort supported by the Medtro... | 322 |
DE_Youtility | Duke Energy | This profile will retire soon, so head on over... | 300 |
HoneywellBuild | Honeywell International | Honeywell Building Solutions installs, integra... | 242 |
googlestudents | Google news and updates especially for student... | 211 | |
CitizenDisney | Walt Disney | We believe in the power of stories, families a... | 211 |
SprintGreenNews | Sprint Nextel | News and Information about Sprint's sustainabi... | 201 |
PG_CSDW | Procter & Gamble | News from P&G's Children’s Safe Drinking Water... | 187 |
TeachingMoney | Capital One Financial | Teaching Money is a new initiative created to ... | 187 |
PPGIdeascapes | PPG Industries | PPG Ideascapes provides sustainable building p... | 160 |
nikebetterworld | Nike | A better world needs the world’s biggest team.... | 153 |
ClickToEmpower | Allstate | Click to Empower is a Web initiative of The Al... | 149 |
AmgenFoundation | Amgen | The Amgen Foundation seeks to advance science ... | 135 |
GreenIBM | International Business Machines | Official Twitter account for IBM Big Green Inn... | 81 |
EnviroSears | Sears Holdings | Discussing sustainability topics at Sears Hold... | 53 |
FordDriveGreen | Ford Motor | Ford is committed to affordable, sustainable p... | 27 |
JNJStories | Johnson & Johnson | We're tweeting about social good over on @JNJC... | 9 |
PromesaPepsiCo | PepsiCo | PepsiCo's commitment to the Hispanic community... | 7 |
41 rows × 3 columns
We'll also make the default plot larger.
plt.rcParams['figure.figsize'] = (20, 10)
PANDAS can produce plots quite easily, as you can see in the following example. We're asking PANDAS to plot a bar graph based on the Number_of_tweets column in our account-level dataframe.
account_count['Number_of_tweets'].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x12ad45950>
We can also try a horizontal bar graph
account_count['Number_of_tweets'].plot(kind='barh')
<matplotlib.axes.AxesSubplot at 0x11efb5110>
Typically the above default graphs have a few things we'd like to tweak. Learning the ins and outs of all the possible modifications takes time, so don't worry about learning them all now. Instead, I'd recommend using the following examples as a template for your own data and then learning new options as you need them.
In the code block below I modify the transparency of the graph (using "alpha"), change the font size and the rotation of the x-axis ticks labels, add/change the y-axis and x-axis headings, make them bold, and add some extra spacing. I then save the output as a .png file that you can then insert into your Word or LaTeX file. I have left the title out of these figures -- I recommend adding these in later in Word/LaTeX.
account_plot = account_count['Number_of_tweets'].plot(kind='bar', alpha=0.75)
xticks(fontsize = 10,rotation=60, ha ="right")
account_plot.set_xlabel('Twitter Account Name', weight='bold', labelpad=30)
account_plot.set_ylabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - bar graph.png', bbox_inches='tight', dpi=300, format='png')
And here's a modified version of the horizontal bar graph.
account_plot = account_count['Number_of_tweets'].plot(kind='barh', alpha=0.75)
xticks(fontsize = 12) #,rotation=60, ha ="right")
account_plot.set_ylabel('Twitter Account Name', weight='bold', labelpad=20)
account_plot.set_xlabel('# Tweets', weight='bold', labelpad=25) #ADD PADDING TO RIGHT OF Y-AXIS LABEL
savefig('account counts - horizontal bar graph.png', bbox_inches='tight', dpi=300, format='png')
OK, we have covered a few important steps in this tutorial. We now have an account-level dataset, and used it to output a CSV-formatted table showing the number of tweets sent by each of the 41 Twitter accounts. We'll likely use this table as an appendix in our research report. We have also generated a bar graph representation of these data, which we could also include. Finally, we know which of the Fortune 200 companies are managing multiple CSR-related Twitter accounts. I hope you've found this tutorial helpful. In the next tutorial we will cover how to aggregate the data by time.
For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton