This is the first in a series of notebooks designed to show you how to analyze social media data. We assume you have already downloaded the data and are now ready to begin examining it. In this first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded.

Chapter 1: Set up iPython, Import Twitter Data and Select Cases


First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets.

Import packages

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series


PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [2]:
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)


We can check which version of various packages we're using. You can see I'm running PANDAS 0.13 here.

In [86]:
print pd.__version__
0.13.1

Read in data

PANDAS can read in data from a variety of different data types. If you've followed some of my earlier tutorials you have downloaded tweets into an SQLite database, then converted to a CSV file. That's what we have here. We have a set of tweets by Fortune 200 firms. So, in the following three lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the len function to see how many rows (tweets) there are in the dataset; there are 34,097 tweets in total. Finally, we will use the head function to show the first two rows of the dataset.

In [67]:
df = pd.read_csv('CSR_user_timeline_2013.csv', sep=',', low_memory=False)
print len(df)
df.head(2)
34097
Out[67]:
rowid query tweet_id tweet_id_str inserted_date truncated language possibly_sensitive coordinates retweeted_status withheld_in_countries withheld_scope created_at_text created_at month year content from_user_screen_name from_user_id from_user_followers_count from_user_friends_count from_user_listed_count from_user_favourites_count from_user_statuses_count from_user_description from_user_location from_user_created_at retweet_count favorite_count entities_urls entities_urls_count entities_hashtags entities_hashtags_count entities_mentions entities_mentions_count in_reply_to_screen_name in_reply_to_status_id source entities_expanded_urls entities_media_count media_expanded_url media_url media_type video_link photo_link twitpic num_characters num_words retweeted_user retweeted_user_description retweeted_user_screen_name retweeted_user_followers_count retweeted_user_listed_count retweeted_user_statuses_count retweeted_user_location retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability specific_project_initiative_area
0 67340 humanavitality 306897327585652736 306897327585652736 2014-03-09 13:46:50.222857 0 en NaN NaN NaN NaN NaN Wed Feb 27 22:43:19 +0000 2013 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaN NaN NaN NaN NaN 0 0 0 121 19 NaN NaN NaN NaN NaN NaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 308616393706844160 2014-03-09 13:38:20.679967 0 es NaN NaN NaN NaN NaN Mon Mar 04 16:34:17 +0000 2013 2013-03-04 16:34:17.000000 3 2013 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... México Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumonía 2 NaN 0 NaN NaN web NaN NaN NaN NaN NaN 0 0 0 138 20 NaN NaN NaN NaN NaN NaN NaN NaN 40 Pfizer 0 1

2 rows × 60 columns


List all the columns in the DataFrame

In [68]:
df.columns
Out[68]:
Index([u'rowid', u'query', u'tweet_id', u'tweet_id_str', u'inserted_date', u'truncated', u'language', u'possibly_sensitive', u'coordinates', u'retweeted_status', u'withheld_in_countries', u'withheld_scope', u'created_at_text', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')


We can use the len function again here to see how many columns there are in the dataframe: 60.

In [69]:
len(df.columns)
Out[69]:
60


We should also inspect the format for our columns. We can see that some are integers, some are 'float' (can have a decimal), and some are 'objects' (text). If you have a identifying text variable that has accidentally been imported as a float, for instance, that could cause problems down the road, so you should fix it before continuing.

In [85]:
df.dtypes
Out[85]:
rowid                                 int64
query                                object
tweet_id_str                          int64
inserted_date                        object
language                             object
coordinates                          object
retweeted_status                     object
created_at                           object
month                                 int64
year                                  int64
content                              object
from_user_screen_name                object
from_user_id                          int64
from_user_followers_count             int64
from_user_friends_count               int64
from_user_listed_count                int64
from_user_favourites_count            int64
from_user_statuses_count              int64
from_user_description                object
from_user_location                   object
from_user_created_at                 object
retweet_count                         int64
favorite_count                        int64
entities_urls                        object
entities_urls_count                   int64
entities_hashtags                    object
entities_hashtags_count               int64
entities_mentions                    object
entities_mentions_count               int64
in_reply_to_screen_name              object
in_reply_to_status_id               float64
source                               object
entities_expanded_urls               object
entities_media_count                float64
media_expanded_url                   object
media_url                            object
media_type                           object
video_link                            int64
photo_link                            int64
twitpic                               int64
num_characters                        int64
num_words                             int64
retweeted_user                      float64
retweeted_user_description           object
retweeted_user_screen_name           object
retweeted_user_followers_count      float64
retweeted_user_listed_count         float64
retweeted_user_statuses_count       float64
retweeted_user_location              object
retweeted_tweet_created_at           object
Fortune_2012_rank                     int64
Company                              object
CSR_sustainability                    int64
specific_project_initiative_area      int64
Length: 54, dtype: object

Remove Unneeded Columns

Every researcher will have different preferences about keeping or deleting unneeded columns. You might want to delete variables to make your dataset cleaner, you might want to save memory, or you might want a smaller dataset for some specific analyses. In any case, we can use the drop command to delete individual columns. Let's drop six that are not needed here. The first two have duplicate columns in another format, while the latter four all have zero variation (all are blank).

In [70]:
df = df.drop('created_at_text',1)
df = df.drop('tweet_id',1)
df = df.drop('withheld_in_countries',1)
df = df.drop('withheld_scope',1)
df = df.drop('truncated',1)
df = df.drop('possibly_sensitive',1)


There are now 54 columns in the dataframe.

In [71]:
len(df.columns)
Out[71]:
54
In [72]:
df.head(2)
Out[72]:
rowid query tweet_id_str inserted_date language coordinates retweeted_status created_at month year content from_user_screen_name from_user_id from_user_followers_count from_user_friends_count from_user_listed_count from_user_favourites_count from_user_statuses_count from_user_description from_user_location from_user_created_at retweet_count favorite_count entities_urls entities_urls_count entities_hashtags entities_hashtags_count entities_mentions entities_mentions_count in_reply_to_screen_name in_reply_to_status_id source entities_expanded_urls entities_media_count media_expanded_url media_url media_type video_link photo_link twitpic num_characters num_words retweeted_user retweeted_user_description retweeted_user_screen_name retweeted_user_followers_count retweeted_user_listed_count retweeted_user_statuses_count retweeted_user_location retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability specific_project_initiative_area
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaN NaN NaN NaN NaN 0 0 0 121 19 NaN NaN NaN NaN NaN NaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... México Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumonía 2 NaN 0 NaN NaN web NaN NaN NaN NaN NaN 0 0 0 138 20 NaN NaN NaN NaN NaN NaN NaN NaN 40 Pfizer 0 1

2 rows × 54 columns


If you have only a few columns to delete you can use the drop command as shown above. On the other hand, if you only want to keep a few columns, you can create a new version of the dataframe with only those columns you like. Note that the double square brackets -- "[[...]]" -- in PANDAS forms a dataframe representation. In the following example, I am creating a new dataframe with only three variables. You can see that this new dataframe has the same number of tweets but fewer columns (variables).

In [73]:
df2 = df[['created_at', 'from_user_screen_name', 'retweet_count']]
print len(df2)
df2.head(2)
34097
Out[73]:
created_at from_user_screen_name retweet_count
0 2013-02-27 22:43:19.000000 humanavitality 0
1 2013-03-04 16:34:17.000000 FundacionPfizer 1

2 rows × 3 columns

View Twitter Accounts Represented in DF

We can use the unique function to find how many unique Twitter accounts are represented in the dataset. First, I'll show you what unique function does -- it creates an array of all the screen_names of the Twitter accounts.

In [74]:
pd.unique(df.from_user_screen_name.ravel())
Out[74]:
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'TICalculators', 'CiscoEDU', 'DuPont_ability', 'Dell4Good',
       'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney',
       'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)


Note again how we can use len to find out how many accounts are in the array: 42

In [75]:
len(pd.unique(df.from_user_screen_name.ravel()))
Out[75]:
42

Remove Tweets from One Specific Account

We want to get rid of all tweets by TICalculators from the dataframe. Unlike the other 41 Twitter accounts in the dataset, this account is not a CSR-related account. First, we can use the len function combined with a dataframe query to count the number of tweets that are not sent by TICalculators: 32,300

In [76]:
len(df[df['from_user_screen_name'] != 'TICalculators'])
Out[76]:
32330


We should then also check how many tweets are sent by TICalculators: 1,767

In [77]:
len(df[df['from_user_screen_name'] == 'TICalculators'])
Out[77]:
1767


We can use Python to do "math." Let's use this to show whether the two numbers returned in the above steps add up to the total number of tweets in our dataframe. They do. While this may seem like an unnecessary step, it is always critical to perform such basic data checks in order to avert unexpected data disasters.

In [78]:
1767 + 32330
Out[78]:
34097


We can also do this another way

In [79]:
(1767 + 32330) - len(df)
Out[79]:
0


Or even

In [80]:
len(df[df['from_user_screen_name'] != 'TICalculators']) + len(df[df['from_user_screen_name'] == 'TICalculators']) - len(df)
Out[80]:
0

Remove Tweets by TICalculators

In the next block of code we will create a new version of our dataframe, this time limiting it to only those tweets that are not sent by TICalculators. As we can see, there are now 32,300 tweets in this dataframe, the same number as we calculated above.

In [81]:
df = df[df['from_user_screen_name'] != 'TICalculators']
print len(df)
df.head(2)
32330
Out[81]:
rowid query tweet_id_str inserted_date language coordinates retweeted_status created_at month year content from_user_screen_name from_user_id from_user_followers_count from_user_friends_count from_user_listed_count from_user_favourites_count from_user_statuses_count from_user_description from_user_location from_user_created_at retweet_count favorite_count entities_urls entities_urls_count entities_hashtags entities_hashtags_count entities_mentions entities_mentions_count in_reply_to_screen_name in_reply_to_status_id source entities_expanded_urls entities_media_count media_expanded_url media_url media_type video_link photo_link twitpic num_characters num_words retweeted_user retweeted_user_description retweeted_user_screen_name retweeted_user_followers_count retweeted_user_listed_count retweeted_user_statuses_count retweeted_user_location retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability specific_project_initiative_area
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 3.062183e+17 web NaN NaN NaN NaN NaN 0 0 0 121 19 NaN NaN NaN NaN NaN NaN NaN NaN 79 Humana 0 1
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... México Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumonía 2 NaN 0 NaN NaN web NaN NaN NaN NaN NaN 0 0 0 138 20 NaN NaN NaN NaN NaN NaN NaN NaN 40 Pfizer 0 1

2 rows × 54 columns


Now let's check again for all the unique accounts in the dataframe -- as you can see, TICalculators is gone and there are now 41 accounts.

In [82]:
pd.unique(df.from_user_screen_name.ravel())
Out[82]:
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR',
       'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW',
       'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship',
       'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving',
       'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen',
       'ecomagination', 'WalmartAction', 'Microsoft_Green',
       'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen',
       'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation',
       'nikebetterworld', 'HoneywellBuild', 'googlestudents',
       '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower',
       'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)
In [83]:
len(pd.unique(df.from_user_screen_name.ravel()))
Out[83]:
41


We now have our new dataframe without TICalculators. We are left with 32,330 tweets sent by 41 companies over the course of 2013.

Save New DataFrame

We will now save the dataframe in PANDAS' native format. It's called 'pickling' a file, so we'll give it the typical 'pkl' extension.

In [87]:
df.to_pickle('CSR tweets - 2013 by 41 accounts.pkl')


For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton