This is this third in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. We assume you have already downloaded the data and have completed the steps taken in Chapter 1 and Chapter 2. In this third notebook I will show you how to conduct various temporal analyses of the Twitter data. Essentially, we will be taking the tweet-level data and aggregating to the account level.

Chapter 3: Analyze Twitter Data by Time Period¶

First, we will import several necessary Python packages and set some options for viewing the data. As with Chapter 1 and Chapter 2, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations.

Import packages and set viewing options¶

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:

#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)

I'm using version 0.16.2 of PANDAS

In [3]:

pd.__version__

Out[3]:

'0.16.2'

We'll use the calendar package for one of our temporal manipulations.

In [39]:

import calendar

Import graphing packages¶

We'll be producing some figures at the end of this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid.

In [4]:

import matplotlib
print matplotlib.__version__

1.4.3

In [5]:

import matplotlib.pyplot as plt

In [6]:

#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*

One of the great innovations of ipython notebook is the ability to see output and graphics "inline," that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line.

In [7]:

%matplotlib inline

We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn.

In [8]:

import seaborn as sns
print sns.__version__

0.6.0

The following line will set the default plots to be bigger.

In [9]:

plt.rcParams['figure.figsize'] = (15, 5)

Version 1.4 of matplotlib enables specific plotting styles. Let's check which ones are already imported so we can play around with them later.

In [10]:

import matplotlib as mpl

In [11]:

mpl.style.available

Out[11]:

[u'dark_background', u'bmh', u'grayscale', u'ggplot', u'fivethirtyeight']

Read in data¶

In Chapter 1 we deleted tweets from one unneeded Twitter account and also omitted several unnecessary columns (variables). We then saved, or "pickled," the updated dataframe. Let's now open this saved file. As we can see in the operations below this dataframe contains 54 variables for 32,330 tweets.

In [12]:

df = pd.read_pickle('CSR tweets - 2013 by 41 accounts.pkl')
print len(df)
df.head(2)

Out[12]:

	rowid	query	tweet_id_str	inserted_date	language	coordinates	retweeted_status	created_at	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
0	67340	humanavitality	306897327585652736	2014-03-09 13:46:50.222857	en	NaN	NaN	2013-02-27 22:43:19.000000	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	3.062183e+17	web	NaN	NaN	NaN	NaN	NaN	0	0	0	121	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	79	Humana	0	1
1	39454	FundacionPfizer	308616393706844160	2014-03-09 13:38:20.679967	es	NaN	NaN	2013-03-04 16:34:17.000000	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	NaN	web	NaN	NaN	NaN	NaN	NaN	0	0	0	138	20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	Pfizer	0	1

List all the columns in the DataFrame

In [13]:

df.columns

Out[13]:

Index([u'rowid', u'query', u'tweet_id_str', u'inserted_date', u'language',
       u'coordinates', u'retweeted_status', u'created_at', u'month', u'year',
       u'content', u'from_user_screen_name', u'from_user_id',
       u'from_user_followers_count', u'from_user_friends_count',
       u'from_user_listed_count', u'from_user_favourites_count',
       u'from_user_statuses_count', u'from_user_description',
       u'from_user_location', u'from_user_created_at', u'retweet_count',
       u'favorite_count', u'entities_urls', u'entities_urls_count',
       u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions',
       u'entities_mentions_count', u'in_reply_to_screen_name',
       u'in_reply_to_status_id', u'source', u'entities_expanded_urls',
       u'entities_media_count', u'media_expanded_url', u'media_url',
       u'media_type', u'video_link', u'photo_link', u'twitpic',
       u'num_characters', u'num_words', u'retweeted_user',
       u'retweeted_user_description', u'retweeted_user_screen_name',
       u'retweeted_user_followers_count', u'retweeted_user_listed_count',
       u'retweeted_user_statuses_count', u'retweeted_user_location',
       u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company',
       u'CSR_sustainability', u'specific_project_initiative_area'],
      dtype='object')

Refresher: We can use the len function again here to see how many columns there are in the dataframe: 54.

In [14]:

len(df.columns)

Out[14]:

And we can see what types of variable each column is -- an integer (int64), a numerical float variable (float), or a text variable (object).

In [15]:

df.dtypes

Out[15]:

rowid                                 int64
query                                object
tweet_id_str                          int64
inserted_date                        object
language                             object
coordinates                          object
retweeted_status                     object
created_at                           object
month                                 int64
year                                  int64
content                              object
from_user_screen_name                object
from_user_id                          int64
from_user_followers_count             int64
from_user_friends_count               int64
from_user_listed_count                int64
from_user_favourites_count            int64
from_user_statuses_count              int64
from_user_description                object
from_user_location                   object
from_user_created_at                 object
retweet_count                         int64
favorite_count                        int64
entities_urls                        object
entities_urls_count                   int64
entities_hashtags                    object
entities_hashtags_count               int64
entities_mentions                    object
entities_mentions_count               int64
in_reply_to_screen_name              object
in_reply_to_status_id               float64
source                               object
entities_expanded_urls               object
entities_media_count                float64
media_expanded_url                   object
media_url                            object
media_type                           object
video_link                            int64
photo_link                            int64
twitpic                               int64
num_characters                        int64
num_words                             int64
retweeted_user                      float64
retweeted_user_description           object
retweeted_user_screen_name           object
retweeted_user_followers_count      float64
retweeted_user_listed_count         float64
retweeted_user_statuses_count       float64
retweeted_user_location              object
retweeted_tweet_created_at           object
Fortune_2012_rank                     int64
Company                              object
CSR_sustainability                    int64
specific_project_initiative_area      int64
dtype: object

Convert created_at to time variable¶

To work with time, we first have to have a variable in our dataframe that indicates time. We will use the created_at column, which represents the time at which the tweet was created. In the following line we will convert this variable from text format to python's datetime format.

In [16]:

df.dtypes[:8]

Out[16]:

rowid                int64
query               object
tweet_id_str         int64
inserted_date       object
language            object
coordinates         object
retweeted_status    object
created_at          object
dtype: object

In [17]:

df['created_at'] = pd.to_datetime(df['created_at'])

Let's take another look at the column types. This time we'll just look at the first 8 columns, using python's splicing capabilities (the '[:8]' command tells us we want to return up to the 8th column; this is a useful tool for other applications as well). We see that the created_at column is now a datetime64 object. We now will now be able to manipulate the data -- sorting, indexing, aggregating, and the like -- based on time.

In [18]:

df.dtypes[:8]

Out[18]:

rowid                        int64
query                       object
tweet_id_str                 int64
inserted_date               object
language                    object
coordinates                 object
retweeted_status            object
created_at          datetime64[ns]
dtype: object

Set the Index¶

One thing you'll have to frequently do in PANDAS is set an index to your dataframe. For the non-programmer this can be a bit difficult to wrap your head around. You might think of the index as a tool for organizing or categorizing your data. For instance, will you organize your data alphabetically, by organization, by time, by location, or by whether it includes a photo? Each of these would require a different index variable. What we are going to do here is set the index to be our created_at variable. This will allow us to manipulate the data by time.

In [19]:

df = df.set_index(['created_at'])
df.head(2)

Out[19]:

	rowid	query	tweet_id_str	inserted_date	language	coordinates	retweeted_status	month	year	content	from_user_screen_name	from_user_id	from_user_followers_count	from_user_friends_count	from_user_listed_count	from_user_favourites_count	from_user_statuses_count	from_user_description	from_user_location	from_user_created_at	retweet_count	favorite_count	entities_urls	entities_urls_count	entities_hashtags	entities_hashtags_count	entities_mentions	entities_mentions_count	in_reply_to_screen_name	in_reply_to_status_id	source	entities_expanded_urls	entities_media_count	media_expanded_url	media_url	media_type	video_link	photo_link	twitpic	num_characters	num_words	retweeted_user	retweeted_user_description	retweeted_user_screen_name	retweeted_user_followers_count	retweeted_user_listed_count	retweeted_user_statuses_count	retweeted_user_location	retweeted_tweet_created_at	Fortune_2012_rank	Company	CSR_sustainability	specific_project_initiative_area
created_at
2013-02-27 22:43:19	67340	humanavitality	306897327585652736	2014-03-09 13:46:50.222857	en	NaN	NaN	2	2013	@louloushive (Tweet 2) We encourage other empl...	humanavitality	274041023	2859	440	38	25	1766	This is the official Twitter account for Human...	NaN	Tue Mar 29 16:23:02 +0000 2011	0	0	NaN	0	NaN	0	louloushive	1	louloushive	3.062183e+17	web	NaN	NaN	NaN	NaN	NaN	0	0	0	121	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	79	Humana	0	1
2013-03-04 16:34:17	39454	FundacionPfizer	308616393706844160	2014-03-09 13:38:20.679967	es	NaN	NaN	3	2013	¿Sabes por qué la #vacuna contra la #neumonía ...	FundacionPfizer	188384056	2464	597	50	11	2400	Noticias sobre Responsabilidad Social y Fundac...	México	Wed Sep 08 16:14:11 +0000 2010	1	0	NaN	0	vacuna, neumonía	2	NaN	0	NaN	NaN	web	NaN	NaN	NaN	NaN	NaN	0	0	0	138	20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40	Pfizer	0	1

Look at the far-left column in bold above. That's our index column, and it's no longer 0,1,2 but rather our created_at variable. We have effectively told PANDAS that each row (i.e., each tweet) is ready to be indexed according to the values of the created_at column, which, as we did earlier, is a datetime variable.

Generate and Plot Number of Tweets over Different Time Periods¶

Recall that in Chapter 2 we took our tweet-level dataframe and converted it to an account-level dataframe by aggregating the tweets based on the account variable. We did this by first writing an aggregation function and second by applying that function with a groupby command. We're going to do something similar here except that we'll be aggregating by time; differently put, we will be "collapsing" the data by time rather than by Twitter account.

As in the last chapter, our first step is to create a function to spell out which variables (columns) from our dataframe that we wish to keep and/or aggregate. Specifically, the following function is first designed to produce a variable called Number_of_tweets that is a count of the number of tweets sent; we are basing this on the "content" column but we could have chosen others. This is the same function written in Chapter 2, except that we don't need the second and third variables

In [20]:

def f(x):
     return Series(dict(Number_of_tweets = x['content'].count(), 
                        ))

Generate and Plot Daily Counts¶

First, let's analyze the data by day of the year. To do this, we need to convert our tweet-level dataset -- a dataset where each row is dedicated to one of the 32,330 tweets -- to a daily dataset. This process is called aggregation and the output will be a new dataframe with 365 rows -- one per day.

In the following block of code we will now apply the above function to our dataframe. Note the groupby command again -- same as in Chapter 2. This is how we aggregate the data. We are asking PANDAS to create a new dataframe, called daily_count, and asking that this new dataframe be based on aggregating our original dataframe by the index column, applying the function f we wrote above. In other words, we are going to our original dataframe of 32,330 tweets, and aggregating or collapsing that data based on the day of the year it was sent. We are thus converting our tweet-level dataset with 32,330 rows into a daily dataset with 365 rows. As you can see in the output, there are 365 observations in this new dataframe -- one per day.

Notice also that we are accessing the "date" element of our index variable (see index.date). Here you'll start to see the power of the Python's datetime variables -- by specifying ".date", we are telling PANDAS we want to access only the particular day/month/year combination indicated included in the created_at column.

In [21]:

daily_count = df.groupby(df.index.date).apply(f)
print len(daily_count)
daily_count.head(5)

Out[21]:

	Number_of_tweets
2013-01-01	24
2013-01-02	71
2013-01-03	92
2013-01-04	94
2013-01-05	38

You'll see above that the index column (in bold) is the date. Let's give a name to this index and then inspect the first five rows in the dataframe.

In [22]:

daily_count.index.name = 'date'
daily_count.head(5)

Out[22]:

	Number_of_tweets
date
2013-01-01	24
2013-01-02	71
2013-01-03	92
2013-01-04	94
2013-01-05	38

It's always good to inspect your data to do that it worked as expected. We already know there are 365 rows, which is a good sign. Let's now look at the last five rows of the dataframe.

In [23]:

daily_count.tail(5)

Out[23]:

	Number_of_tweets
date
2013-12-27	35
2013-12-28	13
2013-12-29	8
2013-12-30	25
2013-12-31	234

OK, that's exactly what we were expecting, too. Let's run two more lines of code to see what the minimum and maximum daily values are in the dataset.

In [24]:

daily_count.index.min()

Out[24]:

datetime.date(2013, 1, 1)

In [25]:

daily_count.index.max()

Out[25]:

datetime.date(2013, 12, 31)

Perfect. We're all set. Now let's plot it. If you recall from Chapter 2, we are using iPython's built-in graphics package matplotlib, and making the plots prettier by applying the Seaborn package's tweaks to matplotlib. PANDAS makes it easy to produce fine plots of your data, thought typically the default graphs have a few things we'd like to tweak. Learning the ins and outs of all the possible modifications takes time, so don't worry about learning them all now. Instead, I'd recommend using the following examples as a template for your own data and then learning new options as you need them.

In the code block below I modify the transparency of the graph (using "alpha"), change the font size and the rotation of the x-axis ticks labels, add/change the y-axis and x-axis headings, make them bold, and add some extra spacing. I then save the output as a .png file that you can then insert into your Word or LaTeX file. I have left the title out of these figures -- I recommend adding these in later in Word/LaTeX.

One final note: I have added comments to the code below. In Python, anything after the pound sign ('#') is considered a comment. It's good coding practice.

In [26]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)

daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL

xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS

#http://matplotlib.org/users/legend_guide.html
#http://nbviewer.ipython.org/gist/olgabot/5357268  ### LIST OF OPTIONS
#legend(fontsize='x-small',loc=2,labelspacing=0.1, frameon=False)#.draggable()
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5) #SET PADDING ABOVE X-AXIS LABELS
#Set x axis label on top of plot, set label text --> https://datasciencelab.wordpress.com/2013/12/21/beautiful-plots-with-pandas-and-matplotlib/
#daily_plot.xaxis.set_label_position('top')

savefig('daily counts.png', bbox_inches='tight', dpi=300, format='png')   #SAVE PLOT IN PNG FORMAT

Output account-level data to CSV file¶

As shown in Chapter 1, it is simple to save the output of any of the temporally aggregated files we have created above. For example, let's output daily_count to a CSV file. This will give us a file with 365 rows and two columns above: date and Number_of_tweets.

In [77]:

daily_count.to_csv('Number of Tweets per Day.csv')

PANDAS can output to a number of different formats. I most commonly use to_csv, to_excel, to_json, and to_pickle.

Generate and Plot Day-of-the-Week Tweets¶

Now let's create a dataframe with a count of the number of tweets per day of the week. We can apply the same function f and use the same index variable date. The only difference is that we will access the "weekday" element of our index variable created_at. It bears repeating that we converted created_at to a python datetime variable and, fortunately for us, this type of variable has a number of different attributes we can access, including second, minute, hour, day, weekday, month, and year.

In the following block of code we will now apply our aggregating function f to our dataframe df and use the groupby command again -- only this time we are aggregating on the weekday attribute of our datetime variable. We are thus asking PANDAS to create a new dataframe, called weekday_count, and asking that this new dataframe be based on aggregating our original dataframe by index.weekday, applying the function f we wrote above. In other words, we are going to our original dataframe of 32,330 tweets, and aggregating or collapsing that data based on the day of the week it was sent. We are thus converting our tweet-level dataset with 32,330 rows into a daily dataset with 7 rows.

In [27]:

weekday_count = df.groupby(df.index.weekday).apply(f)
print len(weekday_count)
weekday_count

Out[27]:

	Number_of_tweets
0	5306
1	6467
2	6715
3	6108
4	5264
5	1513
6	957

In the datetime variable, '0' is Monday and '6' is Sunday. Let's add another column to our new dataframe with the names of the days of the week. The first line creates a Python list with 7 elements, the second line adds a new column "day" to the dataframe and fills it with the values in the list (called "days") we created in the first line. The third line displays the updated dataframe.

In [28]:

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_count['day'] = days
weekday_count

Out[28]:

	Number_of_tweets	day
0	5306	Monday
1	6467	Tuesday
2	6715	Wednesday
3	6108	Thursday
4	5264	Friday
5	1513	Saturday
6	957	Sunday

Let's plot these data. We'll use a bar graph here.

One change you'll notice is that we are not using the index column for our x-axis labels but rather the labels in our new "day" column. The second line in the block of code below takes care of this.

In [29]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9) #, ha ="left") 

###IF WE DON'T WANT TO CREATE ANOTHER COLUMN IN DATAFRAME WE CAN SET CUSTOM LABELS
#days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
#xticks(np.arange(7), days, rotation = 0,fontsize = 9) #, ha ="left") 

savefig('day-of-week counts.png', bbox_inches='tight', dpi=300, format='png')

We see that, not surprisingly, there are many fewer tweets sent on the weekend over the course of 2013.

Hour-of-Day Counts¶

We might also be interested in the hour of the day that the tweets are sent. We only need to access our index's hour attribute to accomplish this.

In [30]:

hourly_count = df.groupby(df.index.hour).apply(f)
print len(hourly_count)
hourly_count

Out[30]:

	Number_of_tweets
0	1020
1	743
2	420
3	250
4	189
5	238
6	144
7	93
8	117
9	99
10	183
11	267
12	663
13	1713
14	2888
15	3155
16	3140
17	2925
18	2944
19	3114
20	2937
21	2362
22	1612
23	1114

Now let's plot the data. First let's try the default plot. As you'll see, it's a line plot with ticks every five hours.

In [31]:

hourly_plot = hourly_count['Number_of_tweets'].plot()

We can show ticks for each hourby adding a line of code.

In [33]:

hourly_plot = hourly_count['Number_of_tweets'].plot()
xticks(np.arange(24), rotation = 0,fontsize = 9) #, ha ="left") 

Out[33]:

([<matplotlib.axis.XTick at 0x10d90c210>,
  <matplotlib.axis.XTick at 0x10d21a750>,
  <matplotlib.axis.XTick at 0x10c157910>,
  <matplotlib.axis.XTick at 0x10c157b10>,
  <matplotlib.axis.XTick at 0x10e9ee1d0>,
  <matplotlib.axis.XTick at 0x10cff9090>,
  <matplotlib.axis.XTick at 0x10d250f50>,
  <matplotlib.axis.XTick at 0x10ceec110>,
  <matplotlib.axis.XTick at 0x10d29d710>,
  <matplotlib.axis.XTick at 0x10d9c8c50>,
  <matplotlib.axis.XTick at 0x10e9154d0>,
  <matplotlib.axis.XTick at 0x10c157c90>,
  <matplotlib.axis.XTick at 0x10d599990>,
  <matplotlib.axis.XTick at 0x10ceee810>,
  <matplotlib.axis.XTick at 0x10d007590>,
  <matplotlib.axis.XTick at 0x10cf88d90>,
  <matplotlib.axis.XTick at 0x10cf88410>,
  <matplotlib.axis.XTick at 0x10d1c4150>,
  <matplotlib.axis.XTick at 0x10d71f490>,
  <matplotlib.axis.XTick at 0x10d71fc10>,
  <matplotlib.axis.XTick at 0x10d03d3d0>,
  <matplotlib.axis.XTick at 0x10d03db50>,
  <matplotlib.axis.XTick at 0x10d065310>,
  <matplotlib.axis.XTick at 0x10d065a90>],
 <a list of 24 Text xticklabel objects>)

However, it is showing the hours in the "pythonic" way -- from 0 to 23. We can fix that by adding a list with the hours 1-24 and then invoking that list in our xticks command. The plot below also adds labels for the x and y axes and saves the output in PNG format.`

In [34]:

hourly_plot = hourly_count['Number_of_tweets'].plot(kind='line')
hours = list(range(1,25))                                                #GENERATE LIST FROM 1 TO 24
xticks(np.arange(24), hours, rotation = 0,fontsize = 9)                  #USE THE CUSTOM TICKS

hourly_plot.set_xlabel('Hour of the Day', weight='bold', labelpad=15)     #SET X-AXIS LABEL, ADD PADDING TO TOP OF X-AXIS LABEL
hourly_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL, ADD PADDING TO RIGHT OF Y-AXIS LABEL

xticks(fontsize = 9, rotation = 0, ha= "center")                          #SET FONT SIZE FOR X-AXIS TICK LABELS
yticks(fontsize = 9)                                                      #SET FONT SIZE FOR Y-AXIS TICK LABELS
daily_plot.tick_params(axis='x', pad=5)                                   #SET PADDING ABOVE X-AXIS LABELS

daily_plot.legend_ = None                                                 #TURN OFF LEGEND

savefig('hourly counts - line graph.png', bbox_inches='tight', dpi=300, format='png')

It's then easy to copy the above code block and change it to a bar graph by adding kind='bar' to the first line.

In [35]:

hourly_plot = hourly_count['Number_of_tweets'].plot(kind='bar')
hours = list(range(1,25))                                                 #GENERATE LIST FROM 1 TO 24
xticks(np.arange(24), hours, rotation = 0,fontsize = 9)                   #USE THE CUSTOM TICKS

hourly_plot.set_xlabel('Hour of the Day', weight='bold', labelpad=15)     #SET X-AXIS LABEL, ADD PADDING TO TOP OF X-AXIS LABEL
hourly_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL, ADD PADDING TO RIGHT OF Y-AXIS LABEL

xticks(fontsize = 9, rotation = 0, ha= "center")                          #SET FONT SIZE FOR X-AXIS TICK LABELS
yticks(fontsize = 9)                                                      #SET FONT SIZE FOR Y-AXIS TICK LABELS
daily_plot.tick_params(axis='x', pad=5)                                   #SET PADDING ABOVE X-AXIS LABELS

daily_plot.legend_ = None                                                 #TURN OFF LEGEND

savefig('hourly counts - bar graph.png', bbox_inches='tight', dpi=300, format='png')

Generate Monthly Tweet Count¶

Generating a count by month follows the same process.

In [36]:

monthly_count = df.groupby(df.index.month).apply(f)
print len(monthly_count)
monthly_count

Out[36]:

	Number_of_tweets
1	3203
2	3056
3	2973
4	3162
5	2784
6	2366
7	2314
8	2314
9	2485
10	3207
11	2382
12	2084

Using basically the same code as above, we can plot a bar graph of these data. The one change is the second line of code -- here we use the calendar package to help generate a list of the months of the year. Python has a ton of such specialized packages to help same save.

In [41]:

monthly_plot = monthly_count['Number_of_tweets'].plot(kind='bar')
months = list(calendar.month_name[1:])                                    #GENERATE LIST OF MONTHS
xticks(np.arange(12), months, rotation = 0,fontsize = 9)                  #USE THE CUSTOM TICKS

monthly_plot.set_xlabel('Month of the Year', weight='bold', labelpad=15)  #SET X-AXIS LABEL, ADD PADDING TO TOP OF X-AXIS LABEL
monthly_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL, ADD PADDING TO RIGHT OF Y-AXIS LABEL

xticks(fontsize = 9, rotation = 0, ha= "center")                          #SET FONT SIZE FOR X-AXIS TICK LABELS
yticks(fontsize = 9)                                                      #SET FONT SIZE FOR Y-AXIS TICK LABELS
daily_plot.tick_params(axis='x', pad=5)                                   #SET PADDING ABOVE X-AXIS LABELS

daily_plot.legend_ = None                                                 #TURN OFF LEGEND

savefig('monthly counts - bar graph.png', bbox_inches='tight', dpi=300, format='png')

Getting Ridiculous: Number of Tweets per Minute¶

In case you ever wanted to, you can also access the minute at which the tweets were posted. Note that the minute attribute refers to the minute of the hour, not the minute of the day. That is, the n is 60.

In [42]:

minute_count = df.groupby(df.index.minute).apply(f)
print len(minute_count)
minute_count.head()

Out[42]:

	Number_of_tweets
0	3126
1	634
2	600
3	557
4	471

I'll just use the default plot here. There appears to be a spike at the 1st and 30th minutes of each hour. Probably due to automatic scheduling of tweets.

In [239]:

minute_count.plot()

Out[239]:

<matplotlib.axes._subplots.AxesSubplot at 0x11ed68910>

Super Ridiculous: Number of Tweets per Second¶

Note that the second attribute refers to the second of the minute, so the n is 60.

In [44]:

second_count = df.groupby(df.index.second).apply(f)
print len(second_count)
second_count.head()

Out[44]:

	Number_of_tweets
0	1227
1	1626
2	1720
3	1141
4	920

Not much useful information here. There are spikes at the first, second, and third seconds of the minute, probably also due to automated scheduling of the tweets.

In [46]:

second_count.plot()

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x10fd35b90>

Alternative Plotting Styles¶

I like the Seaborn styles shown above. However, you might want to play around with some of the other styles available to you. First, let's try the plot with the 'default' mpl_style. For each style, I'll show you a line plot (# of tweets per day of the year) and a bar graph (# of tweets per day of the week).

In [52]:

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier

In [53]:

plt.rcParams['figure.figsize'] = (15, 5)

In [54]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)

daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL

xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [55]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[55]:

([<matplotlib.axis.XTick at 0x115cc3e90>,
  <matplotlib.axis.XTick at 0x10d71f650>,
  <matplotlib.axis.XTick at 0x116f1bb50>,
  <matplotlib.axis.XTick at 0x116eaa090>,
  <matplotlib.axis.XTick at 0x116eaa810>,
  <matplotlib.axis.XTick at 0x116eaaf90>,
  <matplotlib.axis.XTick at 0x116eb3750>],
 <a list of 7 Text xticklabel objects>)

Try Matplotlib's ggplot style¶

Matplotlib 1.4 also comes with five different built-in styles. Let's try them all out.

In [50]:

mpl.style.available

Out[50]:

[u'dark_background', u'bmh', u'grayscale', u'ggplot', u'fivethirtyeight']

First, let's run it in dark_background style. We only have to run the following line of code and then all subsequent plots will be run in this style.

In [60]:

mpl.style.use('dark_background')

In [61]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)
daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL
xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [62]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[62]:

([<matplotlib.axis.XTick at 0x11748cd90>,
  <matplotlib.axis.XTick at 0x117a58810>,
  <matplotlib.axis.XTick at 0x117e1ba10>,
  <matplotlib.axis.XTick at 0x117e42f10>,
  <matplotlib.axis.XTick at 0x117e856d0>,
  <matplotlib.axis.XTick at 0x117e85e50>,
  <matplotlib.axis.XTick at 0x117e8e610>],
 <a list of 7 Text xticklabel objects>)

Now let's run it in bmh style

In [63]:

mpl.style.use('bmh')

In [64]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)
daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL
xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [65]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[65]:

([<matplotlib.axis.XTick at 0x115c86790>,
  <matplotlib.axis.XTick at 0x115c86190>,
  <matplotlib.axis.XTick at 0x1145a2e10>,
  <matplotlib.axis.XTick at 0x115caf810>,
  <matplotlib.axis.XTick at 0x115cb4610>,
  <matplotlib.axis.XTick at 0x115cb4d90>,
  <matplotlib.axis.XTick at 0x116e40050>],
 <a list of 7 Text xticklabel objects>)

Now in grayscale style

In [66]:

mpl.style.use('grayscale')

In [67]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)
daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABE
xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [68]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[68]:

([<matplotlib.axis.XTick at 0x117bfa650>,
  <matplotlib.axis.XTick at 0x10f3d7b90>,
  <matplotlib.axis.XTick at 0x10d0cc610>,
  <matplotlib.axis.XTick at 0x10fb9bc50>,
  <matplotlib.axis.XTick at 0x112f9db10>,
  <matplotlib.axis.XTick at 0x10e151210>,
  <matplotlib.axis.XTick at 0x10fd1a390>],
 <a list of 7 Text xticklabel objects>)

And here's ggplot style -- patterned after the popular R plotting package.

In [69]:

mpl.style.use('ggplot')

In [70]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)
daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL
xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [71]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[71]:

([<matplotlib.axis.XTick at 0x10d007210>,
  <matplotlib.axis.XTick at 0x10d007650>,
  <matplotlib.axis.XTick at 0x117ef5e50>,
  <matplotlib.axis.XTick at 0x10fd05390>,
  <matplotlib.axis.XTick at 0x10fd05b10>,
  <matplotlib.axis.XTick at 0x10fd0c2d0>,
  <matplotlib.axis.XTick at 0x10fd0ca50>],
 <a list of 7 Text xticklabel objects>)

Finally, let's run it in fivethirtyeight style, so named after Nate Silver's statistics site http://fivethirtyeight.com/

In [72]:

mpl.style.use('fivethirtyeight')

In [73]:

daily_plot = daily_count['Number_of_tweets'].plot(kind='line', lw=1, alpha=0.75, legend=True, x_compat=True)
daily_plot.set_xlabel('Month', weight='bold', labelpad=15)    #SET X-AXIS LABEL; ADD PADDING TO TOP OF LABEL
daily_plot.set_ylabel('# Tweets (Messages)', weight='bold', labelpad=15) #SET Y-AXIS LABEL; ADD PADDING TO RIGHT OF LABEL
xticks(fontsize = 9, rotation = -30, ha ="left")  #SET FONT PROPERTIES OF X-AXIS TICK LABELS
yticks(fontsize = 9)                              #SET FONT PROPERTIES OF Y-AXIS TICK LABELS
daily_plot.legend_ = None
daily_plot.tick_params(axis='x', pad=5)

In [74]:

day_of_week_plot = weekday_count['Number_of_tweets'].plot(kind='bar')
xticks(np.arange(7), weekday_count['day'], rotation = 0, fontsize = 9)

Out[74]:

([<matplotlib.axis.XTick at 0x117f16d10>,
  <matplotlib.axis.XTick at 0x10daf81d0>,
  <matplotlib.axis.XTick at 0x10dc095d0>,
  <matplotlib.axis.XTick at 0x10dc30ad0>,
  <matplotlib.axis.XTick at 0x10dc3c290>,
  <matplotlib.axis.XTick at 0x10dc3ca10>,
  <matplotlib.axis.XTick at 0x10dc461d0>],
 <a list of 7 Text xticklabel objects>)

OK, we have covered a few important steps in this tutorial. We have created a number of datasets that aggregate the 41 accounts' 2013 tweets by different time periods -- by second, minute, hour, day, month, and date. In so doing, you have been introduced to PANDAS' powerful indexing capabilities. We have also explored additional options and styles for plotting your data. In the next tutorial we will cover various analyses of the hashtags that are included in the collection of tweets.

For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton