In this notebook I will show you how to run descriptive statistics for your dataset and save the output. The desired end product is a CSV table of key summary statistics -- count, mean, std. dev., min. and max -- for the variables in your dataset.

Import packages

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
import statsmodels #FOR NEXT STEP -- RUNNING REGRESSIONS 
import statsmodels.api as sm
import statsmodels.formula.api as smf   #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS


PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [2]:
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)


I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on.

In [3]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)


I'm running PANDAS 0.13 here.

In [4]:
print pd.__version__
print statsmodels.__version__
0.13.1
0.6.1

Read in dataframe

I'm using data on a sample of 1,500 Facebook statuses of a sample of US-based health organizations

In [81]:
df = pd.read_excel('fb.xls', 'Sheet1', header=0)
print len(df)
df.head(2)
1500
Out[81]:
No. id feed_id org_name FB_org_id location link message_id org_id status_id status_link content published_date date_inserted last_comment type status_type video_source picture_link link_name link_caption link_description num_mentions mentions num_likes like_count comment_count share_count hashtag_count hashtags mentions_count actions application properties message_output time_since_post_days like_count_7days comment_count_7days share_count_7days time_since_post_14days like_count_14days comment_count_14days share_count_14days feed_organization images urls_count urls_count_true mentions_scrape source mission/non-mission mission_focus I-C-A subcode source_External source_Internal mission_N mission_Y mission_focus_Building capacity mission_focus_Patient advocacy mission_focus_Prevention ICA_A ICA_C ICA_I ICA_subcode_AIDS-related day ICA_subcode_Awareness ICA_subcode_Complementary support ICA_subcode_Cover/profile photo ICA_subcode_Dialogue ICA_subcode_Donation ICA_subcode_Event info ICA_subcode_Event promotion ICA_subcode_Event update ICA_subcode_Get tested ICA_subcode_HIV/AIDS info/news ICA_subcode_Idea promotion ICA_subcode_LGBT/Transgender information ICA_subcode_Lobbying ICA_subcode_Media action ICA_subcode_Medication ICA_subcode_National holiday/Holiday ICA_subcode_Organizational news/announcement ICA_subcode_Other ICA_subcode_Patient stories ICA_subcode_Recognition ICA_subcode_Research/survey ICA_subcode_Viewing action ICA_subcode_Volunteer type_event type_link type_music type_photo type_status type_video status_type_added_photos status_type_added_video status_type_created_event status_type_created_note status_type_mobile_status_update status_type_published_story status_type_shared_story video_dummy picture_dummy Org_ID Org_Name Total_Revenue Assets followers_count talking_about_count were_here_count Log_of_Assets Log_of_Total_Revenue Log_of_Followers likes_binary comment_binary share_binary
0 1118 74 1 AIDS Healthcare Foundation 34661411150 {'location': {'city': 'Los Angeles', 'zip': '9... https://www.facebook.com/AIDShealth/photos/a.1... 34661411150_10152940344646151 34661411150 10152940344646100 https://www.facebook.com/34661411150/posts/101... 'TIS THE SEASON: We know you've heard us say, ... 2014-12-18T05:19:30+0000 2015-03-09 00:00:39 2014-12-20T13:33:45+0000 photo added_photos NaN https://scontent.xx.fbcdn.net/hphotos-xpf1/v/t... NaN NaN NaN 1.00 Art Hearts Fashion nan nan nan nan 1 UseACondom nan [{'link': 'https://www.facebook.com/3466141115... NaN NaN {'picture': 'https://scontent.xx.fbcdn.net/hph... nan nan nan nan 81.00 101 3 6 AIDS HEALTHCARE FOUNDATION nan nan nan nan Internal Y Building capacity A Event promotion 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 AIDS HEALTHCARE FOUNDATION 209986539 227735541 721419 896 3891 19.24 19.16 13.49 1 1 1
1 1262 79 1 AIDS Healthcare Foundation 34661411150 NaN http://www.frontiersmedia.com/frontiers-blog/2... 34661411150_10152926215086151 34661411150 10152926215086100 https://www.facebook.com/34661411150/posts/101... AHF mourns the tragic loss of our friend and c... 2014-12-11T18:02:10+0000 2015-03-09 00:00:40 2014-12-14T16:41:29+0000 link shared_story NaN https://fbexternal-a.akamaihd.net/safe_image.p... Dana Miller—Producer, AIDS Advocate, Frontie... frontiersmedia.com “Dana was legendary in gay Los Angeles and H... 0.00 NaN nan nan nan nan 1 DanaMiller nan [{'link': 'https://www.facebook.com/3466141115... NaN NaN {'picture': 'https://fbexternal-a.akamaihd.net... nan nan nan nan 88.00 78 9 4 AIDS HEALTHCARE FOUNDATION nan nan nan nan External Y Building capacity C Recognition 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 AIDS HEALTHCARE FOUNDATION 209986539 227735541 721419 896 3891 19.24 19.16 13.49 1 1 1

2 rows × 115 columns


List all the columns in the DataFrame

In [82]:
df.columns
Out[82]:
Index([u'No.', u'id', u'feed_id', u'org_name', u'FB_org_id', u'location', u'link', u'message_id', u'org_id', u'status_id', u'status_link', u'content', u'published_date', u'date_inserted', u'last_comment', u'type', u'status_type', u'video_source', u'picture_link', u'link_name', u'link_caption', u'link_description', u'num_mentions', u'mentions', u'num_likes', u'like_count', u'comment_count', u'share_count', u'hashtag_count', u'hashtags', u'mentions_count', u'actions', u'application', u'properties', u'message_output', u'time_since_post_days', u'like_count_7days', u'comment_count_7days', u'share_count_7days', u'time_since_post_14days', u'like_count_14days', u'comment_count_14days', u'share_count_14days', u'feed_organization', u'images', u'urls_count', u'urls_count_true', u'mentions_scrape', u'source', u'mission/non-mission', u'mission_focus', u'I-C-A', u'subcode', u'source_External', u'source_Internal', u'mission_N', u'mission_Y', u'mission_focus_Building capacity', u'mission_focus_Patient advocacy', u'mission_focus_Prevention', u'ICA_A', u'ICA_C', u'ICA_I', u'ICA_subcode_AIDS-related day', u'ICA_subcode_Awareness', u'ICA_subcode_Complementary support', u'ICA_subcode_Cover/profile photo', u'ICA_subcode_Dialogue', u'ICA_subcode_Donation', u'ICA_subcode_Event info', u'ICA_subcode_Event promotion', u'ICA_subcode_Event update', u'ICA_subcode_Get tested', u'ICA_subcode_HIV/AIDS info/news', u'ICA_subcode_Idea promotion', u'ICA_subcode_LGBT/Transgender information', u'ICA_subcode_Lobbying', u'ICA_subcode_Media action', u'ICA_subcode_Medication', u'ICA_subcode_National holiday/Holiday', u'ICA_subcode_Organizational news/announcement', u'ICA_subcode_Other', u'ICA_subcode_Patient stories', u'ICA_subcode_Recognition', u'ICA_subcode_Research/survey', u'ICA_subcode_Viewing action', u'ICA_subcode_Volunteer', u'type_event', u'type_link', u'type_music', u'type_photo', u'type_status', u'type_video', u'status_type_added_photos', u'status_type_added_video', u'status_type_created_event', u'status_type_created_note', u'status_type_mobile_status_update', u'status_type_published_story', u'status_type_shared_story', ...], dtype='object')


You might not want to include all of your variables in the summary statistics table. When you're dealing with a dataset with a lot of columns, I find the easiest way is to output the column names to a list, copy and paste the output into a text editor, do a replace all command to delete the all the u's before each column name, and then paste back into iPython and create your desired sub-set.

In [56]:
df.columns.tolist()
Out[56]:
[u'No.',
 u'id',
 u'feed_id',
 u'org_name',
 u'FB_org_id',
 u'location',
 u'link',
 u'message_id',
 u'org_id',
 u'status_id',
 u'status_link',
 u'content',
 u'published_date',
 u'date_inserted',
 u'last_comment',
 u'type',
 u'status_type',
 u'video_source',
 u'picture_link',
 u'link_name',
 u'link_caption',
 u'link_description',
 u'num_mentions',
 u'mentions',
 u'num_likes',
 u'like_count',
 u'comment_count',
 u'share_count',
 u'hashtag_count',
 u'hashtags',
 u'mentions_count',
 u'actions',
 u'application',
 u'properties',
 u'message_output',
 u'time_since_post_days',
 u'like_count_7days',
 u'comment_count_7days',
 u'share_count_7days',
 u'time_since_post_14days',
 u'like_count_14days',
 u'comment_count_14days',
 u'share_count_14days',
 u'feed_organization',
 u'images',
 u'urls_count',
 u'urls_count_true',
 u'mentions_scrape',
 u'source',
 u'mission/non-mission',
 u'mission_focus',
 u'I-C-A',
 u'subcode',
 u'source_External',
 u'source_Internal',
 u'mission_N',
 u'mission_Y',
 u'mission_focus_Building capacity',
 u'mission_focus_Patient advocacy',
 u'mission_focus_Prevention',
 u'ICA_A',
 u'ICA_C',
 u'ICA_I',
 u'ICA_subcode_AIDS-related day',
 u'ICA_subcode_Awareness',
 u'ICA_subcode_Complementary support',
 u'ICA_subcode_Cover/profile photo',
 u'ICA_subcode_Dialogue',
 u'ICA_subcode_Donation',
 u'ICA_subcode_Event info',
 u'ICA_subcode_Event promotion',
 u'ICA_subcode_Event update',
 u'ICA_subcode_Get tested',
 u'ICA_subcode_HIV/AIDS info/news',
 u'ICA_subcode_Idea promotion',
 u'ICA_subcode_LGBT/Transgender information',
 u'ICA_subcode_Lobbying',
 u'ICA_subcode_Media action',
 u'ICA_subcode_Medication',
 u'ICA_subcode_National holiday/Holiday',
 u'ICA_subcode_Organizational news/announcement',
 u'ICA_subcode_Other',
 u'ICA_subcode_Patient stories',
 u'ICA_subcode_Recognition',
 u'ICA_subcode_Research/survey',
 u'ICA_subcode_Viewing action',
 u'ICA_subcode_Volunteer',
 u'type_event',
 u'type_link',
 u'type_music',
 u'type_photo',
 u'type_status',
 u'type_video',
 u'status_type_added_photos',
 u'status_type_added_video',
 u'status_type_created_event',
 u'status_type_created_note',
 u'status_type_mobile_status_update',
 u'status_type_published_story',
 u'status_type_shared_story',
 u'video_dummy',
 u'picture_dummy',
 u'Org_ID',
 u'Org_Name',
 u'Total_Revenue',
 u'Assets',
 u'followers_count',
 u'talking_about_count',
 u'were_here_count',
 u'Log_of_Assets',
 u'Log_of_Total_Revenue',
 u'Log_of_Followers',
 u'likes_binary',
 u'comment_binary',
 u'share_binary']


I've copy and pasted the above output into TextWrangler, omitted all the 'u's, and selected the columns I want. I will now limit the dataframe to just those columns I want.

In [84]:
df = df[['hashtag_count', 'like_count_14days', 'comment_count_14days', 'share_count_14days',
 'source', 'source_External', 'video_dummy', 'picture_dummy',
 'Total_Revenue', 'Log_of_Total_Revenue', 'followers_count',  'Log_of_Followers']]


As you can see, you now have a dataframe with only 12 columns (variables)

In [85]:
print len(df) 
print len(df.columns)
df.head(2)
1500
12
Out[85]:
hashtag_count like_count_14days comment_count_14days share_count_14days source source_External video_dummy picture_dummy Total_Revenue Log_of_Total_Revenue followers_count Log_of_Followers
0 1 101 3 6 Internal 0 0 1 209986539 19.16 721419 13.49
1 1 78 9 4 External 1 0 1 209986539 19.16 721419 13.49

2 rows × 12 columns

Generate Summary Statistics

This is the basic way to produce summary statistics for all variables in your dataframe

In [86]:
df.describe()
Out[86]:
hashtag_count like_count_14days comment_count_14days share_count_14days source_External video_dummy picture_dummy Total_Revenue Log_of_Total_Revenue followers_count Log_of_Followers
count 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00
mean 0.48 30.59 0.59 3.11 0.42 0.03 0.88 11486117.19 15.36 20150.46 7.79
std 1.12 355.34 2.62 14.95 0.49 0.18 0.33 31061197.86 1.12 107022.51 1.34
min 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1015497.00 13.83 45.00 3.81
25% 0.00 2.00 0.00 0.00 0.00 0.00 1.00 1954081.00 14.49 1089.50 6.99
50% 0.00 5.00 0.00 0.00 0.00 0.00 1.00 4179577.00 15.25 2134.00 7.67
75% 0.00 12.00 0.00 1.00 1.00 0.00 1.00 10633968.00 16.18 3481.00 8.16
max 10.00 11004.00 59.00 219.00 1.00 1.00 1.00 209986539.00 19.16 721419.00 13.49

8 rows × 11 columns


If you'd like to see the help for the describe function

In [ ]:
DataFrame.describe?


Use the dir function to get an alphabetical listing of valid names (attributes) in an object.

In [93]:
dir(df.describe())


CHANGE TO TWO DECIMALS (n.b. - This step is not necessary if you have run the display.float_format command earlier)

In [88]:
np.round(df.describe(), 2)
Out[88]:
hashtag_count like_count_14days comment_count_14days share_count_14days source_External video_dummy picture_dummy Total_Revenue Log_of_Total_Revenue followers_count Log_of_Followers
count 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00
mean 0.48 30.59 0.59 3.11 0.42 0.03 0.88 11486117.19 15.36 20150.46 7.79
std 1.12 355.34 2.62 14.95 0.49 0.18 0.33 31061197.86 1.12 107022.51 1.34
min 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1015497.00 13.83 45.00 3.81
25% 0.00 2.00 0.00 0.00 0.00 0.00 1.00 1954081.00 14.49 1089.50 6.99
50% 0.00 5.00 0.00 0.00 0.00 0.00 1.00 4179577.00 15.25 2134.00 7.67
75% 0.00 12.00 0.00 1.00 1.00 0.00 1.00 10633968.00 16.18 3481.00 8.16
max 10.00 11004.00 59.00 219.00 1.00 1.00 1.00 209986539.00 19.16 721419.00 13.49

8 rows × 11 columns


NOW LET'S TRANSPOSE THE OUTPUT -- necessary for a more typical social scientific presentation of the data

In [89]:
np.round(df.describe(), 2).T
Out[89]:
count mean std min 25% 50% 75% max
hashtag_count 1500.00 0.48 1.12 0.00 0.00 0.00 0.00 10.00
like_count_14days 1500.00 30.59 355.34 0.00 2.00 5.00 12.00 11004.00
comment_count_14days 1500.00 0.59 2.62 0.00 0.00 0.00 0.00 59.00
share_count_14days 1500.00 3.11 14.95 0.00 0.00 0.00 1.00 219.00
source_External 1500.00 0.42 0.49 0.00 0.00 0.00 1.00 1.00
video_dummy 1500.00 0.03 0.18 0.00 0.00 0.00 0.00 1.00
picture_dummy 1500.00 0.88 0.33 0.00 1.00 1.00 1.00 1.00
Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 1954081.00 4179577.00 10633968.00 209986539.00
Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 14.49 15.25 16.18 19.16
followers_count 1500.00 20150.46 107022.51 45.00 1089.50 2134.00 3481.00 721419.00
Log_of_Followers 1500.00 7.79 1.34 3.81 6.99 7.67 8.16 13.49

11 rows × 8 columns


We won't typically want the percentile columns in a social scientific publication. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV

In [90]:
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']]
Out[90]:
count mean std min max
hashtag_count 1500.00 0.48 1.12 0.00 10.00
like_count_14days 1500.00 30.59 355.34 0.00 11004.00
comment_count_14days 1500.00 0.59 2.62 0.00 59.00
share_count_14days 1500.00 3.11 14.95 0.00 219.00
source_External 1500.00 0.42 0.49 0.00 1.00
video_dummy 1500.00 0.03 0.18 0.00 1.00
picture_dummy 1500.00 0.88 0.33 0.00 1.00
Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 209986539.00
Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 19.16
followers_count 1500.00 20150.46 107022.51 45.00 721419.00
Log_of_Followers 1500.00 7.79 1.34 3.81 13.49

11 rows × 5 columns

In [92]:
#ALTERNATIVE WAY OF WRITING
np.round(df.describe(), 2).transpose()
Out[92]:
count mean std min 25% 50% 75% max
hashtag_count 1500.00 0.48 1.12 0.00 0.00 0.00 0.00 10.00
like_count_14days 1500.00 30.59 355.34 0.00 2.00 5.00 12.00 11004.00
comment_count_14days 1500.00 0.59 2.62 0.00 0.00 0.00 0.00 59.00
share_count_14days 1500.00 3.11 14.95 0.00 0.00 0.00 1.00 219.00
source_External 1500.00 0.42 0.49 0.00 0.00 0.00 1.00 1.00
video_dummy 1500.00 0.03 0.18 0.00 0.00 0.00 0.00 1.00
picture_dummy 1500.00 0.88 0.33 0.00 1.00 1.00 1.00 1.00
Total_Revenue 1500.00 11486117.19 31061197.86 1015497.00 1954081.00 4179577.00 10633968.00 209986539.00
Log_of_Total_Revenue 1500.00 15.36 1.12 13.83 14.49 15.25 16.18 19.16
followers_count 1500.00 20150.46 107022.51 45.00 1089.50 2134.00 3481.00 721419.00
Log_of_Followers 1500.00 7.79 1.34 3.81 6.99 7.67 8.16 13.49

11 rows × 8 columns

SAVE THE OUTPUT OF THE TABLE AS A CSV FILE


We can use the above commands and output to CSV

In [41]:
#WITH FOUR DECIMAL PLACES (DEFAULT)
df.describe().transpose().to_csv('summary stats.csv', sep=',')


For a typical social scientific publication, we would not need the percentile columns. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV

In [42]:
df.describe().transpose()[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')


The problem with the above output is that more than 2 decimal places are showing. If you want only two, then run the following version.

In [43]:
#WITH TWO DECIMAL PLACES
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')


Now you have a CSV file containing the columns you'll need for a typical Summary Statistics or Descriptive Statistics table for a submission to a social science journal. You likely won't want all of the columns in the final table, so I would probably open up the CSV file in Excel, delete unwanted variables, then copy and paste into Word. At that point you just need some formatting for aesthetics. If you do want to select which specific variables to include, you can specify the columns like this.

In [55]:
cols = ['hashtag_count','like_count_14days']
np.round(df[cols].describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats (partial).csv', sep=',')


For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton