In this notebook I will show you how to run descriptive statistics for your dataset and save the output. The desired end product is a CSV table of key summary statistics -- count, mean, std. dev., min. and max -- for the variables in your dataset.
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
import statsmodels #FOR NEXT STEP -- RUNNING REGRESSIONS
import statsmodels.api as sm
import statsmodels.formula.api as smf #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)
I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
I'm running PANDAS 0.13 here.
print pd.__version__
print statsmodels.__version__
0.13.1 0.6.1
I'm using data on a sample of 1,500 Facebook statuses of a sample of US-based health organizations
df = pd.read_excel('fb.xls', 'Sheet1', header=0)
print len(df)
df.head(2)
1500
No. | id | feed_id | org_name | FB_org_id | location | link | message_id | org_id | status_id | status_link | content | published_date | date_inserted | last_comment | type | status_type | video_source | picture_link | link_name | link_caption | link_description | num_mentions | mentions | num_likes | like_count | comment_count | share_count | hashtag_count | hashtags | mentions_count | actions | application | properties | message_output | time_since_post_days | like_count_7days | comment_count_7days | share_count_7days | time_since_post_14days | like_count_14days | comment_count_14days | share_count_14days | feed_organization | images | urls_count | urls_count_true | mentions_scrape | source | mission/non-mission | mission_focus | I-C-A | subcode | source_External | source_Internal | mission_N | mission_Y | mission_focus_Building capacity | mission_focus_Patient advocacy | mission_focus_Prevention | ICA_A | ICA_C | ICA_I | ICA_subcode_AIDS-related day | ICA_subcode_Awareness | ICA_subcode_Complementary support | ICA_subcode_Cover/profile photo | ICA_subcode_Dialogue | ICA_subcode_Donation | ICA_subcode_Event info | ICA_subcode_Event promotion | ICA_subcode_Event update | ICA_subcode_Get tested | ICA_subcode_HIV/AIDS info/news | ICA_subcode_Idea promotion | ICA_subcode_LGBT/Transgender information | ICA_subcode_Lobbying | ICA_subcode_Media action | ICA_subcode_Medication | ICA_subcode_National holiday/Holiday | ICA_subcode_Organizational news/announcement | ICA_subcode_Other | ICA_subcode_Patient stories | ICA_subcode_Recognition | ICA_subcode_Research/survey | ICA_subcode_Viewing action | ICA_subcode_Volunteer | type_event | type_link | type_music | type_photo | type_status | type_video | status_type_added_photos | status_type_added_video | status_type_created_event | status_type_created_note | status_type_mobile_status_update | status_type_published_story | status_type_shared_story | video_dummy | picture_dummy | Org_ID | Org_Name | Total_Revenue | Assets | followers_count | talking_about_count | were_here_count | Log_of_Assets | Log_of_Total_Revenue | Log_of_Followers | likes_binary | comment_binary | share_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1118 | 74 | 1 | AIDS Healthcare Foundation | 34661411150 | {'location': {'city': 'Los Angeles', 'zip': '9... | https://www.facebook.com/AIDShealth/photos/a.1... | 34661411150_10152940344646151 | 34661411150 | 10152940344646100 | https://www.facebook.com/34661411150/posts/101... | 'TIS THE SEASON: We know you've heard us say, ... | 2014-12-18T05:19:30+0000 | 2015-03-09 00:00:39 | 2014-12-20T13:33:45+0000 | photo | added_photos | NaN | https://scontent.xx.fbcdn.net/hphotos-xpf1/v/t... | NaN | NaN | NaN | 1.00 | Art Hearts Fashion | nan | nan | nan | nan | 1 | UseACondom | nan | [{'link': 'https://www.facebook.com/3466141115... | NaN | NaN | {'picture': 'https://scontent.xx.fbcdn.net/hph... | nan | nan | nan | nan | 81.00 | 101 | 3 | 6 | AIDS HEALTHCARE FOUNDATION | nan | nan | nan | nan | Internal | Y | Building capacity | A | Event promotion | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | AIDS HEALTHCARE FOUNDATION | 209986539 | 227735541 | 721419 | 896 | 3891 | 19.24 | 19.16 | 13.49 | 1 | 1 | 1 |
1 | 1262 | 79 | 1 | AIDS Healthcare Foundation | 34661411150 | NaN | http://www.frontiersmedia.com/frontiers-blog/2... | 34661411150_10152926215086151 | 34661411150 | 10152926215086100 | https://www.facebook.com/34661411150/posts/101... | AHF mourns the tragic loss of our friend and c... | 2014-12-11T18:02:10+0000 | 2015-03-09 00:00:40 | 2014-12-14T16:41:29+0000 | link | shared_story | NaN | https://fbexternal-a.akamaihd.net/safe_image.p... | Dana Miller—Producer, AIDS Advocate, Frontie... | frontiersmedia.com | “Dana was legendary in gay Los Angeles and H... | 0.00 | NaN | nan | nan | nan | nan | 1 | DanaMiller | nan | [{'link': 'https://www.facebook.com/3466141115... | NaN | NaN | {'picture': 'https://fbexternal-a.akamaihd.net... | nan | nan | nan | nan | 88.00 | 78 | 9 | 4 | AIDS HEALTHCARE FOUNDATION | nan | nan | nan | nan | External | Y | Building capacity | C | Recognition | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | AIDS HEALTHCARE FOUNDATION | 209986539 | 227735541 | 721419 | 896 | 3891 | 19.24 | 19.16 | 13.49 | 1 | 1 | 1 |
2 rows × 115 columns
List all the columns in the DataFrame
df.columns
Index([u'No.', u'id', u'feed_id', u'org_name', u'FB_org_id', u'location', u'link', u'message_id', u'org_id', u'status_id', u'status_link', u'content', u'published_date', u'date_inserted', u'last_comment', u'type', u'status_type', u'video_source', u'picture_link', u'link_name', u'link_caption', u'link_description', u'num_mentions', u'mentions', u'num_likes', u'like_count', u'comment_count', u'share_count', u'hashtag_count', u'hashtags', u'mentions_count', u'actions', u'application', u'properties', u'message_output', u'time_since_post_days', u'like_count_7days', u'comment_count_7days', u'share_count_7days', u'time_since_post_14days', u'like_count_14days', u'comment_count_14days', u'share_count_14days', u'feed_organization', u'images', u'urls_count', u'urls_count_true', u'mentions_scrape', u'source', u'mission/non-mission', u'mission_focus', u'I-C-A', u'subcode', u'source_External', u'source_Internal', u'mission_N', u'mission_Y', u'mission_focus_Building capacity', u'mission_focus_Patient advocacy', u'mission_focus_Prevention', u'ICA_A', u'ICA_C', u'ICA_I', u'ICA_subcode_AIDS-related day', u'ICA_subcode_Awareness', u'ICA_subcode_Complementary support', u'ICA_subcode_Cover/profile photo', u'ICA_subcode_Dialogue', u'ICA_subcode_Donation', u'ICA_subcode_Event info', u'ICA_subcode_Event promotion', u'ICA_subcode_Event update', u'ICA_subcode_Get tested', u'ICA_subcode_HIV/AIDS info/news', u'ICA_subcode_Idea promotion', u'ICA_subcode_LGBT/Transgender information', u'ICA_subcode_Lobbying', u'ICA_subcode_Media action', u'ICA_subcode_Medication', u'ICA_subcode_National holiday/Holiday', u'ICA_subcode_Organizational news/announcement', u'ICA_subcode_Other', u'ICA_subcode_Patient stories', u'ICA_subcode_Recognition', u'ICA_subcode_Research/survey', u'ICA_subcode_Viewing action', u'ICA_subcode_Volunteer', u'type_event', u'type_link', u'type_music', u'type_photo', u'type_status', u'type_video', u'status_type_added_photos', u'status_type_added_video', u'status_type_created_event', u'status_type_created_note', u'status_type_mobile_status_update', u'status_type_published_story', u'status_type_shared_story', ...], dtype='object')
You might not want to include all of your variables in the summary statistics table. When you're dealing with a dataset with a lot of columns, I find the easiest way is to output the column names to a list, copy and paste the output into a text editor, do a replace all command to delete the all the u's before each column name, and then paste back into iPython and create your desired sub-set.
df.columns.tolist()
[u'No.', u'id', u'feed_id', u'org_name', u'FB_org_id', u'location', u'link', u'message_id', u'org_id', u'status_id', u'status_link', u'content', u'published_date', u'date_inserted', u'last_comment', u'type', u'status_type', u'video_source', u'picture_link', u'link_name', u'link_caption', u'link_description', u'num_mentions', u'mentions', u'num_likes', u'like_count', u'comment_count', u'share_count', u'hashtag_count', u'hashtags', u'mentions_count', u'actions', u'application', u'properties', u'message_output', u'time_since_post_days', u'like_count_7days', u'comment_count_7days', u'share_count_7days', u'time_since_post_14days', u'like_count_14days', u'comment_count_14days', u'share_count_14days', u'feed_organization', u'images', u'urls_count', u'urls_count_true', u'mentions_scrape', u'source', u'mission/non-mission', u'mission_focus', u'I-C-A', u'subcode', u'source_External', u'source_Internal', u'mission_N', u'mission_Y', u'mission_focus_Building capacity', u'mission_focus_Patient advocacy', u'mission_focus_Prevention', u'ICA_A', u'ICA_C', u'ICA_I', u'ICA_subcode_AIDS-related day', u'ICA_subcode_Awareness', u'ICA_subcode_Complementary support', u'ICA_subcode_Cover/profile photo', u'ICA_subcode_Dialogue', u'ICA_subcode_Donation', u'ICA_subcode_Event info', u'ICA_subcode_Event promotion', u'ICA_subcode_Event update', u'ICA_subcode_Get tested', u'ICA_subcode_HIV/AIDS info/news', u'ICA_subcode_Idea promotion', u'ICA_subcode_LGBT/Transgender information', u'ICA_subcode_Lobbying', u'ICA_subcode_Media action', u'ICA_subcode_Medication', u'ICA_subcode_National holiday/Holiday', u'ICA_subcode_Organizational news/announcement', u'ICA_subcode_Other', u'ICA_subcode_Patient stories', u'ICA_subcode_Recognition', u'ICA_subcode_Research/survey', u'ICA_subcode_Viewing action', u'ICA_subcode_Volunteer', u'type_event', u'type_link', u'type_music', u'type_photo', u'type_status', u'type_video', u'status_type_added_photos', u'status_type_added_video', u'status_type_created_event', u'status_type_created_note', u'status_type_mobile_status_update', u'status_type_published_story', u'status_type_shared_story', u'video_dummy', u'picture_dummy', u'Org_ID', u'Org_Name', u'Total_Revenue', u'Assets', u'followers_count', u'talking_about_count', u'were_here_count', u'Log_of_Assets', u'Log_of_Total_Revenue', u'Log_of_Followers', u'likes_binary', u'comment_binary', u'share_binary']
I've copy and pasted the above output into TextWrangler, omitted all the 'u's, and selected the columns I want. I will now limit the dataframe to just those columns I want.
df = df[['hashtag_count', 'like_count_14days', 'comment_count_14days', 'share_count_14days',
'source', 'source_External', 'video_dummy', 'picture_dummy',
'Total_Revenue', 'Log_of_Total_Revenue', 'followers_count', 'Log_of_Followers']]
As you can see, you now have a dataframe with only 12 columns (variables)
print len(df)
print len(df.columns)
df.head(2)
1500 12
hashtag_count | like_count_14days | comment_count_14days | share_count_14days | source | source_External | video_dummy | picture_dummy | Total_Revenue | Log_of_Total_Revenue | followers_count | Log_of_Followers | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 101 | 3 | 6 | Internal | 0 | 0 | 1 | 209986539 | 19.16 | 721419 | 13.49 |
1 | 1 | 78 | 9 | 4 | External | 1 | 0 | 1 | 209986539 | 19.16 | 721419 | 13.49 |
2 rows × 12 columns
This is the basic way to produce summary statistics for all variables in your dataframe
df.describe()
hashtag_count | like_count_14days | comment_count_14days | share_count_14days | source_External | video_dummy | picture_dummy | Total_Revenue | Log_of_Total_Revenue | followers_count | Log_of_Followers | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 |
mean | 0.48 | 30.59 | 0.59 | 3.11 | 0.42 | 0.03 | 0.88 | 11486117.19 | 15.36 | 20150.46 | 7.79 |
std | 1.12 | 355.34 | 2.62 | 14.95 | 0.49 | 0.18 | 0.33 | 31061197.86 | 1.12 | 107022.51 | 1.34 |
min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1015497.00 | 13.83 | 45.00 | 3.81 |
25% | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1954081.00 | 14.49 | 1089.50 | 6.99 |
50% | 0.00 | 5.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 4179577.00 | 15.25 | 2134.00 | 7.67 |
75% | 0.00 | 12.00 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 10633968.00 | 16.18 | 3481.00 | 8.16 |
max | 10.00 | 11004.00 | 59.00 | 219.00 | 1.00 | 1.00 | 1.00 | 209986539.00 | 19.16 | 721419.00 | 13.49 |
8 rows × 11 columns
If you'd like to see the help for the describe function
DataFrame.describe?
Use the dir function to get an alphabetical listing of valid names (attributes) in an object.
dir(df.describe())
CHANGE TO TWO DECIMALS (n.b. - This step is not necessary if you have run the display.float_format command earlier)
np.round(df.describe(), 2)
hashtag_count | like_count_14days | comment_count_14days | share_count_14days | source_External | video_dummy | picture_dummy | Total_Revenue | Log_of_Total_Revenue | followers_count | Log_of_Followers | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 | 1500.00 |
mean | 0.48 | 30.59 | 0.59 | 3.11 | 0.42 | 0.03 | 0.88 | 11486117.19 | 15.36 | 20150.46 | 7.79 |
std | 1.12 | 355.34 | 2.62 | 14.95 | 0.49 | 0.18 | 0.33 | 31061197.86 | 1.12 | 107022.51 | 1.34 |
min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1015497.00 | 13.83 | 45.00 | 3.81 |
25% | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1954081.00 | 14.49 | 1089.50 | 6.99 |
50% | 0.00 | 5.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 4179577.00 | 15.25 | 2134.00 | 7.67 |
75% | 0.00 | 12.00 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 10633968.00 | 16.18 | 3481.00 | 8.16 |
max | 10.00 | 11004.00 | 59.00 | 219.00 | 1.00 | 1.00 | 1.00 | 209986539.00 | 19.16 | 721419.00 | 13.49 |
8 rows × 11 columns
NOW LET'S TRANSPOSE THE OUTPUT -- necessary for a more typical social scientific presentation of the data
np.round(df.describe(), 2).T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
hashtag_count | 1500.00 | 0.48 | 1.12 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
like_count_14days | 1500.00 | 30.59 | 355.34 | 0.00 | 2.00 | 5.00 | 12.00 | 11004.00 |
comment_count_14days | 1500.00 | 0.59 | 2.62 | 0.00 | 0.00 | 0.00 | 0.00 | 59.00 |
share_count_14days | 1500.00 | 3.11 | 14.95 | 0.00 | 0.00 | 0.00 | 1.00 | 219.00 |
source_External | 1500.00 | 0.42 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
video_dummy | 1500.00 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
picture_dummy | 1500.00 | 0.88 | 0.33 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Total_Revenue | 1500.00 | 11486117.19 | 31061197.86 | 1015497.00 | 1954081.00 | 4179577.00 | 10633968.00 | 209986539.00 |
Log_of_Total_Revenue | 1500.00 | 15.36 | 1.12 | 13.83 | 14.49 | 15.25 | 16.18 | 19.16 |
followers_count | 1500.00 | 20150.46 | 107022.51 | 45.00 | 1089.50 | 2134.00 | 3481.00 | 721419.00 |
Log_of_Followers | 1500.00 | 7.79 | 1.34 | 3.81 | 6.99 | 7.67 | 8.16 | 13.49 |
11 rows × 8 columns
We won't typically want the percentile columns in a social scientific publication. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']]
count | mean | std | min | max | |
---|---|---|---|---|---|
hashtag_count | 1500.00 | 0.48 | 1.12 | 0.00 | 10.00 |
like_count_14days | 1500.00 | 30.59 | 355.34 | 0.00 | 11004.00 |
comment_count_14days | 1500.00 | 0.59 | 2.62 | 0.00 | 59.00 |
share_count_14days | 1500.00 | 3.11 | 14.95 | 0.00 | 219.00 |
source_External | 1500.00 | 0.42 | 0.49 | 0.00 | 1.00 |
video_dummy | 1500.00 | 0.03 | 0.18 | 0.00 | 1.00 |
picture_dummy | 1500.00 | 0.88 | 0.33 | 0.00 | 1.00 |
Total_Revenue | 1500.00 | 11486117.19 | 31061197.86 | 1015497.00 | 209986539.00 |
Log_of_Total_Revenue | 1500.00 | 15.36 | 1.12 | 13.83 | 19.16 |
followers_count | 1500.00 | 20150.46 | 107022.51 | 45.00 | 721419.00 |
Log_of_Followers | 1500.00 | 7.79 | 1.34 | 3.81 | 13.49 |
11 rows × 5 columns
#ALTERNATIVE WAY OF WRITING
np.round(df.describe(), 2).transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
hashtag_count | 1500.00 | 0.48 | 1.12 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
like_count_14days | 1500.00 | 30.59 | 355.34 | 0.00 | 2.00 | 5.00 | 12.00 | 11004.00 |
comment_count_14days | 1500.00 | 0.59 | 2.62 | 0.00 | 0.00 | 0.00 | 0.00 | 59.00 |
share_count_14days | 1500.00 | 3.11 | 14.95 | 0.00 | 0.00 | 0.00 | 1.00 | 219.00 |
source_External | 1500.00 | 0.42 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
video_dummy | 1500.00 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
picture_dummy | 1500.00 | 0.88 | 0.33 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Total_Revenue | 1500.00 | 11486117.19 | 31061197.86 | 1015497.00 | 1954081.00 | 4179577.00 | 10633968.00 | 209986539.00 |
Log_of_Total_Revenue | 1500.00 | 15.36 | 1.12 | 13.83 | 14.49 | 15.25 | 16.18 | 19.16 |
followers_count | 1500.00 | 20150.46 | 107022.51 | 45.00 | 1089.50 | 2134.00 | 3481.00 | 721419.00 |
Log_of_Followers | 1500.00 | 7.79 | 1.34 | 3.81 | 6.99 | 7.67 | 8.16 | 13.49 |
11 rows × 8 columns
We can use the above commands and output to CSV
#WITH FOUR DECIMAL PLACES (DEFAULT)
df.describe().transpose().to_csv('summary stats.csv', sep=',')
For a typical social scientific publication, we would not need the percentile columns. In version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In PANDAS 0.13 we can instead select only those columns we want, then output to CSV
df.describe().transpose()[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')
The problem with the above output is that more than 2 decimal places are showing. If you want only two, then run the following version.
#WITH TWO DECIMAL PLACES
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')
Now you have a CSV file containing the columns you'll need for a typical Summary Statistics or Descriptive Statistics table for a submission to a social science journal. You likely won't want all of the columns in the final table, so I would probably open up the CSV file in Excel, delete unwanted variables, then copy and paste into Word. At that point you just need some formatting for aesthetics. If you do want to select which specific variables to include, you can specify the columns like this.
cols = ['hashtag_count','like_count_14days']
np.round(df[cols].describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats (partial).csv', sep=',')
For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton