This is this sixth in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. I assume you have already downloaded the data and have completed the steps taken in Chapter 1, Chapter 2, Chapter 3, Chapter 4, and Chapter 5. In this notebook I will show you how to run and then save for output your descriptive statistics. The desired end product is the a CSV table of key summary statistics -- count, mean, std. dev., min. and max -- for the variables in your dataset.

Also known as descriptive statistics, summary statistics are crucial for helping readers understand the nature of your data, especially for helping convey the range and dispersion of your data. A summary statistics table is mandatory in some journals and some disciplines whenever one is presenting statistical analyses of quantitative data. Below is an example of a summary statistics table from an article Chao Guo and I published last year on 150 nonprofit advocacy organizations' use of Twitter:

As you can see, the table shows the count, mean, std. dev., and minimum and maximum values for each quantitative variable in our analyses.

In [59]:
from IPython.display import Image
Image(width=750, filename='Descriptive Statistics Table.png') 
Out[59]:


Chapter 6: Producing a Summary Statistics Table

As per normal, we will first import several necessary Python packages and set some options for viewing the data. As with prior chapters, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations.

Import packages and set viewing options

In [2]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
In [3]:
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)

I'm using version 0.16.2 of PANDAS

In [5]:
pd.__version__
Out[5]:
'0.16.2'


I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on.

In [17]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Read in dataframe

In Chapter 4 we created a version of the dataframe that omitted all tweets that were retweets, allowing us to focus only on original messages sent by the 41 Twitter accounts. In Chapter 5 we then added 6 new variables to this dataset. Let's now open this saved file. As we can see in the operations below this dataframe contains 60 variables for 26,257 tweets.

In [32]:
df = pd.read_pickle('Original 2013 CSR Tweets with 3 binary variables.pkl')
print "# of variables in dataframe:", len(df.columns)
print  "# of tweets in dataframe:", len(df)
df.head(2)
# of variables in dataframe: 60
# of tweets in dataframe: 26257
Out[32]:
rowid query tweet_id_str inserted_date language coordinates retweeted_status created_at month year content from_user_screen_name from_user_id from_user_followers_count from_user_friends_count from_user_listed_count from_user_favourites_count from_user_statuses_count from_user_description from_user_location from_user_created_at retweet_count favorite_count entities_urls entities_urls_count entities_hashtags entities_hashtags_count entities_mentions entities_mentions_count in_reply_to_screen_name in_reply_to_status_id source entities_expanded_urls entities_media_count media_expanded_url media_url media_type video_link photo_link twitpic num_characters num_words retweeted_user retweeted_user_description retweeted_user_screen_name retweeted_user_followers_count retweeted_user_listed_count retweeted_user_statuses_count retweeted_user_location retweeted_tweet_created_at Fortune_2012_rank Company CSR_sustainability specific_project_initiative_area English RTs_binary favorites_binary hashtags_binary mentions_binary URLs_binary
0 67340 humanavitality 306897327585652736 2014-03-09 13:46:50.222857 en NaN NaN 2013-02-27 22:43:19.000000 2 2013 @louloushive (Tweet 2) We encourage other empl... humanavitality 274041023 2859 440 38 25 1766 This is the official Twitter account for Human... NaN Tue Mar 29 16:23:02 +0000 2011 0 0 NaN 0 NaN 0 louloushive 1 louloushive 306218267737989120.00 web NaN nan NaN NaN NaN 0 0 0 121 19 nan NaN NaN nan nan nan NaN NaN 79 Humana 0 1 1.00 0 0 0 1 0
1 39454 FundacionPfizer 308616393706844160 2014-03-09 13:38:20.679967 es NaN NaN 2013-03-04 16:34:17.000000 3 2013 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 188384056 2464 597 50 11 2400 Noticias sobre Responsabilidad Social y Fundac... México Wed Sep 08 16:14:11 +0000 2010 1 0 NaN 0 vacuna, neumonía 2 NaN 0 NaN nan web NaN nan NaN NaN NaN 0 0 0 138 20 nan NaN NaN nan nan nan NaN NaN 40 Pfizer 0 1 0.00 1 0 1 0 0


List all the columns in the DataFrame

In [8]:
df.columns
Out[8]:
Index([u'rowid', u'query', u'tweet_id_str', u'inserted_date', u'language',
       u'coordinates', u'retweeted_status', u'created_at', u'month', u'year',
       u'content', u'from_user_screen_name', u'from_user_id',
       u'from_user_followers_count', u'from_user_friends_count',
       u'from_user_listed_count', u'from_user_favourites_count',
       u'from_user_statuses_count', u'from_user_description',
       u'from_user_location', u'from_user_created_at', u'retweet_count',
       u'favorite_count', u'entities_urls', u'entities_urls_count',
       u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions',
       u'entities_mentions_count', u'in_reply_to_screen_name',
       u'in_reply_to_status_id', u'source', u'entities_expanded_urls',
       u'entities_media_count', u'media_expanded_url', u'media_url',
       u'media_type', u'video_link', u'photo_link', u'twitpic',
       u'num_characters', u'num_words', u'retweeted_user',
       u'retweeted_user_description', u'retweeted_user_screen_name',
       u'retweeted_user_followers_count', u'retweeted_user_listed_count',
       u'retweeted_user_statuses_count', u'retweeted_user_location',
       u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company',
       u'CSR_sustainability', u'specific_project_initiative_area', u'English',
       u'RTs_binary', u'favorites_binary', u'hashtags_binary',
       u'mentions_binary', u'URLs_binary'],
      dtype='object')

Create Sub-Set of DataFrame with only Desired Variables

You might not want to include all of your variables in the summary statistics table. When you're dealing with a dataset with a lot of columns, I find the easiest way is to output the column names to a list, copy and paste the output into another cell, then delete the columns you don't want.

In [33]:
print df.columns.tolist()
['rowid', 'query', 'tweet_id_str', 'inserted_date', 'language', 'coordinates', 'retweeted_status', 'created_at', 'month', 'year', 'content', 'from_user_screen_name', 'from_user_id', 'from_user_followers_count', 'from_user_friends_count', 'from_user_listed_count', 'from_user_favourites_count', 'from_user_statuses_count', 'from_user_description', 'from_user_location', 'from_user_created_at', 'retweet_count', 'favorite_count', 'entities_urls', 'entities_urls_count', 'entities_hashtags', 'entities_hashtags_count', 'entities_mentions', 'entities_mentions_count', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'source', 'entities_expanded_urls', 'entities_media_count', 'media_expanded_url', 'media_url', 'media_type', 'video_link', 'photo_link', 'twitpic', 'num_characters', 'num_words', 'retweeted_user', 'retweeted_user_description', 'retweeted_user_screen_name', 'retweeted_user_followers_count', 'retweeted_user_listed_count', 'retweeted_user_statuses_count', 'retweeted_user_location', 'retweeted_tweet_created_at', 'Fortune_2012_rank', 'Company', 'CSR_sustainability', 'specific_project_initiative_area', 'English', 'RTs_binary', 'favorites_binary', 'hashtags_binary', 'mentions_binary', 'URLs_binary']


I've copy and pasted the above output into the cell below and kept only a subset of the columns. Note the use of the single square brackets above to denote column names but the double square brackets below. In PANDAS the double brackets refer to dataframes; in the following line I am thus saying I want my dataframe df to be limited to the columns listed on the right-hand side of the equation.

In [34]:
df = df[['content','from_user_screen_name','from_user_followers_count','from_user_listed_count','from_user_statuses_count','retweet_count','favorite_count','entities_urls_count','entities_hashtags_count','entities_mentions_count',
 'num_characters','Company', 'English','RTs_binary','favorites_binary','hashtags_binary','mentions_binary',
 'URLs_binary']]
print "# of variables in dataframe:", len(df.columns)
print  "# of tweets in dataframe:", len(df)
df.head(2)
# of variables in dataframe: 18
# of tweets in dataframe: 26257
Out[34]:
content from_user_screen_name from_user_followers_count from_user_listed_count from_user_statuses_count retweet_count favorite_count entities_urls_count entities_hashtags_count entities_mentions_count num_characters Company English RTs_binary favorites_binary hashtags_binary mentions_binary URLs_binary
0 @louloushive (Tweet 2) We encourage other empl... humanavitality 2859 38 1766 0 0 0 0 1 121 Humana 1.00 0 0 0 1 0
1 ¿Sabes por qué la #vacuna contra la #neumonía ... FundacionPfizer 2464 50 2400 1 0 0 2 0 138 Pfizer 0.00 1 0 1 0 0


As you can see above, we now have a dataframe with only 18 variables.

Generate Summary Statistics

The describe function is the basic way to produce summary statistics for all the variables in your dataframe.

In [35]:
df.describe()
Out[35]:
from_user_followers_count from_user_listed_count from_user_statuses_count retweet_count favorite_count entities_urls_count entities_hashtags_count entities_mentions_count num_characters English RTs_binary favorites_binary hashtags_binary mentions_binary URLs_binary
count 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00
mean 48397.00 533.32 5190.04 3.83 1.52 0.63 1.01 0.96 111.11 0.97 0.55 0.33 0.63 0.61 0.62
std 103199.48 684.03 4004.43 42.94 15.28 0.50 1.01 1.14 29.89 0.17 0.50 0.47 0.48 0.49 0.49
min 58.00 8.00 9.00 0.00 0.00 0.00 0.00 0.00 9.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 4426.00 209.00 2311.00 0.00 0.00 0.00 0.00 0.00 94.00 1.00 0.00 0.00 0.00 0.00 0.00
50% 6401.00 253.00 4231.00 1.00 0.00 1.00 1.00 1.00 122.00 1.00 1.00 0.00 1.00 1.00 1.00
75% 37586.00 748.00 6324.00 2.00 1.00 1.00 2.00 1.00 135.00 1.00 1.00 1.00 1.00 1.00 1.00
max 424892.00 4605.00 16594.00 3719.00 1150.00 4.00 9.00 11.00 159.00 1.00 1.00 1.00 1.00 1.00 1.00


If you'd like to see the help for the describe function use the question mark.

In [49]:
DataFrame.describe?


Use the dir function to get an alphabetical listing of valid names (attributes) in an object.

In [38]:
print dir(df.describe())
['English', 'RTs_binary', 'T', 'URLs_binary', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__idiv__', '__imul__', '__init__', '__invert__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__unicode__', '__weakref__', '__xor__', '_accessors', '_add_numeric_operations', '_agg_by_level', '_align_frame', '_align_series', '_apply_broadcast', '_apply_empty_result', '_apply_raw', '_apply_standard', '_at', '_auto_consolidate', '_box_col_values', '_box_item_values', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_setitem_copy', '_clear_item_cache', '_combine_const', '_combine_frame', '_combine_match_columns', '_combine_match_index', '_combine_series', '_combine_series_infer', '_compare_frame', '_compare_frame_evaluate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_dict_for_slice', '_construct_axes_dict_from', '_construct_axes_from_arguments', '_constructor', '_constructor_expanddim', '_constructor_sliced', '_count_level', '_create_indexer', '_dir_additions', '_dir_deletions', '_ensure_valid_index', '_expand_axes', '_flex_compare_frame', '_from_arrays', '_from_axes', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_index_resolvers', '_get_item_cache', '_get_numeric_data', '_get_values', '_getitem_array', '_getitem_column', '_getitem_frame', '_getitem_multilevel', '_getitem_slice', '_iat', '_iget_item_cache', '_iloc', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_dict', '_init_mgr', '_init_ndarray', '_internal_names', '_internal_names_set', '_is_cached', '_is_datelike_mixed_type', '_is_mixed_type', '_is_numeric_mixed_type', '_is_view', '_ix', '_ixs', '_join_compat', '_loc', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_axis', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_reset_cache', '_sanitize_column', '_series', '_set_as_cached', '_set_axis', '_set_is_copy', '_set_item', '_setitem_array', '_setitem_frame', '_setitem_slice', '_setup_axes', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_typ', '_unpickle_frame_compat', '_unpickle_matrix_compat', '_update_inplace', '_validate_dtype', '_xs', 'abs', 'add', 'add_prefix', 'add_suffix', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'as_blocks', 'as_matrix', 'asfreq', 'assign', 'astype', 'at', 'at_time', 'axes', 'between_time', 'bfill', 'blocks', 'bool', 'boxplot', 'clip', 'clip_lower', 'clip_upper', 'columns', 'combine', 'combineAdd', 'combineMult', 'combine_first', 'compound', 'consolidate', 'convert_objects', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'dropna', 'dtypes', 'duplicated', 'empty', 'entities_hashtags_count', 'entities_mentions_count', 'entities_urls_count', 'eq', 'equals', 'eval', 'favorite_count', 'favorites_binary', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'from_csv', 'from_dict', 'from_items', 'from_records', 'from_user_followers_count', 'from_user_listed_count', 'from_user_statuses_count', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'hashtags_binary', 'head', 'hist', 'iat', 'icol', 'idxmax', 'idxmin', 'iget_value', 'iloc', 'index', 'info', 'insert', 'interpolate', 'irow', 'is_copy', 'isin', 'isnull', 'iteritems', 'iterkv', 'iterrows', 'itertuples', 'ix', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'load', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'memory_usage', 'mentions_binary', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'notnull', 'num_characters', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_axis', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'retweet_count', 'rfloordiv', 'rmod', 'rmul', 'rpow', 'rsub', 'rtruediv', 'sample', 'save', 'select', 'select_dtypes', 'sem', 'set_axis', 'set_index', 'set_value', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort', 'sort_index', 'sortlevel', 'squeeze', 'stack', 'std', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_excel', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_msgpack', 'to_panel', 'to_period', 'to_pickle', 'to_records', 'to_sparse', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_wide', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unstack', 'update', 'values', 'var', 'where', 'xs']


CHANGE TO TWO DECIMALS (n.b. - This step is not necessary if you have run the display.float_format command earlier)

In [39]:
np.round(df.describe(), 2)
Out[39]:
from_user_followers_count from_user_listed_count from_user_statuses_count retweet_count favorite_count entities_urls_count entities_hashtags_count entities_mentions_count num_characters English RTs_binary favorites_binary hashtags_binary mentions_binary URLs_binary
count 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00 26257.00
mean 48397.00 533.32 5190.04 3.83 1.52 0.63 1.01 0.96 111.11 0.97 0.55 0.33 0.63 0.61 0.62
std 103199.48 684.03 4004.43 42.94 15.28 0.50 1.01 1.14 29.89 0.17 0.50 0.47 0.48 0.49 0.49
min 58.00 8.00 9.00 0.00 0.00 0.00 0.00 0.00 9.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 4426.00 209.00 2311.00 0.00 0.00 0.00 0.00 0.00 94.00 1.00 0.00 0.00 0.00 0.00 0.00
50% 6401.00 253.00 4231.00 1.00 0.00 1.00 1.00 1.00 122.00 1.00 1.00 0.00 1.00 1.00 1.00
75% 37586.00 748.00 6324.00 2.00 1.00 1.00 2.00 1.00 135.00 1.00 1.00 1.00 1.00 1.00 1.00
max 424892.00 4605.00 16594.00 3719.00 1150.00 4.00 9.00 11.00 159.00 1.00 1.00 1.00 1.00 1.00 1.00


NOW LET'S TRANSPOSE THE OUTPUT -- necessary for a more typical social scientific presentation of the data. Note how only 15 variables are shown. These are our numerical variables. The categorical variables content, from_user_screen_name, and Company are not shown.

In [62]:
np.round(df.describe(), 2).T

#ALTERNATIVE WAY OF WRITING ABOVE
#np.round(df.describe(), 2).transpose()
Out[62]:
count mean std min 25% 50% 75% max
from_user_followers_count 26257.00 48397.00 103199.48 58.00 4426.00 6401.00 37586.00 424892.00
from_user_listed_count 26257.00 533.32 684.03 8.00 209.00 253.00 748.00 4605.00
from_user_statuses_count 26257.00 5190.04 4004.43 9.00 2311.00 4231.00 6324.00 16594.00
retweet_count 26257.00 3.83 42.94 0.00 0.00 1.00 2.00 3719.00
favorite_count 26257.00 1.52 15.28 0.00 0.00 0.00 1.00 1150.00
entities_urls_count 26257.00 0.63 0.50 0.00 0.00 1.00 1.00 4.00
entities_hashtags_count 26257.00 1.01 1.01 0.00 0.00 1.00 2.00 9.00
entities_mentions_count 26257.00 0.96 1.14 0.00 0.00 1.00 1.00 11.00
num_characters 26257.00 111.11 29.89 9.00 94.00 122.00 135.00 159.00
English 26257.00 0.97 0.17 0.00 1.00 1.00 1.00 1.00
RTs_binary 26257.00 0.55 0.50 0.00 0.00 1.00 1.00 1.00
favorites_binary 26257.00 0.33 0.47 0.00 0.00 0.00 1.00 1.00
hashtags_binary 26257.00 0.63 0.48 0.00 0.00 1.00 1.00 1.00
mentions_binary 26257.00 0.61 0.49 0.00 0.00 1.00 1.00 1.00
URLs_binary 26257.00 0.62 0.49 0.00 0.00 1.00 1.00 1.00


We won't typically want the percentile columns in a social scientific publication. Supposedly, in version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In version 0.16 as well as earlier versions of PANDAS we can alternatively select only those columns we want, then output to CSV.

In [41]:
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']]
Out[41]:
count mean std min max
from_user_followers_count 26257.00 48397.00 103199.48 58.00 424892.00
from_user_listed_count 26257.00 533.32 684.03 8.00 4605.00
from_user_statuses_count 26257.00 5190.04 4004.43 9.00 16594.00
retweet_count 26257.00 3.83 42.94 0.00 3719.00
favorite_count 26257.00 1.52 15.28 0.00 1150.00
entities_urls_count 26257.00 0.63 0.50 0.00 4.00
entities_hashtags_count 26257.00 1.01 1.01 0.00 9.00
entities_mentions_count 26257.00 0.96 1.14 0.00 11.00
num_characters 26257.00 111.11 29.89 9.00 159.00
English 26257.00 0.97 0.17 0.00 1.00
RTs_binary 26257.00 0.55 0.50 0.00 1.00
favorites_binary 26257.00 0.33 0.47 0.00 1.00
hashtags_binary 26257.00 0.63 0.48 0.00 1.00
mentions_binary 26257.00 0.61 0.49 0.00 1.00
URLs_binary 26257.00 0.62 0.49 0.00 1.00


Save the Output of the Table as a CSV File

Once you get more comfortable with Python and PANDAS you can combine your commands. For instance, we can simultaneously run our summary statistics and output the results to a CSV file.

In [43]:
#WITH FOUR DECIMAL PLACES (DEFAULT)
df.describe().transpose().to_csv('summary stats.csv', sep=',')


For a typical social scientific publication, we would not need the percentile columns. We can instead select only those columns we want, then output to CSV.

In [51]:
df.describe().transpose()[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')


The problem with the above output is that more than 2 decimal places are showing. If you want only two, then run the following version.

In [52]:
#WITH TWO DECIMAL PLACES
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',')


Now you have a CSV file containing the columns you'll need for a typical Summary Statistics or Descriptive Statistics table for a submission to a social science journal. You likely won't want all of the columns in the final table, so I would probably open up the CSV file in Excel, delete unwanted variables, then copy and paste into Word. At that point you just need some formatting for aesthetics. If you do want to select which specific variables to include, you can specify the columns like this.

In [54]:
cols = ['retweet_count','RTs_binary']
np.round(df[cols].describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats (partial).csv', sep=',')

Outputting to LaTeX

In some disciplines (e.g., Political Science, Engineering, Computer Science, Accounting, Finance, Economics) it is common to use LaTeX rather than Word. PANDAS has excellent LaTeX capabilities. For instance, the first of the following three lines of code shows how to output to a *.tex file rather than CSV, while the second shows what the LaTeX code looks like. The third imports an image of what the table looks like once it's rendered in TeXShop.

In [64]:
np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_latex('summary stats.tex')
In [67]:
print np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_latex()
\begin{tabular}{lrrrrr}
\toprule
{} &    count &     mean &       std &   min &       max \\
\midrule
from\_user\_followers\_count & 26257.00 & 48397.00 & 103199.48 & 58.00 & 424892.00 \\
from\_user\_listed\_count    & 26257.00 &   533.32 &    684.03 &  8.00 &   4605.00 \\
from\_user\_statuses\_count  & 26257.00 &  5190.04 &   4004.43 &  9.00 &  16594.00 \\
retweet\_count             & 26257.00 &     3.83 &     42.94 &  0.00 &   3719.00 \\
favorite\_count            & 26257.00 &     1.52 &     15.28 &  0.00 &   1150.00 \\
entities\_urls\_count       & 26257.00 &     0.63 &      0.50 &  0.00 &      4.00 \\
entities\_hashtags\_count   & 26257.00 &     1.01 &      1.01 &  0.00 &      9.00 \\
entities\_mentions\_count   & 26257.00 &     0.96 &      1.14 &  0.00 &     11.00 \\
num\_characters            & 26257.00 &   111.11 &     29.89 &  9.00 &    159.00 \\
English                   & 26257.00 &     0.97 &      0.17 &  0.00 &      1.00 \\
RTs\_binary                & 26257.00 &     0.55 &      0.50 &  0.00 &      1.00 \\
favorites\_binary          & 26257.00 &     0.33 &      0.47 &  0.00 &      1.00 \\
hashtags\_binary           & 26257.00 &     0.63 &      0.48 &  0.00 &      1.00 \\
mentions\_binary           & 26257.00 &     0.61 &      0.49 &  0.00 &      1.00 \\
URLs\_binary               & 26257.00 &     0.62 &      0.49 &  0.00 &      1.00 \\
\bottomrule
\end{tabular}

In [70]:
Image(width=600, filename='Descriptive Statistics Table (LaTeX).png') 
Out[70]:


In this tutorial we have covered how to generate a summary statistics table in preparation for further analyses and for submitting your work to scholarly outlets. In the following tutorials I'll introduce you to how to analyze audience reaction to the companies' tweets as well as how to test your hypotheses using logistic regression.

For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton