#!/usr/bin/env python # coding: utf-8 # # Notebook Tasks # #
# **_Possible Samples for Statistical Tests_**: # - Given the above, there are a number of possible tests:

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

IV: SOX Policies	DV: Donor Advisory	N	Notes	TO DO
2011	2016	4,857	47 donor advisories on these organizations; associational test (we don't know when the SOX policies were added); also, DV is 'current donor advisory'	ready to run
2011	2012-2016	4,857	47 2016 advisories plus probably another dozen or so advisories over the 2012-2015 period; associational test as above, but adds in donor advisories that were put in place then dropped between 2012 and 2015.	some minor work creating this new DV but not very burdensome
2011	2011	5,439	39 donor advisories; pure cross-sectional test	Download the '2011' 990 data (SOX policies + controls) for the 39 orgs with a 2011 donor advisory; a few hours work to download and enter the data
2016	2016	8,304	328 donor advisories; pure cross-sectional test	ready to run
change 2011-2016	2016	4,857	'Divide 4,857 orgs into three groups: i) those with no SOX policies in 2011 and still no SOX policies in 2016; ii) those with SOX policies in 2011 and 2016; and iii) those with no SOX policies in 2011 but SOX policies in 2016. Create dummy variables for each group and see whether those in group iii) do better than i) or ii). This is a relatively low cost 'pre-post' test.	moderate amount of work to create the new dummies but not too burdensome
change 2011-2016	2012-2016	TBD	Similar to above option, but would need to take a sample of organizations in group iii) and go through their 990s to find out exactly when they added the SOX policies	Resource-intensive 990 searches

# # #

# **_Notes from Meeting with Dan:_** # - Do not do 3rd or 6th test -- benefit not worth the cost # - 1st and 2nd tests can be robustness analyses # - Focus on 4th and 5th tests # - Control variables: # - Size: total revenues best (probably logged) # - will need 2011 and 2016 versions for the 4th and 5th tests # - efficiency ratio # - age (from BMF) # - complexity (could be a good control from Erica's paper) # - fixed effects: # - state # - category # - I need to scrape the category dummies for the new orgs in the 2016 database # - CN does not include that information in the ratings area, but it is included on the webpage in the 'breadcrumbs' area # - The focus of our paper is on SOC policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are: # - *independent board* --> related to Erica's *independence of key actors" concept # - *board review of 990* and *audited financials* --> both related to Erica's *board monitoring* concept # - we could include other governance variables as needed. # - We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies. # - To differentiate from Erica's *JBE* paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'. # # # #

# **_To Do (beyond notes listed in table above):_** # - For all above tests, we need to decide on controls, then find/merge/create any not currently in dataset # - Run a selection model? # - Code the *type* of advisory? Maybe save for future study # - There are 53 orgs on the CN 'Watchlist' -- we probably don't need to look at these but it's a possible future move. # #
# **_Notes on 2011 data:_** # - Only 47 of 329 current donor advisories are on orgs that were rated in 2011 # - Number of 2011 orgs (n=5,349) missing from 2016 ratings: 582 # - Number of 2016 orgs (n=8,304) not in 2011 ratings: 3,447 # - In 2011 when I scraped the current ratings there are 39 blank rows. Specifically, I checked the following spreadsheet: *Charity Navigator - current ratings, October 18, 2011 (WITH UPDATES FOR DONOR ADVISORY ORGS).xlsx* -- 39 rows were blank for all ratings information, so I checked against the historical ratings on the CN website. (So far) all rows were either 1) dropped from CN, 2) had a donor advisory, or 3) still have a donor advisory. I have 5,439 orgs in the 2011 database. 39 seem to have had donor advisories on them at that time. So, the 2011 sample is the 5,400 orgs that did not have an advisory on them at the time. This conforms with the *n* of 5,400 in the above logit. #
# # ### Import Packages # First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets. # Import of basic elements of PANDAS and numpy # In[3]: import numpy as np import pandas as pd from pandas import DataFrame from pandas import Series #
# # We can check which version of various packages we're using. You can see I'm running PANDAS 0.17 here. # In[4]: print pd.__version__ #
# PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks. # In[5]: #http://pandas.pydata.org/pandas-docs/stable/options.html pd.set_option('display.max_columns', None) pd.set_option('max_colwidth', 500) # ## Read in Data # Let's read in the merged historical/current/2011 dataset we created in the last notebook. First we'll change the working directory. # In[6]: cd '/Users/gregorysaxton/Google Drive/SOX' # ## Logit Tests # In[37]: df_2011 = pd.read_pickle('Tests 1-2 data.pkl') print len(df_2011.columns) print len(df_2011) df_2011.head(1) # In[38]: print df_2011.columns.tolist() # In[39]: print df_2011['donor_advisory_2016'].value_counts() print df_2011['donor_advisory_2011_to_2016'].value_counts() # In[40]: #DVs = ['donor_advisory', DVs = ['donor_advisory_2016', 'donor_advisory_2011_to_2016'] indicators = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data'] IVs = ['conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary'] controls = ['program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state'] fixed_effects = ['category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy'] SOI_check = ['tot_rev'] merge_cols = ['_merge_v1', '_merge_v2', '_merge_v3', '_merge_v4', '_merge_47', '_merge_efile'] #+ SOI_check logit_cols = DVs + indicators + IVs + controls + fixed_effects print logit_cols # In[41]: df_2011[df_2011.duplicated()] # In[36]: #df_2011.to_excel('df_2011.xls') # ### Test Logit # In[13]: import statsmodels import statsmodels.api as sm import statsmodels.formula.api as smf #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS print statsmodels.__version__ # In[52]: #IVs = '%s + ' % IV #DV = '%s ~ ' % DV IVs = 'SOX_policies ' #IVs = 'SOX_policies_binary' #DV = 'advisory ~ ' #DV = 'donor_advisory_2016 ~ ' DV = 'donor_advisory_2011_to_2016 ~ ' controls = '+ total_revenue_logged + program_efficiency + age + complexity_2011 + C(category)' #admin_expense_percent + leader_comp_percent + budget_surplus logit_formula = DV+IVs+controls print logit_formula #globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit() #print globals()["mod%s" % model_num].summary() # #print model_num.summary() #print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE # In[22]: print df_2011.columns.tolist() # In[24]: logit_cols1 = ['donor_advisory_2016', 'donor_advisory_2011_to_2016', 'SOX_policies', 'program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state'] len(df_2011[logit_cols1].dropna()) # In[19]: len(df_2011.dropna()) # In[53]: logit = smf.logit(formula=logit_formula, data=df_2011).fit() logit.summary() # In[29]: df_2011['donor_advisory_2016'] = df_2011['donor_advisory_2016'].astype('int') df_2011['donor_advisory_2011_to_2016'] = df_2011['donor_advisory_2011_to_2016'].astype('int') # In[30]: df_2011.dtypes # ### Write Function for Logits # In[ ]: #def new_logit(IV,model_num): def new_logit_clustered(data, DV, columns, FE, model_num): #IVs = '%s + ' % IV #DV = 'RTs_binary ~ ' DV = '%s ~ ' % DV #controls = 'from_user_followers_count + time_on_twitter_days + CSR_sustainability + \ # URLs_binary + photo' IVs = ' + '.join(columns) FE = '%s ' % FE logit_formula = DV+IVs+FE print logit_formula globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=data).fit(cov_type='cluster', cov_kwds={'groups': df['firm_from_user_screen_name']}) print globals()["mod%s" % model_num].summary() #print model_num.summary() print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE #print '\n', "Pseudo R-squared:", globals()["mod%s" % model_num].prsquared #TO GET THE PSEUDO-R-SQUARED # In[1157]: # In[1089]: logit_variables = ['2011_data', 'donor_advisory_2016', 'SOX_policies', 'total_revenue_logged', 'program_efficiency', 'age', 'complexity', 'state', 'category'] df_2011 = df[logit_variables] df_2011 = df_2011[df_2011['2011_data']==1] print len(df_2011) len(df_2011.dropna()) # In[1100]: df[['2011_data', 'donor_advisory_2016', 'SOX_policies', 'total_revenue_logged', 'program_efficiency', 'age', 'complexity', 'complexity_2011', 'state', 'category']].describe().T # In[1158]: logit = smf.logit(formula=logit_formula, data=df[df['2011_data']==1]).fit() logit.summary() # ### Test with standard errors clustered on state # In[47]: logit_formula # In[54]: cols1 = ['donor_advisory_2011_to_2016', 'SOX_policies', 'total_revenue_logged', 'program_efficiency', 'age', 'complexity_2011', 'category', 'state'] # In[55]: logit = smf.logit(formula=logit_formula, data=df_2011[cols1].dropna()).fit(cov_type='cluster', cov_kwds={'groups': df_2011[cols1].dropna()['state']}) logit.summary() # In[ ]: # In[ ]: # In[ ]: # In[ ]: # In[ ]: # # Below this I've played around with creating a '2011' dataset --> and run a couple of logits #
Here are the two variables that serve as indicators of '2016' and '2011' CN data. **_These 8,304 and 4,863 rows will serve as the base for conducting the logit regressions_**. Design variable creation solutions are these subsets of data. # In[373]: print len(df[df['latest_entry']=='True']) print len(df[df['2011 data']==1]) # In[ ]: - Control variables: - Size: total revenues best (probably logged) - will need 2011 and 2016 versions for the 4th and 5th tests - efficiency ratio - complexity (could be a good control from Erica's paper) - fixed effects: - state - category - I need to scrape the category dummies for the new orgs in the 2016 database - CN does not include that information in the ratings area, but it is included on the webpage in the 'breadcrumbs' area - The focus of our paper is on SOX policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are: - *independent board* --> related to Erica's *independence of key actors" concept - *board review of 990* and *audited financials* --> both related to Erica's *board monitoring* concept - we could include other governance variables as needed. - We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies. - To differentiate from Erica's *JBE* paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'. # In[397]: cols_2011 = [col for col in list(df) if col.endswith('_2011')] print cols_2011 # In[398]: df[cols_2011].dtypes # In[379]: df[cols_2011].describe().T # In[380]: from __future__ import division # #### Program Efficiency Ratio # efficiency = ProgExp/TotExp # In[386]: df[df['org_id']=='12123']['program_expense_2011']/df[df['org_id']=='12123']['total_functional_expense_2011'] # In[388]: df[df['org_id']=='12123'][cols_2011] # In[395]: df[df['2011 data']==1]['program_expense_2011'][:5]/df[df['2011 data']==1]['total_functional_expense_2011'][:5] # ### Complexity # Number of revenue sources (Donations, Government Grants, Program Service Revenues) # In[407]: print len(df[df['2011 data']==1]) print len(df[df['govt_revenue_2011'].notnull()]) df[df['govt_revenue_2011'].notnull()]['govt_revenue_2011'].value_counts() # In[409]: df[df['govt_revenue_2011'].notnull()]['govt_revenue_2011'][5:12] # In[410]: df['govt_revenue_2011_binary'] = np.nan df['govt_revenue_2011_binary'] = np.where( df['govt_revenue_2011'] == 'Note: This organization receives $0 in government support.', 0, df['govt_revenue_2011_binary']) df['govt_revenue_2011_binary'] = np.where( df['govt_revenue_2011'] == 'GOVERNMENT SUPPORT MUST BE RECEIVED', 1, df['govt_revenue_2011_binary']) df['govt_revenue_2011_binary'].value_counts() # In[420]: print len(df[df['2011 data']==1]) print len(df[df['primary_revenue_2011'].notnull()]) print df[df['primary_revenue_2011'].notnull()]['primary_revenue_2011'][:5] df[df['primary_revenue_2011'].notnull()]['primary_revenue_2011'].value_counts().sum() # In[427]: df[df['other_revenue_2011']<1000]['other_revenue_2011'] # In[429]: len(df[df['other_revenue_2011']==0]) # In[419]: print len(df[df['2011 data']==1]) print len(df[df['other_revenue_2011'].notnull()]) print df[df['other_revenue_2011'].notnull()]['other_revenue_2011'][:5] df[df['other_revenue_2011'].notnull()]['other_revenue_2011'].value_counts().sum() # In[431]: print len(df[df['other_revenue_2011'].notnull()]) df['other_revenue_2011_binary'] = np.where(df['other_revenue_2011']>0, 1, df['other_revenue_2011'] ) df['other_revenue_2011_binary'] = np.where(df['other_revenue_2011']<0, 1, df['other_revenue_2011_binary'] ) print len(df[df['other_revenue_2011_binary'].notnull()]) df['other_revenue_2011_binary'].value_counts() # In[434]: df['complexity_2011'] = 1 + df['other_revenue_2011_binary'] + df['govt_revenue_2011_binary'] print len(df[df['complexity_2011'].notnull()]) df['complexity_2011'].value_counts() # In[ ]: cols_2011 = [col for col in list(df) if col.endswith('_2011')] print cols_2011 # In[ ]: # In[ ]: # In[ ]: # ### Test Logits # In[1132]: import statsmodels import statsmodels.api as sm import statsmodels.formula.api as smf #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS print statsmodels.__version__ # In[455]: df['SOX_policies_2011'] = df['conflict_of_interest_policy_2011'] + ' ' + df['whistleblower_policy_2011'] + ' ' + df['records_retention_policy_2011'] df['SOX_policies_2011'].value_counts() # In[456]: df['SOX_policies_2011'] = df['SOX_policies_2011'].str.count('yes') df['SOX_policies_2011'].value_counts() # In[491]: df['total_revenue_2011_logged'] = np.log(df['total_revenue_2011']) df[['total_revenue_2011_logged', 'total_revenue_2011']].describe().T # In[ ]: df['advisory'] = df[''] # In[545]: #IVs = '%s + ' % IV #DV = '%s ~ ' % DV IVs = 'SOX_policies_2011 ' #DV = 'advisory ~ ' DV = 'donor_advisory_2016 ~ ' controls = '+ total_revenue_2011_logged + program_expense_percent_2011 + age + \ complexity_2011 + C(category)' #admin_expense_percent + leader_comp_percent + budget_surplus logit_formula = DV+IVs+controls print logit_formula #globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit() #print globals()["mod%s" % model_num].summary() # #print model_num.summary() #print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE # In[532]: df.to_csv('df.csv', encoding='utf-8') # In[535]: logit_variables = ['donor_advisory_2016', 'SOX_policies_2011', 'total_revenue_2011_logged', 'program_expense_percent_2011', 'age', 'complexity_2011', 'state_2011', 'category'] df_2011 = df[logit_variables] print len(df_2011) len(df_2011.dropna()) # In[519]: #IVs = '%s + ' % IV #DV = '%s ~ ' % DV IVs = 'SOX_policies_2011 ' #DV = 'advisory ~ ' DV = 'donor_advisory_2016 ~ ' controls = '+ total_revenue_2011_logged + program_expense_percent_2011 + age + \ complexity_2011' #admin_expense_percent + leader_comp_percent + budget_surplus logit_formula = DV+IVs+controls print logit_formula #globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit() #print globals()["mod%s" % model_num].summary() # #print model_num.summary() #print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE # In[536]: logit = smf.logit(formula=logit_formula, data=df_2011).fit() logit.summary() # In[ ]: #def new_logit(IV,model_num): def new_logit_clustered(data, DV, columns, FE, model_num): #IVs = '%s + ' % IV #DV = 'RTs_binary ~ ' DV = '%s ~ ' % DV #controls = 'from_user_followers_count + time_on_twitter_days + CSR_sustainability + \ # URLs_binary + photo' IVs = ' + '.join(columns) FE = '%s ' % FE logit_formula = DV+IVs+FE print logit_formula globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=data).fit(cov_type='cluster', cov_kwds={'groups': df['firm_from_user_screen_name']}) print globals()["mod%s" % model_num].summary() #print model_num.summary() print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE #print '\n', "Pseudo R-squared:", globals()["mod%s" % model_num].prsquared #TO GET THE PSEUDO-R-SQUARED # In[543]: logit = smf.logit(formula=logit_formula, data=df).fit() logit.summary() # In[524]: len(df[df['2011 data']==1]) # In[523]: df[df['2011 data']==1][['SOX_policies_2011', 'total_revenue_2011_logged', 'program_expense_percent_2011', 'age', 'complexity_2011']].describe().T # In[508]: len(df[df['state_2011'].notnull()]) # In[509]: df['state_2011'].value_counts() # In[516]: len(df[ (df['2011 data']==1) & (df['state_2011'].isnull())]) # In[ ]: # In[540]: df_2011.dropna().describe().T # In[546]: logit = smf.logit(formula=logit_formula, data=df_2011.dropna()).fit(cov_type='cluster', cov_kwds={'groups': df_2011.dropna()['state_2011']}) logit.summary() # In[547]: logit = smf.logit(formula=logit_formula, data=df).fit() logit.summary() # In[549]: df_2011.dropna()['category'].value_counts() # In[513]: logit = smf.logit(formula=logit_formula, data=df[df['2011 data']==1]).fit(cov_type='cluster', cov_kwds={'groups': df[df['2011 data']]['state_2011']}) logit.summary() # In[526]: logit = smf.logit(formula=logit_formula, data=df[df['total_revenue_2011_logged'].notnull()]).fit(cov_type='cluster', cov_kwds={'groups': df[df['total_revenue_2011_logged'].notnull()]['state_2011']}) logit.summary() # In[500]: print len(df[(df['2011 data']==1) & (df['donor_advisory_2016'].notnull())]) print len(df[(df['2011 data']==1) & (df['donor_advisory_2016']==0)]) print len(df[(df['2011 data']==1) & (df['donor_advisory_2016']==1)]) # In[468]: print cols_2011 # In[ ]: BMF_columns = ['NEW ROW', 'NAME_2015_BMF', 'STREET_2015_BMF', 'CITY_2015_BMF', 'STATE_2015_BMF', 'ZIP_2015_BMF', 'RULING_2015_BMF', 'ACTIVITY_2015_BMF', 'TAX_PERIOD_2015_BMF', 'ASSET_AMT_2015_BMF', 'INCOME_AMT_2015_BMF', 'REVENUE_AMT_2015_BMF', 'NTEE_CD_2015_BMF', '2015 BMF', 'ruledate_2004_BMF', 'name_MSTRALL', 'state_MSTRALL', 'NTEE1_MSTRALL', 'nteecc_MSTRALL', 'zip_MSTRALL', 'fips_MSTRALL', 'taxper_MSTRALL', 'income_MSTRALL', 'F990REV_MSTRALL', 'assets_MSTRALL', 'ruledate_MSTRALL', 'deductcd_MSTRALL', 'accper_MSTRALL', 'rule_date_v1', 'taxpd'] # In[ ]: df['']df[(df['SOX_policies_2011'].notnull())] # In[475]: print len(df[df['current_donor_advisory']==1]) DA_2016 = df[df['current_donor_advisory']==1]['org_id'].tolist() print len(DA_2016), len(set(DA_2016)) DA_2016[:3] # In[486]: df['donor_advisory_2016'] = np.nan df['donor_advisory_2016'] = np.where( df['org_id'].isin(DA_2016), 1,0 ) df['donor_advisory_2016'].value_counts() # In[473]: df[(df['SOX_policies_2011'].notnull())][['name', 'org_id', 'FYE', '2011 data', #'current_or_past_donor_advisory', 'current_donor_advisory', 'past_donor_advisory', 'SOX_policies_2011']][:24] # In[447]: #df['advisory'] = np.where(~df['advisory text - current advisory'].isnull(), 1,0) #df['advisory'].value_counts() # In[458]: #df[(df['2011 data']==1) & (df['past_donor_advisory']==1)][:5] # In[483]: # In[ ]: # In[ ]: # In[ ]: controls = ['total_revenue_2011', 'program_expense_percent_2011', ] len(df[(df['2011 data']==1) & (df['total_revenue_2011'].isnull())])#[controls] # In[411]: df[df['2011 data']==1][:2] # In[ ]: # In[ ]: # In[ ]: # In[ ]: # In[396]: print df.columns.tolist() # ### SOI data # In[ ]: SOI_columns = ['taxpd', 'yr_frmtn', 'pt1_num_vtng_gvrn_bdy_mems', 'pt1_num_ind_vtng_mems', 'num_vtng_gvrn_bdy_mems', 'num_ind_vtng_mems', 'tot_num_empls', 'tot_num_vlntrs', 'contri_grnts_cy', 'prog_srvc_rev_cy', 'invst_incm_cy', 'oth_rev_cy', 'grnts_and_smlr_amts_cy', 'tot_prof_fndrsng_exp_cy', 'tot_fndrsng_exp_cy', 'pt1_tot_asts_eoy', 'aud_fincl_stmts', 'mtrl_divrsn_or_misuse', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy', 'federated_campaigns', 'memshp_dues', 'rltd_orgs', 'govt_grnts', 'all_oth_contri', 'nncsh_contri', 'tot_contri', 'psr_tot', 'inv_incm_tot_rev', 'bonds_tot_rev', 'roylrev_tot_rev', 'net_rent_tot_rev', 'gain_or_loss_sec', 'gain_or_loss_oth', 'oth_rev_tot', 'tot_rev', 'mgmt_srvc_fee_tot', 'fee_for_srvc_leg_tot', 'fee_for_srvc_acct_tot', 'fee_for_srvc_lbby_tot', 'fee_for_srvc_prof_tot', 'fee_for_srvc_invst_tot', 'fee_for_srvc_oth_tot', 'fs_audited', 'audit_committee', 'vlntr_hrs', 'NAME_SOI'] len(SOI_columns)