Notebook Tasks¶

**_Possible Samples for Statistical Tests_**: - Given the above, there are a number of possible tests:

IV: SOX Policies	DV: Donor Advisory	N	Notes	TO DO
2011	2016	4,857	47 donor advisories on these organizations; associational test (we don't know when the SOX policies were added); also, DV is 'current donor advisory'	ready to run
2011	2012-2016	4,857	47 2016 advisories plus probably another dozen or so advisories over the 2012-2015 period; associational test as above, but adds in donor advisories that were put in place then dropped between 2012 and 2015.	some minor work creating this new DV but not very burdensome
2011	2011	5,439	39 donor advisories; pure cross-sectional test	Download the '2011' 990 data (SOX policies + controls) for the 39 orgs with a 2011 donor advisory; a few hours work to download and enter the data
2016	2016	8,304	328 donor advisories; pure cross-sectional test	ready to run
change 2011-2016	2016	4,857	'Divide 4,857 orgs into three groups: i) those with no SOX policies in 2011 and still no SOX policies in 2016; ii) those with SOX policies in 2011 and 2016; and iii) those with no SOX policies in 2011 but SOX policies in 2016. Create dummy variables for each group and see whether those in group iii) do better than i) or ii). This is a relatively low cost 'pre-post' test.	moderate amount of work to create the new dummies but not too burdensome
change 2011-2016	2012-2016	TBD	Similar to above option, but would need to take a sample of organizations in group iii) and go through their 990s to find out exactly when they added the SOX policies	Resource-intensive 990 searches

Notes from Meeting with Dan:

Do not do 3rd or 6th test -- benefit not worth the cost
1st and 2nd tests can be robustness analyses
Focus on 4th and 5th tests
Control variables:
- Size: total revenues best (probably logged)
  - will need 2011 and 2016 versions for the 4th and 5th tests
- efficiency ratio
- age (from BMF)
- complexity (could be a good control from Erica's paper)
- fixed effects:
  - state
  - category
    - I need to scrape the category dummies for the new orgs in the 2016 database
      - CN does not include that information in the ratings area, but it is included on the webpage in the 'breadcrumbs' area
- The focus of our paper is on SOC policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are:
  - independent board --> related to Erica's *independence of key actors" concept
  - board review of 990 and audited financials --> both related to Erica's board monitoring concept
  - we could include other governance variables as needed.
We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies.
To differentiate from Erica's JBE paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'.

To Do (beyond notes listed in table above):

For all above tests, we need to decide on controls, then find/merge/create any not currently in dataset
Run a selection model?
Code the type of advisory? Maybe save for future study
There are 53 orgs on the CN 'Watchlist' -- we probably don't need to look at these but it's a possible future move.

**_Notes on 2011 data:_** - Only 47 of 329 current donor advisories are on orgs that were rated in 2011 - Number of 2011 orgs (n=5,349) missing from 2016 ratings: 582 - Number of 2016 orgs (n=8,304) not in 2011 ratings: 3,447 - In 2011 when I scraped the current ratings there are 39 blank rows. Specifically, I checked the following spreadsheet: *Charity Navigator - current ratings, October 18, 2011 (WITH UPDATES FOR DONOR ADVISORY ORGS).xlsx* -- 39 rows were blank for all ratings information, so I checked against the historical ratings on the CN website. (So far) all rows were either 1) dropped from CN, 2) had a donor advisory, or 3) still have a donor advisory. I have 5,439 orgs in the 2011 database. 39 seem to have had donor advisories on them at that time. So, the 2011 sample is the 5,400 orgs that did not have an advisory on them at the time. This conforms with the *n* of 5,400 in the above logit.

Import Packages¶

First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets.

Import of basic elements of PANDAS and numpy

In [3]:

import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

We can check which version of various packages we're using. You can see I'm running PANDAS 0.17 here.

In [4]:

print pd.__version__

0.18.1

PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [5]:

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 500)

Read in Data¶

Let's read in the merged historical/current/2011 dataset we created in the last notebook. First we'll change the working directory.

In [6]:

cd '/Users/gregorysaxton/Google Drive/SOX'

/Users/gregorysaxton/Google Drive/SOX

Logit Tests¶

In [37]:

df_2011 = pd.read_pickle('Tests 1-2 data.pkl')
print len(df_2011.columns)
print len(df_2011)
df_2011.head(1)

35
4863

Out[37]:

	donor_advisory	donor_advisory_2016	donor_advisory_2011_to_2016	org_id	EIN	FYE	Form 990 FYE	ratings_system	2011_data	2016_data	conflict_of_interest_policy_v2	records_retention_policy_v2	whistleblower_policy_v2	SOX_policies	SOX_policies_binary	SOX_policies_all_binary	program_efficiency	complexity	complexity_2011	age	total_revenue_logged	category	state	tot_rev	category_Animals	category_Arts, Culture, Humanities	category_Community Development	category_Education	category_Environment	category_Health	category_Human Services	category_Human and Civil Rights	category_International	category_Religion	category_Research and Public Policy
50715	0.0	0.0	0.0	5954	010202467	FY2009	2009-12	CN 2.0	1.0	0.0	1.0	1.0	1.0	3.0	1.0	1.0	0.788895	0.0	3.0	62.0	15.947563	Research and Public Policy	ME	8432154.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0

In [38]:

print df_2011.columns.tolist()

['donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016', 'org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data', 'conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary', 'program_efficiency', 'complexity', 'complexity_2011', 'age', 'total_revenue_logged', 'category', 'state', 'tot_rev', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy']

In [39]:

print df_2011['donor_advisory_2016'].value_counts()
print df_2011['donor_advisory_2011_to_2016'].value_counts()

0.0    4816
1.0      47
Name: donor_advisory_2016, dtype: int64
0.0    4755
1.0     108
Name: donor_advisory_2011_to_2016, dtype: int64

In [40]:

#DVs = ['donor_advisory', 
DVs = ['donor_advisory_2016', 'donor_advisory_2011_to_2016']
indicators = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data']
IVs = ['conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2',
       'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary']
controls = ['program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state']
fixed_effects = ['category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 
                 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 
                 'category_Human and Civil Rights', 'category_International', 'category_Religion', 
                 'category_Research and Public Policy']
SOI_check = ['tot_rev']

merge_cols = ['_merge_v1', '_merge_v2', '_merge_v3', '_merge_v4', '_merge_47', '_merge_efile']

#+ SOI_check
logit_cols = DVs + indicators + IVs + controls  + fixed_effects
print logit_cols

['donor_advisory_2016', 'donor_advisory_2011_to_2016', 'org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data', 'conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary', 'program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy']

In [41]:

df_2011[df_2011.duplicated()]

Out[41]:

	donor_advisory	donor_advisory_2016	donor_advisory_2011_to_2016	org_id	EIN	FYE	Form 990 FYE	ratings_system	2011_data	2016_data	conflict_of_interest_policy_v2	records_retention_policy_v2	whistleblower_policy_v2	SOX_policies	SOX_policies_binary	SOX_policies_all_binary	program_efficiency	complexity	complexity_2011	age	total_revenue_logged	category	state	tot_rev	category_Animals	category_Arts, Culture, Humanities	category_Community Development	category_Education	category_Environment	category_Health	category_Human Services	category_Human and Civil Rights	category_International	category_Religion	category_Research and Public Policy

In [36]:

#df_2011.to_excel('df_2011.xls')

Test Logit¶

In [13]:

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf   #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS
print statsmodels.__version__

0.6.1

In [52]:

#IVs = '%s + ' % IV
#DV = '%s ~ ' % DV  
IVs = 'SOX_policies '
#IVs = 'SOX_policies_binary'
#DV = 'advisory ~ '
#DV = 'donor_advisory_2016 ~ '
DV = 'donor_advisory_2011_to_2016 ~ '
controls = '+ total_revenue_logged +  program_efficiency + age + complexity_2011 + C(category)'

#admin_expense_percent + leader_comp_percent + budget_surplus
logit_formula = DV+IVs+controls
print logit_formula
#globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit()   
#print globals()["mod%s" % model_num].summary()
# #print model_num.summary()
#print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE

donor_advisory_2011_to_2016 ~ SOX_policies + total_revenue_logged +  program_efficiency + age + complexity_2011 + C(category)

In [22]:

print df_2011.columns.tolist()

['donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016', 'org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data', 'conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary', 'program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state', 'tot_rev', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy']

In [24]:

logit_cols1 = ['donor_advisory_2016', 'donor_advisory_2011_to_2016', 
               'SOX_policies', 'program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state']
len(df_2011[logit_cols1].dropna())

Out[24]:

In [19]:

len(df_2011.dropna())

Out[19]:

In [53]:

logit = smf.logit(formula=logit_formula, data=df_2011).fit() 
logit.summary()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.075203
         Iterations: 35

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Out[53]:

Logit Regression Results
Dep. Variable:	donor_advisory_2011_to_2016	No. Observations:	4833
Model:	Logit	Df Residuals:	4817
Method:	MLE	Df Model:	15
Date:	Tue, 06 Sep 2016	Pseudo R-squ.:	0.08962
Time:	12:35:26	Log-Likelihood:	-363.45
converged:	False	LL-Null:	-399.24
		LLR p-value:	2.351e-09

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-2.6828	1.438	-1.865	0.062	-5.502 0.136
C(category)[T.Arts, Culture, Humanities]	-0.5078	0.515	-0.986	0.324	-1.517 0.502
C(category)[T.Community Development]	-0.3612	0.575	-0.629	0.530	-1.487 0.765
C(category)[T.Education]	-1.2152	0.794	-1.531	0.126	-2.771 0.340
C(category)[T.Environment]	-18.8719	4856.123	-0.004	0.997	-9536.698 9498.954
C(category)[T.Health]	-1.0195	0.548	-1.862	0.063	-2.093 0.054
C(category)[T.Human Services]	-0.1790	0.415	-0.431	0.666	-0.993 0.635
C(category)[T.Human and Civil Rights]	-0.6823	0.687	-0.993	0.321	-2.029 0.664
C(category)[T.International]	-0.3676	0.511	-0.719	0.472	-1.369 0.634
C(category)[T.Religion]	0.3141	0.449	0.700	0.484	-0.566 1.194
C(category)[T.Research and Public Policy]	-0.5298	0.796	-0.665	0.506	-2.090 1.031
SOX_policies	-0.4256	0.110	-3.876	0.000	-0.641 -0.210
total_revenue_logged	0.2806	0.101	2.785	0.005	0.083 0.478
program_efficiency	-3.0563	0.767	-3.982	0.000	-4.561 -1.552
age	-0.0065	0.007	-0.981	0.327	-0.019 0.006
complexity_2011	-0.7391	0.274	-2.700	0.007	-1.275 -0.203

In [29]:

df_2011['donor_advisory_2016'] = df_2011['donor_advisory_2016'].astype('int')
df_2011['donor_advisory_2011_to_2016'] = df_2011['donor_advisory_2011_to_2016'].astype('int')

In [30]:

df_2011.dtypes

Out[30]:

donor_advisory                         float64
donor_advisory_2016                      int64
donor_advisory_2011_to_2016              int64
org_id                                  object
EIN                                     object
FYE                                     object
Form 990 FYE                            object
ratings_system                          object
2011_data                              float64
2016_data                              float64
conflict_of_interest_policy_v2         float64
records_retention_policy_v2            float64
whistleblower_policy_v2                float64
SOX_policies                           float64
SOX_policies_binary                    float64
SOX_policies_all_binary                float64
program_efficiency                     float64
complexity                             float64
age                                    float64
total_revenue_logged                   float64
category                                object
state                                   object
tot_rev                                float64
category_Animals                       float64
category_Arts, Culture, Humanities     float64
category_Community Development         float64
category_Education                     float64
category_Environment                   float64
category_Health                        float64
category_Human Services                float64
category_Human and Civil Rights        float64
category_International                 float64
category_Religion                      float64
category_Research and Public Policy    float64
dtype: object

Write Function for Logits¶

In [ ]:

#def new_logit(IV,model_num):
def new_logit_clustered(data, DV, columns, FE, model_num):
    #IVs = '%s + ' % IV
    #DV = 'RTs_binary ~ '
    DV = '%s ~ ' % DV  
    #controls = 'from_user_followers_count + time_on_twitter_days + CSR_sustainability +  \
    #         URLs_binary + photo'
    IVs = ' + '.join(columns)
    FE = '%s ' % FE
    logit_formula = DV+IVs+FE
    print logit_formula
    globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=data).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['firm_from_user_screen_name']})   
    print globals()["mod%s" % model_num].summary()
    #print model_num.summary()
    print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE
    #print '\n', "Pseudo R-squared:", globals()["mod%s" % model_num].prsquared #TO GET THE PSEUDO-R-SQUARED

In [1157]:

donor_advisory_2011_to_2016 ~ SOX_policies + total_revenue_logged +  program_efficiency + age + complexity_2011 + C(category)

In [1089]:

logit_variables = ['2011_data', 'donor_advisory_2016', 'SOX_policies', 'total_revenue_logged', 
                   'program_efficiency', 'age', 'complexity', 'state', 'category']
df_2011 = df[logit_variables]
df_2011 = df_2011[df_2011['2011_data']==1]
print len(df_2011)
len(df_2011.dropna())

Out[1089]:

In [1100]:

df[['2011_data', 'donor_advisory_2016', 'SOX_policies', 'total_revenue_logged', 
                   'program_efficiency', 'age', 'complexity', 'complexity_2011', 'state', 'category']].describe().T

Out[1100]:

	count	mean	std	min	25%	50%	75%	max
2011_data	84958	0.057240	0.232302	0.000000	0.000000	0.000000	0.000000	1.000000
donor_advisory_2016	84958	0.004332	0.065672	0.000000	0.000000	0.000000	0.000000	1.000000
SOX_policies	21894	2.724582	0.689867	0.000000	3.000000	3.000000	3.000000	3.000000
total_revenue_logged	21825	15.911470	1.458552	11.616123	14.781636	15.706608	16.864409	22.042788
program_efficiency	21894	0.805400	0.103635	0.000000	0.756568	0.817758	0.871105	1.010186
age	83830	39.508147	19.310175	0.000000	24.000000	35.000000	52.000000	108.000000
complexity	84958	0.373031	1.220945	0.000000	0.000000	0.000000	0.000000	8.000000
complexity_2011	4833	2.466791	0.514468	1.000000	2.000000	2.000000	3.000000	3.000000

In [1158]:

logit = smf.logit(formula=logit_formula, data=df[df['2011_data']==1]).fit() 
logit.summary()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.075575
         Iterations: 35

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Out[1158]:

Logit Regression Results
Dep. Variable:	donor_advisory_2011_to_2016	No. Observations:	4808
Model:	Logit	Df Residuals:	4792
Method:	MLE	Df Model:	15
Date:	Fri, 02 Sep 2016	Pseudo R-squ.:	0.08892
Time:	12:11:05	Log-Likelihood:	-363.36
converged:	False	LL-Null:	-398.83
		LLR p-value:	3.049e-09

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-2.6649	1.444	-1.845	0.065	-5.495 0.166
C(category)[T.Arts, Culture, Humanities]	-0.5071	0.515	-0.985	0.325	-1.516 0.502
C(category)[T.Community Development]	-0.3595	0.575	-0.626	0.532	-1.486 0.767
C(category)[T.Education]	-1.2142	0.794	-1.530	0.126	-2.770 0.341
C(category)[T.Environment]	-20.2785	9810.033	-0.002	0.998	-1.92e+04 1.92e+04
C(category)[T.Health]	-1.0181	0.548	-1.859	0.063	-2.092 0.055
C(category)[T.Human Services]	-0.1780	0.415	-0.429	0.668	-0.992 0.636
C(category)[T.Human and Civil Rights]	-0.6780	0.687	-0.987	0.324	-2.025 0.669
C(category)[T.International]	-0.3664	0.511	-0.717	0.474	-1.368 0.636
C(category)[T.Religion]	0.3200	0.449	0.712	0.476	-0.560 1.200
C(category)[T.Research and Public Policy]	-0.5291	0.796	-0.665	0.506	-2.090 1.031
SOX_policies	-0.4264	0.110	-3.882	0.000	-0.642 -0.211
total_revenue_logged	0.2790	0.101	2.753	0.006	0.080 0.478
program_efficiency	-3.0518	0.768	-3.976	0.000	-4.556 -1.547
age	-0.0064	0.007	-0.977	0.329	-0.019 0.006
complexity_2011	-0.7374	0.274	-2.695	0.007	-1.274 -0.201

Test with standard errors clustered on state¶

In [47]:

logit_formula

Out[47]:

'donor_advisory_2016 ~ SOX_policies + total_revenue_logged +  program_efficiency + age + complexity_2011 + C(category)'

In [54]:

cols1 = ['donor_advisory_2011_to_2016', 'SOX_policies', 'total_revenue_logged', 'program_efficiency', 'age',
         'complexity_2011', 'category', 'state']

In [55]:

logit = smf.logit(formula=logit_formula, data=df_2011[cols1].dropna()).fit(cov_type='cluster', 
                                                    cov_kwds={'groups': df_2011[cols1].dropna()['state']}) 
logit.summary()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.075203
         Iterations: 35

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Out[55]:

Logit Regression Results
Dep. Variable:	donor_advisory_2011_to_2016	No. Observations:	4833
Model:	Logit	Df Residuals:	4817
Method:	MLE	Df Model:	15
Date:	Tue, 06 Sep 2016	Pseudo R-squ.:	0.08962
Time:	12:35:43	Log-Likelihood:	-363.45
converged:	False	LL-Null:	-399.24
		LLR p-value:	2.351e-09

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-2.6828	1.038	-2.585	0.010	-4.717 -0.649
C(category)[T.Arts, Culture, Humanities]	-0.5078	0.465	-1.092	0.275	-1.419 0.404
C(category)[T.Community Development]	-0.3612	0.700	-0.516	0.606	-1.733 1.010
C(category)[T.Education]	-1.2152	0.683	-1.779	0.075	-2.554 0.123
C(category)[T.Environment]	-18.8719	0.433	-43.552	0.000	-19.721 -18.023
C(category)[T.Health]	-1.0195	0.520	-1.959	0.050	-2.039 0.000
C(category)[T.Human Services]	-0.1790	0.370	-0.484	0.628	-0.904 0.546
C(category)[T.Human and Civil Rights]	-0.6823	0.483	-1.412	0.158	-1.629 0.265
C(category)[T.International]	-0.3676	0.487	-0.754	0.451	-1.323 0.588
C(category)[T.Religion]	0.3141	0.470	0.668	0.504	-0.607 1.235
C(category)[T.Research and Public Policy]	-0.5298	0.526	-1.007	0.314	-1.561 0.501
SOX_policies	-0.4256	0.093	-4.588	0.000	-0.607 -0.244
total_revenue_logged	0.2806	0.074	3.776	0.000	0.135 0.426
program_efficiency	-3.0563	0.816	-3.744	0.000	-4.656 -1.456
age	-0.0065	0.005	-1.199	0.231	-0.017 0.004
complexity_2011	-0.7391	0.263	-2.814	0.005	-1.254 -0.224

In [ ]:

Below this I've played around with creating a '2011' dataset --> and run a couple of logits¶

Here are the two variables that serve as indicators of '2016' and '2011' CN data. These 8,304 and 4,863 rows will serve as the base for conducting the logit regressions. Design variable creation solutions are these subsets of data.

In [373]:

print len(df[df['latest_entry']=='True'])
print len(df[df['2011 data']==1])

8304
4863

In [ ]:

- Control variables:
  - Size: total revenues best (probably logged)
    - will need 2011 and 2016 versions for the 4th and 5th tests
  - efficiency ratio
  - complexity (could be a good control from Erica's paper)
  - fixed effects:
    - state
    - category
      - I need to scrape the category dummies for the new orgs in the 2016 database
        - CN does not include that information in the ratings area, but it is included on the webpage in the 'breadcrumbs' area
  - The focus of our paper is on SOX policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are:
    - *independent board* --> related to Erica's *independence of key actors" concept
    - *board review of 990* and *audited financials* --> both related to Erica's *board monitoring* concept
    - we could include other governance variables as needed.
- We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies.
- To differentiate from Erica's *JBE* paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'.

In [397]:

cols_2011 = [col for col in list(df) if col.endswith('_2011')]
print cols_2011

['charity_name_2011', 'category_2011', 'city_2011', 'state_2011', 'cause_2011', 'tag_line_2011', 'url_2011', 'ein_2011', 'fye_2011', 'overall_rating_2011', 'overall_rating_star_2011', 'efficiency_rating_2011', 'AT_rating_2011', 'financial_rating_star_2011', 'AT_rating_star_2011', 'program_expense_percent_2011', 'admin_expense_percent_2011', 'fund_expense_percent_2011', 'fund_efficiency_2011', 'primary_revenue_growth_2011', 'program_expense_growth_2011', 'working_capital_ratio_2011', 'independent_board_2011', 'no_material_division_2011', 'audited_financials_2011', 'no_loans_related_2011', 'documents_minutes_2011', 'form_990_2011', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'CEO_listed_2011', 'process_CEO_compensation_2011', 'no_board_compensation_2011', 'donor_privacy_policy_2011', 'board_listed_2011', 'audited_financials_web_2011', 'form_990_web_2011', 'staff_listed_2011', 'primary_revenue_2011', 'other_revenue_2011', 'total_revenue_2011', 'govt_revenue_2011', 'program_expense_2011', 'admin_expense_2011', 'fund_expense_2011', 'total_functional_expense_2011', 'affiliate_payments_2011', 'budget_surplus_2011', 'net_assets_2011', 'leader_comp_2011', 'leader_comp_percent_2011', 'email_2011', 'website_2011']

In [398]:

df[cols_2011].dtypes

Out[398]:

charity_name_2011                    object
category_2011                        object
city_2011                            object
state_2011                           object
cause_2011                           object
tag_line_2011                        object
url_2011                             object
ein_2011                             object
fye_2011                             object
overall_rating_2011                 float64
overall_rating_star_2011            float64
efficiency_rating_2011              float64
AT_rating_2011                      float64
financial_rating_star_2011          float64
AT_rating_star_2011                 float64
program_expense_percent_2011        float64
admin_expense_percent_2011          float64
fund_expense_percent_2011           float64
fund_efficiency_2011                float64
primary_revenue_growth_2011         float64
program_expense_growth_2011         float64
working_capital_ratio_2011          float64
independent_board_2011               object
no_material_division_2011            object
audited_financials_2011              object
no_loans_related_2011                object
documents_minutes_2011               object
form_990_2011                        object
conflict_of_interest_policy_2011     object
whistleblower_policy_2011            object
records_retention_policy_2011        object
CEO_listed_2011                      object
process_CEO_compensation_2011        object
no_board_compensation_2011           object
donor_privacy_policy_2011            object
board_listed_2011                    object
audited_financials_web_2011          object
form_990_web_2011                    object
staff_listed_2011                    object
primary_revenue_2011                float64
other_revenue_2011                  float64
total_revenue_2011                  float64
govt_revenue_2011                    object
program_expense_2011                float64
admin_expense_2011                  float64
fund_expense_2011                   float64
total_functional_expense_2011       float64
affiliate_payments_2011             float64
budget_surplus_2011                 float64
net_assets_2011                     float64
leader_comp_2011                    float64
leader_comp_percent_2011            float64
email_2011                           object
website_2011                         object
dtype: object

In [379]:

df[cols_2011].describe().T

Out[379]:

	count	mean	std	min	25%	50%	75%	max
overall_rating_2011	4833	53.547455	8.933975e+00	8.040000e+00	48.73	54.61	59.97	6.996000e+01
overall_rating_star_2011	4833	2.869646	8.916785e-01	0.000000e+00	2.00	3.00	3.00	4.000000e+00
efficiency_rating_2011	4833	53.430780	1.048261e+01	0.000000e+00	46.53	54.88	61.73	6.997000e+01
AT_rating_2011	4833	56.446100	1.152050e+01	-2.000000e+00	52.00	59.00	63.00	7.000000e+01
financial_rating_star_2011	4833	2.850403	1.007304e+00	0.000000e+00	2.00	3.00	4.00	4.000000e+00
AT_rating_star_2011	4833	3.128078	1.001005e+00	0.000000e+00	3.00	3.00	4.00	4.000000e+00
program_expense_percent_2011	4833	80.416325	1.055343e+01	2.200000e+00	75.50	81.60	87.00	9.970000e+01
admin_expense_percent_2011	4833	10.320608	6.665639e+00	-5.000000e-01	5.80	9.20	13.20	6.790000e+01
fund_expense_percent_2011	4833	9.110180	8.056593e+00	0.000000e+00	4.10	7.50	11.90	9.070000e+01
fund_efficiency_2011	4833	0.129679	5.190376e-01	0.000000e+00	0.05	0.10	0.16	3.533000e+01
primary_revenue_growth_2011	4833	4.403580	1.554988e+01	-7.960000e+01	-3.50	2.60	10.20	2.452000e+02
program_expense_growth_2011	4833	6.314918	2.073699e+01	-5.320000e+01	-1.10	4.10	11.10	1.007400e+03
working_capital_ratio_2011	4833	1.788438	2.455707e+00	-3.260000e+00	0.44	1.01	2.20	5.842000e+01
primary_revenue_2011	4833	16525465.992551	7.313324e+07	1.498700e+04	1989838.00	4430280.00	11346855.00	3.502077e+09
other_revenue_2011	4833	651129.463066	6.235616e+06	-6.612359e+07	5922.00	99129.00	443689.00	2.392543e+08
total_revenue_2011	4833	17176106.777985	7.570407e+07	-4.263887e+07	2103386.00	4673878.00	11721565.00	3.587775e+09
program_expense_2011	4833	14784010.616387	6.451777e+07	2.848300e+04	1694422.00	3808132.00	9557716.00	3.091879e+09
admin_expense_2011	4833	1477281.476929	7.805211e+06	-8.585600e+04	188863.00	412603.00	1010869.00	4.323913e+08
fund_expense_2011	4833	1133717.688599	5.382822e+06	0.000000e+00	160486.00	359811.00	830250.00	2.231224e+08
total_functional_expense_2011	4833	17395155.368922	7.292829e+07	1.507310e+05	2188637.00	4769351.00	11766482.00	3.354177e+09
affiliate_payments_2011	4833	60756.626733	1.812179e+06	-4.059500e+05	0.00	0.00	0.00	1.235951e+08
budget_surplus_2011	4833	-219048.159321	9.895040e+06	-3.008552e+08	-477470.00	1652.00	452604.00	2.335980e+08
net_assets_2011	4833	33372071.914339	1.651812e+08	-1.691832e+07	2013666.00	6004379.00	19206628.00	7.002755e+09
leader_comp_2011	4657	153519.535753	1.313950e+05	0.000000e+00	80808.00	126250.00	191203.00	2.257910e+06
leader_comp_percent_2011	4657	3.166133	3.015173e+00	0.000000e+00	1.00	2.33	4.44	3.090000e+01

In [380]:

from __future__ import division

Program Efficiency Ratio¶

efficiency = ProgExp/TotExp

In [386]:

df[df['org_id']=='12123']['program_expense_2011']/df[df['org_id']=='12123']['total_functional_expense_2011']

Out[386]:

34         NaN
35         NaN
36         NaN
37         NaN
38         NaN
39    0.824939
40         NaN
41         NaN
42         NaN
dtype: float64

In [388]:

df[df['org_id']=='12123'][cols_2011]

Out[388]:

	charity_name_2011	category_2011	city_2011	state_2011	cause_2011	tag_line_2011	url_2011	ein_2011	fye_2011	overall_rating_2011	overall_rating_star_2011	efficiency_rating_2011	AT_rating_2011	financial_rating_star_2011	AT_rating_star_2011	program_expense_percent_2011	admin_expense_percent_2011	fund_expense_percent_2011	fund_efficiency_2011	primary_revenue_growth_2011	program_expense_growth_2011	working_capital_ratio_2011	independent_board_2011	no_material_division_2011	audited_financials_2011	no_loans_related_2011	documents_minutes_2011	form_990_2011	conflict_of_interest_policy_2011	whistleblower_policy_2011	records_retention_policy_2011	CEO_listed_2011	process_CEO_compensation_2011	no_board_compensation_2011	donor_privacy_policy_2011	board_listed_2011	audited_financials_web_2011	form_990_web_2011	staff_listed_2011	primary_revenue_2011	other_revenue_2011	total_revenue_2011	govt_revenue_2011	program_expense_2011	admin_expense_2011	fund_expense_2011	total_functional_expense_2011	affiliate_payments_2011	budget_surplus_2011	net_assets_2011	leader_comp_2011	leader_comp_percent_2011	email_2011	website_2011
34	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
35	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
39	100 Club of Arizona	Human Services	Phoenix	AZ	Multipurpose Human Service Organizations	Supporting families of public safety	http://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=12123	23-7172077	12/2009	63.84	4	66.58	62	4	4	82.4	14.4	3	0.03	6.2	6.3	1.06	yes	yes	yes	yes	yes	yes	yes	yes	NO	yes	yes	yes	yes	yes	NO	yes	yes	1212051	-227543	984508	Note: This organization receives $0 in government support.	1019191	178385	37899	1235475	0	-250967	1316781	122623	9.92	info@100club.org	http://www.100club.org
40	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [395]:

df[df['2011 data']==1]['program_expense_2011'][:5]/df[df['2011 data']==1]['total_functional_expense_2011'][:5]

Out[395]:

9          NaN
21    0.797448
39    0.824939
52    0.854655
63    0.786945
dtype: float64

Complexity¶

Number of revenue sources (Donations, Government Grants, Program Service Revenues)

In [407]:

print len(df[df['2011 data']==1])
print len(df[df['govt_revenue_2011'].notnull()])
df[df['govt_revenue_2011'].notnull()]['govt_revenue_2011'].value_counts()

4863
4833

Out[407]:

Note: This organization receives $0 in government support.    2531
GOVERNMENT SUPPORT MUST BE RECEIVED                           2302
Name: govt_revenue_2011, dtype: int64

In [409]:

df[df['govt_revenue_2011'].notnull()]['govt_revenue_2011'][5:12]

Out[409]:

161                           GOVERNMENT SUPPORT MUST BE RECEIVED
185    Note: This organization receives $0 in government support.
208                           GOVERNMENT SUPPORT MUST BE RECEIVED
228    Note: This organization receives $0 in government support.
244    Note: This organization receives $0 in government support.
255    Note: This organization receives $0 in government support.
276    Note: This organization receives $0 in government support.
Name: govt_revenue_2011, dtype: object

In [410]:

df['govt_revenue_2011_binary'] = np.nan
df['govt_revenue_2011_binary'] = np.where(
    df['govt_revenue_2011'] == 'Note: This organization receives $0 in government support.', 0,
    df['govt_revenue_2011_binary'])
df['govt_revenue_2011_binary'] = np.where(
    df['govt_revenue_2011'] == 'GOVERNMENT SUPPORT MUST BE RECEIVED', 1, 
    df['govt_revenue_2011_binary'])
df['govt_revenue_2011_binary'].value_counts()

Out[410]:

0    2531
1    2302
Name: govt_revenue_2011_binary, dtype: int64

In [420]:

print len(df[df['2011 data']==1])
print len(df[df['primary_revenue_2011'].notnull()])
print df[df['primary_revenue_2011'].notnull()]['primary_revenue_2011'][:5]
df[df['primary_revenue_2011'].notnull()]['primary_revenue_2011'].value_counts().sum()

4863
4833
21     3914222
39     1212051
52      762512
63     1140158
148    1375169
Name: primary_revenue_2011, dtype: float64

Out[420]:

In [427]:

df[df['other_revenue_2011']<1000]['other_revenue_2011']

Out[427]:

39        -227543
63         -44041
312       -572243
362         -2666
385         -7531
477      -4653961
558       -176202
907      -3543041
942      -1726139
1161      -831757
1317        -7494
1352      -651915
1372      -288151
1390            0
1604     -7067577
1674      -271688
1689    -11111670
1705       -46280
1927     -1702433
1955        -6955
1982       -77563
2013      -388287
2027       -76537
2099        -1543
2196      -162612
2249        -2860
2352      -113036
2436          678
2597        -7227
2663      -456651
           ...   
82598    -1308477
82683           0
82914     -212856
82971    -1358658
82987      -29502
83033    -1003245
83098     -356419
83115   -10947728
83366         991
83423      -18112
83590     -173857
83632       -1810
83713    -4529558
83746           0
83758         251
83841     -364815
83873     -313582
83898           0
83910         300
83911           0
83912          56
83913          28
83917       -4437
83920       -3318
83922         263
83925         748
83927           0
83928          30
83932     -228354
83940    -6663315
Name: other_revenue_2011, dtype: float64

In [429]:

len(df[df['other_revenue_2011']==0])

Out[429]:

In [419]:

print len(df[df['2011 data']==1])
print len(df[df['other_revenue_2011'].notnull()])
print df[df['other_revenue_2011'].notnull()]['other_revenue_2011'][:5]
df[df['other_revenue_2011'].notnull()]['other_revenue_2011'].value_counts().sum()

4863
4833
21     216503
39    -227543
52      21340
63     -44041
148      5061
Name: other_revenue_2011, dtype: float64

Out[419]:

In [431]:

print len(df[df['other_revenue_2011'].notnull()])
df['other_revenue_2011_binary'] = np.where(df['other_revenue_2011']>0, 1, df['other_revenue_2011'] )
df['other_revenue_2011_binary'] = np.where(df['other_revenue_2011']<0, 1, df['other_revenue_2011_binary'] )
print len(df[df['other_revenue_2011_binary'].notnull()])
df['other_revenue_2011_binary'].value_counts()

4833
4833

Out[431]:

1    4787
0      46
Name: other_revenue_2011_binary, dtype: int64

In [434]:

df['complexity_2011'] = 1 + df['other_revenue_2011_binary'] + df['govt_revenue_2011_binary']
print len(df[df['complexity_2011'].notnull()])
df['complexity_2011'].value_counts()

Out[434]:

2    2501
3    2294
1      38
Name: complexity_2011, dtype: int64

In [ ]:

cols_2011 = [col for col in list(df) if col.endswith('_2011')]
print cols_2011

In [ ]:

Test Logits¶

In [1132]:

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf   #FOR USING 'R'-STYLE FORMULAS FOR REGRESSIONS
print statsmodels.__version__

0.6.1

In [455]:

df['SOX_policies_2011'] = df['conflict_of_interest_policy_2011'] + ' ' + df['whistleblower_policy_2011'] + ' ' + df['records_retention_policy_2011']
df['SOX_policies_2011'].value_counts()

Out[455]:

yes yes yes    3548
yes NO NO       416
yes yes NO      284
yes NO yes      265
NO NO NO        255
NO NO yes        30
NO yes yes       21
NO yes NO        14
Name: SOX_policies_2011, dtype: int64

In [456]:

df['SOX_policies_2011'] = df['SOX_policies_2011'].str.count('yes')
df['SOX_policies_2011'].value_counts()

Out[456]:

3    3548
2     570
1     460
0     255
Name: SOX_policies_2011, dtype: int64

In [491]:

df['total_revenue_2011_logged'] = np.log(df['total_revenue_2011'])
df[['total_revenue_2011_logged', 'total_revenue_2011']].describe().T

Out[491]:

	count	mean	std	min	25%	50%	75%	max
total_revenue_2011_logged	4811	15.530981	1.282845	12.586466	14.569243	15.36338	16.27977	2.200080e+01
total_revenue_2011	4833	17176106.777985	75704074.259743	-42638874.000000	2103386.000000	4673878.00000	11721565.00000	3.587775e+09

In [ ]:

df['advisory'] = df['']

In [545]:

#IVs = '%s + ' % IV
#DV = '%s ~ ' % DV  
IVs = 'SOX_policies_2011 '
#DV = 'advisory ~ '
DV = 'donor_advisory_2016 ~ '
controls = '+ total_revenue_2011_logged +  program_expense_percent_2011 + age +  \
            complexity_2011 + C(category)'

#admin_expense_percent + leader_comp_percent + budget_surplus
logit_formula = DV+IVs+controls
print logit_formula
#globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit()   
#print globals()["mod%s" % model_num].summary()
# #print model_num.summary()
#print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE

donor_advisory_2016 ~ SOX_policies_2011 + total_revenue_2011_logged +  program_expense_percent_2011 + age +              complexity_2011 + C(category)

In [532]:

df.to_csv('df.csv', encoding='utf-8')

In [535]:

logit_variables = ['donor_advisory_2016', 'SOX_policies_2011', 'total_revenue_2011_logged', 
                   'program_expense_percent_2011', 'age', 'complexity_2011', 'state_2011', 'category']
df_2011 = df[logit_variables]
print len(df_2011)
len(df_2011.dropna())

Out[535]:

In [519]:

#IVs = '%s + ' % IV
#DV = '%s ~ ' % DV  
IVs = 'SOX_policies_2011 '
#DV = 'advisory ~ '
DV = 'donor_advisory_2016 ~ '
controls = '+ total_revenue_2011_logged +  program_expense_percent_2011 + age +  \
            complexity_2011'

#admin_expense_percent + leader_comp_percent + budget_surplus
logit_formula = DV+IVs+controls
print logit_formula
#globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=df).fit()   
#print globals()["mod%s" % model_num].summary()
# #print model_num.summary()
#print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE

donor_advisory_2016 ~ SOX_policies_2011 + total_revenue_2011_logged +  program_expense_percent_2011 + age +              complexity_2011

In [536]:

logit = smf.logit(formula=logit_formula, data=df_2011).fit() 
logit.summary()

Optimization terminated successfully.
         Current function value: 0.044916
         Iterations 10

Out[536]:

Logit Regression Results
Dep. Variable:	donor_advisory_2016	No. Observations:	4808
Model:	Logit	Df Residuals:	4802
Method:	MLE	Df Model:	5
Date:	Wed, 31 Aug 2016	Pseudo R-squ.:	0.08557
Time:	21:43:38	Log-Likelihood:	-215.96
converged:	True	LL-Null:	-236.17
		LLR p-value:	1.229e-07

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-3.0296	1.925	-1.574	0.115	-6.802 0.743
SOX_policies_2011	-0.4992	0.145	-3.448	0.001	-0.783 -0.215
total_revenue_2011_logged	0.3113	0.138	2.260	0.024	0.041 0.581
program_expense_percent_2011	-0.0285	0.009	-3.068	0.002	-0.047 -0.010
age	-0.0104	0.009	-1.130	0.258	-0.028 0.008
complexity_2011	-1.2189	0.358	-3.402	0.001	-1.921 -0.517

In [ ]:

#def new_logit(IV,model_num):
def new_logit_clustered(data, DV, columns, FE, model_num):
    #IVs = '%s + ' % IV
    #DV = 'RTs_binary ~ '
    DV = '%s ~ ' % DV  
    #controls = 'from_user_followers_count + time_on_twitter_days + CSR_sustainability +  \
    #         URLs_binary + photo'
    IVs = ' + '.join(columns)
    FE = '%s ' % FE
    logit_formula = DV+IVs+FE
    print logit_formula
    globals()["mod%s" % model_num] = smf.logit(formula=logit_formula, data=data).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['firm_from_user_screen_name']})   
    print globals()["mod%s" % model_num].summary()
    #print model_num.summary()
    print '\n', "Chi-squared value:", globals()["mod%s" % model_num].llr, '\n' #TO GET THE CHI-SQUARED VALUE
    #print '\n', "Pseudo R-squared:", globals()["mod%s" % model_num].prsquared #TO GET THE PSEUDO-R-SQUARED

In [543]:

logit = smf.logit(formula=logit_formula, data=df).fit() 
logit.summary()

Optimization terminated successfully.
         Current function value: 0.044916
         Iterations 10

Out[543]:

Logit Regression Results
Dep. Variable:	donor_advisory_2016	No. Observations:	4808
Model:	Logit	Df Residuals:	4802
Method:	MLE	Df Model:	5
Date:	Wed, 31 Aug 2016	Pseudo R-squ.:	0.08557
Time:	21:46:12	Log-Likelihood:	-215.96
converged:	True	LL-Null:	-236.17
		LLR p-value:	1.229e-07

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-3.0296	1.925	-1.574	0.115	-6.802 0.743
SOX_policies_2011	-0.4992	0.145	-3.448	0.001	-0.783 -0.215
total_revenue_2011_logged	0.3113	0.138	2.260	0.024	0.041 0.581
program_expense_percent_2011	-0.0285	0.009	-3.068	0.002	-0.047 -0.010
age	-0.0104	0.009	-1.130	0.258	-0.028 0.008
complexity_2011	-1.2189	0.358	-3.402	0.001	-1.921 -0.517

In [524]:

len(df[df['2011 data']==1])

Out[524]:

In [523]:

df[df['2011 data']==1][['SOX_policies_2011', 'total_revenue_2011_logged', 'program_expense_percent_2011', 
    'age', 'complexity_2011']].describe().T

Out[523]:

	count	mean	std	min	25%	50%	75%	max
SOX_policies_2011	4833	2.533416	0.869466	0.000000	2.000000	3.00000	3.00000	3.000000
total_revenue_2011_logged	4811	15.530981	1.282845	12.586466	14.569243	15.36338	16.27977	22.000798
program_expense_percent_2011	4833	80.416325	10.553429	2.200000	75.500000	81.60000	87.00000	99.700000
age	4860	40.051029	19.240216	0.000000	25.000000	35.00000	52.00000	108.000000
complexity_2011	4833	2.466791	0.514468	1.000000	2.000000	2.00000	3.00000	3.000000

In [508]:

len(df[df['state_2011'].notnull()])

Out[508]:

In [509]:

df['state_2011'].value_counts()

Out[509]:

NY    672
CA    649
DC    333
FL    283
TX    239
VA    187
MA    183
IL    172
PA    169
OH    127
CO    126
GA    114
WA    110
MD    108
MI    107
MN     94
NC     93
MO     92
NJ     86
TN     80
OR     79
AZ     75
CT     71
WI     64
IN     52
KY     37
SC     35
NE     33
LA     29
OK     29
ME     27
UT     26
AL     26
KS     24
NM     24
IA     23
MT     22
HI     20
NH     17
NV     17
RI     15
MS     15
VT     14
AR     14
DE     13
WY      9
AK      8
ID      7
SD      6
WV      5
ND      2
PR      1
Name: state_2011, dtype: int64

In [516]:

len(df[ (df['2011 data']==1) & (df['state_2011'].isnull())])

Out[516]:

In [ ]:

In [540]:

df_2011.dropna().describe().T

Out[540]:

	count	mean	std	min	25%	50%	75%	max
donor_advisory_2016	4808	0.008527	0.091959	0.000000	0.000000	0.000000	0.0000	1.000000
SOX_policies_2011	4808	2.534110	0.869042	0.000000	2.000000	3.000000	3.0000	3.000000
total_revenue_2011_logged	4808	15.532027	1.282551	12.586466	14.569824	15.364199	16.2804	22.000798
program_expense_percent_2011	4808	80.427787	10.558554	2.200000	75.500000	81.600000	87.0000	99.700000
age	4808	40.016015	19.198956	0.000000	25.000000	35.000000	52.0000	108.000000
complexity_2011	4808	2.466722	0.514543	1.000000	2.000000	2.000000	3.0000	3.000000

In [546]:

logit = smf.logit(formula=logit_formula, data=df_2011.dropna()).fit(cov_type='cluster', 
                                                    cov_kwds={'groups': df_2011.dropna()['state_2011']}) 
logit.summary()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.043361
         Iterations: 35

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Out[546]:

Logit Regression Results
Dep. Variable:	donor_advisory_2016	No. Observations:	4808
Model:	Logit	Df Residuals:	4792
Method:	MLE	Df Model:	15
Date:	Wed, 31 Aug 2016	Pseudo R-squ.:	0.1172
Time:	21:48:06	Log-Likelihood:	-208.48
converged:	False	LL-Null:	-236.17
		LLR p-value:	1.546e-06

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-3.4848	1.884	-1.849	0.064	-7.178 0.208
C(category)[T.Arts, Culture, Humanities]	-0.7432	0.965	-0.770	0.441	-2.635 1.149
C(category)[T.Community Development]	0.2960	0.856	0.346	0.730	-1.382 1.974
C(category)[T.Education]	-0.1108	0.901	-0.123	0.902	-1.877 1.655
C(category)[T.Environment]	-17.5510	0.614	-28.598	0.000	-18.754 -16.348
C(category)[T.Health]	-0.1414	0.744	-0.190	0.849	-1.600 1.317
C(category)[T.Human Services]	-0.2230	0.760	-0.293	0.769	-1.713 1.268
C(category)[T.Human and Civil Rights]	-0.0320	0.774	-0.041	0.967	-1.550 1.486
C(category)[T.International]	-0.0034	0.766	-0.004	0.996	-1.505 1.498
C(category)[T.Religion]	1.1258	0.708	1.589	0.112	-0.263 2.514
C(category)[T.Research and Public Policy]	0.5617	0.799	0.703	0.482	-1.004 2.127
SOX_policies_2011	-0.4728	0.124	-3.805	0.000	-0.716 -0.229
total_revenue_2011_logged	0.2978	0.126	2.356	0.018	0.050 0.546
program_expense_percent_2011	-0.0330	0.011	-2.998	0.003	-0.055 -0.011
age	-0.0104	0.007	-1.499	0.134	-0.024 0.003
complexity_2011	-0.8049	0.465	-1.730	0.084	-1.717 0.107

In [547]:

logit = smf.logit(formula=logit_formula, data=df).fit() 
logit.summary()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.043361
         Iterations: 35

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Out[547]:

Logit Regression Results
Dep. Variable:	donor_advisory_2016	No. Observations:	4808
Model:	Logit	Df Residuals:	4792
Method:	MLE	Df Model:	15
Date:	Wed, 31 Aug 2016	Pseudo R-squ.:	0.1172
Time:	21:48:36	Log-Likelihood:	-208.48
converged:	False	LL-Null:	-236.17
		LLR p-value:	1.546e-06

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-3.4848	2.010	-1.734	0.083	-7.424 0.454
C(category)[T.Arts, Culture, Humanities]	-0.7432	0.940	-0.790	0.429	-2.586 1.100
C(category)[T.Community Development]	0.2960	0.835	0.354	0.723	-1.341 1.933
C(category)[T.Education]	-0.1108	0.929	-0.119	0.905	-1.931 1.709
C(category)[T.Environment]	-17.5510	4367.028	-0.004	0.997	-8576.769 8541.667
C(category)[T.Health]	-0.1414	0.755	-0.187	0.851	-1.620 1.338
C(category)[T.Human Services]	-0.2230	0.707	-0.315	0.753	-1.609 1.163
C(category)[T.Human and Civil Rights]	-0.0320	0.934	-0.034	0.973	-1.863 1.799
C(category)[T.International]	-0.0034	0.790	-0.004	0.997	-1.552 1.545
C(category)[T.Religion]	1.1258	0.673	1.672	0.095	-0.194 2.445
C(category)[T.Research and Public Policy]	0.5617	0.930	0.604	0.546	-1.260 2.384
SOX_policies_2011	-0.4728	0.146	-3.240	0.001	-0.759 -0.187
total_revenue_2011_logged	0.2978	0.139	2.150	0.032	0.026 0.569
program_expense_percent_2011	-0.0330	0.010	-3.336	0.001	-0.052 -0.014
age	-0.0104	0.010	-1.081	0.280	-0.029 0.008
complexity_2011	-0.8049	0.394	-2.045	0.041	-1.576 -0.034

In [549]:

df_2011.dropna()['category'].value_counts()

Out[549]:

Human Services                1188
Arts, Culture, Humanities      670
Health                         574
International                  428
Community Development          381
Animals                        372
Environment                    317
Religion                       295
Education                      280
Human and Civil Rights         181
Research and Public Policy     122
Name: category, dtype: int64

In [513]:

logit = smf.logit(formula=logit_formula, data=df[df['2011 data']==1]).fit(cov_type='cluster', 
                                                    cov_kwds={'groups': df[df['2011 data']]['state_2011']}) 
logit.summary()

Optimization terminated successfully.
         Current function value: 0.044916
         Iterations 10

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-513-f1b037f5d426> in <module>()
      1 logit = smf.logit(formula=logit_formula, data=df[df['2011 data']==1]).fit(cov_type='cluster', 
----> 2                                                     cov_kwds={'groups': df[df['2011 data']==1]['state_2011']}) 
      3 logit.summary()

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1374         bnryfit = super(Logit, self).fit(start_params=start_params,
   1375                 method=method, maxiter=maxiter, full_output=full_output,
-> 1376                 disp=disp, callback=callback, **kwargs)
   1377 
   1378         discretefit = LogitResults(self, bnryfit)

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    201         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    202                 method=method, maxiter=maxiter, full_output=full_output,
--> 203                 disp=disp, callback=callback, **kwargs)
    204 
    205         return mlefit # up to subclasses to wrap results

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    455         #print('kwds inLikelihoodModel.fit', kwds)
    456         #TODO: add Hessian approximation and change the above if needed
--> 457         mlefit = LikelihoodModelResults(self, xopt, Hinv, scale=1., **kwds)
    458 
    459         #TODO: hardcode scale?

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in __init__(self, model, params, normalized_cov_params, scale, **kwargs)
    940                 # TODO: we shouldn't need use_t in get_robustcov_results
    941                 get_robustcov_results(self, cov_type=cov_type, use_self=True,
--> 942                                            use_t=use_t, **cov_kwds)
    943 
    944 

//anaconda/lib/python2.7/site-packages/statsmodels/base/covtype.pyc in get_robustcov_results(self, cov_type, use_t, **kwds)
    193                 self.n_groups = n_groups = len(np.unique(groups))
    194             res.cov_params_default = sw.cov_cluster(self, groups,
--> 195                                              use_correction=use_correction)
    196 
    197         elif groups.ndim == 2:

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in cov_cluster(results, group, use_correction)
    535         clusters = np.unique(group)
    536 
--> 537     scale = S_crosssection(xu, group)
    538 
    539     nobs, k_params = xu.shape

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in S_crosssection(x, group)
    489 
    490     '''
--> 491     x_group_sums = group_sums(x, group).T  #TODO: why transposed
    492 
    493     return S_white_simple(x_group_sums)

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in group_sums(x, group)
    435     #TODO: transpose return in group_sum, need test coverage first
    436     return np.array([np.bincount(group, weights=x[:, col])
--> 437                             for col in range(x.shape[1])])
    438 
    439 

ValueError: The weights and list don't have the same length.

In [526]:

logit = smf.logit(formula=logit_formula, data=df[df['total_revenue_2011_logged'].notnull()]).fit(cov_type='cluster', 
                                    cov_kwds={'groups': df[df['total_revenue_2011_logged'].notnull()]['state_2011']}) 
logit.summary()

Optimization terminated successfully.
         Current function value: 0.044916
         Iterations 10

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-526-17812cb557f5> in <module>()
      1 logit = smf.logit(formula=logit_formula, data=df[df['total_revenue_2011_logged'].notnull()]).fit(cov_type='cluster', 
----> 2                                                     cov_kwds={'groups': df[df['total_revenue_2011_logged'].notnull()]['state_2011']}) 
      3 logit.summary()

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1374         bnryfit = super(Logit, self).fit(start_params=start_params,
   1375                 method=method, maxiter=maxiter, full_output=full_output,
-> 1376                 disp=disp, callback=callback, **kwargs)
   1377 
   1378         discretefit = LogitResults(self, bnryfit)

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    201         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    202                 method=method, maxiter=maxiter, full_output=full_output,
--> 203                 disp=disp, callback=callback, **kwargs)
    204 
    205         return mlefit # up to subclasses to wrap results

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    455         #print('kwds inLikelihoodModel.fit', kwds)
    456         #TODO: add Hessian approximation and change the above if needed
--> 457         mlefit = LikelihoodModelResults(self, xopt, Hinv, scale=1., **kwds)
    458 
    459         #TODO: hardcode scale?

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in __init__(self, model, params, normalized_cov_params, scale, **kwargs)
    940                 # TODO: we shouldn't need use_t in get_robustcov_results
    941                 get_robustcov_results(self, cov_type=cov_type, use_self=True,
--> 942                                            use_t=use_t, **cov_kwds)
    943 
    944 

//anaconda/lib/python2.7/site-packages/statsmodels/base/covtype.pyc in get_robustcov_results(self, cov_type, use_t, **kwds)
    193                 self.n_groups = n_groups = len(np.unique(groups))
    194             res.cov_params_default = sw.cov_cluster(self, groups,
--> 195                                              use_correction=use_correction)
    196 
    197         elif groups.ndim == 2:

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in cov_cluster(results, group, use_correction)
    535         clusters = np.unique(group)
    536 
--> 537     scale = S_crosssection(xu, group)
    538 
    539     nobs, k_params = xu.shape

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in S_crosssection(x, group)
    489 
    490     '''
--> 491     x_group_sums = group_sums(x, group).T  #TODO: why transposed
    492 
    493     return S_white_simple(x_group_sums)

//anaconda/lib/python2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in group_sums(x, group)
    435     #TODO: transpose return in group_sum, need test coverage first
    436     return np.array([np.bincount(group, weights=x[:, col])
--> 437                             for col in range(x.shape[1])])
    438 
    439 

ValueError: The weights and list don't have the same length.

In [500]:

print len(df[(df['2011 data']==1) & (df['donor_advisory_2016'].notnull())])
print len(df[(df['2011 data']==1) & (df['donor_advisory_2016']==0)])
print len(df[(df['2011 data']==1) & (df['donor_advisory_2016']==1)])

4863
4816
47

In [468]:

print cols_2011

['charity_name_2011', 'category_2011', 'city_2011', 'state_2011', 'cause_2011', 'tag_line_2011', 'url_2011', 'ein_2011', 'fye_2011', 'overall_rating_2011', 'overall_rating_star_2011', 'efficiency_rating_2011', 'AT_rating_2011', 'financial_rating_star_2011', 'AT_rating_star_2011', 'program_expense_percent_2011', 'admin_expense_percent_2011', 'fund_expense_percent_2011', 'fund_efficiency_2011', 'primary_revenue_growth_2011', 'program_expense_growth_2011', 'working_capital_ratio_2011', 'independent_board_2011', 'no_material_division_2011', 'audited_financials_2011', 'no_loans_related_2011', 'documents_minutes_2011', 'form_990_2011', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'CEO_listed_2011', 'process_CEO_compensation_2011', 'no_board_compensation_2011', 'donor_privacy_policy_2011', 'board_listed_2011', 'audited_financials_web_2011', 'form_990_web_2011', 'staff_listed_2011', 'primary_revenue_2011', 'other_revenue_2011', 'total_revenue_2011', 'govt_revenue_2011', 'program_expense_2011', 'admin_expense_2011', 'fund_expense_2011', 'total_functional_expense_2011', 'affiliate_payments_2011', 'budget_surplus_2011', 'net_assets_2011', 'leader_comp_2011', 'leader_comp_percent_2011', 'email_2011', 'website_2011']

In [ ]:

BMF_columns = ['NEW ROW', 'NAME_2015_BMF', 'STREET_2015_BMF', 'CITY_2015_BMF', 'STATE_2015_BMF', 'ZIP_2015_BMF', 
'RULING_2015_BMF', 'ACTIVITY_2015_BMF', 'TAX_PERIOD_2015_BMF', 'ASSET_AMT_2015_BMF', 'INCOME_AMT_2015_BMF', 
'REVENUE_AMT_2015_BMF', 'NTEE_CD_2015_BMF', '2015 BMF', 'ruledate_2004_BMF', 'name_MSTRALL', 'state_MSTRALL', 
'NTEE1_MSTRALL', 'nteecc_MSTRALL', 'zip_MSTRALL', 'fips_MSTRALL', 'taxper_MSTRALL', 'income_MSTRALL', 
'F990REV_MSTRALL', 'assets_MSTRALL', 'ruledate_MSTRALL', 'deductcd_MSTRALL', 'accper_MSTRALL', 
'rule_date_v1', 'taxpd']

In [ ]:

df['']df[(df['SOX_policies_2011'].notnull())]

In [475]:

print len(df[df['current_donor_advisory']==1])
DA_2016 = df[df['current_donor_advisory']==1]['org_id'].tolist()
print len(DA_2016), len(set(DA_2016))
DA_2016[:3]

321
321 321

Out[475]:

['16722', '14954', '16155']

In [486]:

df['donor_advisory_2016'] = np.nan
df['donor_advisory_2016'] = np.where(  df['org_id'].isin(DA_2016), 1,0
                                       )
df['donor_advisory_2016'].value_counts()

Out[486]:

0    84590
1      368
Name: donor_advisory_2016, dtype: int64

In [473]:

df[(df['SOX_policies_2011'].notnull())][['name', 'org_id', 'FYE', '2011 data', #'current_or_past_donor_advisory',
                                 'current_donor_advisory', 'past_donor_advisory', 'SOX_policies_2011']][:24]

Out[473]:

	name	org_id	FYE	2011 data	SOX_policies_2011
21	10,000 Degrees	6466	FY2009	1	3
39	100 Club of Arizona	12123	FY2009	1	2
52	1000 Friends of Florida	10092	FY2008	1	0
63	1000 Friends of Oregon	8770	FY2010	1	3
148	4 Paws for Ability	13055	FY2009	1	3
161	The 92nd Street Y	4792	FY2010	1	3
185	A Better Chance	6082	FY2010	1	3
208	A Contemporary Theatre	3634	FY2009	1	3
228	A Kid Again	9239	FY2009	1	3
244	A Noise Within	10176	FY2010	1	0
255	A Place Called Home	8040	FY2010	1	3
276	A.J. Muste Memorial Institute	6096	FY2009	1	3
292	AAA Foundation for Traffic Safety	8302	FY2009	1	3
312	Aaron Diamond AIDS Research Center	4991	FY2010	1	3
328	AARP Foundation	3205	FY2009	1	3
337	AAUW - American Association of University Women	3240	FY2010	1	3
362	Abilities United	7940	FY2010	1	3
373	The Ability Experience	7632	FY2010	1	3
385	Abode Services	9182	FY2010	1	3
399	The Abraham Fund Initiatives	9371	FY2009	1	3
426	Abused Deaf Women's Advocacy Services	12762	FY2009	1	3
441	Academy of Achievement	5705	FY2009	1	1
458	Academy of American Poets	9256	FY2009	1	3
477	The Academy of Natural Sciences of Drexel University	3209	FY2009	1	3

In [447]:

#df['advisory'] = np.where(~df['advisory text - current advisory'].isnull(), 1,0)
#df['advisory'].value_counts()

Out[447]:

0    84637
1      321
Name: advisory, dtype: int64

In [458]:

#df[(df['2011 data']==1) & (df['past_donor_advisory']==1)][:5]

In [483]:

  File "<unknown>", line 1
    2016 _donor_advisory
                       ^
SyntaxError: invalid syntax

In [ ]:

controls = ['total_revenue_2011', 'program_expense_percent_2011', 
            ]
len(df[(df['2011 data']==1) & (df['total_revenue_2011'].isnull())])#[controls]

In [411]:

df[df['2011 data']==1][:2]

Out[411]:

	org_id	EIN	org_url	name	category	category-full	Date Published	Form 990 FYE	Form 990 FYE, v2	FYE	Earliest Rating Publication Date	ratings_system	Overall Score	Overall Rating	advisory text - current advisory	advisory text - past advisory	current_or_past_donor_advisory	current_donor_advisory	past_donor_advisory	latest_entry	current_ratings_url	ein_2016	Publication_date_and_FY_2016	Publication Date_2016	FYE_2016	donor_alert_2016	overall_rating_2016	efficiency_rating_rating_2016	AT_rating_2016	overall_rating_star_2016	financial_rating_star_2016	AT_rating_star_2016	program_expense_percent_2016	admin_expense_percent_2016	fund_expense_percent_2016	fund_efficiency_2016	working_capital_ratio_2016	program_expense_growth_2016	liabilities_to_assets_2016	independent_board_2016	no_material_division_2016	audited_financials_2016	no_loans_related_2016	documents_minutes_2016	form_990_2016	conflict_of_interest_policy_2016	whistleblower_policy_2016	records_retention_policy_2016	CEO_listed_2016	process_CEO_compensation_2016	no_board_compensation_2016	donor_privacy_policy_2016	board_listed_2016	audited_financials_web_2016	form_990_web_2016	staff_listed_2016	contributions_gifts_grants_2016	federated_campaigns_2016	membership_dues_2016	fundraising_events_2016	related_organizations_2016	government_grants_2016	total_contributions_2016	program_service_revenue_2016	total_primary_revenue_2016	other_revenue_2016	total_revenue_2016	program_expenses_2016	administrative_expenses_2016	fundraising_expenses_2016	total_functional_expenses_2016	payments_to_affiliates_2016	excess_or_deficit_2016	net_assets_2016	comp_2016	cp_2016	mission_2016	2011 data	charity_name_2011	category_2011	city_2011	state_2011	cause_2011	tag_line_2011	url_2011	ein_2011	fye_2011	overall_rating_2011	overall_rating_2011_plus_30	overall_rating_2011_plus_30_v2	overall_rating_star_2011	overall_rating_star_2011_text	efficiency_rating_2011	AT_rating_2011	financial_rating_star_2011	AT_rating_star_2011	program_expense_percent_2011	admin_expense_percent_2011	fund_expense_percent_2011	fund_efficiency_2011	primary_revenue_growth_2011	program_expense_growth_2011	working_capital_ratio_2011	independent_board_2011	no_material_division_2011	audited_financials_2011	no_loans_related_2011	documents_minutes_2011	form_990_2011	conflict_of_interest_policy_2011	whistleblower_policy_2011	records_retention_policy_2011	CEO_listed_2011	process_CEO_compensation_2011	no_board_compensation_2011	donor_privacy_policy_2011	board_listed_2011	audited_financials_web_2011	form_990_web_2011	staff_listed_2011	primary_revenue_2011	other_revenue_2011	total_revenue_2011	govt_revenue_2011	program_expense_2011	admin_expense_2011	fund_expense_2011	total_functional_expense_2011	affiliate_payments_2011	budget_surplus_2011	net_assets_2011	leader_comp_2011	leader_comp_percent_2011	email_2011	website_2011	2016 Advisory - Date Posted	2016 Advisory - Charity Name	2016 Advisory - advisory_url	2016 Advisory - advisory	_merge_v1	to_be_merged	NEW ROW	NAME_2015_BMF	STREET_2015_BMF	CITY_2015_BMF	STATE_2015_BMF	ZIP_2015_BMF	RULING_2015_BMF	ACTIVITY_2015_BMF	TAX_PERIOD_2015_BMF	ASSET_AMT_2015_BMF	INCOME_AMT_2015_BMF	REVENUE_AMT_2015_BMF	NTEE_CD_2015_BMF	2015 BMF	ruledate_2004_BMF	name_MSTRALL	state_MSTRALL	NTEE1_MSTRALL	nteecc_MSTRALL	zip_MSTRALL	fips_MSTRALL	taxper_MSTRALL	income_MSTRALL	F990REV_MSTRALL	assets_MSTRALL	ruledate_MSTRALL	deductcd_MSTRALL	accper_MSTRALL	rule_date_v1	taxpd	NAME_SOI	yr_frmtn	pt1_num_vtng_gvrn_bdy_mems	pt1_num_ind_vtng_mems	num_vtng_gvrn_bdy_mems	num_ind_vtng_mems	tot_num_empls	tot_num_vlntrs	contri_grnts_cy	prog_srvc_rev_cy	invst_incm_cy	oth_rev_cy	grnts_and_smlr_amts_cy	tot_prof_fndrsng_exp_cy	tot_fndrsng_exp_cy	pt1_tot_asts_eoy	aud_fincl_stmts	mtrl_divrsn_or_misuse	cnflct_int_plcy	whistleblower_plcy	doc_retention_plcy	federated_campaigns	memshp_dues	rltd_orgs	govt_grnts	all_oth_contri	nncsh_contri	tot_contri	psr_tot	inv_incm_tot_rev	bonds_tot_rev	roylrev_tot_rev	net_rent_tot_rev	gain_or_loss_sec	gain_or_loss_oth	oth_rev_tot	tot_rev	mgmt_srvc_fee_tot	fee_for_srvc_leg_tot	fee_for_srvc_acct_tot	fee_for_srvc_lbby_tot	fee_for_srvc_prof_tot	fee_for_srvc_invst_tot	fee_for_srvc_oth_tot	fs_audited	audit_committee	vlntr_hrs	_merge	rule_date	ruledate_2004_BMF_v2	ruledate_MSTRALL_v2	yr_frmtn_v2	age	category_Animals	category_Arts, Culture, Humanities	category_Community Development	category_Education	category_Environment	category_Health	category_Human Services	category_Human and Civil Rights	category_International	category_Religion	category_Research and Public Policy	govt_revenue_2011_binary
9	10166	043314346	http://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=10166	Angel Flight Northeast	Health	Health : Patient and Family Support	2011-01-05 00:00:00	2009-12	2009-12-01	FY2009	2005-12-01	CN 1.0	--	Donor Advisory	NaN	This donor advisory was published on Wednesday, January 5, 2011.In accordance with our.policy for removing Donor Advisories., Charity Navigator removed the Donor Advisory for Angel Flight Northeast on March 1, 2012 because the Donor Advisory had been in place for more than a year (since January 5, 2011) and because the issue that prompted the Donor Advisory has been resolved..Charity Navigator had published a Donor Advisory for this charity because we became aware of the following informati...	1	0	1	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	Angel Flight Northeast	Health	North Andover	MA	Patient and Family Support	Providing free flights so children and adults can access medical care since 1996	http://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=10166	04-3314346	NaN	NaN	--	--	NaN	Missing - Apparent Donor Advisory	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	left_only	1	NaN	ANGEL FLIGHT OF NEW ENGLAND INC	LAWRENCE MUNICIPAL AIRPORT	NORTH ANDOVER	MA	01845-0000	200812	994179000	201312	869310	896259	3877845	E99	1	199608	ANGEL FLIGHT NEW ENG	MA	E	E87	01867-1110	25017	200012	539450	520862	318758	199608	1	12	1996	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	left_only	2008	1996	1996	nan	8	0	0	0	0	0	1	0	0	0	0	0	NaN
21	6466	953667812	http://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=6466	10,000 Degrees	Education	Education : Scholarship and Financial Support	2011-09-20 00:00:00	2009-06	2009-06-01	FY2009	2003-09-01	CN 2.0	85.33	3 stars	NaN	NaN	0	0	0	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	10,000 Degrees	Education	San Rafael	CA	Other Education Programs and Services	Creating College Graduates Who Change the World	http://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=6466	95-3667812	06/2009	55.33	85.33	85.33	3	3 stars	52.42	59	3	3	79.7	7.3	12.8	0.11	3.4	0	0.67	yes	yes	yes	yes	yes	yes	yes	yes	yes	yes	yes	yes	NO	yes	NO	NO	yes	3914222	216503	4130725	Note: This organization receives $0 in government support.	2813532	260007	454629	3528168	0	602557	3389166	154300	4.37	info@10000degrees.org	http://www.10000degrees.org	NaN	NaN	NaN	NaN	both	0	NaN	10000 DEGREES	1650 LOS GAMOS SUITE 110	SAN RAFAEL	CA	94903-1838	198105	40000000	201506	8611662	7760209	7627694	B11	1	198211	MARIN EDUC FND	CA	B	B20	94901-2920	06041	200106	3958011	3958011	1958251	198211	1	06	1982	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	left_only	1981	1982	1982	nan	35	0	0	0	1	0	0	0	0	0	0	0	0

In [ ]:

In [396]:

print df.columns.tolist()

['org_id', 'EIN', 'org_url', 'name', 'category', 'category-full', 'Date Published', 'Form 990 FYE', 'Form 990 FYE, v2', 'FYE', 'Earliest Rating Publication Date', 'ratings_system', 'Overall Score', 'Overall Rating', 'advisory text - current advisory', 'advisory text - past advisory', 'current_or_past_donor_advisory', 'current_donor_advisory', 'past_donor_advisory', 'latest_entry', 'current_ratings_url', 'ein_2016', 'Publication_date_and_FY_2016', 'Publication Date_2016', 'FYE_2016', 'donor_alert_2016', 'overall_rating_2016', 'efficiency_rating_rating_2016', 'AT_rating_2016', 'overall_rating_star_2016', 'financial_rating_star_2016', 'AT_rating_star_2016', 'program_expense_percent_2016', 'admin_expense_percent_2016', 'fund_expense_percent_2016', 'fund_efficiency_2016', 'working_capital_ratio_2016', 'program_expense_growth_2016', 'liabilities_to_assets_2016', 'independent_board_2016', 'no_material_division_2016', 'audited_financials_2016', 'no_loans_related_2016', 'documents_minutes_2016', 'form_990_2016', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'CEO_listed_2016', 'process_CEO_compensation_2016', 'no_board_compensation_2016', 'donor_privacy_policy_2016', 'board_listed_2016', 'audited_financials_web_2016', 'form_990_web_2016', 'staff_listed_2016', 'contributions_gifts_grants_2016', 'federated_campaigns_2016', 'membership_dues_2016', 'fundraising_events_2016', 'related_organizations_2016', 'government_grants_2016', 'total_contributions_2016', 'program_service_revenue_2016', 'total_primary_revenue_2016', 'other_revenue_2016', 'total_revenue_2016', 'program_expenses_2016', 'administrative_expenses_2016', 'fundraising_expenses_2016', 'total_functional_expenses_2016', 'payments_to_affiliates_2016', 'excess_or_deficit_2016', 'net_assets_2016', 'comp_2016', 'cp_2016', 'mission_2016', '2011 data', 'charity_name_2011', 'category_2011', 'city_2011', 'state_2011', 'cause_2011', 'tag_line_2011', 'url_2011', 'ein_2011', 'fye_2011', 'overall_rating_2011', 'overall_rating_2011_plus_30', 'overall_rating_2011_plus_30_v2', 'overall_rating_star_2011', 'overall_rating_star_2011_text', 'efficiency_rating_2011', 'AT_rating_2011', 'financial_rating_star_2011', 'AT_rating_star_2011', 'program_expense_percent_2011', 'admin_expense_percent_2011', 'fund_expense_percent_2011', 'fund_efficiency_2011', 'primary_revenue_growth_2011', 'program_expense_growth_2011', 'working_capital_ratio_2011', 'independent_board_2011', 'no_material_division_2011', 'audited_financials_2011', 'no_loans_related_2011', 'documents_minutes_2011', 'form_990_2011', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'CEO_listed_2011', 'process_CEO_compensation_2011', 'no_board_compensation_2011', 'donor_privacy_policy_2011', 'board_listed_2011', 'audited_financials_web_2011', 'form_990_web_2011', 'staff_listed_2011', 'primary_revenue_2011', 'other_revenue_2011', 'total_revenue_2011', 'govt_revenue_2011', 'program_expense_2011', 'admin_expense_2011', 'fund_expense_2011', 'total_functional_expense_2011', 'affiliate_payments_2011', 'budget_surplus_2011', 'net_assets_2011', 'leader_comp_2011', 'leader_comp_percent_2011', 'email_2011', 'website_2011', '2016 Advisory - Date Posted', '2016 Advisory - Charity Name', '2016 Advisory - advisory_url', '2016 Advisory - advisory', '_merge_v1', 'to_be_merged', u'NEW ROW', 'NAME_2015_BMF', 'STREET_2015_BMF', 'CITY_2015_BMF', 'STATE_2015_BMF', 'ZIP_2015_BMF', 'RULING_2015_BMF', 'ACTIVITY_2015_BMF', 'TAX_PERIOD_2015_BMF', 'ASSET_AMT_2015_BMF', 'INCOME_AMT_2015_BMF', 'REVENUE_AMT_2015_BMF', 'NTEE_CD_2015_BMF', '2015 BMF', 'ruledate_2004_BMF', 'name_MSTRALL', 'state_MSTRALL', 'NTEE1_MSTRALL', 'nteecc_MSTRALL', 'zip_MSTRALL', 'fips_MSTRALL', 'taxper_MSTRALL', 'income_MSTRALL', 'F990REV_MSTRALL', 'assets_MSTRALL', 'ruledate_MSTRALL', 'deductcd_MSTRALL', 'accper_MSTRALL', 'rule_date_v1', 'taxpd', 'NAME_SOI', 'yr_frmtn', 'pt1_num_vtng_gvrn_bdy_mems', 'pt1_num_ind_vtng_mems', 'num_vtng_gvrn_bdy_mems', 'num_ind_vtng_mems', 'tot_num_empls', 'tot_num_vlntrs', 'contri_grnts_cy', 'prog_srvc_rev_cy', 'invst_incm_cy', 'oth_rev_cy', 'grnts_and_smlr_amts_cy', 'tot_prof_fndrsng_exp_cy', 'tot_fndrsng_exp_cy', 'pt1_tot_asts_eoy', 'aud_fincl_stmts', 'mtrl_divrsn_or_misuse', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy', 'federated_campaigns', 'memshp_dues', 'rltd_orgs', 'govt_grnts', 'all_oth_contri', 'nncsh_contri', 'tot_contri', 'psr_tot', 'inv_incm_tot_rev', 'bonds_tot_rev', 'roylrev_tot_rev', 'net_rent_tot_rev', 'gain_or_loss_sec', 'gain_or_loss_oth', 'oth_rev_tot', 'tot_rev', 'mgmt_srvc_fee_tot', 'fee_for_srvc_leg_tot', 'fee_for_srvc_acct_tot', 'fee_for_srvc_lbby_tot', 'fee_for_srvc_prof_tot', 'fee_for_srvc_invst_tot', 'fee_for_srvc_oth_tot', 'fs_audited', 'audit_committee', 'vlntr_hrs', '_merge', 'rule_date', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2', 'yr_frmtn_v2', 'age', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy']

SOI data¶

In [ ]:

SOI_columns = ['taxpd', 'yr_frmtn', 'pt1_num_vtng_gvrn_bdy_mems', 'pt1_num_ind_vtng_mems', 'num_vtng_gvrn_bdy_mems', 
               'num_ind_vtng_mems', 'tot_num_empls', 'tot_num_vlntrs', 'contri_grnts_cy', 'prog_srvc_rev_cy', 
               'invst_incm_cy', 'oth_rev_cy', 'grnts_and_smlr_amts_cy', 'tot_prof_fndrsng_exp_cy', 
               'tot_fndrsng_exp_cy', 'pt1_tot_asts_eoy', 'aud_fincl_stmts', 'mtrl_divrsn_or_misuse', 
               'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy', 'federated_campaigns', 'memshp_dues', 
               'rltd_orgs', 'govt_grnts', 'all_oth_contri', 'nncsh_contri', 'tot_contri', 'psr_tot', 
               'inv_incm_tot_rev', 'bonds_tot_rev', 'roylrev_tot_rev', 'net_rent_tot_rev', 'gain_or_loss_sec', 
               'gain_or_loss_oth', 'oth_rev_tot', 'tot_rev', 'mgmt_srvc_fee_tot', 'fee_for_srvc_leg_tot', 
               'fee_for_srvc_acct_tot', 'fee_for_srvc_lbby_tot', 'fee_for_srvc_prof_tot', 'fee_for_srvc_invst_tot', 
               'fee_for_srvc_oth_tot', 'fs_audited', 'audit_committee', 'vlntr_hrs', 'NAME_SOI']
len(SOI_columns)