#!/usr/bin/env python # coding: utf-8 # # Notebook Tasks # In this notebook you'll see code used to conduct the following steps: # - Read in merged current/historical/2011 dataset with all BMF and SOI data: # - **_merged data with EIN clean-up, SOI data, and 2015, 2008, and 2004 BMF data.pkl_** # - There are 84,958 rows in this dataset, including some years for which only SOI data are included. # # # **_Notes:_** # - Of the 8,304 organizations in the dataset, 4,857 are those that were rated in 2011. # - 582 of the 5,439 2011 organizations have been dropped by CN, leaving 4,857 # - The dataset is organized in a **_org/FY/ratings system_** format # - i.e., one row per organization for each fiscal year, with multiple rows per org/FY when there has been a ratings system change from *CN 1.0* to *CN 2.0* to *CN 2.1.* # - Some organizations will have more than one entry per fiscal year -- even for the same ratings system (CN2.1, CN2.0, or CN1.0) -- multiple ratings per year are triggered by amended 990s, etc.); e.g., https://www.charitynavigator.org/index.cfm?bay=search.history&orgid=10166 # - The baseline for the dataset is data gathered from each organization's *Historical Ratings* page -- you'll see the same number of rows -- and same data for each row -- as seen on those pages; e.g, https://www.charitynavigator.org/index.cfm?bay=search.history&orgid=10166 # - What I've done is to merge each organization's current *Rating Profile* into this *org/FY/ratings system* dataset, as seen here: https://www.charitynavigator.org/index.cfm?bay=search.summary&orgid=10166 # - I have merged/appended in two rows of data for each organization: # - **_current ratings_** -- data scraped from the *Rating Profile* in August 2016. # - **_2011 data_** -- data scraped from the *Rating Profile* in October 2011. # - So, for *SOX policy data*, we have two years' of data -- 2011 and 2016. For all years, we have what is on the *Historical Ratings* page -- namely, the overall numerical score, the overall numerical star rating, and whether the organization was under a donor advisory that year. # # #
# **_Possible Samples for Statistical Tests_**: # - Given the above, there are a number of possible tests:

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
IV: SOX PoliciesDV: Donor AdvisoryNNotesTO DO
201120164,85747 donor advisories on these organizations; associational test (we don't know when the SOX policies were added); also, DV is 'current donor advisory'ready to run
20112012-20164,85747 2016 advisories plus probably another dozen or so advisories over the 2012-2015 period; associational test as above, but adds in donor advisories that were put in place then dropped between 2012 and 2015.some minor work creating this new DV but not very burdensome
201120115,43939 donor advisories; pure cross-sectional test
Download the '2011' 990 data (SOX policies + controls) for the 39 orgs with a 2011 donor advisory; a few hours work to download and enter the data
201620168,304328 donor advisories; pure cross-sectional testready to run
change 2011-201620164,857'Divide 4,857 orgs into three groups: i) those with no SOX policies in 2011 and still no SOX policies in 2016; ii) those with SOX policies in 2011 and 2016; and iii) those with no SOX policies in 2011 but SOX policies in 2016. Create dummy variables for each group and see whether those in group iii) do better than i) or ii). This is a relatively low cost 'pre-post' test.moderate amount of work to create the new dummies but not too burdensome
change 2011-20162012-2016TBDSimilar to above option, but would need to take a sample of organizations in group iii) and go through their 990s to find out exactly when they added the SOX policiesResource-intensive 990 searches
# # #

# **_Notes from Meeting with Dan:_** # - Do not do 3rd or 6th test -- benefit not worth the cost # - 1st and 2nd tests can be robustness analyses # - Focus on 4th and 5th tests # - Control variables: # - Size: total revenues best (probably logged) # - will need 2011 and 2016 versions for the 4th and 5th tests # - efficiency ratio # - age (from BMF) # - complexity (could be a good control from Erica's paper) # - fixed effects: # - state # - category # - I need to scrape the category dummies for the new orgs in the 2016 database # - CN does not include that information in the ratings area, but it is included on the webpage in the 'breadcrumbs' area # - The focus of our paper is on SOC policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are: # - *independent board* --> related to Erica's *independence of key actors" concept # - *board review of 990* and *audited financials* --> both related to Erica's *board monitoring* concept # - we could include other governance variables as needed. # - We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies. # - To differentiate from Erica's *JBE* paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'. # # # #

# **_To Do (beyond notes listed in table above):_** # - For all above tests, we need to decide on controls, then find/merge/create any not currently in dataset # - Run a selection model? # - Code the *type* of advisory? Maybe save for future study # - There are 53 orgs on the CN 'Watchlist' -- we probably don't need to look at these but it's a possible future move. # #
# **_Notes on 2011 data:_** # - Only 47 of 329 current donor advisories are on orgs that were rated in 2011 # - Number of 2011 orgs (n=5,349) missing from 2016 ratings: 582 # - Number of 2016 orgs (n=8,304) not in 2011 ratings: 3,447 # - In 2011 when I scraped the current ratings there are 39 blank rows. Specifically, I checked the following spreadsheet: *Charity Navigator - current ratings, October 18, 2011 (WITH UPDATES FOR DONOR ADVISORY ORGS).xlsx* -- 39 rows were blank for all ratings information, so I checked against the historical ratings on the CN website. (So far) all rows were either 1) dropped from CN, 2) had a donor advisory, or 3) still have a donor advisory. I have 5,439 orgs in the 2011 database. 39 seem to have had donor advisories on them at that time. So, the 2011 sample is the 5,400 orgs that did not have an advisory on them at the time. This conforms with the *n* of 5,400 in the above logit. #
# # ### Import Packages # First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets. # Import of basic elements of PANDAS and numpy # In[2]: import numpy as np import pandas as pd from pandas import DataFrame from pandas import Series #
# # We can check which version of various packages we're using. You can see I'm running PANDAS 0.17 here. # In[3]: print pd.__version__ #
# PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks. # In[4]: #http://pandas.pydata.org/pandas-docs/stable/options.html pd.set_option('display.max_columns', None) pd.set_option('max_colwidth', 500) # ## Read in Data # Let's read in the merged historical/current/2011 dataset we created in the last notebook. First we'll change the working directory. # In[5]: cd '/Users/gregorysaxton/Google Drive/SOX' # #### Historical Ratings # In[6]: df = pd.read_pickle('merged data with EIN clean-up, SOI data, and 2015, 2008, and 2004 BMF data.pkl') print "Number of columns:", len(df.columns) print "Number of observations:", len(df) df.head(1) # ### Show columns # The variables are organized as follows. First come the organization identifiers -- *org_id* and *EIN*. These are followed by variables indicating the FY and date the ratings were posted. Then there are variables indicating the existence of a donor advisory, then all of the '2016' *Rating Profile* variables (variable names followed by '_2016') and then all the '2011 variables. After that comes the BMF data and the SOI data. # In[12]: print df.columns.tolist() #
**_Note_**: Only 9,951 of the original rows have SOI data; another 1,013 rows are *SOI data only*, but these could be useful for additional tests as well as filling in the blanks with, for instance, SOX data for orgs with current donor advisories. # In[7]: df['_merge'].value_counts() # #Check how many rows have 2011 data # #Yes, there are 4,857 unique *org_ids* with 2011 data. # In[11]: #print len(df[df['2011 data']==1]) #print len(set(df[df['2011 data']==1]['org_id'].tolist())) # ### FYE # - We have *BMF* data for 2004, 2008, and 2015 # - We have *SOI* data for 2008 through 2013 # - We have *CN* data for 2002 through 2016 # # These observations span a broad range of fiscal years, going back as early as FY2000. # # In[16]: df['FYE'].value_counts() # ## Time-Invariant Controls # Age, State, Category # ### Age # The ruling date values have already been incorporated into *rule_date* in the prior notebook. After that we were missing the *rule_date* information for 73 observations (that number is higher now given the 1,013 rows with only SOI data). All of those 73 are organizations with a current donor advisory. Let's double check that there are no more BMF ruling date values to incorporate and then merge in SOI *yr_frmtn* values. # In[45]: print df['RULING_2015_BMF'].value_counts().sum() df['RULING_2015_BMF'][:3] # In[23]: print df['ruledate_2004_BMF'].value_counts().sum() df['ruledate_2004_BMF'][:3] # In[29]: print df['ruledate_MSTRALL'].value_counts().sum() df['ruledate_MSTRALL'][:3] # In[38]: print df['yr_frmtn'].value_counts().sum() df[df['yr_frmtn'].notnull()]['yr_frmtn'][:3] # In[47]: print df['rule_date'].value_counts().sum() df['rule_date'][:3] #
Check for missing ruling date information in SOI and BMF data # In[46]: print len(df[df['rule_date'].isnull()]) print len(df[(df['rule_date'].isnull()) & df['yr_frmtn'].notnull()]) print len(df[(df['rule_date'].isnull()) & df['ruledate_MSTRALL'].notnull()]) print len(df[(df['rule_date'].isnull()) & df['ruledate_2004_BMF'].notnull()]) print len(df[(df['rule_date'].isnull()) & df['RULING_2015_BMF'].notnull()]) # In[8]: print df['yr_frmtn'].dtype print df['rule_date'].dtype # In[9]: print len(df[df['rule_date'].isnull()]) df['rule_date'] = np.where( ( df['rule_date'].isnull() & df['yr_frmtn'].notnull() ), df['yr_frmtn'].astype('str'), df['rule_date'] ) print len(df[df['rule_date'].isnull()]) # In[10]: df[df['rule_date'].notnull()]['rule_date'][:2] #
67 rows have a *rule_date* of 0.0. We need to delete those values. # In[13]: print len(df[df['rule_date']=='0.0']) #df[df['rule_date']=='0.0'][['age', 'rule_date', 'name']][:2] df[df['rule_date']=='0.0'][['rule_date', 'name']][:2] # In[14]: print df['rule_date'].value_counts().sum() df['rule_date'] = np.where(df['rule_date']=='0.0', np.nan, df['rule_date']) print len(df[df['rule_date']=='0.0']) print df['rule_date'].value_counts().sum() #
Same with 5 rows where value is '' # In[15]: print df['rule_date'].value_counts().sum() df['rule_date'] = np.where(df['rule_date']=='', np.nan, df['rule_date']) print len(df[df['rule_date']=='0.0']) print df['rule_date'].value_counts().sum() #
Some values have decimals so let's restrict variable to first four characters. # In[16]: df['rule_date'] = df['rule_date'].str[:4] #
Given the above deletions, let's see if we can grab a few more ruling dates. # In[17]: print len(df[df['rule_date'].isnull()]) df['rule_date'] = np.where( ( df['rule_date'].isnull() & df['yr_frmtn'].notnull() ), df['yr_frmtn'].astype('str'), df['rule_date'] ) print len(df[df['rule_date'].isnull()]) # In[18]: df.dtypes[160:180] # In[99]: #df[df['rule_date']=='1996.0'][['rule_date']] # In[19]: print df['rule_date'].value_counts().sum() print len(df[df['rule_date']=='0000']) df['rule_date'] = np.where(df['rule_date']=='0000', np.nan, df['rule_date']) print len(df[df['rule_date']=='0000']) print df['rule_date'].value_counts().sum() # In[133]: for index, row in df.iterrows(): if pd.notnull(row['rule_date']): df.ix[index, 'age'] = 2016 - int(row['rule_date']) else: pass # In[134]: df['age'].describe() # In[135]: print df['rule_date'].value_counts().sum() print df['age'].value_counts().sum() #df['age'].value_counts() # In[137]: import matplotlib.pyplot as plt import seaborn as sns from pylab import* get_ipython().run_line_magic('matplotlib', 'inline') #df['age'].plot(kind='bar') df[df['age'].notnull()]['age'].plot.hist(by=None, bins=100) # In[124]: df['age'].describe() # In[141]: df[df['age']>160][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date']] # In[144]: df.set_value(84113, 'rule_date', 1946) df.set_value(84433, 'rule_date', 1946) df.set_value(84113, 'age', 2016-1946) df.set_value(84433, 'age', 2016-1946) df[df['EIN']=='520715244'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date']] # In[146]: df.set_value(84377, 'rule_date', 1944) df.set_value(84377, 'age', 2016-1944) df[df['EIN']=='350868211'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date']] # In[ ]: df.set_value(84113, 'rule_date', 1946) df.set_value(84433, 'rule_date', 1946) df.set_value(84113, 'age', 2016-1946) df.set_value(84433, 'age', 2016-1946) df[df['EIN']=='520715244'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date']] # In[147]: df[df['age'].notnull()]['age'].plot.hist(by=None, bins=100) # In[155]: df[df['age']>100][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date']][:10] #
The problem is with the 2004 BMF dates. # In[153]: df[df['EIN']=='731116884'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date', 'RULING_2015_BMF', 'ruledate_2004_BMF', 'ruledate_MSTRALL']] # In[152]: df[df['EIN']=='742684333'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date', 'RULING_2015_BMF', 'ruledate_2004_BMF', 'ruledate_MSTRALL']] # In[156]: df[df['EIN']=='480891418'][['EIN', 'name', 'yr_frmtn', 'FYE', 'age', 'rule_date', 'RULING_2015_BMF', 'ruledate_2004_BMF', 'ruledate_MSTRALL']] # # Start Over - BMF 2004 and MSTRALL values for ruling date were wrong in above examples # In[182]: #df['rule_date_v2'] = df['rule_date'] # In[20]: df = pd.read_pickle('merged data with EIN clean-up, SOI data, and 2015, 2008, and 2004 BMF data.pkl') print "Number of columns:", len(df.columns) print "Number of observations:", len(df) df.head(1) # In[21]: updated_ruledate = pd.read_excel('91 orgs missing BMF data_DGN.xls') print len(updated_ruledate) updated_ruledate[:3] # In[22]: print len(updated_ruledate[updated_ruledate['Ruling Year'].notnull()]) updated_ruledate = updated_ruledate[updated_ruledate['Ruling Year'].notnull()] updated_ruledate = updated_ruledate[['org_id', 'Ruling Year']] updated_ruledate.columns = ['org_id', 'rule_date'] updated_ruledate['org_id'] = updated_ruledate['org_id'].astype('str') print len(updated_ruledate) updated_ruledate # In[23]: print len(df.columns) print len(df) print len(pd.merge(df, updated_ruledate, left_on='org_id', right_on='org_id', how='left')) df = pd.merge(df, updated_ruledate, left_on='org_id', right_on='org_id', how='left') print len(df.columns) print len(df) # In[24]: df.rename(columns={'rule_date_x':'rule_date_v1'}, inplace=True) df.rename(columns={'rule_date_y':'rule_date'}, inplace=True) # In[25]: print len(df[df['rule_date_v1'].isnull()]) print len(df[df['rule_date'].isnull()]) # In[26]: df['rule_date'].value_counts().sum() # In[27]: df[df['name']=='Kids Charity of Tampa Bay'][['org_id', 'name', 'EIN']] # In[28]: print df['rule_date'].value_counts().sum() df['rule_date'] = np.where(df['EIN']=='550900271', '2006', df['rule_date']) df['rule_date'].value_counts() # In[29]: print len(df[df['rule_date']=='nan']), len(df[df['rule_date'].isnull()]) df['rule_date'] = np.where(df['rule_date']=='nan', np.nan, df['rule_date'] ) print df['rule_date'].value_counts().sum() print len(df[df['rule_date'].isnull()]) # In[30]: print df['rule_date'].dtype df['rule_date'].value_counts() # In[31]: df['rule_date'] = df['rule_date'].str[:4] df['rule_date'].value_counts() # In[32]: print len(df[df['rule_date'].isnull()]) print len(df[df['rule_date'].notnull()]) df['rule_date'] = np.where( ( df['rule_date'].isnull() & df['RULING_2015_BMF'].notnull() ), df['RULING_2015_BMF'].astype('str').str[:4], df['rule_date'] ) print len(df[df['rule_date'].isnull()]) print len(df[df['rule_date'].notnull()]) # In[33]: df['rule_date'].value_counts().sum() # In[34]: df[['EIN', 'name', 'yr_frmtn', 'FYE', 'rule_date', 'RULING_2015_BMF', 'ruledate_2004_BMF', 'ruledate_MSTRALL']].dtypes # In[35]: df['ruledate_2004_BMF_v2'] = df['ruledate_2004_BMF'].astype('str').str[:4] df['ruledate_MSTRALL_v2'] = df['ruledate_MSTRALL'].str[:4] df['yr_frmtn_v2'] = df['yr_frmtn'].astype('str').str[:4] # In[36]: df['ruledate_2004_BMF_v2'] = np.where(df['ruledate_2004_BMF_v2']=='nan', np.nan, df['ruledate_2004_BMF_v2'] ) # In[37]: df[df['rule_date'].isnull() & df['ruledate_2004_BMF'].notnull()][['EIN', 'name', 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2']][4:22] # In[38]: df['ruledate_2004_BMF_v2'] = df['ruledate_2004_BMF_v2'].astype('float')#.dtype # In[39]: df[(df['rule_date'].isnull()) & (df['ruledate_2004_BMF'].notnull()) & (df['ruledate_2004_BMF_v2'].notnull()) & (df['ruledate_2004_BMF_v2']<1970)][['EIN', 'name', 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2']] # In[40]: print df['ruledate_2004_BMF_v2'].value_counts().sum() df['ruledate_2004_BMF_v2'] = np.where(df['EIN']=='942719901', 1980, df['ruledate_2004_BMF_v2']) print df['ruledate_2004_BMF_v2'].value_counts().sum() # In[41]: df[(df['rule_date'].isnull()) & (df['ruledate_2004_BMF'].notnull()) & (df['ruledate_2004_BMF_v2'].notnull()) & (df['ruledate_2004_BMF_v2']<1970)][['EIN', 'name', 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2']] # ##### Quick save # In[286]: df.to_pickle('quick save - merged with age fixes.pkl') # #### Now merge in 2004 BMF rule date values # In[42]: print len(df[df['rule_date'].isnull()]) df['rule_date'] = np.where( ( df['rule_date'].isnull() & df['ruledate_2004_BMF_v2'].notnull() ), df['ruledate_2004_BMF_v2'].astype('str').str[:4], df['rule_date'] ) print len(df[df['rule_date'].isnull()]) # In[43]: print len(df[(df['rule_date'].isnull())]) print len(df[df['rule_date'].notnull()]) print len(df[(df['rule_date'].isnull()) & df['yr_frmtn'].notnull()]) print len(df[(df['rule_date'].isnull()) & df['ruledate_MSTRALL'].notnull()]) # ##### Merge in MSTRALL values # In[44]: df[df['rule_date'].isnull() & df['ruledate_MSTRALL_v2'].notnull()][['EIN', 'name', 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2']] # In[45]: print len(df[df['rule_date'].isnull()]) df['rule_date'] = np.where( ( df['rule_date'].isnull() & df['ruledate_MSTRALL_v2'].notnull() ), df['ruledate_MSTRALL_v2'].str[:4], df['rule_date'] ) print len(df[df['rule_date'].isnull()]) # ##### Quick save # In[317]: df.to_pickle('quick save - merged with age fixes.pkl') # ##### No need to try with SOI variable # NOTE: All 985 cases where the *yr_frmtn* is available but *rule_date* is missing are 'right_only' merges ('SOI only' data). # In[46]: print len(df[(df['rule_date'].isnull()) & (df['yr_frmtn'].notnull())]) print len(df[(df['rule_date'].isnull()) & (df['yr_frmtn'].notnull()) & (df['yr_frmtn']<1955)]) print len(df[(df['rule_date'].isnull()) & (df['yr_frmtn'].notnull()) & (df['_merge']=='right_only')]) print len(df[(df['rule_date'].isnull()) & (df['yr_frmtn'].notnull()) & (df['yr_frmtn']<1955) & (df['_merge']=='right_only')]) # In[47]: df[(df['rule_date'].isnull()) & (df['yr_frmtn'].notnull()) & (df['yr_frmtn']<1940) ][['EIN', 'name', 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2', '_merge']][:2] # In[48]: print df['rule_date'].value_counts().sum() df['rule_date'] = np.where(df['EIN']=='010211478', '1959', df['rule_date']) df['rule_date'] = np.where(df['EIN']=='010212541', '1942', df['rule_date']) print df['rule_date'].value_counts().sum() # In[49]: #df[df['EIN']=='010211478'][['EIN', 'name', # 'rule_date', 'yr_frmtn', 'yr_frmtn_v2', # 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2']] # #### Create *Age* variable # In[50]: print df['rule_date'].value_counts().sum() print len(df[df['rule_date']=='0.0']) print len(df[df['rule_date']=='']) print len(df[df['rule_date']=='0000']) # In[51]: print df['rule_date'].value_counts().sum() df['rule_date'] = np.where(df['rule_date']=='0.0', np.nan, df['rule_date']) print len(df[df['rule_date']=='0.0']) print df['rule_date'].value_counts().sum() # In[ ]: #print df['rule_date'].value_counts().sum() #df['rule_date'] = np.where(df['rule_date']=='', np.nan, df['rule_date']) #print len(df[df['rule_date']=='0.0']) #print df['rule_date'].value_counts().sum() # In[ ]: #print df['rule_date'].value_counts().sum() #print len(df[df['rule_date']=='0000']) #df['rule_date'] = np.where(df['rule_date']=='0000', np.nan, df['rule_date']) #print len(df[df['rule_date']=='0000']) #print df['rule_date'].value_counts().sum() # In[348]: #df['age'] = np.nan # In[52]: for index, row in df.iterrows(): if pd.notnull(row['rule_date']): df.ix[index, 'age'] = 2016 - int(row['rule_date']) else: pass # In[350]: print df['age'].value_counts().sum() # In[351]: df[df['age'].notnull()]['age'].plot.hist(by=None, bins=100) # In[355]: print len(df[df['age']>100]) df[df['age']>100][['EIN', 'name', 'age', 'rule_date', 'yr_frmtn_v2', 'RULING_2015_BMF', 'ruledate_2004_BMF_v2', 'ruledate_MSTRALL_v2', '_merge']] # In[238]: #df.to_pickle('quick save - merged with age fixes.pkl') # In[53]: print len(df[(df['age'].isnull() & (df['_merge']=='right_only'))]) print len(df[(df['age'].isnull() & (df['_merge']!='right_only'))]) print len(df[(df['age'].isnull() & (df['_merge']=='left_only'))]) print len(df[(df['age'].isnull() & (df['_merge']=='both'))]) # ##### Save DF # In[361]: df.to_pickle('quick save - merged with age fixes.pkl') # ### Category # There are 11 categories here. # In[54]: print df['category'].value_counts().sum() df['category'].value_counts() # ##### Create category dummy variables # In[55]: pd.get_dummies(df['category'], prefix='category').head(5) # In[56]: df = pd.concat([df, pd.get_dummies(df['category'], prefix='category')], axis=1) # In[57]: print df.columns.tolist() # ## Time_Variant Controls # - Control variables: # - Size: total revenues best (probably logged) # - will need 2011 and 2016 versions for the 4th and 5th tests # - efficiency ratio # - complexity (could be a good control from Erica's paper) # - The focus of our paper is on SOX policies; if an org has SOX policies it probably has other governance policies, and these would be highly correlated. So, we will leave the other governance variables out of one version of the 4th and 5th tests, and then try to include them in another set. The best candidates are: # - *independent board* --> related to Erica's *independence of key actors" concept # - *board review of 990* and *audited financials* --> both related to Erica's *board monitoring* concept # - we could include other governance variables as needed. # - We are focusing on non-health, non-university organizations; by focusing more on a donor-focused sample (CN), we are differentiating the work from previous studies. # - To differentiate from Erica's *JBE* paper, we should use the SOI data to see how many of the donor advisories are because of 'non-material diversions'. # # ### Size - Logged Total Revenues # In[58]: total_revenue_columns = ['org_id', 'EIN', 'FYE', 'latest_entry', '2011 data', 'Overall Rating', 'total_revenue_2016', 'total_revenue_2011', 'tot_rev', 'TAX_PERIOD_2015_BMF', 'INCOME_AMT_2015_BMF', 'REVENUE_AMT_2015_BMF', 'taxper_MSTRALL', 'F990REV_MSTRALL'] df[total_revenue_columns][:2] #df[df['EIN']=='020503776'][total_revenue_columns] #df[df['EIN']=='020503776'][total_revenue_columns] #
Create a combined *total_revenue* column. First I will make the 2016 variable a float variable. # In[59]: import re df['total_revenue_2016'] = df['total_revenue_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['total_revenue_2016'][:25] #
Make the combined variable *total_revenue*. Start by making it equal to 2016 value, then add in 2011 value, and if it's missing from a given FY and the SOI data are available, add in the value for the SOI variable *tot_rev*. # In[60]: df['total_revenue'] = np.nan print len(df[df['total_revenue'].notnull()]) df['total_revenue'] = df['total_revenue_2016'] print len(df[df['total_revenue'].notnull()]) df['total_revenue'] = np.where(df['total_revenue_2011'].notnull(), df['total_revenue_2011'], df['total_revenue']) print len(df[df['total_revenue'].notnull()]) df['total_revenue'] = np.where( ( df['total_revenue'].isnull() & df['tot_rev'].notnull()), df['tot_rev'], df['total_revenue']) print len(df[df['total_revenue'].notnull()]) #
Let's take a look and check that the variable is correct. First I will **_sort the dataframe._** # In[61]: df.sort_values(by=['org_id', 'FYE', 'ratings_system', 'latest_entry'], ascending=[1, 0, 0, 0])[['org_id', 'EIN', 'FYE', 'ratings_system', 'latest_entry', '2011 data', 'Overall Rating', 'total_revenue_2016', 'total_revenue_2011', 'total_revenue', 'tot_rev', 'TAX_PERIOD_2015_BMF', 'REVENUE_AMT_2015_BMF', 'taxper_MSTRALL', 'F990REV_MSTRALL']][45:55] # ##### Create logged version # In[62]: print len(df[df['total_revenue']==0]) print len(df[df['total_revenue']<0]) df['total_revenue_no_neg'] = df['total_revenue'] df['total_revenue_no_neg'] = np.where(df['total_revenue_no_neg']<=0, 1, df['total_revenue_no_neg']) print len(df[df['total_revenue_no_neg']==0]) print len(df[df['total_revenue_no_neg']<0]) # In[63]: df['total_revenue_logged'] = np.log(df['total_revenue_no_neg']) df.sort_values(by=['org_id', 'FYE', 'ratings_system', 'latest_entry'], ascending=[1, 0, 0, 0])[['org_id', 'EIN', 'FYE', 'ratings_system', 'latest_entry', #'2011 data', 'Overall Rating', 'total_revenue', 'total_revenue_logged', 'total_revenue_2016', 'total_revenue_2011', 'tot_rev', 'TAX_PERIOD_2015_BMF', 'REVENUE_AMT_2015_BMF', 'taxper_MSTRALL', 'F990REV_MSTRALL']][:20] # In[1235]: #df[df['total_revenue']<0][total_revenue_columns][:5] # ### NOTE: The *total_revenue* variable is now all set. If we need to fill in additional values, we could add the BMF data to the column. But let's only do that if necessary. # # ### NOTE: The 69 observations with negative *total_revenue* are missing from the logged version *total_revenue_logged* # In[64]: total_revenue_columns = ['org_id', 'EIN', 'FYE', 'ratings_system', 'latest_entry', '2011 data', 'Overall Rating', 'total_revenue', 'total_revenue_logged', 'total_revenue_2016', 'total_revenue_2011', 'tot_rev', 'TAX_PERIOD_2015_BMF', 'REVENUE_AMT_2015_BMF', 'taxper_MSTRALL', 'F990REV_MSTRALL'] df[total_revenue_columns].describe().T # ##### Save DF # In[725]: print len(df) df.to_pickle('Merged dataset with with Age, Category dummies, and Total Revenues.pkl') # ### STATE # Create new *state* variable then add in values for the three variables below successively, as I did with *total_revenue* above. # In[65]: [col for col in list(df) if 'state' in col.lower()] # In[66]: print len(df[df['state_2011'].isnull()]) print len(df[df['STATE_2015_BMF'].isnull()]) print len(df[df['state_MSTRALL'].isnull()]) # In[67]: df[df['state_2011'].notnull()]['state_2011'][:2] # In[68]: df[df['STATE_2015_BMF'].notnull()]['STATE_2015_BMF'][:2] # In[69]: df[df['state_MSTRALL'].notnull()]['state_MSTRALL'][:2] # In[70]: df['state'] = np.nan print len(df[df['state'].notnull()]) df['state'] = df['state_2011'] print len(df[df['state'].notnull()]) df['state'] = np.where( ( df['state'].isnull() & df['STATE_2015_BMF'].notnull()), df['STATE_2015_BMF'], df['state']) print len(df[df['state'].notnull()]) df['state'] = np.where( ( df['state'].isnull() & df['state_MSTRALL'].notnull()), df['state_MSTRALL'], df['state']) print len(df[df['state'].notnull()]) # In[71]: df['state'].value_counts() #
Save DF # In[ ]: print len(df) df.to_pickle('Merged dataset with with Age, Category dummies, and Total Revenues.pkl') # In[76]: #df = pd.read_pickle('Merged dataset with with Age, Category dummies, and Total Revenues.pkl') # ## EFFICIENCY #
This variable is the average over 3 years. # In[77]: df[df['org_id']=='10166']['program_expense_percent_2016'][:3] #
Let's make one out of the original columns. First we'll need to change the variables to floats. # In[78]: df[['program_expenses_2016', 'total_functional_expenses_2016']].dtypes # In[79]: df[df['org_id']=='10166']['program_expenses_2016'][:2] # In[80]: df[df['org_id']=='10166']['total_functional_expenses_2016'][:2] # In[81]: df['program_expenses_2016'] = df['program_expenses_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['program_expenses_2016'][:2] # ##### Read in and re-merge SOI data # I did not keep the two columns needed for efficiency so I'll re-merge those in. # In[82]: SOI_data_valid_EINS = pd.read_pickle('combined SOI file 2008 to 2013 for CN EINs, v2.pkl') print len(SOI_data_valid_EINS.columns) print len(SOI_data_valid_EINS) SOI_data_valid_EINS = SOI_data_valid_EINS[['EIN', 'FYE', 'tot_func_expns_prg_srvcs', 'tot_func_expns_tot']] print len(SOI_data_valid_EINS.columns) print len(SOI_data_valid_EINS) SOI_data_valid_EINS.head(1) # In[83]: print len(df.columns) print len(df) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left').columns) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left')) # In[759]: # In[84]: SOI_data_valid_EINS['FYE'] = 'FY' + SOI_data_valid_EINS['FYE'] SOI_data_valid_EINS['FYE'][:2] # In[ ]: # In[85]: print len(df.columns) print len(df) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left').columns) df = pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left', indicator=True) print len(df.columns) print len(df) # In[762]: df['_merge'].value_counts() # ##### Create combined efficiency variables # In[777]: efficiency_columns = ['org_id', 'FYE', 'program_expense_percent_2016', 'program_expenses_2016', 'total_functional_expenses_2016', 'program_expense_2011', 'total_functional_expense_2011', 'tot_func_expns_prg_srvcs', 'tot_func_expns_tot'] df[efficiency_columns][15:22] # In[779]: df['program_expenses'] = np.nan print len(df[df['program_expenses'].notnull()]) df['program_expenses'] = df['program_expenses_2016'] print len(df[df['program_expenses'].notnull()]) df['program_expenses'] = np.where( (df['program_expenses'].isnull() & df['program_expense_2011'].notnull()), df['program_expense_2011'], df['program_expenses']) print len(df[df['program_expenses'].notnull()]) df['program_expenses'] = np.where( ( df['program_expenses'].isnull() & df['tot_func_expns_prg_srvcs'].notnull()), df['tot_func_expns_prg_srvcs'], df['program_expenses']) print len(df[df['program_expenses'].notnull()]) # In[792]: efficiency_columns = ['org_id', 'FYE', 'latest_entry', '2011 data', 'program_expenses', 'program_expenses_2016', 'tot_func_expns_prg_srvcs', 'program_expense_2011', 'total_functional_expense_2011', 'total_functional_expenses_2016', 'tot_func_expns_tot'] #'program_expense_percent_2016', df[efficiency_columns][148:160] # In[793]: df['total_expenses'] = np.nan print len(df[df['total_expenses'].notnull()]) df['total_expenses'] = df['total_functional_expenses_2016'] print len(df[df['total_expenses'].notnull()]) df['total_expenses'] = np.where( (df['total_expenses'].isnull() & df['total_functional_expense_2011'].notnull()), df['total_functional_expense_2011'], df['total_expenses']) print len(df[df['total_expenses'].notnull()]) df['total_expenses'] = np.where( ( df['total_expenses'].isnull() & df['tot_func_expns_tot'].notnull()), df['tot_func_expns_tot'], df['total_expenses']) print len(df[df['total_expenses'].notnull()]) # In[795]: efficiency_columns = ['org_id', 'FYE', 'latest_entry', '2011 data', 'total_expenses', 'total_functional_expenses_2016', 'total_functional_expense_2011', 'tot_func_expns_tot', 'program_expenses', 'program_expenses_2016', 'tot_func_expns_prg_srvcs', 'program_expense_2011', ] #'program_expense_percent_2016', df[efficiency_columns][148:160] # ##### Create combined version of efficiency variable # In[796]: df['program_efficiency'] = df['program_expenses']/df['total_expenses'] # In[797]: efficiency_columns = ['org_id', 'FYE', 'latest_entry', '2011 data', 'program_efficiency', 'program_expenses', 'total_expenses', 'total_functional_expenses_2016', 'total_functional_expense_2011', 'tot_func_expns_tot', 'program_expenses_2016', 'tot_func_expns_prg_srvcs', 'program_expense_2011', ] #'program_expense_percent_2016', df[efficiency_columns][148:160] # In[799]: df[efficiency_columns].describe().T # ##### Save DF # In[800]: print len(df) df.to_pickle('Merged dataset with with Age, State, Category dummies, Total Revenues, and Efficiency.pkl') # ## COMPLEXITY # In[826]: complexity_columns_SOI = ['FYE', 'contri_grnts_cy', 'federated_campaigns', 'memshp_dues', 'fndrsng_events', 'rltd_orgs', 'govt_grnts', 'prog_srvc_rev_cy', 'invst_incm_cy', 'oth_rev_cy'] ##### NOTE: 'invst_incm_cy' + 'other_rev_cy' MIGHT BE 'OTHER REVENUE' FOR CN # In[819]: complexity_columns = ['FYE', 'contributions_gifts_grants_2016', 'federated_campaigns_2016', 'membership_dues_2016', 'fundraising_events_2016', 'related_organizations_2016', 'government_grants_2016', 'program_service_revenue_2016', 'other_revenue_2016'] #'total_contributions_2016', 'total_primary_revenue_2016', 'total_revenue_2016', df[(df['EIN']=='362606232')&(df['latest_entry']=='True')][complexity_columns] # In[820]: SOI_2013[SOI_2013['EIN']=='362606232'][complexity_columns_SOI] # In[801]: df['contributions_gifts_grants_2016'] = df['contributions_gifts_grants_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['federated_campaigns_2016'] = df['federated_campaigns_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['membership_dues_2016'] = df['membership_dues_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['fundraising_events_2016'] = df['fundraising_events_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['related_organizations_2016'] = df['related_organizations_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['government_grants_2016'] = df['government_grants_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['program_service_revenue_2016'] = df['program_service_revenue_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) df['other_revenue_2016'] = df['other_revenue_2016'].replace( '[\$,)]', '', regex=True ).replace( '[(]','-', regex=True ).astype(float) # In[802]: df[(df['EIN']=='362606232')&(df['latest_entry']=='True')][complexity_columns] # In[803]: SOI_2013[SOI_2013['EIN']=='362606232'][complexity_columns_SOI] #
I didn't keep one of the needed SOI variables so re-merge. # In[810]: SOI_data_valid_EINS = pd.read_pickle('SOI_data_valid_EINS.pkl') print len(SOI_data_valid_EINS.columns) print len(SOI_data_valid_EINS) SOI_data_valid_EINS = SOI_data_valid_EINS[['EIN', 'FYE', 'fndrsng_events']] print len(SOI_data_valid_EINS.columns) print len(SOI_data_valid_EINS) SOI_data_valid_EINS.head(1) # In[811]: print len(df.columns) print len(df) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left').columns) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left')) # In[812]: df.rename(columns={'_merge':'_merge_v3'}, inplace=True) # In[813]: SOI_data_valid_EINS['FYE'] = 'FY' + SOI_data_valid_EINS['FYE'] SOI_data_valid_EINS['FYE'][:2] # In[814]: print len(df.columns) print len(df) print len(pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left').columns) df = pd.merge(df, SOI_data_valid_EINS, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left', indicator=True) print len(df.columns) print len(df) # In[815]: df['_merge'].value_counts() # In[824]: df[(df['EIN']=='362606232')&(df['latest_entry']=='True')][complexity_columns] # In[827]: df[(df['EIN']=='362606232')&(df['latest_entry']=='True')][complexity_columns_SOI] # In[828]: print complexity_columns_SOI # In[829]: df['other_revenue_SOI'] = df['invst_incm_cy'] + df['oth_rev_cy'] complexity_columns_SOI = complexity_columns_SOI + ['other_revenue_SOI'] df[(df['EIN']=='362606232')&(df['latest_entry']=='True')][complexity_columnsn_SOI] # In[830]: df[complexity_columns].astype(bool).sum(axis=1)[:5] # In[836]: df[complexity_columns][:5] # In[839]: complexity_columns.remove('FYE') print complexity_columns # #### Create *complexity_2016* # http://stackoverflow.com/questions/23663623/pandas-conditional-count-across-row # In[841]: df['complexity_2016'] = (df[complexity_columns] > 0).sum(1) df['complexity_2016'][:3] # #### Create *complexity_SOI* # In[843]: complexity_columns_SOI.remove('FYE') complexity_columns_SOI.remove('invst_incm_cy') complexity_columns_SOI.remove('oth_rev_cy') # In[844]: print complexity_columns_SOI # In[845]: df['complexity_SOI'] = (df[complexity_columns_SOI] > 0).sum(1) df['complexity_SOI'][:3] #
Create combined *complexity* variable. # In[847]: df['complexity'] = np.nan print len(df[df['complexity'].notnull()]) df['complexity'] = df['complexity_2016'] print len(df[df['complexity'].notnull()]) df['complexity'] = np.where( (df['complexity'].isnull() & df['complexity_SOI'].notnull()), df['complexity_SOI'], df['complexity']) print len(df[df['complexity'].notnull()]) # In[848]: len(df[df['complexity']>0]) # In[849]: print len(df) df.to_pickle('Merged dataset with with Age, State, Category dummies, Total Revenues, Efficiency, Complexity.pkl') # In[1227]: df['complexity'].describe() # ## SOX POLICIES # In[853]: SOX_columns = ['conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ] df[SOX_columns][:5] # In[925]: df['conflict_of_interest_policy'] = np.nan print len(df[df['conflict_of_interest_policy'].notnull()]) df['conflict_of_interest_policy'] = df['conflict_of_interest_policy_2016'] print len(df[df['conflict_of_interest_policy'].notnull()]) df['conflict_of_interest_policy'] = np.where( (df['conflict_of_interest_policy'].isnull() & df['conflict_of_interest_policy_2011'].notnull()), df['conflict_of_interest_policy_2011'], df['conflict_of_interest_policy']) print len(df[df['conflict_of_interest_policy'].notnull()]) df['conflict_of_interest_policy'] = np.where( ( df['conflict_of_interest_policy'].isnull() & df['cnflct_int_plcy'].notnull()), df['cnflct_int_plcy'], df['conflict_of_interest_policy']) print len(df[df['conflict_of_interest_policy'].notnull()]) # In[855]: df['whistleblower_policy'] = np.nan print len(df[df['whistleblower_policy'].notnull()]) df['whistleblower_policy'] = df['whistleblower_policy_2016'] print len(df[df['whistleblower_policy'].notnull()]) df['whistleblower_policy'] = np.where( (df['whistleblower_policy'].isnull() & df['whistleblower_policy_2011'].notnull()), df['whistleblower_policy_2011'], df['whistleblower_policy']) print len(df[df['whistleblower_policy'].notnull()]) df['whistleblower_policy'] = np.where( ( df['whistleblower_policy'].isnull() & df['whistleblower_plcy'].notnull()), df['whistleblower_plcy'], df['whistleblower_policy']) print len(df[df['whistleblower_policy'].notnull()]) # In[856]: df['records_retention_policy'] = np.nan print len(df[df['records_retention_policy'].notnull()]) df['records_retention_policy'] = df['records_retention_policy_2016'] print len(df[df['records_retention_policy'].notnull()]) df['records_retention_policy'] = np.where( (df['records_retention_policy'].isnull() & df['records_retention_policy_2011'].notnull()), df['records_retention_policy_2011'], df['records_retention_policy']) print len(df[df['records_retention_policy'].notnull()]) df['records_retention_policy'] = np.where( ( df['records_retention_policy'].isnull() & df['doc_retention_plcy'].notnull()), df['doc_retention_plcy'], df['records_retention_policy']) print len(df[df['records_retention_policy'].notnull()]) # In[878]: SOX_columns = ['conflict_of_interest_policy', 'whistleblower_policy', 'records_retention_policy', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ] df[SOX_columns][172:190] # In[934]: df[df['conflict_of_interest_policy_2016'].notnull()]['conflict_of_interest_policy_2016'][:5] #
For 2016 data I inserted a *list* instead of a string (among other things, this meant I could not view frequencies). Let me fix that here. # In[932]: for index, row in df[:2].iterrows(): print row['conflict_of_interest_policy'], type(row['conflict_of_interest_policy']) if type(row['conflict_of_interest_policy'])==list: #print 'yes', type(str(row['conflict_of_interest_policy'][0])), str(row['conflict_of_interest_policy'][0]) df.ix[index, 'conflict_of_interest_policy'] = str(row['conflict_of_interest_policy'][0]) if type(row['whistleblower_policy'])==list: #print 'yes', type(str(row['whistleblower_policy'][0])), str(row['whistleblower_policy'][0]) df.ix[index, 'whistleblower_policy'] = str(row['whistleblower_policy'][0]) if type(row['records_retention_policy'])==list: try: #print 'yes', type(str(row['records_retention_policy'][0])), str(row['records_retention_policy'][0]) df.ix[index, 'records_retention_policy'] = str(row['records_retention_policy'][0]) except: #print index pass # In[948]: for index, row in df.iterrows(): if type(row['records_retention_policy'])==list: try: print 'yes', index, type(str(row['records_retention_policy'][0])), str(row['records_retention_policy']) df.ix[index, 'records_retention_policy'] = np.nan except: #print index pass # In[ ]: # In[950]: df[37603:37605][['conflict_of_interest_policy', 'conflict_of_interest_policy_v2', 'whistleblower_policy', 'records_retention_policy', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ]] # In[986]: df.set_value(37604, 'records_retention_policy', '_gfx_/icons/checked.gif') # In[987]: df[37603:37605][['conflict_of_interest_policy', 'conflict_of_interest_policy_v2', 'whistleblower_policy', 'records_retention_policy', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ]] # In[988]: df['conflict_of_interest_policy'].value_counts() # In[989]: df['conflict_of_interest_policy_v2'] = np.nan df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== '_gfx_/icons/checked.gif', 1, df['conflict_of_interest_policy_v2']) df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== '_gfx_/icons/checkboxX.gif', 0, df['conflict_of_interest_policy_v2']) df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== 'Y', 1, df['conflict_of_interest_policy_v2']) df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== 'N', 0, df['conflict_of_interest_policy_v2']) df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== 'yes', 1, df['conflict_of_interest_policy_v2']) df['conflict_of_interest_policy_v2'] = np.where(df['conflict_of_interest_policy']== 'NO', 0, df['conflict_of_interest_policy_v2']) # In[ ]: # In[990]: print 8802+7779+4513 print 320+276+204 df['conflict_of_interest_policy_v2'].value_counts() # ##### Now fix records_retention_policy # In[991]: df['records_retention_policy'].value_counts() # In[992]: df['records_retention_policy_v2'] = np.nan df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== '_gfx_/icons/checked.gif', 1, df['records_retention_policy_v2']) df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== '_gfx_/icons/checkboxX.gif', 0, df['records_retention_policy_v2']) df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== 'Y', 1, df['records_retention_policy_v2']) df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== 'N', 0, df['records_retention_policy_v2']) df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== 'yes', 1, df['records_retention_policy_v2']) df['records_retention_policy_v2'] = np.where(df['records_retention_policy']== 'NO', 0, df['records_retention_policy_v2']) # In[993]: print 8138+7246+3864 print 969+940+736 df['records_retention_policy_v2'].value_counts() # ##### Now fix whistleblower_policy # In[994]: df['whistleblower_policy'].value_counts() # In[995]: df['whistleblower_policy_v2'] = np.nan df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== '_gfx_/icons/checked.gif', 1, df['whistleblower_policy_v2']) df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== '_gfx_/icons/checkboxX.gif', 0, df['whistleblower_policy_v2']) df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== 'Y', 1, df['whistleblower_policy_v2']) df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== 'N', 0, df['whistleblower_policy_v2']) df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== 'yes', 1, df['whistleblower_policy_v2']) df['whistleblower_policy_v2'] = np.where(df['whistleblower_policy']== 'NO', 0, df['whistleblower_policy_v2']) # In[996]: print 8145+7297+3867 print 966+933+686 df['whistleblower_policy_v2'].value_counts() # In[997]: SOX_columns = ['conflict_of_interest_policy', 'conflict_of_interest_policy_v2', 'whistleblower_policy', 'whistleblower_policy_v2', 'records_retention_policy', 'records_retention_policy_v2', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ] df[SOX_columns][172:190] # #### Create combined *SOX_policy* variable # In[998]: SOX_columns = ['conflict_of_interest_policy_v2', 'whistleblower_policy_v2', 'records_retention_policy_v2'] #df['SOX_policies'] = (df[SOX_columns] > 0).sum(1) df['SOX_policies'] = np.where(df['conflict_of_interest_policy_v2'].notnull(), (df[SOX_columns] > 0).sum(1), np.nan) df['SOX_policies'][:3] # In[999]: SOX_columns = ['SOX_policies', 'conflict_of_interest_policy', 'conflict_of_interest_policy_v2', 'whistleblower_policy', 'whistleblower_policy_v2', 'records_retention_policy', 'records_retention_policy_v2', 'conflict_of_interest_policy_2016', 'whistleblower_policy_2016', 'records_retention_policy_2016', 'conflict_of_interest_policy_2011', 'whistleblower_policy_2011', 'records_retention_policy_2011', 'cnflct_int_plcy', 'whistleblower_plcy', 'doc_retention_plcy' ] df[SOX_columns][172:190] # In[1005]: print df['SOX_policies'].value_counts().sum() df['SOX_policies'].value_counts() # In[1006]: print df['SOX_policies_binary'].value_counts().sum() df['SOX_policies_binary'] = df['SOX_policies'] df['SOX_policies_binary'] = np.where(df['SOX_policies_binary']>=1, 1, df['SOX_policies']) df['SOX_policies_binary'].value_counts() # In[1001]: SOX_columns = ['conflict_of_interest_policy_v2', 'whistleblower_policy_v2', 'records_retention_policy_v2'] print df[SOX_columns[0]].value_counts().sum() print SOX_columns[0] df[SOX_columns[0]].value_counts() # In[1002]: print SOX_columns[1] print df[SOX_columns[1]].value_counts().sum() df[SOX_columns[1]].value_counts() # In[1003]: print SOX_columns[2] print df[SOX_columns[2]].value_counts().sum() df[SOX_columns[2]].value_counts() #
Create binary version for *all three policies* # In[1149]: df['SOX_policies_all_binary'] = np.nan df['SOX_policies_all_binary'] = df['SOX_policies'] df['SOX_policies_all_binary'] = np.where( ((df['SOX_policies_all_binary']==1) | (df['SOX_policies_all_binary']==2)), 0, df['SOX_policies_all_binary']) df['SOX_policies_all_binary'] = np.where(df['SOX_policies_all_binary']==3, 1, df['SOX_policies_all_binary']) df['SOX_policies_all_binary'].value_counts() # In[1150]: print df['SOX_policies'].value_counts(), '\n' print df['SOX_policies_binary'].value_counts(), '\n' print df['SOX_policies_all_binary'].value_counts() # ##### Save DF # In[1152]: print len(df) df.to_pickle('Merged dataset with Age, State, Category dummies, Total Revenues, Efficiency, Complexity, SOX.pkl') # #### '2016 data' Indicator # In[1010]: df['2016_data'] = np.where(df['latest_entry']=='True', 1,0) print len(df), df['2016_data'].value_counts().sum() df['2016_data'].value_counts() # In[1011]: pd.crosstab(df['2016_data'], df['latest_entry']) # In[1014]: df.rename(columns={'2011 data':'2011_data'}, inplace=True) print df['2011_data'].value_counts(), '\n' df['2011_data'] = np.where(df['2011_data']==1, 1,0) print df['2011_data'].value_counts() # ##### Save DF # In[1034]: print len(df) df.to_pickle('Merged dataset with Age, State, Category dummies, Total Revenues, Efficiency, Complexity, SOX.pkl') # ## DONOR ADVISORY # In[ ]: df['advisory text - current advisory']= df['advisory text - current advisory'].str.strip() # In[1036]: advisory_columns = ['org_id', 'Date Published', 'FYE', 'Overall Rating', 'advisory text - current advisory', 'advisory text - past advisory' ] df[advisory_columns][:10] # In[1038]: df['Advisory Text'] = df['advisory text - current advisory'] df['Advisory Text'] = np.where( (df['Advisory Text'].isnull() & df['advisory text - past advisory'].notnull()), df['advisory text - past advisory'], df['Advisory Text']) advisory_columns = ['org_id', 'Date Published', 'FYE', 'Overall Rating', 'Advisory Text', 'advisory text - current advisory', 'advisory text - past advisory' ] df[advisory_columns][:10] # In[ ]: df['donor_advisory'] = df['Overall Rating'].str.contains('advisory', case=False) #df['donor_advisory'] = df['donor_advisory'].convert_objects(convert_numeric=True) #OLD CODE df['donor_advisory'] = pd.to_numeric(df['donor_advisory']) df['Advisory Text'] = df['advisory text - current advisory'] # In[1052]: advisory_columns = ['org_id', 'FYE', '2016_data', 'Overall Rating', 'donor_advisory', 'current_donor_advisory', 'Advisory Text', ] df[advisory_columns][:10] # In[1049]: #df = df.drop('2016_donor_advisory', 1) #df = df.drop('donor_advisory_2016', 1) # In[1057]: len(df[df['donor_advisory'].isnull()]) # In[1054]: df[advisory_columns].dtypes # #### Create *2016_donor_advisory* variable # We need this for '2011' test # In[1062]: print len(df[(df['2016_data']==1) & (df['donor_advisory']==1)]) print len(df[(df['2016_data']==1) & (df['donor_advisory']==1)]['org_id'].tolist()) advisories_2016 = df[(df['2016_data']==1) & (df['donor_advisory']==1)]['org_id'].tolist() print len(advisories_2016) print len(set(advisories_2016)) # In[1064]: df['2016_donor_advisory'] = np.nan df['2016_donor_advisory'] = np.where( df['org_id'].isin(advisories_2016), 1, 0) print df['2016_donor_advisory'].value_counts(), '\n' advisory_columns = ['org_id', 'FYE', '2016_data', 'Overall Rating', '2016_donor_advisory', 'donor_advisory', 'current_donor_advisory', 'Advisory Text' ] df[advisory_columns][:10] #
Verify data # In[1065]: len(df[(df['2011_data']==1) & (df['2016_donor_advisory']==1)]) # #### Post-2011 Donor Advisory # In[1076]: print len(set(df[df['donor_advisory']==1]['org_id'].tolist())) advisories_2011 = set(df[df['donor_advisory']==1]['org_id'].tolist()) print len(advisories_2011) print len(set(advisories_2011)) # In[1077]: df['2011_to_2016_donor_advisory'] = np.nan df['2011_to_2016_donor_advisory'] = np.where( df['org_id'].isin(advisories_2011), 1, 0) print df['2011_to_2016_donor_advisory'].value_counts(), '\n' advisory_columns = ['org_id', 'FYE', '2016_data', 'Overall Rating', '2016_donor_advisory', '2011_to_2016_donor_advisory', 'donor_advisory', 'current_donor_advisory', 'Advisory Text' ] df[advisory_columns][:10] # ##### Save DF # In[1078]: print len(df) df.to_pickle('Merged dataset with Age, State, Category dummies, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory.pkl') # # Material Diversion # In[1117]: df['mtrl_divrsn_or_misuse'].value_counts() # In[1118]: df['donor_advisory'].value_counts() # In[1116]: pd.crosstab(df['donor_advisory'], df['mtrl_divrsn_or_misuse']) # In[1126]: pd.crosstab(df[df['2011_data']==1]['donor_advisory'], df[df['2011_data']==1]['no_material_division_2011']) #
Fix 2016 version. # In[1124]: for index, row in df[2:].iterrows(): #print row['no_material_division_2016'], type(row['no_material_division_2016']) if type(row['no_material_division_2016'])==list: #print 'yes', type(str(row['no_material_division_2016'][0])), str(row['no_material_division_2016'][0]) df.ix[index, 'no_material_division_2016'] = str(row['no_material_division_2016'][0]) # In[1130]: pd.crosstab(df['donor_advisory_2011_to_2016'], df['no_material_division_2016']) # In[1128]: pd.crosstab(df[df['2016_data']==1]['donor_advisory'], df[df['2016_data']==1]['no_material_division_2016']) # In[ ]: # In[1113]: pd.crosstab(df['donor_advisory_2011_to_2016'], df['mtrl_divrsn_or_misuse']) # In[1115]: pd.crosstab(df['past_donor_advisory'], df['mtrl_divrsn_or_misuse']) # In[1159]: print df.columns.tolist() # # Variables # In[1082]: df.rename(columns={'2016_donor_advisory':'donor_advisory_2016'}, inplace=True) df.rename(columns={'2011_to_2016_donor_advisory':'donor_advisory_2011_to_2016'}, inplace=True) # In[1189]: DVs = ['donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016', 'conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2'] indicators = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data'] IVs = ['SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary'] controls = ['program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'category', 'state'] fixed_effects = ['category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy'] SOI_check = ['tot_rev'] # ##### Save DF # In[1160]: print len(df) df.to_pickle('Merged dataset with Age, State, Category dummies, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory.pkl') # ### Output version of dataset with only 2011 rows and logit columns # In[1190]: cols = DVs + indicators + IVs + controls + SOI_check + fixed_effects print cols # In[1178]: print len(df[df['2011_data']==1]['org_id'].tolist()) print len(set(df[df['2011_data']==1]['org_id'].tolist())) org_ids_2011 = list(set(df[df['2011_data']==1]['org_id'].tolist())) print len(org_ids_2011) org_ids_2011[:5] # In[1180]: len(df[(df['org_id'].isin(org_ids_2011))]) # In[1181]: len(df[(df['org_id'].isin(org_ids_2011))]['org_id'].tolist()) len(set(df[(df['org_id'].isin(org_ids_2011))]['org_id'].tolist())) #
Save as Excel file. # In[1179]: df[(df['org_id'].isin(org_ids_2011))][cols].to_excel('2011 dataset.xlsx') # # Create dummies for Test #5 # 'Divide 4,857 orgs into three groups: i) those with no SOX policies in 2011 and still no SOX policies in 2016; ii) those with SOX policies in 2011 and 2016; and iii) those with no SOX policies in 2011 but SOX policies in 2016. Create dummy variables for each group and see whether those in group iii) do better than i) or ii). This is a relatively low cost 'pre-post' test. # # # Try SOX_policies in 2016 - SOX_policies in 2011 # # # --> THIS COULD BE A PROBLEM FOR THOSE WITH A 2016 DONOR ADVISORY - WE DON'T HAVE THEIR 990 DETAILS --> PERHAPS ASK DAN TO GET THOSE? IN ANY CASE, THEY WILL LIKELY HAVE TO BE DOWNLOADED (OTHERWISE, WE ONLY HAVE THEIR '2011' SOX policy data, not the 2016 # # # --> E.G., WE NEED 2016 SOX DATA FOR ORG_ID 10087 # # # DESCRIPTIVE DATA -- SHOW HOW MANY GOT OR ADDED SOX POLICIES FROM 2011 TO 2016; ALSO DO THIS SINCE 2008 FOR THE AVAILABLE 'SOI' ORGANIZATIONS --> PERHAPS JUST FOR OUR SAMPLE PLUS FOR THE ENTIRE SOI DATASET # In[1187]: print len(df[(df['org_id'].isin(org_ids_2011))]) print len(df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1)]) print len(df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1) & (df['donor_advisory']==1)]) df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1) & (df['donor_advisory']==1)][cols].to_excel('47 missing SOX.xls') df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1) & (df['donor_advisory']==1)][cols][:5] # In[1191]: df[(df['org_id'].isin(org_ids_2011)) & (df['donor_advisory_2016']==1)][cols].to_excel('47 missing SOX v2.xls') # ## Merge in hand-coded data for 47 orgs # In[1237]: df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1) & (df['donor_advisory']==1)][cols][:5] # In[1297]: missing_47 = pd.read_excel('47 missing SOX_updated.xls') missing_47['EIN'] = missing_47['EIN'].astype('str') missing_47['FYE'] = 'FY' + missing_47['FYE'].astype('str') print len(missing_47) missing_47.head() # In[1298]: missing_47['program_efficiency'] = missing_47['program_expense']/missing_47['total_expense'] missing_47.head() # In[1299]: len(missing_47[missing_47['total_revenue']<=0]) # In[1300]: missing_47['total_revenue_logged'] = np.log(missing_47['total_revenue']) # In[1301]: missing_47.describe().T # In[1302]: SOX_columns_47 = ['conflict_of_interest', 'whistleblower', 'records_retention'] missing_47['SOX_policies'] = np.where(missing_47['conflict_of_interest'].notnull(), (missing_47[SOX_columns_47] > 0).sum(1), np.nan) missing_47['SOX_policies'][:3] # In[1303]: SOX_columns_47 = ['SOX_policies'] + SOX_columns_47 missing_47[SOX_columns_47][:8] # In[1304]: missing_47['SOX_policies_binary'] = missing_47['SOX_policies'] missing_47['SOX_policies_binary'] = np.where(missing_47['SOX_policies_binary']>=1, 1, missing_47['SOX_policies']) print missing_47['SOX_policies_binary'].value_counts().sum() missing_47['SOX_policies_binary'].value_counts() # In[1305]: missing_47['SOX_policies_all_binary'] = np.nan missing_47['SOX_policies_all_binary'] = missing_47['SOX_policies'] missing_47['SOX_policies_all_binary'] = np.where( ((missing_47['SOX_policies_all_binary']==1) | (missing_47['SOX_policies_all_binary']==2)), 0, missing_47['SOX_policies_all_binary']) missing_47['SOX_policies_all_binary'] = np.where(missing_47['SOX_policies_all_binary']==3, 1, missing_47['SOX_policies_all_binary']) print missing_47['SOX_policies_all_binary'].value_counts() # In[1307]: print missing_47['SOX_policies'].value_counts(), '\n' print missing_47['SOX_policies_binary'].value_counts(), '\n' print missing_47['SOX_policies_all_binary'].value_counts() # In[1308]: SOX_columns_47 = ['SOX_policies_binary', 'SOX_policies_all_binary'] + SOX_columns_47 missing_47[SOX_columns_47][:8] # In[1309]: print missing_47.columns.tolist() # In[1310]: missing_47 = missing_47[['org_id', 'EIN', 'FYE', 'conflict_of_interest', 'records_retention', 'whistleblower', 'SOX_policies', 'SOX_policies_all_binary', 'SOX_policies_binary', 'total_revenue', 'total_revenue_logged', 'program_expense', 'total_expense', 'program_efficiency', 'complexity']] #'EIN', 'donor_advisory', missing_47[:3] # In[1311]: missing_47['org_id'] = missing_47['org_id'].astype('str') #
Rename columns. I checked that all 47 would be merged in as new rows, so I can make the column names the same as in the existing dataset. # In[1313]: missing_47.columns = ['org_id', 'EIN_47', 'FYE', 'conflict_of_interest_policy_47', 'records_retention_policy_47', 'whistleblower_policy_47', 'SOX_policies_47', 'SOX_policies_all_binary_47', 'SOX_policies_binary_47', 'tot_rev_47', 'total_revenue_logged_47', 'program_expenses_47', 'total_expenses_47', 'program_efficiency_47', 'complexity_47'] missing_47[:3] #
Save DF. # In[1314]: missing_47.to_pickle('missing_47.pkl') # ##### Merge into main dataframe # In[1315]: df.rename(columns={'_merge':'_merge_v4'}, inplace=True) # In[1316]: #df.to_pickle('df.pkl') #df = pd.read_pickle('df.pkl') print len(df.columns) print len(df) # In[1318]: print len(df.columns) print len(df) print len(pd.merge(df, missing_47, left_on=['org_id','FYE'], right_on=['org_id','FYE'], how='left').columns) print len(pd.merge(df, missing_47, left_on=['org_id','FYE'], right_on=['org_id','FYE'], how='left')) print len(pd.merge(df, missing_47, left_on=['org_id','FYE'], right_on=['org_id','FYE'], how='outer').columns) print len(pd.merge(df, missing_47, left_on=['org_id','FYE'], right_on=['org_id','FYE'], how='outer')) df = pd.merge(df, missing_47, left_on=['org_id','FYE'], right_on=['org_id','FYE'], how='outer', indicator=True) print len(df.columns) print len(df) # In[1319]: df['_merge'].value_counts() # In[1320]: print df.columns.tolist() # In[1321]: df[df['_merge']=='right_only'][:2] # In[1322]: df.rename(columns={'_merge':'_merge_47'}, inplace=True) # ##### Replace values # In[ ]: print len(df[df['EIN'].notnull()]) df['EIN'] = np.where( ( (df['EIN'].isnull()) & (df['EIN_47'].notnull()) ), df['EIN_47'], df['EIN']) print len(df[df['EIN'].notnull()]) # In[ ]: print len(df[df['conflict_of_interest_policy_v2'].notnull()]) df['conflict_of_interest_policy_v2'] = np.where( ( (df['conflict_of_interest_policy_v2'].isnull()) & (df['conflict_of_interest_policy_47'].notnull()) ), df['conflict_of_interest_policy_47'], df['conflict_of_interest_policy_v2']) print len(df[df['conflict_of_interest_policy_v2'].notnull()]) # In[1328]: print len(df[df['records_retention_policy_v2'].notnull()]) df['records_retention_policy_v2'] = np.where( ( (df['records_retention_policy_v2'].isnull()) & (df['records_retention_policy_47'].notnull()) ), df['records_retention_policy_47'], df['records_retention_policy_v2']) print len(df[df['records_retention_policy_v2'].notnull()]) # In[1340]: print len(df[df['whistleblower_policy_v2'].notnull()]) df['whistleblower_policy_v2'] = np.where( ( (df['whistleblower_policy_v2'].isnull()) & (df['whistleblower_policy_47'].notnull()) ), df['whistleblower_policy_47'], df['whistleblower_policy_v2']) print len(df[df['whistleblower_policy_v2'].notnull()]) # In[1330]: print len(df[df['SOX_policies'].notnull()]) df['SOX_policies'] = np.where( ( (df['SOX_policies'].isnull()) & (df['SOX_policies_47'].notnull()) ), df['SOX_policies_47'], df['SOX_policies']) print len(df[df['SOX_policies'].notnull()]) # In[1331]: print len(df[df['SOX_policies_all_binary'].notnull()]) df['SOX_policies_all_binary'] = np.where( ( (df['SOX_policies_all_binary'].isnull()) & (df['SOX_policies_all_binary_47'].notnull()) ), df['SOX_policies_all_binary_47'], df['SOX_policies_all_binary']) print len(df[df['SOX_policies_all_binary'].notnull()]) # In[1332]: print len(df[df['SOX_policies_binary'].notnull()]) df['SOX_policies_binary'] = np.where( ( (df['SOX_policies_binary'].isnull()) & (df['SOX_policies_binary_47'].notnull()) ), df['SOX_policies_binary_47'], df['SOX_policies_binary']) print len(df[df['SOX_policies_binary'].notnull()]) # In[1333]: print len(df[df['tot_rev'].notnull()]) df['tot_rev'] = np.where( ( (df['tot_rev'].isnull()) & (df['tot_rev_47'].notnull()) ), df['tot_rev_47'], df['tot_rev']) print len(df[df['tot_rev'].notnull()]) # In[1334]: print len(df[df['total_revenue_logged'].notnull()]) df['total_revenue_logged'] = np.where( ( (df['total_revenue_logged'].isnull()) & (df['total_revenue_logged_47'].notnull()) ), df['total_revenue_logged_47'], df['total_revenue_logged']) print len(df[df['total_revenue_logged'].notnull()]) # In[1335]: print len(df[df['program_expenses'].notnull()]) df['program_expenses'] = np.where( ( (df['program_expenses'].isnull()) & (df['program_expenses_47'].notnull()) ), df['program_expenses_47'], df['program_expenses']) print len(df[df['program_expenses'].notnull()]) # In[1336]: print len(df[df['total_expenses'].notnull()]) df['total_expenses'] = np.where( ( (df['total_expenses'].isnull()) & (df['total_expenses_47'].notnull()) ), df['total_expenses_47'], df['total_expenses']) print len(df[df['total_expenses'].notnull()]) # In[1337]: print len(df[df['program_efficiency'].notnull()]) df['program_efficiency'] = np.where( ( (df['program_efficiency'].isnull()) & (df['program_efficiency_47'].notnull()) ), df['program_efficiency_47'], df['program_efficiency']) print len(df[df['program_efficiency'].notnull()]) # In[1338]: print len(df[df['complexity'].notnull()]) df['complexity'] = np.where( ( (df['complexity'].isnull()) & (df['complexity_47'].notnull()) ), df['complexity_47'], df['complexity']) print len(df[df['complexity'].notnull()]) # In[1341]: df[df['_merge_47']=='right_only'][:2] # In[1344]: print cols # In[1351]: cols = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data', 'donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016', 'conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary', 'program_efficiency', 'complexity', 'age', 'total_revenue_logged', 'tot_rev', 'state', 'category', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy', ] # In[1352]: df[df['_merge_47']=='right_only'][cols] # In[1353]: df[cols].describe().T # ##### Save DF # In[1355]: print len(df) df.to_pickle('Merged CN dataset with Age, State, Category, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory (with added 990 data).pkl') # ## Merge in e-file data # NOTE: THERE WAS A STRANGE NUMPY IMPORT ERROR WHEN I ATTEMPTED TO READ IN THE *.PKL VERSION # In[1389]: #dfe = pd.read_pickle('efile 990s.pkl') #dfe = pd.read_excel('e-file 990s for 2016 donor advisory organizations, v4 (key columns only).xls', # dtype={'EIN': object}) dfe = pd.read_csv('e-file 990s for 2016 donor advisory organizations, v4 (key columns only).csv', dtype={'EIN': object}) print len(dfe.columns) print len(dfe) dfe[:2] # In[1393]: dfe.describe().T # In[1392]: dfe.dtypes # In[1371]: #dfe['EIN'] = dfe['EIN'].astype('str') # In[1394]: print dfe.columns.tolist() # In[1395]: dfe.columns = ['EIN', 'OrganizationName_efile', 'URL_efile', 'SubmittedOn_efile', 'TaxPeriod_efile', 'FYE', 'whistleblower_policy_efile', 'conflict_of_interest_policy_efile', 'records_retention_policy_efile', 'SOX_policies_efile', 'SOX_policies_binary_efile', 'SOX_policies_all_binary_efile', 'tot_rev_efile', 'tot_rev_no_neg_efile', 'total_revenue_logged_efile', 'total_expenses_efile', 'program_expenses_efile', 'program_efficiency_efile', 'complexity_efile'] dfe[:3] # In[1396]: dfe = dfe[['EIN', 'FYE', 'OrganizationName_efile', 'URL_efile', 'SubmittedOn_efile', 'TaxPeriod_efile', 'whistleblower_policy_efile', 'conflict_of_interest_policy_efile', 'records_retention_policy_efile', 'SOX_policies_efile', 'SOX_policies_binary_efile', 'SOX_policies_all_binary_efile', 'tot_rev_efile', 'tot_rev_no_neg_efile', 'total_revenue_logged_efile', 'program_expenses_efile', 'total_expenses_efile', 'program_efficiency_efile', 'complexity_efile']] dfe[:2] # In[1397]: dfe.to_pickle('dfe.pkl') # In[1398]: len(dfe) # ##### Merge into main dataframe # In[1380]: #df.to_pickle('df.pkl') # In[1399]: df[cols][:2] # In[1400]: dfe[:2] # In[1403]: print len(df.columns) print len(df) print len(pd.merge(df, dfe, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left').columns) print len(pd.merge(df, dfe, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='left')) print len(pd.merge(df, dfe, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='outer').columns) print len(pd.merge(df, dfe, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='outer')) df = pd.merge(df, dfe, left_on=['EIN','FYE'], right_on=['EIN','FYE'], how='outer', indicator=True) print len(df.columns) print len(df) # In[1404]: df.rename(columns={'_merge':'_merge_efile'}, inplace=True) df['_merge_efile'].value_counts() #
Save DF # In[ ]: print len(df) df.to_pickle('Merged CN dataset with Age, State, Category, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory (with added 990 data).pkl') # In[1442]: print cols efile_cols = ['whistleblower_policy_efile', 'conflict_of_interest_policy_efile', 'records_retention_policy_efile', 'SOX_policies_efile', 'SOX_policies_binary_efile', 'SOX_policies_all_binary_efile', 'tot_rev_efile', 'tot_rev_no_neg_efile', 'total_revenue_logged_efile', 'program_expenses_efile', 'total_expenses_efile', 'program_efficiency_efile', 'complexity_efile'] cols2 = cols+efile_cols+['_merge_efile', 'latest_entry'] print cols2 # In[1412]: df[df['_merge_efile']=='right_only'][cols2][:3] # In[1414]: df[df['_merge_efile']=='both'][cols2][:3] #
Sort DF # In[1443]: #df[df['org_id']=='16648'][cols2] #
Fix 4 EIN values that the sort command showed to be wrong. # In[1437]: #df.set_value(66124, 'EIN', '042453412') #df.set_value(44363, 'EIN', '202440544') #df.set_value(35434, 'EIN', '364762261') #df.set_value(11784, 'EIN', np.nan) # In[1444]: df.sort_values(by=['EIN', 'latest_entry', 'FYE', 'ratings_system'], ascending=[1, 0, 0, 0])[cols2][:16] # In[1447]: df['SOX_policies'].describe() # In[1452]: df[df['org_id']=='3916'][cols2] #
65 EINs are missing. See email exchanges with Dan. # In[1424]: print len(df[df['EIN'].isnull()]) print len(df[df['EIN'].notnull()]) # #### Replace Values # In[1464]: print len(df[df['conflict_of_interest_policy'].notnull()]) df['conflict_of_interest_policy'] = np.where( ( (df['conflict_of_interest_policy'].isnull()) & (df['conflict_of_interest_policy_efile'].notnull()) ), df['conflict_of_interest_policy_efile'], df['conflict_of_interest_policy']) print len(df[df['conflict_of_interest_policy'].notnull()]) # In[1537]: print len(df[df['conflict_of_interest_policy_v2'].notnull()]) df['conflict_of_interest_policy_v2'] = np.where( ( (df['conflict_of_interest_policy_v2'].isnull()) & (df['conflict_of_interest_policy_efile'].notnull()) ), df['conflict_of_interest_policy_efile'], df['conflict_of_interest_policy_v2']) print len(df[df['conflict_of_interest_policy_v2'].notnull()]) # In[1465]: print len(df[df['records_retention_policy'].notnull()]) df['records_retention_policy'] = np.where( ( (df['records_retention_policy'].isnull()) & (df['records_retention_policy_efile'].notnull()) ), df['records_retention_policy_efile'], df['records_retention_policy']) print len(df[df['records_retention_policy'].notnull()]) # In[1538]: print len(df[df['records_retention_policy_v2'].notnull()]) df['records_retention_policy_v2'] = np.where( ( (df['records_retention_policy_v2'].isnull()) & (df['records_retention_policy_efile'].notnull()) ), df['records_retention_policy_efile'], df['records_retention_policy_v2']) print len(df[df['records_retention_policy_v2'].notnull()]) # In[1466]: print len(df[df['whistleblower_policy'].notnull()]) df['whistleblower_policy'] = np.where( ( (df['whistleblower_policy'].isnull()) & (df['whistleblower_policy_efile'].notnull()) ), df['whistleblower_policy_efile'], df['whistleblower_policy']) print len(df[df['whistleblower_policy'].notnull()]) # In[1539]: print len(df[df['whistleblower_policy_v2'].notnull()]) df['whistleblower_policy_v2'] = np.where( ( (df['whistleblower_policy_v2'].isnull()) & (df['whistleblower_policy_efile'].notnull()) ), df['whistleblower_policy_efile'], df['whistleblower_policy_v2']) print len(df[df['whistleblower_policy_v2'].notnull()]) # In[1467]: print len(df[df['SOX_policies'].notnull()]) df['SOX_policies'] = np.where( ( (df['SOX_policies'].isnull()) & (df['SOX_policies_efile'].notnull()) ), df['SOX_policies_efile'], df['SOX_policies']) print len(df[df['SOX_policies'].notnull()]) # In[1468]: print len(df[df['SOX_policies_all_binary'].notnull()]) df['SOX_policies_all_binary'] = np.where( ( (df['SOX_policies_all_binary'].isnull()) & (df['SOX_policies_all_binary_efile'].notnull()) ), df['SOX_policies_all_binary_efile'], df['SOX_policies_all_binary']) print len(df[df['SOX_policies_all_binary'].notnull()]) # In[1469]: print len(df[df['SOX_policies_binary'].notnull()]) df['SOX_policies_binary'] = np.where( ( (df['SOX_policies_binary'].isnull()) & (df['SOX_policies_binary_efile'].notnull()) ), df['SOX_policies_binary_efile'], df['SOX_policies_binary']) print len(df[df['SOX_policies_binary'].notnull()]) # In[1470]: print len(df[df['tot_rev'].notnull()]) df['tot_rev'] = np.where( ( (df['tot_rev'].isnull()) & (df['tot_rev_efile'].notnull()) ), df['tot_rev_efile'], df['tot_rev']) print len(df[df['tot_rev'].notnull()]) # In[1471]: print len(df[df['total_revenue_logged'].notnull()]) df['total_revenue_logged'] = np.where( ( (df['total_revenue_logged'].isnull()) & (df['total_revenue_logged_efile'].notnull()) ), df['total_revenue_logged_efile'], df['total_revenue_logged']) print len(df[df['total_revenue_logged'].notnull()]) # In[1472]: print len(df[df['program_expenses'].notnull()]) df['program_expenses'] = np.where( ( (df['program_expenses'].isnull()) & (df['program_expenses_efile'].notnull()) ), df['program_expenses_efile'], df['program_expenses']) print len(df[df['program_expenses'].notnull()]) # In[1473]: print len(df[df['total_expenses'].notnull()]) df['total_expenses'] = np.where( ( (df['total_expenses'].isnull()) & (df['total_expenses_efile'].notnull()) ), df['total_expenses_efile'], df['total_expenses']) print len(df[df['total_expenses'].notnull()]) # In[1474]: print len(df[df['program_efficiency'].notnull()]) df['program_efficiency'] = np.where( ( (df['program_efficiency'].isnull()) & (df['program_efficiency_efile'].notnull()) ), df['program_efficiency_efile'], df['program_efficiency']) print len(df[df['program_efficiency'].notnull()]) # In[1475]: print len(df[df['complexity'].notnull()]) df['complexity'] = np.where( ( (df['complexity'].isnull()) & (df['complexity_efile'].notnull()) ), df['complexity_efile'], df['complexity']) print len(df[df['complexity'].notnull()]) # In[1476]: df[cols].describe().T # In[1477]: [x for x in list(df) if '_merge' in x] # In[1495]: # ##### Sort DF # In[1488]: # In[1494]: df[['org_id', 'EIN', 'FYE', '2011_data', '2016_data', 'SOX_policies', 'total_revenue_logged', 'tot_rev', '_merge_v4', '_merge_47', '_merge_efile']][40:80] # ##### Save DF # In[1487]: #df = pd.read_pickle('Merged CN dataset with Age, State, Category, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory (with added 990 data).pkl') print len(df) df.to_pickle('Merged CN dataset with Age, State, Category, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory (with added 990 data).pkl') # ## Dataset checks for Test 1 and Test 2 # In[ ]: df[cols] # In[1498]: len(df[df['2011_data']==1]) # In[1504]: df[df['2011_data']==1][cols][:5] # In[1502]: df[df['2011_data']==1]['donor_advisory_2016'].value_counts() # In[1503]: df[df['2011_data']==1]['donor_advisory_2011_to_2016'].value_counts() # In[1500]: 5439-4863 # In[1497]: cols = DVs + indicators + IVs + controls + SOI_check df[df['2011_data']==1][cols].describe().T # ##### Fix Age based on Guidestar Values and website searches # In[1505]: df[(df['2011_data']==1) & (df['age'].isnull())][cols] # In[1509]: print len(df[df['age'].notnull()]) df['age'] = np.where(df['org_id']=='6108', 22, df['age']) print len(df[df['age'].notnull()]) # In[1516]: print len(df[df['age'].notnull()]) df['age'] = np.where(df['org_id']=='6951', 37, df['age']) print len(df[df['age'].notnull()]) # In[1517]: print len(df[df['age'].notnull()]) df['age'] = np.where(df['org_id']=='7972', 45, df['age']) print len(df[df['age'].notnull()]) # In[1515]: df[df['EIN']=='520941367'][['FYE', 'org_id', 'age', 'name', 'SOX_policies', 'donor_advisory']] # In[1524]: cols = DVs + indicators + IVs + controls + SOI_check + ['complexity_2011'] df[df['2011_data']==1][cols].describe().T # ## Save 2011 Dataset # In[18]: DVs = ['donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016'] indicators = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data'] IVs = ['conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary'] controls = ['program_efficiency', 'complexity', 'complexity_2011', 'age', 'total_revenue_logged', 'category', 'state'] fixed_effects = ['category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy'] SOI_check = ['tot_rev'] merge_cols = ['_merge_v1', '_merge_v2', '_merge_v3', '_merge_v4', '_merge_47', '_merge_efile'] logit_cols = DVs + indicators + IVs + controls + SOI_check + fixed_effects print logit_cols # In[1526]: print len(df[df['2011_data']==1][logit_cols]) df_2011 = df[df['2011_data']==1][logit_cols] print len(df_2011.columns) print len(df_2011) df_2011[:2] # In[1522]: print df_2011['donor_advisory_2016'].value_counts() print df_2011['donor_advisory_2011_to_2016'].value_counts() # In[1527]: df_2011.to_pickle('Tests 1-2 data.pkl') df_2011.to_excel('Tests 1-2 data.xls') # ### Fix *complexity* values for obs with 2016 donor advisory # All are zeros. That is not correct. # In[1614]: print len(df[(df['2016_data']==1) & (df['donor_advisory']==1)]) print df[(df['2016_data']==1) & (df['donor_advisory']==1)]['complexity'].value_counts() df[(df['2016_data']==1) & (df['donor_advisory']==1)][logit_cols][:2] # In[1615]: df[(df['2016_data']==1) & (df['donor_advisory']==1)][logit_cols].describe().T # In[1620]: print len(df) print df['complexity'].value_counts().sum() # In[1621]: print len(df[df['complexity'].isnull()]) df['complexity'] = np.where( ( (df['2016_data']==1) & (df['donor_advisory']==1)), np.nan, df['complexity']) print len(df[df['complexity'].isnull()]) # # Check what the 2016 donor advisory orgs are missing # In[1622]: print len(df[(df['donor_advisory_2016']==1) & (df['2016_data']==1)]) df[(df['donor_advisory_2016']==1) & (df['2016_data']==1)][logit_cols][:2] # In[99]: advisory_2016 = df[(df['donor_advisory_2016']==1) & (df['2016_data']==1)]['EIN'].tolist() print len(advisory_2016) print len(set(advisory_2016)) advisory_2016[:5] # In[1624]: df[df['EIN'].isin(advisory_2016)][logit_cols].to_excel('2016 advisory orgs.xls') # In[100]: len(df[df['EIN'].isin(advisory_2016)]) # In[101]: pd.crosstab(df[df['EIN'].isin(advisory_2016)]['2016_data'], df[df['EIN'].isin(advisory_2016)]['latest_entry']) #
Here's the normal way I'd do a groupby -- but it won't work for getting 'first' or 'last' entries. # In[ ]: def f(x): return Series(dict(Number_of_Public_Reply_Messages = x['reply_message'].sum(), Number_of_RTs = x['retweeted_status_dummy'].sum(), Number_of_tweets = x['content'].count(), Avg_number_lists = x['from_user_listed_count'].avg(), #rts = x['retweeted_status_dummy'].value_counts().max(), #TO GET MAXIMUM VALUE --> OTHERWISE IT WILL GENERATE VARIABLE WITH LIST OF VALUES '[1429, 450]' )) # In[ ]: firm_day_count = df_2014.groupby([df_2014.index.date,'ticker']).apply(f) #
We could also do something like this (pseudo code). # In[ ]: g = df.groupby(0) grouped['D'].agg({'result1' : "sum", 'result2' : "mean"}) #
This is one way to do a groupby command and grab first value. But, if all of the aggregations are 'first', then there's a shortcut (see below). # In[1553]: def func_first(s, row): #df1 = s[s.retweeted_status_dummy==0] #df2 = s[s.retweeted_status_dummy==1] #df3 = df_2014 a = df.groupby(row).agg({ 'SOX_policies':{'SOX_policies_first':"first"}}) b = df.groupby(row).agg({ 'FYE':{'FYE_first':"first"}}) c = df.groupby(row).agg({ 'EIN':{'EIN_first':"first"}}) d = df.groupby(row).agg({ 'org_id':{'org_id_first':"first"}}) #bb = df1.groupby(row).agg({ 'content':{'Number of Original Firm Tweets':"count"}}) #PROBLEM HERE #c = df_2014.groupby(row).agg({ 'retweeted_status_dummy':{'Number of RTs Sent by Firm':"sum"}}) #j1 = df1.groupby(row).agg({ 'retweet_count':{'Original Retweet Count for Firm':"sum"}}) #PROBLEM HERE #l = df_2014.groupby(row).agg({ 'from_user_followers_count':{'Number of Firm Followers (min)':"min"}}) #m = df_2014.groupby(row).agg({ 'from_user_followers_count':{'Number of Firm Followers (max)':"max"}}) #n = df_2014.groupby(row).agg({ 'from_user_followers_count':{'Number of Firm Followers (start)':"first"}}) ##o = df_2014.groupby(row).agg({ 'from_user_followers_count':{'Number of Firm Followers (end)':"last"}}) #p = df_2014.groupby(row).agg({ 'from_user_followers_count':{'Number of Firm Followers (mean)':"mean"}}) #q = df_2014.groupby(row).agg({ 'from_user_listed_count':{'Number of Lists for Firm (min)':"min"}}) #r = df_2014.groupby(row).agg({ 'from_user_listed_count':{'Number of Lists for Firm (max)':'max'}}) #o = df_2014.groupby(row).agg({ 'from_user_listed_count':{'Number of Lists for Firm (start)':"first"}}) #t = df_2014.groupby(row).agg({ 'from_user_listed_count':{'Number of Lists for Firm (end)':"last"}}) #u = df_2014.groupby(row).agg({ 'from_user_listed_count':{'Number of Lists for Firm (mean)':"mean"}}) #p = df_2014.groupby(row).agg({ 'retweeted_user_followers_count':{'Total Follower Count for Users Retweeted by Firm':"sum"}}) #s1 = pd.concat([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p], axis=1) s1 = pd.concat([a,b,c], axis=1) s1.columns = s1.columns.droplevel() return s1 # In[1575]: first_data = func_first(df[df['EIN'].isin(advisory_2016)], df[df['EIN'].isin(advisory_2016)]['EIN']) print len(first_data[first_data['SOX_policies_first'].notnull()]) print len(first_data[first_data['SOX_policies_first'].isnull()]) first_data[:10] #
This command doesn't work. # In[1574]: print len(df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').nth(0)) print df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').nth(0)['SOX_policies'].value_counts() df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').nth(0)[:5] #
This version works and is a better shortcut. # In[ ]: #df = pd.read_pickle('Merged CN dataset with Age, State, Category, Total Revenues, Efficiency, Complexity, SOX, Donor Advisory (with added 990 data).pkl') # In[105]: ''' DVs = ['donor_advisory', 'donor_advisory_2016', 'donor_advisory_2011_to_2016'] indicators = ['org_id', 'EIN', 'FYE', 'Form 990 FYE', 'ratings_system', '2011_data', '2016_data'] IVs = ['conflict_of_interest_policy_v2', 'records_retention_policy_v2', 'whistleblower_policy_v2', 'SOX_policies', 'SOX_policies_binary', 'SOX_policies_all_binary'] controls = ['program_efficiency', 'complexity', 'complexity_2011', 'age', 'total_revenue_logged', 'category', 'state'] fixed_effects = ['category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy'] SOI_check = ['tot_rev'] extra = ['total_revenue'] merge_cols = ['_merge_v1', '_merge_v2', '_merge_v3', '_merge_v4', '_merge_47', '_merge_efile'] logit_cols = DVs + indicators + IVs + controls + SOI_check + extra #+ fixed_effects print logit_cols ''' # In[106]: #df[logit_cols][:1] # In[107]: print len(df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').agg('first')) print df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').agg('first')['SOX_policies'].value_counts() print 77+31+30+19 df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').agg('first')[2:10] # In[108]: first_data_2016_advisories = df[df['EIN'].isin(advisory_2016)][logit_cols].groupby('EIN').agg('first') print len(first_data_2016_advisories[first_data_2016_advisories['SOX_policies'].notnull()]) print len(first_data_2016_advisories[first_data_2016_advisories['SOX_policies'].isnull()]) first_data_2016_advisories[:5] # In[109]: first_data_2016_advisories = first_data_2016_advisories.reset_index() # In[110]: print first_data_2016_advisories.columns.tolist() # ### Export 2016 data columns for Test 4 # In[111]: print len(df[df['2016_data']==1][logit_cols]) df_2016 = df[df['2016_data']==1][logit_cols] print len(df_2016.columns) print len(df_2016) df_2016[:2] # In[112]: print df_2016['donor_advisory'].value_counts(), '\n' print df_2016['donor_advisory_2016'].value_counts(), '\n' # In[113]: print df_2016.columns.tolist() # In[114]: print first_data_2016_advisories.columns.tolist() # In[115]: first_data_2016_advisories.describe().T # In[116]: print set(first_data_2016_advisories.columns.tolist()) - set(df_2016.columns.tolist()) print set(df_2016.columns.tolist()) - set(first_data_2016_advisories.columns.tolist()) # In[117]: print len(df_2016[~df_2016['EIN'].isin(advisory_2016)]) print len(df_2016[df_2016['EIN'].isin(advisory_2016)]) print len(df_2016[df_2016['EIN'].isin(advisory_2016)]) + len(df_2016[~df_2016['EIN'].isin(advisory_2016)]) df_2016_mod = df_2016[~df_2016['EIN'].isin(advisory_2016)] print len(df_2016_mod) df_2016_mod[:2] # In[118]: print len(df_2016_mod.append(first_data_2016_advisories)) print 8238-7983 print len(df_2016_mod) print len(df_2016_mod.columns) df_2016_mod = df_2016_mod.append(first_data_2016_advisories) print len(df_2016_mod) print len(df_2016_mod.columns) # In[119]: print len(df_2016_mod[df_2016_mod['EIN'].isin(advisory_2016)]) df_2016_mod[df_2016_mod['EIN'].isin(advisory_2016)].to_excel('df_2016_mod_partial.xls') # In[120]: df_2016_mod.describe().T # # SIDEBAR -- CREATING UNLOGGED REVENUES VARIABLE FOR SUMMARY STATS TABLE (I HAD TO ADD IN UNLOGGED REVENUE COLUMNS TO *logit_cols* # In[122]: df_2016_mod['revitup'] = np.nan print len(df_2016_mod[df_2016_mod['revitup'].notnull()]) df_2016_mod['revitup'] = np.where( ( (df_2016_mod['revitup'].isnull()) & (df_2016_mod['total_revenue'].notnull()) ), df_2016_mod['total_revenue'], df_2016_mod['revitup']) print len(df_2016_mod[df_2016_mod['revitup'].notnull()]) df_2016_mod['revitup'] = np.where( ( (df_2016_mod['revitup'].isnull()) & (df_2016_mod['tot_rev'].notnull()) ), df_2016_mod['tot_rev'], df_2016_mod['revitup']) print len(df_2016_mod[df_2016_mod['revitup'].notnull()]) # In[123]: df_2016_mod.describe().T # In[125]: pd.set_option('display.float_format', lambda x: '%.3f' % x) # In[126]: df_2016_mod['revitup'].describe().T # # BACK TO REGULARLY SCHEDULED PROGRAMMING # In[1653]: df_2016_mod[df_2016_mod['EIN'].isin(advisory_2016)].describe().T # ##### Save DFs # In[1654]: df.to_pickle('df.pkl') df_2016.to_pickle('2016 - Test 4 data.pkl') df_2016.to_excel('2016 - Test 4 data.xls') df_2016_mod.to_pickle('Test 4 data.pkl') df_2016_mod.to_excel('Test 4 data.xls') # # Test 5 Prep # In[8]: df = pd.read_pickle('df.pkl') print "Number of columns:", len(df.columns) print "Number of observations:", len(df) df.head(1) # In[11]: print len(df[df['2011_data']==1]['EIN'].tolist()) orgs_2011 = list(set(df[df['2011_data']==1]['EIN'].tolist())) print len(orgs_2011) print len(set(orgs_2011)) # In[16]: print len(df[(df['EIN'].isin(orgs_2011)) & (df['donor_advisory']==1) & (df['2016_data']==1)]) advisory_orgs_2011 = df[(df['EIN'].isin(orgs_2011)) & (df['donor_advisory']==1) & (df['2016_data']==1)]['EIN'].tolist() print len(advisory_orgs_2011) print len(set(advisory_orgs_2011)) advisory_orgs_2011[:5] # ##### Sort DF # In[23]: #df = df.sort_values(by=['EIN', 'latest_entry', 'FYE', 'ratings_system'], ascending=[1, 0, 0, 0]) # In[45]: #df['year'] = df['FYE'].str[2:] #print df['year'][:3], '\n' #print df['year'].value_counts() # In[44]: #df['year'] = np.where(df['year']=='rrent', 9999, df['year']) #print df['year'].value_counts() # In[41]: df = df.sort_values(by=['EIN', '2016_data', 'FYE', 'ratings_system'], ascending=[1, 0, 0, 0]) # In[42]: #logit_cols2 = ['year'] + logit_cols # ##### Create list of 2016 advisory orgs for the '2011' orgs -- grab 'first' values for each variable # In[43]: print len(df[df['EIN'].isin(advisory_orgs_2011)]) df[df['EIN'].isin(advisory_orgs_2011)][logit_cols2].to_excel('2011 orgs with 2016 advisory.xls') df[df['EIN'].isin(advisory_orgs_2011)][logit_cols2][:15] # In[47]: print len(df[df['EIN'].isin(advisory_orgs_2011)][logit_cols].groupby('EIN').agg('first')) print df[df['EIN'].isin(advisory_orgs_2011)][logit_cols].groupby('EIN').agg('first')['SOX_policies'].value_counts() print 26+11+7+3 df[df['EIN'].isin(advisory_orgs_2011)][logit_cols].groupby('EIN').agg('first')[2:10] # In[48]: first_data_2016_advisories_2011_orgs = df[df['EIN'].isin(advisory_orgs_2011)][logit_cols].groupby('EIN').agg('first') print len(first_data_2016_advisories_2011_orgs[first_data_2016_advisories_2011_orgs['SOX_policies'].notnull()]) print len(first_data_2016_advisories_2011_orgs[first_data_2016_advisories_2011_orgs['SOX_policies'].isnull()]) first_data_2016_advisories_2011_orgs[:5] # In[49]: first_data_2016_advisories_2011_orgs = first_data_2016_advisories_2011_orgs.reset_index() # In[50]: print first_data_2016_advisories_2011_orgs.columns.tolist() # ### Export data and columns for Test 5 obs # In[55]: print len(orgs_2011) print orgs_2011[:5] # In[56]: print len(set(df_2011_orgs['EIN'].tolist())) # In[51]: print len(df[df['EIN'].isin(orgs_2011)]) df_2011_orgs = df[df['EIN'].isin(orgs_2011)][logit_cols] print len(df_2011_orgs.columns) print len(df_2011_orgs) df_2011_orgs[:2] # ##### NOTE: There are 94 advisories in 2016 because they 47 are counted twice -- once with FYE 'current' and once with the actual FYE as coded by Dan. # In[62]: print df_2011_orgs['donor_advisory'].value_counts(), '\n' print df_2011_orgs['donor_advisory_2016'].value_counts(), '\n' # In[64]: print len(first_data_2016_advisories_2011_orgs) first_data_2016_advisories_2011_orgs['donor_advisory_2016'].value_counts() # In[60]: print first_data_2016_advisories_2011_orgs.columns.tolist() # In[65]: first_data_2016_advisories_2011_orgs.describe().T # In[66]: print set(first_data_2016_advisories_2011_orgs.columns.tolist()) - set(df_2011_orgs.columns.tolist()) print set(df_2011_orgs.columns.tolist()) - set(first_data_2016_advisories_2011_orgs.columns.tolist()) # ##### Get rid of *latest_entry* for each 2016 donor advisory org -- so, we're getting rid of 47 rows # In[78]: print len(df_2011_orgs[~df_2011_orgs['EIN'].isin(advisory_orgs_2011)]) print len(df_2011_orgs[df_2011_orgs['EIN'].isin(advisory_orgs_2011)]) print len(df_2011_orgs[~(df_2011_orgs['2016_data']==1)]) print len(df_2011_orgs[df_2011_orgs['2016_data']==1]) print 64993+4857 print len(df_2011_orgs) print len( df_2011_orgs[(df_2011_orgs['2016_data']==1) & (df_2011_orgs['EIN'].isin(advisory_orgs_2011))]) print len( df_2011_orgs[~((df_2011_orgs['2016_data']==1) & (df_2011_orgs['EIN'].isin(advisory_orgs_2011)))]) df_2011_orgs_mod = df_2011_orgs[~((df_2011_orgs['2016_data']==1) & (df_2011_orgs['EIN'].isin(advisory_orgs_2011)))] print len(df_2011_orgs_mod) df_2011_orgs_mod[:2] # In[80]: print len(df_2011_orgs_mod.append(first_data_2016_advisories_2011_orgs)) print 8238-7983 print len(df_2011_orgs_mod) print len(df_2011_orgs_mod.columns) df_2011_orgs_mod = df_2011_orgs_mod.append(first_data_2016_advisories_2011_orgs) print len(df_2011_orgs_mod) print len(df_2011_orgs_mod.columns) # In[81]: print len(df_2011_orgs_mod[df_2011_orgs_mod['EIN'].isin(advisory_orgs_2011)]) df_2011_orgs_mod[df_2011_orgs_mod['EIN'].isin(advisory_orgs_2011)].to_excel('df_2011_orgs_mod_partial.xls') # In[82]: df_2011_orgs_mod.describe().T # In[83]: df_2011_orgs_mod[df_2011_orgs_mod['EIN'].isin(advisory_orgs_2011)].describe().T # ##### Save DF # In[85]: df.to_pickle('df.pkl') first_data_2016_advisories_2011_orgs.to_pickle('first_data_2016_advisories_2011_orgs.pkl') df_2011_orgs_mod.to_pickle('Test 4 data.pkl') df_2011_orgs_mod.to_excel('Test 4 data.xlsx') # ##### Inspect DF # In[91]: df_2011_orgs_mod = df_2011_orgs_mod.sort_values(by=['EIN', '2016_data', 'FYE', 'ratings_system'], ascending=[1, 0, 0, 0]) # In[98]: print len(df_2011_orgs_mod.columns) print len(df_2011_orgs_mod) df_2011_orgs_mod[:25] # In[93]: df_2011_orgs_mod['donor_advisory_2016'].value_counts() # In[94]: print df_2011_orgs_mod.columns.tolist() # In[95]: df_2011_orgs_mod[df_2011_orgs_mod['EIN'].isin(advisory_orgs_2011)].to_excel('data for 47 orgs.xls') # #### Create Change Variables # In[ ]: df_2011_orgs_mod['NEW_SOX'] = np.where() # In[102]: print len(df_2011_orgs_mod[df_2011_orgs_mod['2011_data']==1]) print len(df_2011_orgs_mod[(df_2011_orgs_mod['2011_data']==1) & (df_2011_orgs_mod['SOX_policies'].notnull())]) print len(df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1]) print len(df_2011_orgs_mod[(df_2011_orgs_mod['2016_data']==1) & (df_2011_orgs_mod['SOX_policies'].notnull())]) # In[104]: df_2011_orgs_mod[df_2011_orgs_mod['2011_data']==1].describe().T # In[105]: df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1].describe().T # ##### Delete all rows except the *2011_data* and *2016_data* rows # In[113]: df_2011_orgs_mod[:1] # In[115]: #df_2011_orgs_mod['2011_data'] = df_2011_orgs_mod['2011_data'].astype('int') #df_2011_orgs_mod['2016_data'] = df_2011_orgs_mod['2016_data'].astype('int') # In[133]: print len(df_2011_orgs_mod) print len(df_2011_orgs_mod[(df_2011_orgs_mod['2011_data']==1) | (df_2011_orgs_mod['2016_data']==1)]) print len(df_2011_orgs_mod[df_2011_orgs_mod['2011_data']==1]) print len(df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1]) print len(df_2011_orgs_mod[(df_2011_orgs_mod['2011_data']==1) & (df_2011_orgs_mod['2016_data']==1)]) #
Save version with all rows # In[174]: #df_2011_orgs_mod = pd.read_pickle('df_2011_orgs_mod_v1.pkl') df_2011_orgs_mod.to_pickle('df_2011_orgs_mod_v1.pkl') # In[175]: df_2011_orgs_mod = df_2011_orgs_mod[(df_2011_orgs_mod['2011_data']==1) | (df_2011_orgs_mod['2016_data']==1)] print len(df_2011_orgs_mod) # In[176]: print df_2011_orgs_mod['2011_data'].dtype print df_2011_orgs_mod['2011_data'].value_counts(), '\n' print df_2011_orgs_mod['2016_data'].dtype print df_2011_orgs_mod['2016_data'].value_counts() # ##### Create new variables # # merged_firm_day['Number of Ticker Mentions [t-1]'] = merged_firm_day['Number of Ticker Mentions'].unstack().shift(1).stack() #
Get rid of duplicates first. # In[178]: df_2011_orgs_mod[df_2011_orgs_mod.duplicated(['EIN', 'FYE'])] # In[179]: df_2011_orgs_mod[df_2011_orgs_mod['EIN'].isin(['131624041','316027287','330068583','341787585', '510082499','521219783'])]#[:6] # In[183]: df_2011_orgs_mod.index # In[188]: print len(df_2011_orgs_mod) df_2011_orgs_mod = df_2011_orgs_mod.drop(37354) df_2011_orgs_mod = df_2011_orgs_mod.drop(22391) df_2011_orgs_mod = df_2011_orgs_mod.drop(22839) df_2011_orgs_mod = df_2011_orgs_mod.drop(74544) df_2011_orgs_mod = df_2011_orgs_mod.drop(22575) df_2011_orgs_mod = df_2011_orgs_mod.drop(28224) #df_2011_orgs_mod = df_2011_orgs_mod.drop([37354,22391,22839,74544,22575,28224]) #[756, 3171, 3202, 3381, 4726, 5045]) print len(df_2011_orgs_mod) # In[189]: df_2011_orgs_mod[df_2011_orgs_mod.duplicated(['EIN', 'FYE'])] # In[238]: #df_2011_orgs_mod.set_index(['FYE', 'EIN'], inplace=True) # In[239]: #df_2011_orgs_mod = df_2011_orgs_mod.set_index(['EIN']) # In[230]: df_2011_orgs_mod = df_2011_orgs_mod.reset_index() # In[224]: #df_2011_orgs_mod = df_2011_orgs_mod.set_index(['EIN', '2016_data']) # ##### This is the right way to set the index to get it to work! # In[264]: df_2011_orgs_mod = df_2011_orgs_mod.set_index(['2016_data', 'EIN']) # In[240]: df_2011_orgs_mod[:2] # In[241]: df_2011_orgs_mod.index[0] # ##### Create lagged variables # In[259]: df_2011_orgs_mod['SOX_policies [t-1]'] = df_2011_orgs_mod['SOX_policies'].unstack().shift(1).stack() df_2011_orgs_mod['SOX_policies_binary [t-1]'] = df_2011_orgs_mod['SOX_policies_binary'].unstack().shift(1).stack() df_2011_orgs_mod['SOX_policies_all_binary [t-1]'] = df_2011_orgs_mod['SOX_policies_all_binary'].unstack().shift(1).stack() df_2011_orgs_mod['whistleblower_policy_v2 [t-1]'] = df_2011_orgs_mod['whistleblower_policy_v2'].unstack().shift(1).stack() df_2011_orgs_mod['records_retention_policy_v2 [t-1]'] = df_2011_orgs_mod['records_retention_policy_v2'].unstack().shift(1).stack() df_2011_orgs_mod['conflict_of_interest_policy_v2 [t-1]'] = df_2011_orgs_mod['conflict_of_interest_policy_v2'].unstack().shift(1).stack() # In[260]: df_2011_orgs_mod[['SOX_policies', 'SOX_policies [t-1]']].tail(4) # In[265]: df_2011_orgs_mod['complexity_2011 [t-1]'] = df_2011_orgs_mod['complexity_2011'].unstack().shift(1).stack() df_2011_orgs_mod['total_revenue_logged [t-1]'] = df_2011_orgs_mod['total_revenue_logged'].unstack().shift(1).stack() df_2011_orgs_mod['program_efficiency [t-1]'] = df_2011_orgs_mod['program_efficiency'].unstack().shift(1).stack() # In[266]: df_2011_orgs_mod[:2] # In[267]: df_2011_orgs_mod[df_2011_orgs_mod['2011_data']==1]['SOX_policies'].describe() # In[268]: df_2011_orgs_mod['SOX_policies'].describe() # ##### 25 of the 4,857 orgs are missing 2011 SOX data and Total Revenues, etc. # In[256]: 9689-4832 # In[269]: df_2011_orgs_mod.describe().T # In[250]: print len(df_2011_orgs_mod[(df_2011_orgs_mod['2011_data']==1) & (df_2011_orgs_mod['donor_advisory']==1)]) #['SOX_policies'].describe() # In[253]: df_2011_orgs_mod['donor_advisory_2016'].value_counts() # In[270]: df_2011_orgs_mod = df_2011_orgs_mod.reset_index() #
Save DF # In[275]: #df_2011_orgs_mod.to_pickle('df_2011_orgs_mod.pkl') df_2011_orgs_mod = pd.read_pickle('df_2011_orgs_mod.pkl') # ##### Re-arrange and rename columns # In[276]: print df_2011_orgs_mod.columns.tolist() # In[281]: #'index', 'tot_rev', df_2011_orgs_mod = df_2011_orgs_mod[['EIN', 'org_id', 'FYE', '2011_data', '2016_data', 'Form 990 FYE', 'ratings_system', 'donor_advisory', 'donor_advisory_2011_to_2016', 'donor_advisory_2016', 'SOX_policies', 'SOX_policies [t-1]', 'SOX_policies_binary', 'SOX_policies_binary [t-1]', 'SOX_policies_all_binary', 'SOX_policies_all_binary [t-1]', 'conflict_of_interest_policy_v2', 'conflict_of_interest_policy_v2 [t-1]', 'whistleblower_policy_v2', 'whistleblower_policy_v2 [t-1]', 'records_retention_policy_v2', 'records_retention_policy_v2 [t-1]', 'program_efficiency', 'program_efficiency [t-1]', 'total_revenue_logged', 'total_revenue_logged [t-1]', 'complexity', 'complexity_2011', 'complexity_2011 [t-1]', 'age', 'state', 'category', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy' ]] df_2011_orgs_mod[:4] # In[280]: df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1].describe().T # In[279]: df_2011_orgs_mod[df_2011_orgs_mod['2011_data']==1].describe().T #
Save DF # In[282]: df_2011_orgs_mod.to_pickle('df_2011_orgs_mod (2011 and 2016 rows).pkl') # ##### Limit data to 2016 # In[284]: print len(df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1]) df_2011_orgs_mod = df_2011_orgs_mod[df_2011_orgs_mod['2016_data']==1] print len(df_2011_orgs_mod) # In[285]: df_2011_orgs_mod.describe().T #
Some of the orgs are missing *only* complexity_2011 values. # In[287]: print len(df_2011_orgs_mod[df_2011_orgs_mod['complexity_2011 [t-1]'].isnull()]) df_2011_orgs_mod[df_2011_orgs_mod['complexity_2011 [t-1]'].isnull()][:5] # In[291]: print len(df_2011_orgs_mod[df_2011_orgs_mod['complexity_2011 [t-1]'].isnull()]) df_2011_orgs_mod['complexity_2011 [t-1]'] = np.where(df_2011_orgs_mod['complexity_2011 [t-1]'].isnull(), df_2011_orgs_mod['complexity'], df_2011_orgs_mod['complexity_2011 [t-1]']) print len(df_2011_orgs_mod[df_2011_orgs_mod['complexity_2011 [t-1]'].isnull()]) # In[292]: df_2011_orgs_mod.describe().T # In[293]: print len(df_2011_orgs_mod[df_2011_orgs_mod['donor_advisory']==1]) print len(df_2011_orgs_mod[df_2011_orgs_mod['donor_advisory_2011_to_2016']==1]) print len(df_2011_orgs_mod[df_2011_orgs_mod['donor_advisory_2016']==1]) # In[294]: #df_2011_orgs_mod[df_2011_orgs_mod['donor_advisory']==1][:5] # In[295]: print df_2011_orgs_mod.columns.tolist() # In[296]: df_2011_orgs_mod[:1] # In[297]: #'2011_data', '2016_data', 'FYE', 'Form 990 FYE', 'ratings_system', 'donor_advisory', #'complexity', 'complexity_2011', #'program_efficiency', 'total_revenue_logged', df_2011_orgs_mod = df_2011_orgs_mod[['EIN', 'org_id', 'donor_advisory_2011_to_2016', 'donor_advisory_2016', 'SOX_policies', 'SOX_policies [t-1]', 'SOX_policies_binary', 'SOX_policies_binary [t-1]', 'SOX_policies_all_binary', 'SOX_policies_all_binary [t-1]', 'conflict_of_interest_policy_v2', 'conflict_of_interest_policy_v2 [t-1]', 'whistleblower_policy_v2', 'whistleblower_policy_v2 [t-1]', 'records_retention_policy_v2', 'records_retention_policy_v2 [t-1]', 'program_efficiency [t-1]', 'total_revenue_logged [t-1]', 'complexity_2011 [t-1]', 'age', 'state', 'category', 'category_Animals', 'category_Arts, Culture, Humanities', 'category_Community Development', 'category_Education', 'category_Environment', 'category_Health', 'category_Human Services', 'category_Human and Civil Rights', 'category_International', 'category_Religion', 'category_Research and Public Policy' ]] df_2011_orgs_mod[:4] # In[298]: df_2011_orgs_mod.describe().T # ##### Save DF # In[300]: df_2011_orgs_mod.to_pickle('df_2011_orgs_mod_v3 (single year combined).pkl') # ### Generate Change Variables # In[301]: print df_2011_orgs_mod.columns.tolist() # ##### Number of New SOX Policies Added # In[303]: df_2011_orgs_mod['number_of_SOX_policies_added'] = df_2011_orgs_mod['SOX_policies'] - \ df_2011_orgs_mod['SOX_policies [t-1]'] df_2011_orgs_mod['number_of_SOX_policies_added'].value_counts() # In[304]: df_2011_orgs_mod[df_2011_orgs_mod['number_of_SOX_policies_added']<0] # In[311]: df[df['EIN']=='116101487'][['EIN', 'org_id', 'FYE', '2011_data', '2016_data', 'SOX_policies', 'whistleblower_policy', 'records_retention_policy', 'conflict_of_interest_policy']] # In[312]: print df_2011_orgs_mod['number_of_SOX_policies_added'].value_counts(), '\n' df_2011_orgs_mod['number_of_SOX_policies_added'] = np.where(df_2011_orgs_mod['number_of_SOX_policies_added']<0, 0, df_2011_orgs_mod['number_of_SOX_policies_added']) print df_2011_orgs_mod['number_of_SOX_policies_added'].value_counts() # In[343]: df_2011_orgs_mod['any_SOX_policies_added'] = np.nan df_2011_orgs_mod['any_SOX_policies_added'] = np.where(df_2011_orgs_mod['number_of_SOX_policies_added']>0, 1, 0) print df_2011_orgs_mod['any_SOX_policies_added'].value_counts(), '\n' pd.crosstab(df_2011_orgs_mod['any_SOX_policies_added'], df_2011_orgs_mod['number_of_SOX_policies_added']) # In[314]: df_2011_orgs_mod['always_no_SOX'] = np.where( ( (df_2011_orgs_mod['SOX_policies_binary']==0) & (df_2011_orgs_mod['SOX_policies_binary [t-1]']==0)), 1, 0) print b.value_counts(), '\n' #df_2011_orgs_mod['any_SOX_policies_added'] = np.where(df_2011_orgs_mod['any_SOX_policies_added']<0, # 0, df_2011_orgs_mod['any_SOX_policies_added']) #print df_2011_orgs_mod['any_SOX_policies_added'].value_counts() #
There were 255 orgs with zero SOX policies in 2011; this dropped to 62 in 2016 (so, 193 orgs added SOX policies). # In[315]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['SOX_policies_binary']) # In[316]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['SOX_policies_binary [t-1]']) # In[319]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[320]: df_2011_orgs_mod['always_SOX'] = np.where( ( (df_2011_orgs_mod['SOX_policies_binary']==1) & (df_2011_orgs_mod['SOX_policies_binary [t-1]']==1)), 1, 0) print df_2011_orgs_mod['always_SOX'].value_counts(), '\n' #df_2011_orgs_mod['any_SOX_policies_added'] = np.where(df_2011_orgs_mod['any_SOX_policies_added']<0, # 0, df_2011_orgs_mod['any_SOX_policies_added']) #print df_2011_orgs_mod['any_SOX_policies_added'].value_counts() # In[321]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['always_SOX']) # In[323]: print 62+4572+223 print len(df_2011_orgs_modzero) # In[344]: df_2011_orgs_mod['added_SOX_previously_none'] = np.where( (df_2011_orgs_mod['SOX_policies_binary']> df_2011_orgs_mod['SOX_policies_binary [t-1]']), 1, 0) print df_2011_orgs_mod['added_SOX_previously_none'].value_counts(), '\n' # In[346]: #df_2011_orgs_mod = df_2011_orgs_mod.drop('added_SOX', 1) # In[328]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[330]: from __future__ import division # In[334]: print 11/62 print 0/193 print 31/4541 # In[347]: pd.crosstab(df_2011_orgs_mod['added_SOX_previously_none'], df_2011_orgs_mod['donor_advisory_2016']) # In[333]: pd.crosstab(df_2011_orgs_mod['always_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[352]: df_2011_orgs_mod['from_zero_to_3_SOX'] = np.where( ((df_2011_orgs_mod['SOX_policies_binary [t-1]']==0) & (df_2011_orgs_mod['number_of_SOX_policies_added']==3)), 1, 0) print df_2011_orgs_mod['from_zero_to_3_SOX'].value_counts(), '\n' # In[353]: pd.crosstab(df_2011_orgs_mod['from_zero_to_3_SOX'], df_2011_orgs_mod['number_of_SOX_policies_added']) # In[357]: pd.crosstab(df_2011_orgs_mod['from_zero_to_3_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[358]: pd.crosstab(df_2011_orgs_mod['added_SOX_previously_none'], df_2011_orgs_mod['donor_advisory_2016']) # In[361]: pd.crosstab(df_2011_orgs_mod['always_no_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[359]: pd.crosstab(df_2011_orgs_mod['always_SOX'], df_2011_orgs_mod['donor_advisory_2016']) # In[ ]: #'SOX_policies_binary', 'SOX_policies_binary [t-1]', #'SOX_policies_all_binary', 'SOX_policies_all_binary [t-1]', #'conflict_of_interest_policy_v2', 'conflict_of_interest_policy_v2 [t-1]', #'whistleblower_policy_v2', 'whistleblower_policy_v2 [t-1]', #'records_retention_policy_v2', 'records_retention_policy_v2 [t-1]', # In[336]: print len(df_2011_orgs_mod) # In[354]: df_2011_orgs_mod.describe().T # In[355]: print df_2011_orgs_mod.columns.tolist() # In[356]: df_2011_orgs_mod.to_pickle('Test 5 data.pkl') df_2011_orgs_mod.to_excel('Test 5 data.xls') # In[363]: df.to_pickle('Final Merged CN Dataset (85,401 obs).pkl') # In[ ]: # # Miscellaneous Code Below This # ### NOTE: Newly added 990 rows need org_id plus age and category and state added # In[1207]: for index, row in df[(df['org_id'].isin(org_ids_2011)) & (df['2016_data']==1) & (df['donor_advisory']==1)][:2].iterrows(): #url = 'http://990s.foundationcenter.org/990_pdf_archive/043/043314346/043314346_201312_990.pdf' EIN = row['EIN'] if row['Form 990 FYE']!='current': fye = row['Form 990 FYE'].str.replace('_', '') else: fye = '201412' URL_extension = EIN + '_' + fye print EIN, fye url = 'http://990s.foundationcenter.org/990_pdf_archive/%s/%s/%s_990.pdf' % (EIN[:3], EIN, URL_extension) print url # In[ ]: # In[ ]: # ### Save Lists of EINs to download # In[1218]: print len(df[(df['2016_data']==1) & (df['donor_advisory']==1)]) print len(set(df[(df['2016_data']==1) & (df['donor_advisory']==1)])) print len(list(set(df[(df['2016_data']==1) & (df['donor_advisory']==1)]['EIN'].tolist()))) advisories_2016 = list(set(df[(df['2016_data']==1) & (df['donor_advisory']==1)]['EIN'].tolist())) print len(advisories_2016) f = open('2016 donor advisory EINs.json', 'w') json.dump(advisories_2016, f) f.close()