IGM Experts Panel: Individual Characteristics

This is a second post exploring the data that I collected from the IGM Experts Panel, which surveys a group of leading economists on a variety of policy questions. A CSV of all the data is available here, as are separate datasets of the questions and responses.

One of the interesting things that Gordon and Dahl looked at in their 2012 paper [1, PDF] was how individual characteristics of economists might influence their responses. They compiled information on each economist including the institution of study, graduation year, current university, field of specialization, gender and NBER classification. Ten economists have been added to the group since 2012, so I did my best to gather this information from their CVs and the NBER database. The final CSV of individual characteristics is available here.

Below, I join this dataset of individual characteristics with the responses data from an earlier post, and look at some of the relationships graphically. I don't have background in economics or a very good understanding of regression analysis, so I stick to plotting trends rather than claiming statistical significance.

Loading the Data

Below, I load the data and process it as described in an earlier post. The main result is that I calculate a distance_median column, which is the distance from each response and the median response.

In [8]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

#pd.set_option('max_colwidth', 30)
pd.set_option('max_colwidth', 400)
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

df_responses = pd.read_csv('output_all.csv')
In [9]:
r_list = ['Strongly Disagree', 'Disagree', 'Uncertain', 'Agree', 'Strongly Agree']

def indicator(x):
	if x in r_list:
		return r_list.index(x)
		return None

df_responses['vote_num'] = df_responses['vote'].apply(indicator)
df_responses['median_num'] = df_responses['median_vote'].apply(indicator)
df_responses['vote_distance'] = abs(df_responses['median_num'] - df_responses['vote_num'])

# Construct a continuous column, incorporating confidence into vote_num
# Divide by 11 so 10 confidence of agree > 0 confidence of strongly agree
df_responses['incr_votenum'] = df_responses['vote_num'] + df_responses['confidence'] / 11.0

# Median incr_votenum for each question:
df_responses['median_incrvotenum'] = df_responses.groupby(

# Calculate distance from median for each econ vote, less biased by outliers.  
df_responses['distance_median'] = abs(df_responses['median_incrvotenum'] - \

(8402, 18)

Loading, Parsing the Individual Variables

Next, I load and do some processing of the individual variables, getting them into the correct format to join with my responses dataset. I also load in the individual characteristics for the ten new economist, and concatenate those to the original individual characteristics from Gordon and Dahl.

The result of the join is a dataset with 8402 rows and 26 columns.

In [12]:
df_indvars = pd.read_csv('individual-vars.csv')

df_indvars['name'] = df_indvars['name'].str.split(', ').map(lambda row: row[1] + ' ' + row[0])

institutions = ['Berkeley', 'Chicago', 'Harvard', 'MIT', 'Princeton', 'Stanford', 'Yale']

def inst_ind(element):
    return institutions[int(element) - 1]

df_indvars['phdfrom'] = df_indvars['phdfrom'].apply(inst_ind)
df_indvars['currentuniv'] = df_indvars['currentuniv'].apply(inst_ind)

def age_cohort(element):
    cur_year = 116
    cohort = cur_year - int(element)
    if cohort <= 15:
        return '0 - 15 Years'
    elif cohort >= 30:
        return '30+ Years'
        return '15 - 30 Years'

df_indvars['cohort'] = df_indvars['phdyear'].apply(age_cohort)

ifields = {8:'Industrial Org', 7:'Public Finance', 6:'Labor', 5:'Finance', 4:'Macro', 3:'International'}

def ifield_ind(element):
    return ifields[int(element)]

df_indvars['ifield'] = df_indvars['ifield'].apply(ifield_ind)
# Their coding error?  Alesina is listed as MAC in appendix, but has ifield of 3?
# df_indvars[]  change it?

# Load individual variables from 10 new economists
# Self coded based on personal websites, NBER information
# http://www.nber.org/programs/
df_indvarsnew = pd.read_csv('individual_vars_new.csv')

# Add cohort column
df_indvarsnew['cohort'] = df_indvarsnew['phdyear'].apply(age_cohort)

# Concat both old and new individual vars datasets
df_indvarsall = pd.concat([df_indvars, df_indvarsnew], ignore_index=True)

#df_indvarsall.to_csv('indvars_2016.csv', encoding='utf-8', index=False)


# Gordon Online Appendix and Data:
# http://econweb.ucsd.edu/~gdahl/views-among-economists-code.html
# http://econweb.ucsd.edu/~gdahl/papers/views-among-economists-online-appendix.pdf
# Notes: PhD From and Current University categories are BER=Berkeley; CHI=Chicago, Rochester; HAR=Harvard, 
#     Cambridge, LSE, Wisconsin; MIT=MIT, Oxford; PRI=Princeton; STA=Stanford; YAL=Yale.
#     Field categories are defined by primary NBER affiliation: MAC=macro (EFG, ME, POL); INT=international (IFM, ITI);
#         FIN=finance (AP, CF); LAB=labor (LS, ED, AG, DAE, DEV); PF=public finance (PF, EEE); 
#         IO=industrial organization (IO, LE). Three panel members are not in the NBER; 
#         Ray Fair and James Stock are assigned to MAC, Eric Maskin is assigned to FIN. 
#         Female is an indicator equal to 1 for women. Wash is an indicator for experience serving in Washington.
name phdyear phdfrom currentuniv ifield gender washington nber cohort
0 Daron Acemoglu 92 Harvard MIT Labor 1 0 LS 15 - 30 Years
1 Alberto Alesina 86 Harvard Harvard International 1 0 POL 30+ Years
2 Joseph Altonji 81 Princeton Yale Labor 1 0 LS 30+ Years
3 Alan Auerbach 78 Harvard Berkeley Public Finance 1 1 PE 30+ Years
4 David Autor 99 Harvard MIT Labor 1 0 LS 15 - 30 Years
In [13]:
# Inner join, on name column
df_all = pd.merge(df_responses, df_indvarsall, on=['name'], how='inner')

print df_all.shape
# Remove middle initials for Brunnermeier, Kaplan, to get right shape of (8402, 26)
(8402, 26)

Confidence by Education

One of the interesting things Gordon and Dahl found was that economists that were educated at MIT and the University of Chicago seemed to be more confident. I find less evidence of this in the newer data, although this is just a boxplot, not a regression analysis.

In [14]:
df_all.boxplot(column='confidence', by='phdfrom', whis=[5.0,95.0])

df_all.groupby('phdfrom').agg({'confidence':{'mean': 'mean', 'median':'median',
                                            'std': 'std', 'count':'count'}}).sort_values(
std count median mean
LSE 1.964802 93 6 6.290323
MIT 2.494196 2414 6 6.063380
Princeton 2.404093 591 7 6.000000
Illinois 2.839723 112 6 5.910714
Stanford 2.430876 959 6 5.851929
Harvard 2.384803 2082 6 5.850144
Yale 2.399019 358 5 5.410615
Chicago 2.538058 387 6 5.395349
Berkeley 2.310260 28 5 5.178571

Distance from Median by Education

Another interesting thing to look at is whether any institutions produce graduates with views further from the median view. There definitely are differences in the means and medians below, but I don't think they would rise to the level of significance because the standard deviations overlap.

It's also important to note that the vote counts by institution vary from 2414 (MIT) to 28 (Berkeley), so these responses don't represent the institution as a whole.

In [8]:
df_all.boxplot(column='distance_median', by='phdfrom', whis=[5.0,95.0])

df_all.hist(column='distance_median', by='phdfrom', bins=20, figsize=(12,12))

df_all.groupby('phdfrom').agg({'distance_median':{'mean': 'mean', 'median':'median',
                                            'std': 'std', 'count':'count'}}).sort_values(

#Only 28 votes from Berkeley
#len(df_all[(df_all['phdfrom'] == 'Berkeley') & (df_all['vote'].isin(r_list))])
std count median mean
MIT 0.559621 2414 0.545455 0.655363
Harvard 0.580161 2082 0.454545 0.638023
Chicago 0.524551 387 0.454545 0.633427
LSE 0.489293 93 0.545455 0.621701
Stanford 0.559780 959 0.409091 0.621149
Yale 0.522308 358 0.409091 0.574022
Princeton 0.501498 591 0.363636 0.556145
Illinois 0.523816 112 0.363636 0.551542
Berkeley 0.536900 28 0.272727 0.431818

Washington Insiders

Spending time in Washington doesn't seem to have much of an effect on confidence. Those that haven't spent time in Washington maybe have a little bit longer tail when it comes to controversial responses, but overall both groups are very similar.

In [18]:
df_all.boxplot(column='confidence', by='washington', whis=[5.0,95.0])

df_all.boxplot(column='distance_median', by='washington', whis=[5.0,95.0])
<matplotlib.axes._subplots.AxesSubplot at 0x107292790>

How do responses differ by field of study?

Gordon and Dahl also used the NBER classification of each economist to look at how responses changed by field of study. I do something similar below, looking at confidence and distance_median grouped by field of study. The Finance and Public Economics groups seem a little more confident. The International and Labor Economics groups seem to have a little higher distance_median, but it's unlikely any of this rises to the level of significance.

In [16]:
# http://econweb.ucsd.edu/~gdahl/papers/views-among-economists-online-appendix.pdf
# ifield is grouped nber classifications: 
#     Field categories are defined by primary NBER affiliation: MAC=macro (EFG, ME, POL); INT=international (IFM, ITI);
#         FIN=finance (AP, CF); LAB=labor (LS, ED, AG, DAE, DEV); PF=public finance (PF, EEE); 
#         IO=industrial organization (IO, LE). Three panel members are not in the NBER; 
#         Ray Fair and James Stock are assigned to MAC, Eric Maskin is assigned to FIN. 

df_all.boxplot(column='confidence', by='ifield', rot=90, whis=[5.0,95.0])

df_all.boxplot(column='distance_median', by='ifield', rot=90, whis=[5.0,95.0])
<matplotlib.axes._subplots.AxesSubplot at 0x1039ccd50>

Responses by Age Cohort

One final thing to do is look at these characteristics by age cohort, defined as years since their PhD. Confidence doesn't seem to be too different by age cohort, but it does seem like there are more outlier responses in the older groups. Perhaps the older economists have earned the right to say controversial things?

In [19]:
df_all.boxplot(column='confidence', by='cohort', whis=[5.0,95.0])

df_all.boxplot(column='distance_median', by='cohort', whis=[5.0,95.0])

df_all.hist(column='distance_median', by='cohort', bins=20)

df_all.groupby('cohort').agg({'distance_median':{'mean': 'mean', 'median':'median',
                                            'std': 'std', 'count':'count'},
                             'confidence': {'mean': 'mean', 'median':'median',
                                            'std': 'std'}}).sort_values(by=('distance_median','mean'),
distance_median confidence
std count median mean std median mean
30+ Years 0.561107 3230 0.454545 0.633971 2.413021 6 6.013622
15 - 30 Years 0.554432 3510 0.454545 0.632673 2.503507 6 5.789744
0 - 15 Years 0.524708 284 0.272727 0.523528 2.111850 6 5.795775