The aim of this notebook is to provide an end-to-end pipeline that learns an EgocentricSBM
for the UK using egocentric surveys from Understanding Society, and the UK Census data.
In this section we look at the microdata available for UK Census 2011. In particular, the England-Wales dataset which has 2.8 million samples. We wish to look at key Blau dimensions of interest, so that we can align our own survey's design along with the census. Also, using microdata gives us the full testing distribution of the population, instead of one that is factorized over the Blau dimenions.
Different kinds of microdata have been made available (w/o a special license) corresponding to different geographical granularity. We consider the dataset at the local authority level (also referred to as boroughs in London, and council areas in Scotland). The dataset is available at a grouped LA level, wherein a few LAs can be grouped such that each grouped-LA contains roughly 120,000 people.
First, we write a function to import any census microdata file.
%matplotlib inline
import numpy as np
def import_data(name=('2011', 'ew'), memory_map=True, datapath='./census/'):
import pandas as pd
datadict = {(2001, 'isar'):'UKDA-7205-tab/tab/01uklicind-20061002.tab',
(2001, 'sam'):'UKDA-7207-tab/tab/01uklicsam-20070301.tab',
(2011, 'ew'):'UKDA-7682-tab/tab/recodev12.tab',
(2011, 'n'):'UKDA-7770-tab/tab/ni_safeguarded_la.tab',
(2011, 's'):'UKDA-7835-tab/tab/safeguarded_grouped_la.tab'
}
geogdict = {(2001, 'isar'):'region',
(2001, 'sam'):'lacode',
(2011, 'ew'):'la_group',
(2011, 'n'):'LA_CODE_14',
(2011, 's'):'COUNCIL_AREA_GROUP'
}
datapath += datadict[name]
df = pd.read_csv(datapath, memory_map=memory_map, sep='\t')
df.rename(columns={geogdict[name]:'geocode'}, inplace=True)
return df
data_2011_ew = import_data((2011, 'ew'))
print(data_2011_ew.shape)
(2848149, 121)
We previously took a look at the Understanding Society data, and many of its variables of interest. In this notebook, we try to align USoc with the Census data, so that we can estimate the social access statistic for LAs. First, we import USoc data for wave 3, which has information on people's local social networks.
def import_usoc_data(category='ukhls', wave=3, survey='indresp', datapath='./usoc/'):
datadict = {'ukhls':'UKDA-6614-tab/tab/ukhls_w',
'bhps':'UKDA-6614-tab/tab/bhps_w',
'nurse':'UKDA-7251-tab/tab/',
'ip':'UKDA-6849-tab/tab/'}
from string import ascii_lowercase
prefix = {'bhps':str(wave)+'/', 'ukhls':str(wave)+'/', 'nurse':'', 'ip':''}
suffix = {'bhps':'', 'ukhls':'', 'nurse':'_ns', 'ip':'_ip'}
wave_num2sym = {'bhps': dict(zip(range(1,19), ['b'+s+'_' for s in ascii_lowercase[0:18]])),
'ukhls': dict(zip(range(1,8), [s+'_' for s in ascii_lowercase[0:7]])),
'ip': dict(zip(range(1,11), [s+'_' for s in ascii_lowercase[0:10]])),
'nurse': {2:'b',3:'c'}}
if isinstance(wave, int): wave_sym = wave_num2sym[category][wave]
else: wave_sym = wave
datapath = datapath + datadict[category] + prefix[category] + wave_sym + survey + suffix[category] + '.tab'
from pandas import read_csv
return {'cat':category, 'wav':wave, 'sym':wave_sym, 'svy':survey, 'tab':read_csv(datapath, sep='\t')}
data_wave3 = import_usoc_data(wave=3)
Next, we right functions that will extract and align the USoc and Census datasets.
def extract_usoc(data):
import numpy as np
from pandas import concat
if data['cat']!='ukhls':
print('please use a ukhls dataset')
return
var = {'sex_dv':'sex',
'netsx':'sex_nom',
'age_dv':'age',
'simage':'age_prop',
'netag':'age_nom',
'racel_dv':'ethnicity',
'simrace':'ethnicity_prop',
'netet':'ethnicity_nom',
'qfhigh_dv':'education',
'simeduc':'education_prop',
'jbstat':'employment',
'simjob':'employment_prop',
'netjb':'employment_nom',
'fimngrs_dv':'income',
'siminc':'income_prop',
'simarea':'distance_prop',
'netlv':'distance_nom',
'simfam':'family_prop',
'closenum':'num_friends'
}
var_encoding = {'sex':{1:'male',
2:'female',
(-10, -9, -8, -7, -2, -1, 0):np.nan},
'ethnicity':{1:'british-white',
(2, 3, 4):'other-white',
(5, 6, 14, 15, 16):'black',
(7, 8, 12, 13):'other-asian',
(9, 10, 11):'south-asian',
#11:'bangladeshi',
#13:'other-asian',
(17, 97):'other-ethnicity',
(-10, -9, -8, -7, -2, -1):np.nan
},
'education':{(1, 2, 3, 4, 5, 6):'level-4',
(7, 8, 9):'level-3',
(10, 11, 12, 13):'level-2',
(14, 15, 16):'level-1',
96:'level-0',
(-10, -9, -8, -7, -2, -1):np.nan
},
'employment':{(1, 2, 5, 10, 11, 97):'employed',
3:'unemployed',
4:'retired',
6:'housework',
(7, 9):'student',
(8, -10, -9, -8, -7, -2, -1):np.nan
},
'employment_nom':{(1, 2):'employed',
3:'unemployed',
4:'student',
5:'housework',
6:'retired',
(-10, -9, -8, -7, -2, -1):np.nan
},
'distance_nom':{(1, 2):'similar',
(3, 4, 5):'dissimilar',
(-10, -9, -8, -7, -2, -1):np.nan
}
}
def expand_var(var_dict, prefix='', num_nom=3):
new_dict = dict()
for key, value in var_dict.items():
if key[:3]=='net':
for i in range(1, num_nom+1): new_dict[prefix+key+'_'+str(i)] = value+'_'+str(i)
else: new_dict[prefix+key] = value
return new_dict
def expand_encoding(encoding_dict, atts=[], num_nom=3):
new_dict = dict()
for key, value in encoding_dict.items():
new_d = dict()
for k, v in value.items():
if not hasattr(k, '__iter__'): k = (k,)
for i in k: new_d[i] = v
if key[-4:]=='_nom':
for i in range(1, num_nom+1):
new_dict[key+'_'+str(i)] = new_d
else:
new_dict[key] = new_d
if key in atts and key+'_nom' not in encoding_dict:
for i in range(1, num_nom+1):
new_dict[key+'_nom_'+str(i)] = new_d
return new_dict
def define_prop(x):
if x==1: return 1
elif x==2: return 3/4
elif x==3: return 1/2
elif x==4: return 1/4
else: return np.nan
def derive_prop(df, atts=[], num_nom=3):
if not hasattr(atts, '__iter__'): atts = [atts]
dfs = list()
col = list()
self_atts = set(df)
for att in set(atts):
if att in self_atts:
idx = df[att].notnull()
friends = df[['_'.join([att, 'nom', str(i)]) for i in range(1, num_nom+1)]][idx]
dfs.append((friends.eq(df[att][idx], axis=0) & friends.notna()).sum(axis=1)/(friends.notna().sum(axis=1)))
else:
friends = df[['_'.join([att, 'nom', str(i)]) for i in range(1, num_nom+1)]]
dfs.append(friends.eq('similar').sum(axis=1)/(friends.notna().sum(axis=1)))
col.append(att+'_prop_dv')
dfs = concat(dfs, axis=1)
dfs.columns = col
return dfs
def classify_age(x, granularity=20):
if np.isnan(x) or x < 16: return np.nan
else: return min(x, 99)//granularity*granularity
def classify_age_static(x): #http://www.pgagroup.com/standardized-survey-classifications.html
if np.isnan(x) or x < 16: return np.nan
elif 16 <= x < 25: return 20.0
elif 25 <= x < 35: return 30.0
elif 35 <= x < 45: return 40.0
elif 45 <= x < 55: return 50.0
elif 55 <= x < 65: return 60.0
else: return 70.0
def classify_income(x):
if np.isnan(x): return np.nan
elif x <= 584: return 'DE'
elif x <= 1342: return 'C2'
elif x <= 2115: return 'C1'
else: return 'AB'
if data['sym']=='x': prefix = ''
else: prefix = data['sym']
var = expand_var(var, prefix)
out = data['tab'].loc[:,var.keys()]
out.index = data['tab']['pidp']
out.rename(columns=var, inplace=True)
var_encoding = expand_encoding(var_encoding, atts=['sex', 'ethnicity'])
out.replace(to_replace=var_encoding, inplace=True)
out.loc[out['employment']=='unemployed','employment_prop'] = 1-out.loc[out['employment']=='unemployed','employment_prop']
out['age_raw'] = out['age']
#out['age'] = out['age'].apply(classify_age)
out['age'] = out['age'].apply(classify_age_static)
out['income_raw'] = out['income']
out['income'] = out['income'].apply(classify_income)
for i in range(1, 4):
out['age_raw_nom_'+str(i)] = out['age_nom_'+str(i)]
#out['age_nom_'+str(i)] = out['age_nom_'+str(i)].apply(classify_age)
out['age_nom_'+str(i)] = out['age_nom_'+str(i)].apply(classify_age_static)
for var_name in list(out):
if var_name[-5:]=='_prop': out[var_name] = out[var_name].apply(define_prop)
out[out._get_numeric_data()<0] = np.nan
out = concat([out, derive_prop(out, atts=['sex', 'age', 'ethnicity', 'employment', 'distance'])], axis=1)
out.dropna(axis=1, how='all', inplace=True)
return out
data_wave3_atts = extract_usoc(data_wave3)
def extract_census(data):
import numpy as np
var = {'sex':'sex',
'ageh':'age',
'ethnicityew':'ethnicity',
'hlqupuk11':'education',
'ecopuk11':'employment',
'scgpuk11c':'income',
'religionew':'religion',
'mainglangg':'language',
'geocode':'geocode'
}
var_encoding = {'sex':{1:'male',
2:'female',
-9:np.nan
},
'ethnicity':{1:'british-white',
(2, 3):'other-white',
(4, 11, 12):'black',
(5, 9, 10):'other-asian',
(6, 7, 8):'south-asian',
#8:'bangladeshi',
#10:'other-asian',
13:'other-ethnicity',
-9:np.nan
},
'education':{10:'level-0',
11:'level-1',
(12, 13, 16):'level-2',
14:'level-3',
15:'level-4',
-9:np.nan
},
'employment':{(1, 2, 3, 4, 7):'employed',
(5, 6):'unemployed',
(8, 9, 11):'student',
10:'retired',
12:'housework',
(13, 14, -9):np.nan
},
'income':{1:'AB',
2:'C1',
3:'C2',
4:'DE',
-9:np.nan
},
'religion':{1:'atheist',
2:'christian',
3:'buddhist',
4:'hindu',
5:'jewish',
6:'muslim',
7:'sikh',
8:'other-religion',
(9, -9):np.nan
},
'language':{1:'english/welsh',
2:'polish',
3:'punjabi',
4:'urdu',
5:'bengali',
6:'gujarati',
7:'arabic',
8:'french',
9:'other-language',
-9:np.nan
}
}
def expand_encoding(encoding_dict):
new_dict = dict()
for key, value in encoding_dict.items():
new_d = dict()
for k, v in value.items():
if not hasattr(k, '__iter__'): k = (k,)
for i in k: new_d[i] = v
new_dict[key] = new_d
return new_dict
def decode_age(x):
return (x-0.5)*5
def classify_age(x, granularity=20):
if np.isnan(x) or x < 4: return np.nan
else: return ((x-0.5)*5)//granularity*granularity
def classify_age_static(x): #http://www.pgagroup.com/standardized-survey-classifications.html
if np.isnan(x) or x < 16: return np.nan
elif 16 <= x < 25: return 20.0
elif 25 <= x < 35: return 30.0
elif 35 <= x < 45: return 40.0
elif 45 <= x < 55: return 50.0
elif 55 <= x < 65: return 60.0
else: return 70.0
out = data.loc[:,var.keys()]
out.rename(columns=var, inplace=True)
var_encoding = expand_encoding(var_encoding)
out.replace(to_replace=var_encoding, inplace=True)
out['age_raw'] = out['age'].apply(decode_age)
#out['age'] = out['age'].apply(classify_age)
out['age'] = out['age_raw'].apply(classify_age_static)
out[out._get_numeric_data()<0] = np.nan
out.dropna(axis=1, how='all', inplace=True)
return out
data_c2011_atts = extract_census(data_2011_ew)
Now that we have parsed and aligned USoc-Census data, we are ready to learn the SBM. $\rho$s are derived from prop-type questions for education
and income
, and from nominee-type questions using nom2prop()
for sex
, age
, ethnicity
and employment
. The $\omega$s are derived from the response of the question "how many close friends do you have?" (discounting for family). And $\pi$s are derived from simple counts, not from USoc but the census, since the latter is more representative.
Given that the expected number of friends in an SBM follows Poisson distribution, and the number of friends within the same community follow a Binomial distribution, we can place appropriate conjugate prior distributions on $\omega$, $\rho$ and $\pi$ for every community. Namely, if there are $k$ communities, then $\forall i \in \{1,2,\dots k\}$:
$$\omega_i\sim\mathrm{Gamma}(\alpha_\omega,\beta_\omega)$$$$\rho_i\sim\mathrm{Beta}(\alpha_\rho,\beta_\rho)$$$$\boldsymbol\pi\sim\mathrm{Dirichlet}(\alpha_\pi, k)$$Upon observing appropriate data for these parameters, we can update the priors appropriately. Let us observe dataset $\mathcal{D}$: $n_i$ people in community $i$, with number of friends $f_{i1},f_{i2},\dots,f_{in_i}$ within $i$ and $g_{i1},g_{i2},\dots,g_{in_i}$ outside of $i$. Then posterior distributions are given by:
$$\omega_i|\mathcal{D} \sim\mathrm{Gamma}\left(\alpha_\omega + \sum_{j=1}^{n_i}{(f_{ij}+g_{ij})},\beta_\omega + n_i\right)$$$$\rho|\mathcal{D} \sim\mathrm{Beta}\left(\alpha_\rho + \sum_{j=1}^{n_i}{f_{ij}},\beta_\rho + \sum_{j=1}^{n_i}{g_{ij}}\right)$$$$\boldsymbol\pi|\mathcal{D} \sim\mathrm{Dirichlet}\left(\{\alpha_\pi+n_1, \alpha_\pi+n_2, \dots \alpha_\pi+n_k\}, k \right)$$We can use the mean/mode of the posteriors as the parameter to feed into the SBM, while the variance of the posteriors can inform us about the errors of the model. We demonstrate with a simple example below.
def posterior(priors={'pi':0.1, 'rho':(1.0, 1.0), 'omega':(3.0, 1.0)}, data={}, d=2, tol=1e-2, plot=False):
from scipy.stats import beta, dirichlet, gamma
import numpy as np
rv, x = dict(), dict()
data = data.copy()
posteriors = dict()
if data:
if 'pi' in data:
d = len(data['pi'])
posteriors['pi'] = dirichlet(priors['pi'] + data['pi'])
if 'rho' in data:
a = data['rho'][0]
b = data['rho'][1] - data['rho'][0]
posteriors['rho'] = beta(priors['rho'][0] + a, priors['rho'][1] + b)
if 'omega' in data:
posteriors['omega'] = gamma(priors['omega'][0] + data['omega'].sum(), scale=1/(priors['omega'][1] + len(data['omega'])))
#plot
if plot:
import matplotlib.pyplot as plt
rv['pi'] = dirichlet(priors['pi']*np.ones(d))
rv['rho'] = beta(priors['rho'][0], priors['rho'][1])
rv['omega'] = gamma(priors['omega'][0], scale=1/priors['omega'][1])
x['rho'] = np.linspace(tol, 1-tol, 100)
max_omega = 3*priors['omega'][0]/priors['omega'][1]
if 'omega' in posteriors: max_omega = max(max_omega, 3*posteriors['omega'].mean())
x['omega'] = np.linspace(0, max_omega, 100)
for i in x:
plt.plot(x[i], rv[i].pdf(x[i]), label='prior of $\%s$'%i)
if i in posteriors: plt.plot(x[i], posteriors[i].pdf(x[i]), label='posterior of $\%s$'%i)
plt.legend()
plt.show()
if posteriors: return posteriors
else: return rv
def generate_data(params={'pi':[0.5, 0.5], 'rho':0.5, 'omega':5}, n=100):
from scipy.stats import multinomial, poisson, binom
return {'pi': multinomial.rvs(n=n, p=params['pi']),
'rho': (binom.rvs(n=n, p=params['rho']), n),
'omega': poisson.rvs(params['omega'], size=n)}
data = generate_data({'pi':[0.2, 0.3, 0.5], 'rho':0.8, 'omega':5})
param_posteriors = posterior(data=data, plot=True)
for k in param_posteriors: print(k, 'mean', param_posteriors[k].mean(), 'var', param_posteriors[k].var())
pi mean [0.2003988 0.33000997 0.46959123] var [0.00158183 0.00218266 0.00245879] rho mean 0.8333333333333334 var 0.001348435814455232 omega mean 4.891089108910891 var 0.04842662484070189
We thus recover the original parameters. We now do the same with actual data from USoc, learning parameters for every Blau dimensions separately.
def nom2prop(df, atts=['sex', 'age', 'ethnicity', 'employment', 'distance'], group=True, num_nom=3):
if not hasattr(atts, '__iter__') or isinstance(atts, str): atts = [atts]
dvs = dict()
self_atts = set(df)
for att in set(atts):
if att in self_atts:
if group:
dvs[att] = dict()
for name, data in df.groupby(att):
idx = data[att].notnull()
friends = data[['_'.join([att, 'nom', str(i)]) for i in range(1, num_nom+1)]][idx]
#dvs[att][name] = (friends.eq(data[att][idx], axis=0) & friends.notna()).sum(axis=1).sum()/friends.notna().sum(axis=1).sum()
rho_post = posterior(data={'rho':((friends.eq(data[att][idx], axis=0) & friends.notna()).sum(axis=1).sum(), friends.notna().sum(axis=1).sum())})
dvs[att][name] = (rho_post['rho'].mean(), rho_post['rho'].var())
else:
idx = df[att].notnull()
friends = df[['_'.join([att, 'nom', str(i)]) for i in range(1, num_nom+1)]][idx]
#dvs[att] = (friends.eq(df[att][idx], axis=0) & friends.notna()).sum(axis=1).sum()/friends.notna().sum(axis=1).sum()
dvs[att] = (friends.eq(df[att][idx], axis=0) & friends.notna()).sum(axis=1).sum()/friends.notna().sum(axis=1).sum()
else:
friends = df[['_'.join([att, 'nom', str(i)]) for i in range(1, num_nom+1)]]
dvs[att] = friends.eq('similar').sum(axis=1).sum()/friends.notna().sum(axis=1).sum()
return dvs
def generate_sbm_params(df, df_pi=None, prop=['education', 'income'], prop_dv=[], prop_dv_global=['sex', 'age', 'ethnicity', 'employment'], ignore_family=True, plot=False, save='', fmt='.pdf', fontsize=6):
if df_pi is None: df_pi = df
all_atts = list(df)
prop = dict([(i, i+'_prop') for i in prop if i in all_atts and i+'_prop' in all_atts])
prop_dv = dict([(i, i+'_prop_dv') for i in prop_dv if i not in prop and i in all_atts and i+'_prop_dv' in all_atts])
print('using prop for dims:', set(prop.keys()))
print('using prop-dv for dims:', set(prop_dv.keys()))
prop.update(prop_dv)
prop_dv_global = set([i for i in prop_dv_global if i not in prop and i in all_atts and i+'_nom_1' in all_atts])
print('using prop-dv-global for dims:', prop_dv_global)
prop_dv_global = nom2prop(df, prop_dv_global)
dimnames = []
pis = []
rhos = []
omegas = []
#meanomegas = []
for dim in list(prop.keys())+list(prop_dv_global.keys()):
df_given_dim = df.groupby(dim)
names = []
pi = []
rho = []
omega = []
#tot = 0
for name, data in df_given_dim:
names.append(name)
pi.append(df_pi[df_pi[dim]==name].size)
if dim in prop:
rho_data = data[prop[dim]].values
rho_data = rho_data[~np.isnan(rho_data)]
rho_post = posterior(data={'rho':(rho_data.sum(), len(rho_data))})
rho.append((rho_post['rho'].mean(), rho_post['rho'].var()))
else: rho.append(prop_dv_global[dim][name])
if ignore_family: om_data = data['num_friends']*(1-data['family_prop']).values
else: om_data = data['num_friends'].values
om_data = om_data[~np.isnan(om_data)]
om_post = posterior(data={'omega':om_data})
omega.append((om_post['omega'].mean(), om_post['omega'].var()))
#tot += pi[-1]
#pi = [p/tot for p in pi]
pi_post = posterior(data={'pi':np.array(pi)})
dimnames.append((dim, names))
pis.append((pi_post['pi'].mean(), pi_post['pi'].var()))
rhos.append(rho)
omegas.append(omega)
#meanomegas.append(sum([(o*p) for o,p in zip(omega, pi)]))
if plot:
import matplotlib.pyplot as plt
plt.figure(dpi=120)
for i in range(len(dimnames)):
for j in range(len(dimnames[i][1])): plt.errorbar(i, pis[i][0][j], yerr=pis[i][1][j], fmt='o', label=dimnames[i][1][j])
plt.title('pi')
plt.xticks(np.arange(len(dimnames)), [d[0] for d in dimnames])
plt.legend(bbox_to_anchor=(1.02,1.0), loc='upper left', prop={'size':fontsize})
if save: plt.savefig(save+'/param_pi'+fmt, bbox_inches='tight')
plt.show()
plt.figure(dpi=120)
for i in range(len(dimnames)):
for j in range(len(dimnames[i][1])): plt.errorbar(i, rhos[i][j][0], yerr=rhos[i][j][1], fmt='o', label=dimnames[i][1][j])
plt.title('rho')
plt.xticks(np.arange(len(dimnames)), [d[0] for d in dimnames])
plt.legend(bbox_to_anchor=(1.02,1.0), loc='upper left', prop={'size':fontsize})
if save: plt.savefig(save+'/param_rho'+fmt, bbox_inches='tight')
plt.show()
plt.figure(dpi=120)
for i in range(len(dimnames)):
for j in range(len(dimnames[i][1])): plt.errorbar(i, omegas[i][j][0], yerr=omegas[i][j][1], fmt='o', label=dimnames[i][1][j])
plt.title('omega')
plt.xticks(np.arange(len(dimnames)), [d[0] for d in dimnames])
plt.legend(bbox_to_anchor=(1.02,1.0), loc='upper left', prop={'size':fontsize})
if save: plt.savefig(save+'/param_omega'+fmt, bbox_inches='tight')
plt.show()
#mo = sum(meanomegas)/len(meanomegas)
#print('mean omega', mo)
#for i in range(len(omegas)):
# for j in range(len(omegas[i])):
# omegas[i][j] *= mo/meanomegas[i] #a *small* correction to ensure mean omega is same across all dimensions
return {'shape':[len(i) for i in pis], 'name':dimnames, 'pi':pis, 'rho':rhos, 'omega':omegas}
sbm_params = generate_sbm_params(data_wave3_atts, data_c2011_atts, plot=True)
using prop for dims: {'education', 'income'} using prop-dv for dims: set() using prop-dv-global for dims: {'age', 'ethnicity', 'sex', 'employment'}
We note the following:
So we can now simply import the EgocentricSBM
class and explore the learnt SBM. (See the egosbm notebook for more details on this class.)
We also note that at least 3 of these dimensions are ordinal: age, education, and income. For these dimensions, the EDA assumption (the community discriminate against all other communities equally) may not hold, but an ordinal smooth kernel which favours people "closer" on the dimension may enjoy a higher propensity than those further away. We can infer the said exponential kernel even from proportion-type data: "what proportion of your friends are within a window $w$ of you along dimension X?" (See the cont2cat notebook for more details.)
def func_theta(theta, window=0, rho=0):
import numpy as np
def f(x, y=0): return (x**(2-2*y)*(2*x**y-1)**2 + x**(2*y)*(2*x**(1-y)-1)**2 + 2*x*(1-2*x))/(1-x)**2
def g(x, y=0): return (1-x)**0.5/(1+(x**(1-y)-x**y)/(2*(x**(y-1)-1)))
a = np.exp(-theta)
b = np.exp(theta)
logterm = f(a, window)
arctanh = g(b, window)
return window + (1/(2*theta))*np.log(logterm) + ((b**window-1)/(2*theta*(1-b)**0.5))*np.log((1+arctanh)/(1-arctanh)) - rho
def plot_contour(func, domain=((0,1),(0,1)), args=False, n=1000, aspect='auto', xlabel='', ylabel='', zlabel='', colorbar=True, refresh=False, levels=None, save='', fmt='.pdf'):
import numpy as np
import matplotlib.pyplot as plt
if isinstance(domain, dict):
x = np.linspace(domain['x'][0], domain['x'][1], n)
y = np.linspace(domain['y'][0], domain['y'][1], n)
else:
x = np.linspace(domain[0][0], domain[0][1], n)
y = np.linspace(domain[1][0], domain[1][1], n)
xx, yy = np.meshgrid(x, y, sparse=True)
if args: z = func(xx, yy, args)
else: z = func(xx, yy)
h = plt.axes(aspect=aspect)
if levels is None: plt.contourf(x, y, z)
else: plt.contourf(x, y, z, levels=levels)
if xlabel: plt.xlabel(xlabel)
if ylabel: plt.ylabel(ylabel)
if colorbar:
if zlabel: plt.colorbar(label=zlabel)
else: plt.colorbar()
if save: plt.savefig(save+xlabel+'_vs_'+ylabel+'_vs_'+zlabel+fmt, bbox_inches = 'tight')
elif refresh: plt.show()
Below, we plot how $\rho$ (proportion of friends within $w$) varies as the homophily is made stronger (more negative $\theta$) or as the window of query $w$ increases.
plot_contour(func_theta, ((-10,0),(0,0.5)), xlabel='theta', ylabel='window', zlabel='rho', levels=np.linspace(0, 1, 11))
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in true_divide This is separate from the ipykernel package so we can avoid doing imports until C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide after removing the cwd from sys.path. C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in true_divide if __name__ == '__main__': C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in true_divide if __name__ == '__main__':
The contour plot makes it clear that $\rho$ monotonically increases with the window $w$ (for a given $\theta$) and with homophily $-\theta$ (for a given $w$). Now if we know $w$ and $\rho$, we can solve for $\theta$. However, USoc survey does not ask questions at a quantitative window $w$, instead going for "what proportion of your friends have a similar X as you?"
However, because we are using nominee type questions converted to prop-type questions for age, and because we know the distribution of people along income and education, we can assume people too treat others as being "similar" based on these categories. Under this assumption, we can use the exponential kernel characterization to infer a $\theta$ that satisfies that $\rho$ for the category-windows we have determined to be appropriate.
For a quick analysis, we will assume equiproportioned population across our dimensions of interest.
def solve_theta_window(rho, window, init=-1):
from scipy.optimize import fsolve as solve
if hasattr(rho, '__iter__'): return [solve_theta(r, window, init) for r in rho]
assert(window<=0.5)
assert(init<0)
assert(0<rho<1)
return solve(func_theta, init, (window, rho))[0]
def solve_theta_self(rho, boundaries=None, init=-1):
from scipy.optimize import fsolve as solve
from numpy import exp, log
def f_rho_self(c, a, b, val=0.):
try:
if all(a==b): return a-a
except: pass
logterm = (exp(2*c*(1-b))-2*exp(c*(1-b))+exp(c))/(exp(2*c*(1-a))-2*exp(c*(1-a))+exp(c))
arctanh = ((exp(c*a)-exp(c*b))*(1-exp(c))**0.5)/(exp(c)+exp(c*(a+b))-exp(c*a)-exp(c*b))
return exp(-c*a) + (exp(-c*a)-exp(-c*(1-b)))/(2*c*(b-a))*log(logterm) - (exp(-c*a)+exp(-c*(1-b))-2)/(2*c*(1-exp(c))**0.5*(b-a))*log((1+arctanh)/(1-arctanh)) - val
if boundaries is None: boundaries = np.linspace(0, 1, len(rho)+1)
assert(boundaries[0]==0)
assert(boundaries[-1]==1)
assert(len(rho)==len(boundaries)-1)
solutions = []
for i in range(len(rho)): solutions.append(solve(f_rho_self, init, (boundaries[i], boundaries[i+1], rho[i]))[0])
return solutions
def func_rho(theta, k=4, boundaries=None):
from numpy import exp, log
def f_rho_self(c, a, b):
try:
if all(a==b): return a-a
except: pass
logterm = (exp(2*c*(1-b))-2*exp(c*(1-b))+exp(c))/(exp(2*c*(1-a))-2*exp(c*(1-a))+exp(c))
arctanh = ((exp(c*a)-exp(c*b))*(1-exp(c))**0.5)/(exp(c)+exp(c*(a+b))-exp(c*a)-exp(c*b))
return exp(-c*a) + (exp(-c*a)-exp(-c*(1-b)))/(2*c*(b-a))*log(logterm) - (exp(-c*a)+exp(-c*(1-b))-2)/(2*c*(1-exp(c))**0.5*(b-a))*log((1+arctanh)/(1-arctanh))
def f_rho_low(c, a, b, p, q):
try:
if all(a==b): return a/a
elif all(p==q): return p-p
except: pass
logterm = (exp(2*c*(1-b))-2*exp(c*(1-b))+exp(c))/(exp(2*c*(1-a))-2*exp(c*(1-a))+exp(c))
arctanh = ((exp(c*a)-exp(c*b))*(1-exp(c))**0.5)/(exp(c)+exp(c*(a+b))-exp(c*a)-exp(c*b))
return (exp(-c*q)-exp(-c*p))/(2*c*(b-a))*(-2*c*(b-a) - log(logterm) + 1/(1-exp(c))**0.5*log((1+arctanh)/(1-arctanh)))
def f_rho_high(c, a, b, p, q):
try:
if all(a==b): return a/a
elif all(p==q): return p-p
except: pass
logterm = (exp(2*c*(1-b))-2*exp(c*(1-b))+exp(c))/(exp(2*c*(1-a))-2*exp(c*(1-a))+exp(c))
arctanh = ((exp(c*a)-exp(c*b))*(1-exp(c))**0.5)/(exp(c)+exp(c*(a+b))-exp(c*a)-exp(c*b))
return (exp(-c*(1-p))-exp(-c*(1-q)))/(2*c*(b-a))*(log(logterm)+1/(1-exp(c))**0.5*log((1+arctanh)/(1-arctanh)))
if isinstance(theta, int) or isinstance(theta, float):
if boundaries is None: theta = theta*np.ones(k)
else: theta = theta*np.ones(len(boundaries)-1)
if boundaries is None: boundaries = np.linspace(0, 1, len(theta)+1)
assert(boundaries[0]==0)
assert(boundaries[-1]==1)
assert(len(theta)==len(boundaries)-1)
rho_mat = []
for i in range(len(boundaries)-1):
rhos = []
a, b = boundaries[i], boundaries[i+1]
for j in range(i):
p, q = boundaries[j], boundaries[j+1]
rhos.append(f_rho_low(theta[i], a, b, p, q))
rhos.append(f_rho_self(theta[i], a, b))
for j in range(i+1, len(boundaries)-1):
p, q = boundaries[j], boundaries[j+1]
rhos.append(f_rho_high(theta[i], a, b, p, q))
rho_mat.append(rhos)
return np.array(rho_mat)
def plot_matrix(matrix, labels=False, title='', save='', fmt='.pdf'):
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(dpi=120)
ax = sns.heatmap(matrix, square=True, cmap='Spectral', xticklabels=labels, yticklabels=labels)
if title: plt.title(title)
if save: plt.savefig(save+title+fmt, bbox_inches='tight')
else: plt.show()
Now we use the $\rho$s inferred above from USoc, and apply the exponential kernel setup to infer $\theta$s for every block of the Blau dimension, and correspondingly the proportion of friends a community will have across the entire Blau dimension. Let's do this for age first.
rho_age = [i[0] for i in sbm_params['rho'][2]]
theta_age = solve_theta_self(rho_age)
rho_all_age = func_rho(theta_age)
print('rho age:', rho_age)
print('theta age (inferred):', theta_age)
print('rho age (inferred):', np.diag(rho_all_age))
plot_matrix(rho_all_age, sbm_params['name'][2][1], title='$\\rho$ for age')
rho age: [0.8795737122557726, 0.7168223720843903, 0.6294147582697202, 0.5571218949795619, 0.4888313982836951, 0.6322679175302717] theta age (inferred): [-25.43643367737479, -20.217316373274414, -14.774885140955782, -11.462509542541532, -8.173592046656891, -8.036897889800269] rho age (inferred): [0.8795737 0.71682237 0.62941476 0.55712189 0.4888314 0.63226792]
Things to note:
We repeat this analysis for income groups (or employment-type groups).
rho_inc = [i[0] for i in sbm_params['rho'][1]]
theta_inc = solve_theta_self(rho_inc)
rho_all_inc = func_rho(theta_inc)
print('rho inc:', rho_inc)
print('theta inc (inferred):', theta_inc)
print('rho inc (inferred):', np.diag(rho_all_inc))
plot_matrix(rho_all_inc, sbm_params['name'][1][1], title='$\\rho$ for income')
rho inc: [0.5988417554491252, 0.6181683899556869, 0.6073896518960007, 0.5743704401035538] theta inc (inferred): [-4.659429871873838, -9.060005101766219, -8.680419488906448, -4.243070605874064] rho inc (inferred): [0.59884176 0.61816839 0.60738965 0.57437044]
Again, although the $\rho$s are all quite comparable, the bulk is much more homophilous than the edges. This adds more credulity to the exponential kernel assumption, over the EDA one.
rho_edu = [i[0] for i in sbm_params['rho'][0]]
theta_edu = solve_theta_self(rho_edu)
rho_all_edu = func_rho(theta_edu)
print('rho edu:', rho_edu)
print('theta edu (inferred):', theta_edu)
print('rho edu (inferred):', np.diag(rho_all_edu))
plot_matrix(rho_all_edu, sbm_params['name'][0][1], title='$\\rho$ for education')
rho edu: [0.7873970345963757, 0.7829450309428467, 0.7601093801789858, 0.7342526109660574, 0.7125576986482031] theta edu (inferred): [-12.31575224190963, -22.701746543767943, -20.49290465808315, -18.145701618263384, -8.983709612312476] rho edu (inferred): [0.78739703 0.78294503 0.76010938 0.73425261 0.7125577 ]
Additionally, once we have inferred $\theta$, it leaves us free to change the number of blocks in our analysis, allowing us to move from discrete to near-continuous settings, as evidenced by the plots below. We assume the same $\theta$ for all blocks.
for theta in [-0.5, -2, -8]:
for k in [4, 16, 64]:
plot_matrix(func_rho(theta, k), title='$\\theta$: '+str(theta)+', k: '+str(k))
Again, the edge effects are much more visible as we increase the number of blocks. Another place where this becomes apparent is in the asymmetry of the matrix--- proportion of friends in a certain block will not be identically reciprocated by another block, even if they have the same $\theta$, owing to the edge effects in ordinal dimensions.