The Method of Reflection (MOR) is a algorithm first coming out of Macroeconomics, that ranks nodes in a bi-partite network. This notebook should hopefully help you implement the method of reflection in python. To be precise, it is the modified algorithm that is proposed by Caldarelli et al., which solves some problems with the original Hidalgo-Hausmann (HH) algorithm doi:10.1073/pnas.0900943106. The main problem with (HH) is that all values converge to a single fixed point after sufficiently many iterations. The Caldarelli version solves this by adding a new term to the recursive equation - what they call a biased random walker (function G). doi: 10.1371/journal.pone.0047278. I hadn't seen any open-source implementations of this algorithm, so I thought I'd share my naïve approach.
import datetime
import pandas as pd
import numpy as np
from collections import defaultdict
import scipy.stats as ss
from scipy.optimize import fmin as scipyfmin
import operator
import re
import json
#if you are using ipython and want to see things inline.
%pylab inline
Populating the interactive namespace from numpy and matplotlib
We want to model a bi-partite network. Since we are using numpy
and want to operate on a matrix we will use a numpy.matrix
, but we also may want to retain the Unique IDs associated with each node, so we'll need to keep dicts
of those as they relate to the matrix indices. Therefore for each bi-partite network at least three files are needed. In my case I analysed a network which was a Wikipedia Category, so my two node-types are users and articles. In your case they may be different, but I've kept the nomenclature here for simplicity - my own simplicity 8^). You can borrow my sample data from https://github.com/notconfusing/wiki_econ_capability/tree/master/savedata
M.npy
- an numpy matrix M - adjacently matrix of the network.user_dict.json
- a mapping (json dictionary)between unique key (in my case Wikipedia user name) to M row index.article_dict
- a mapping (json dictionary) between unique key (in my case the name of the article) M column index.If you plan to calibrate your model against some exogenous metrics you will need to provide two more files - the rankings from the exogenous data.
user_exogenous_ranks.json
- a mapping (json dictionary) between the same keys of the user_dict
to their exogenous ranks (how well another metric ranks Wikipedia users).article_exogenous_ranks.json
- a mapping (json dictionary) between the same keys of the article_dict
to their exogenous ranks (how well another metric ranks Wikipedia articles).I put everything belonging to one network into a single folder and then use this load_files
method which unpacks all the object and gives you a dict of all the objects.
def load_files(folder):
M = np.load(folder+'M.npy')
user_dict = json.load(open(folder+'user_dict.json', 'r'))
article_dict = json.load(open(folder+'article_dict.json', 'r'))
user_exogenous_ranks = json.load(open(folder+'user_exogenous_ranks.json', 'r'))
article_exogenous_ranks = json.load(open(folder+'article_exogenous_ranks.json', 'r'))
return {'M':M,
'user_dict':user_dict,
'article_dict':article_dict,
'user_exogenous_ranks':user_exogenous_ranks,
'article_exogenous_ranks':article_exogenous_ranks}
For instance, let's create the feminist_data
dict, which holds data from Category:Feminist_writers . I am snapshotting data, so that's why below you see two dates: when the snapshot was taken, and the latest date to be considered in the snapshot. So we are getting the full data from February 18th 2014.
feminist_data = load_files('savedata/Category:Feminist_writers/2014-02-18/2014-02-18/')
Now we define two operations on $M$. The HH technique and those coming after use a binary input matrix, I also got better results, with a binary matrix. So we can normalise all our non-zero data to 1
. Also we may want to take a look at it an see if it has a triangular shape, since (MOR) assumes a triangular matrix.
def make_bin_matrix(M):
#this returns the a matrix with entry True where the original was nonzero, and zero otherwise.
M[M>0] = 1.0
return M
def M_test_triangular(M):
user_edits_sum = M.sum(axis=1)
article_edits_sum = M.sum(axis=0)
user_edits_order = user_edits_sum.argsort()
article_edits_order = article_edits_sum.argsort()
M_sorted = M[user_edits_order,:]
M_sorted_sorted = M_sorted[:,article_edits_order]
M_bin = make_bin_matrix(M_sorted_sorted)
plt.figure(figsize=(10,10))
imshow(M_sorted_sorted, cmap=plt.cm.bone, interpolation='nearest')
bin_M = make_bin_matrix(feminist_data['M'])
M_test_triangular(bin_M)
Ok now let's move onto our third and fourth operations on $M$ - getting some numeric rankings out of the matrix. Let us refresh ourselves on the equations from the literature. (Note variables c and p, countries and products, translate to editors and article respectively):
These are an $w_{c}$ editor-vector which is the sums of articles edited by each editor. Or the article-vector $w_{p}$, which is the sum of editors contributing to each article.
\begin{cases} w_{c}^{(0)} = \sum_{p=1}^{N_{p}} M \equiv k_c\\[7pt] w_{p}^{(0)} = \sum_{c=1}^{N_{c}} M \equiv k_p \end{cases}The first order $w^{1}_c$ is the sum of the articles touched, but weighted by the Zeroth order article-vector (and the $G$ term). So if you've edited better articles that counts. And $w^{1}_c$ is the sum of editors touching, but weighted by the Zeroth order editor-vector (and $G$). So if you're touched by better editors that's also being considered.
Beyond the first order interpretation for the higher orders is difficult.
\begin{cases} w^{(n+1)}_c (\alpha,\beta) = \sum_{p=1}^{N_p} G_{cp}(\beta) \,w^{(n)}_p (\alpha,\beta)\\[7pt] w^{(n+1)}_p (\alpha,\beta) = \sum_{c=1}^{N_c} G_{pc}(\alpha) \, w^{(n)}_c (\alpha,\beta)\\ \end{cases}Depending on $\alpha$ and $\beta$ we non-linearly weight based on the Zeroth order iterations.
\begin{cases} G_{cp}(\beta) = \frac{M_{cp} k_{c}^{-\beta}}{\sum_{c' = 1}^{N_c} M_{c'p} k_{c'}^{-\beta}}\\[10pt] G_{pc}(\alpha) = \frac{M_{cp} k_{p}^{-\alpha}}{\sum_{p' = 1}^{N_p} M_{cp'} k_{p'}^{-\alpha}}.\\ \end{cases}And now we implement the mathematics in python. Hopefully I got this right, it hasn't been independently verified. Additionally I implement $w$ as a generator
, so you can go on for many iterations without chewing up too much memory. There is also a stream function that allows you get a specific iteration. And lastly a find_convergence
function, that checks to see if the rankings haven't shifted for two consecutive iterations.
def Gcp_denominateur(M, p, k_c, beta):
M_p = M[:,p]
k_c_beta = k_c ** (-1 * beta)
return np.dot(M_p, k_c_beta)
def Gpc_denominateur(M, c, k_p, alpha):
M_c = M[c,:]
k_p_alpha = k_p ** (-1 * alpha)
return np.dot(M_c, k_p_alpha)
def make_G_hat(M, alpha=1, beta=1):
'''G hat is Markov chain of length 2
Gcp is a matrix to go from contries to product and then
Gpc is a matrix to go from products to ccountries'''
k_c = M.sum(axis=1) #aka k_c summing over the rows
k_p = M.sum(axis=0) #aka k_p summering over the columns
G_cp = np.zeros(shape=M.shape)
#Gcp_beta
for [c, p], val in np.ndenumerate(M):
numerateur = (M[c,p]) * (k_c[c] ** ((-1) * beta))
denominateur = Gcp_denominateur(M, p, k_c, beta)
G_cp[c,p] = numerateur / float(denominateur)
G_pc = np.zeros(shape=M.T.shape)
#Gpc_alpha
for [p, c], val in np.ndenumerate(M.T):
numerateur = (M.T[p,c]) * (k_p[p] ** ((-1) * alpha))
denominateur = Gpc_denominateur(M, c, k_p, alpha)
G_pc[p,c] = numerateur / float(denominateur)
return {'G_cp': G_cp, "G_pc" : G_pc}
def w_generator(M, alpha, beta):
#this cannot return the zeroeth iteration
G_hat = make_G_hat(M, alpha, beta)
G_cp = G_hat['G_cp']
G_pc = G_hat['G_pc']
#
fitness_0 = np.sum(M,1)
ubiquity_0 = np.sum(M,0)
fitness_next = fitness_0
ubiquity_next = ubiquity_0
i = 0
while True:
fitness_prev = fitness_next
ubiquity_prev = ubiquity_next
i += 1
fitness_next = np.sum( G_cp*ubiquity_prev, axis=1 )
ubiquity_next = np.sum( G_pc* fitness_prev, axis=1 )
yield {'iteration':i, 'fitness': fitness_next, 'ubiquity': ubiquity_next}
def w_stream(M, i, alpha, beta):
"""gets the i'th iteration of reflections of M,
but in a memory safe way so we can calculate many generations"""
if i < 0:
raise ValueError
for j in w_generator(M, alpha, beta):
if j[0] == i:
return {'fitness': j[1], 'ubiquity': j[2]}
break
def find_convergence(M, alpha, beta, fit_or_ubiq, do_plot=False,):
'''finds the convergence point (or gives up after 1000 iterations)'''
if fit_or_ubiq == 'fitness':
Mshape = M.shape[0]
elif fit_or_ubiq == 'ubiquity':
Mshape = M.shape[1]
rankings = list()
scores = list()
prev_rankdata = np.zeros(Mshape)
iteration = 0
for stream_data in w_generator(M, alpha, beta):
iteration = stream_data['iteration']
data = stream_data[fit_or_ubiq]
rankdata = data.argsort().argsort()
#test for convergence
if np.equal(rankdata,prev_rankdata).all():
break
if iteration == 1000:
break
else:
rankings.append(rankdata)
scores.append(data)
prev_rankdata = rankdata
if do_plot:
plt.figure(figsize=(iteration/10, Mshape / 20))
plt.xlabel('Iteration')
plt.ylabel('Rank, higher is better')
plt.title('Rank Evolution')
p = semilogx(range(1,iteration), rankings, '-,', alpha=0.5)
return {fit_or_ubiq:scores[-1], 'iteration':iteration}
We also know from Caldarelli et al. that there is an analytic formulation to the recursive procedure. So if you want to save some (a lot) processing and just know the end result we can use:
And again in python:
def w_star_analytic(M, alpha, beta, w_star_type):
k_c = M.sum(axis=1) #aka k_c summing over the rows
k_p = M.sum(axis=0) #aka k_p summering over the columns
A = 1
B = 1
def Gcp_denominateur(M, p, k_c, beta):
M_p = M[:,p]
k_c_beta = k_c ** (-1 * beta)
return np.dot(M_p, k_c_beta)
def Gpc_denominateur(M, c, k_p, alpha):
M_c = M[c,:]
k_p_alpha = k_p ** (-1 * alpha)
return np.dot(M_c, k_p_alpha)
if w_star_type == 'w_star_c':
w_star_c = np.zeros(shape=M.shape[0])
for c in range(M.shape[0]):
summand = Gpc_denominateur(M, c, k_p, alpha)
k_beta = (k_c[c] ** (-1 * beta))
w_star_c[c] = A * summand * k_beta
return w_star_c
elif w_star_type == 'w_star_p':
w_star_p = np.zeros(shape=M.shape[1])
for p in range(M.shape[1]):
summand = Gcp_denominateur(M, p, k_c, beta)
k_alpha = (k_p[p] ** (-1 * alpha))
w_star_p[p] = B * summand * k_alpha
return w_star_p
We will run our algorithms on our data. The output of both the iterative and the analytic solutions are a list of scores. So in order to know who was the best, we afterwards identify (this is why we need the ID-mapping) and sort the identified list. I use pandas
here to simply life, but I've also done it in pure-python if you're not familiar with pandas
. Also I arbitrarily use $(\alpha, \beta) = (0,0)$.
#purer python
#score
w_scores = w_star_analytic(M=feminist_data['M'], alpha=0.5, beta=0.5, w_star_type='w_star_c')
#identify
w_ranks = {name: w_scores[pos] for name, pos in feminist_data['user_dict'].iteritems() }
#sort
w_ranks_sorted = sorted(w_ranks.iteritems(), key=operator.itemgetter(1))
#or use pandas
w_scores_df = pd.DataFrame.from_dict(w_ranks, orient='index')
w_scores_df.columns = ['w_score']
w_scores_df.sort(columns=['w_score'], ascending=False).head()
w_score | |
---|---|
Dsp13 | 2.230387 |
Bearcat | 2.107264 |
Johnpacklambert | 2.047463 |
Solar-Wind | 1.952138 |
Treybien | 1.850515 |
5 rows × 1 columns
Well done users Dsp13 and Bearcat. If you look up these user's on English Wikipedia, you can see at a glance they are accomplished editors - so this is also a good sanity check. (Isn't there a Mogwai video called Bearcat? Oh, no it's Batcat \m/ never mind let's move on.)
Let's take the long way home, and check that the shortcut actually takes us to the right place. We use the iterative method with the same data, until we find convergence. Also I make a plot here of the ranks of the users after each iteration, so we can track them. So each line you see is the history of user's rise to Glory, or their slow decline to forgotten irrelevance (or none of those phenomenon). Actually if you see a user going up its because the value of the articles edited is increasing. And likewise if a user is losing standing, its because they edited a lot of articles, but were of poor quality.
convergence = find_convergence(M=feminist_data['M'], alpha=0.5, beta=0.5, fit_or_ubiq='fitness', do_plot=True)