Generate Bidding Lists for Reviewers

11th June 2014 Neil D. Lawrence

This notebook loads in the TPMS scores and key word similarities and and allocates the highest similarity score matches to reviewers for bidding on.

In [ ]:
import cmtutils
import os
import pandas as pd
import re
import numpy as np

First things first, we need to get all the current information out of CMT. That includes: external matching scores, conflict information, keyword overlap. We do this from Assignments & Conflicts > *** > Automatic Assignment Wizard where *** is either reviewers or meta reviewers. Here's the link for meta-reviewers. Proceed through the wizard putting in some values. Then at the end click on Export Data for Custom Assignment. You will need to select: Subject Areas: Paper and Meta-Reviewer, Toronto Paper Matching System and Conflicts for setting things up for bidding. For setting things up for final allocation you also need bids.

In [ ]:
# First we load in the external matching scores.

filename = '2014-06-19_externalMatchingScores.tsv'
filename=os.path.join(cmtutils.cmt_data_directory, filename)
affinity = pd.read_csv(filename, delimiter='\t', index_col='PaperID', na_values=['N/A']).fillna(0)
#data = cmtutils.xl_read(, index='Paper ID', dataframe=True)
#affinity = data.items
# Scale affinities to be between 0 and 1.
affinity -= affinity.values.min()
affinity /= affinity.values.max()

Paper Subject Areas

Now load in paper subject areas and group them by the Paper ID. This file is downloaded from:

https://cmt.research.microsoft.com/NIPS2014/Protected/Chair/ManageAssignmentsExport.aspx?data=subjectareas&view=cs&format=excel

In [ ]:
# Now we load in paper subject areas
filename = '2014-06-13_paperSubjectAreas.xls'
data = cmtutils.xl_read(filename=os.path.join(cmtutils.cmt_data_directory, filename), index='Selected Subject Area', dataframe=True, worksheet_number=1)
paper_subject = data.items.groupby(by=['Paper ID'])

Reviewer Subject Areas

Load in reviewer (or meta reviewer) subject areas and group them by email. This file is downloaded from:

https://cmt.research.microsoft.com/NIPS2014/Protected/Chair/ManageAssignmentsExport.aspx?data=subjectareas&view=cr&format=excel

In [ ]:
# Now we load in (meta-)reviewer subject areas
filename = '2014-06-13_reviewerSubjectAreas.xls'
data = cmtutils.xl_read(filename=os.path.join(cmtutils.cmt_data_directory, filename), index='Selected Subject Area', dataframe=True, worksheet_number=1)
reviewer_subject = data.items.groupby(by=['Email'])

Possible Assignments and Conflicts

Possible assignments is derived from the conflicts. It lists the people that the paper could be assigned to. This file is downloaded from:

https://cmt.research.microsoft.com/NIPS2014/Protected/Chair/ManageAssignmentsExport.aspx?data=possibleassignments&view=cs&format=tab&excludemetareviewer=1

Conflicts is downloaded from:

https://cmt.research.microsoft.com/NIPS2014/Protected/Chair/ManageAssignmentsExport.aspx?data=conflicts&view=cs&format=tab

In [ ]:
if True: # Read from the TSV format CMT provide.
    filename = 'Conflicts.txt'
    with open(os.path.join(cmtutils.cmt_data_directory, filename)) as fin:
        rows = ( line.strip().split('\t') for line in fin)
        conflicts_groups = { row[0]:row[1:] for row in rows}
    papers = conflicts_groups.keys()
    conflicts_by_reviewer = {}

    for paper in papers:
        for reviewer in conflicts_groups[paper]:
            if conflicts_by_reviewer.has_key(reviewer):
                conflicts_by_reviewer[reviewer].append(paper)
            else:
                conflicts_by_reviewer[reviewer] = [paper]
    conflicts_file = True
else:
    # And finally we load in 'possible assignments'
    filename = '2014-06-13_possibleAssignmentsByPaper.xls'
    data = cmtutils.xl_read(filename=os.path.join(cmtutils.nips_data_directory, filename), index='Paper ID', dataframe=True)
    possible_assignments = data.items
    regex = re.compile(r'\(([^)]*)\)')
    papers = possible_assignments.index
    conflicts_file = False
    #conflicts = conflicts.set_index('Reviewer/Meta-Reviewer')
#conflicts_groups = conflicts.groupby('PaperID').groups

Compute a simple similarity based on subject overlap. The similarity is the number of overlapping keywords divided by the square root of the number of reviewer keywords multiplied by the square root of the number of subject keywords. None of the above is removed as a term if it is present.

This actually turns out not to be a very sensible way of doing it. I was only just getting used to pandas when I wrote this. There's a more sensible (much faster) way of getting these similarities out in the reviewer calibration notebook.

In [ ]:
subject_sim = pd.DataFrame(np.zeros((len(paper_subject.groups), len(reviewer_subject.groups))), 
                           index=paper_subject.groups, columns=reviewer_subject.groups)
for paper in paper_subject.groups:        
    set_paper = set(paper_subject.groups[paper]) - set(['None of the above'])
    for reviewer in reviewer_subject.groups:
        set_reviewer = set(reviewer_subject.groups[reviewer]) - set(['None of the above'])
        if len(set_paper)>0 and len(set_reviewer)>0:
            norm = np.sqrt(len(set_paper))*np.sqrt(len(set_reviewer))
        else:
            norm = 1. # don't normalise if the vector is all zeros!
        subject_sim.loc[paper, reviewer] = len(set_reviewer & set_paper)/norm

Weight $\alpha$ portion of the affinities and $1-\alpha$ of the keyword similarities.

Allocate to Top 40 High Scoring Reviewers

A little big of background is needed here. At the time the code was written Corinna and I were struggling to get CMT to perform an allocation. It was across the weekend so there was no support, and it turned out the scale of the 2014 NIPS had broken a few different things. This caused me to start writing paper allocation code, within the space of a few days, without having much knowledge of the literature. This first piece of code simply allocates each paper to the top 40 high scoring reviewers. It is superceded by the code that follows. The code that follows ranks the entire matrix and starts by allocating to the highest score in the matrix.

In [ ]:
alpha=0.5
assignment = {}
all_reviewers = affinity.columns
for reviewer in all_reviewers:
    assignment[reviewer] = []
assignment_paper = {}

all_scores = (alpha*affinity + (1-alpha)*subject_sim)
min_vals = all_scores.min()
max_vals = all_scores.max()
normalise_scores = True


for paper_str in papers:
    paper = int(paper_str)
    if conflicts_file:
        reviewers = set(all_reviewers) - set(conflicts_groups[paper_str])
    else:
        reviewers = regex.findall(possible_assignments['Assigned Meta-Reviewers'][paper])
        assert(len(reviewers)==int(possible_assignments['Number of Meta-Reviewers'][paper]))
    scores = (1-alpha)*subject_sim.loc[paper][reviewers]
    if paper in affinity.index:
        scores += alpha*affinity.loc[paper][reviewers]
    else:
        print "Warning paper ", paper, " not found in TPMS scores."
    
    if normalise_scores:
        scores -= min_vals[reviewers]
        scores/=(max_vals-min_vals)[reviewers]
        #print scores
    scores.sort(ascending=False)
    assignment_paper[paper] = scores[:40].index
    for reviewer in assignment_paper[paper]:
        assignment[reviewer].append(paper)

Use this code if you loaded in the conflicts.xls file.

In [ ]:
all_scores = (alpha*affinity + (1-alpha)*subject_sim)
min_vals = all_scores.min()
max_vals = all_scores.max()
normalise_scores = True
In [ ]:
all_scores.index
In [ ]:
# Identify conflicts by setting to -1
rank_scores = all_scores.copy()
for paper in conflicts_groups:
    rank_scores.loc[int(paper)][conflicts_groups[paper]] = -1.
In [ ]:
paper

Ranking All Scores

After some thought, this next piece of code was preferred. Now all scores are taken and ranked. The papers then can be allocated from the most similar paper-reviewer pair and downwards.

In [ ]:
score_vec = rank_scores.reset_index()
score_vec = pd.melt(score_vec, id_vars=['index'])
#score_vec = score_vec[score_vec.value != -1.]
score_vec = score_vec[score_vec.value > 0.1]
score_vec = score_vec[pd.notnull(score_vec.value)]
score_vec.columns = ['PaperID', 'Email', 'Score']
score_vec = score_vec.sort_index(by='Score', ascending=False)
In [ ]:
paper_number_assigned = {}
reviewer_number_assigned = {}
max_number_paper = 17
max_number_reviewer = 25
assignment_paper = {}
assignment_reviewer = {}

for idx in score_vec.index:
    paper = str(score_vec['PaperID'][idx])
    assign = True
    if paper_number_assigned.has_key(paper):
        if paper_number_assigned[paper]>=max_number_paper:
            assign = False
            continue
    else:
        paper_number_assigned[paper] = 0

    reviewer = str(score_vec['Email'][idx])
    if reviewer_number_assigned.has_key(reviewer):
        if reviewer_number_assigned[reviewer]>=max_number_reviewer:
            assign = False
            continue
    else:
        reviewer_number_assigned[reviewer] = 0
    
    if assign:
        if assignment_paper.has_key(paper):
            assignment_paper[paper].append(reviewer)
        else:
            assignment_paper[paper] = [reviewer]
        
        if assignment_reviewer.has_key(reviewer):
            assignment_reviewer[reviewer].append(paper)
        else:
            assignment_reviewer[reviewer] = [paper]
        paper_number_assigned[paper] += 1
        reviewer_number_assigned[reviewer] += 1
        
    

FInd Reviewers with Less than 25 Papers

Now reviewers who haven't got a full allocation of 25 papers to rank are allocated a top up number of papers. In later runs of the allocation algorithm, papers were allocated in batches to reviewers (each reviewer allocated up to 5, then to 10, then to 20 etc.) to balance things a little more. But in this early stage allocation to get the bidding going it was done in this 'top up' style way. Due to the problems with CMT and the allocation steps being unforeseen, we felt quite a lot of time pressure at this point.

In [ ]:
all_papers = affinity.index
min_papers = 25
additional_papers = {}
additional_reviewers = {}
additional_number_assigned = []
for reviewer in affinity.columns:
    if reviewer_number_assigned.has_key(reviewer):
        num_papers = reviewer_number_assigned[reviewer] 
        if num_papers < min_papers:
            required_papers = min_papers - num_papers
        elif reviewer == '[email protected]':
            required_papers = 25
        else:
            continue
    else:
        required_papers = min_papers
        
    papers = set(all_papers) - set(conflicts_by_reviewer[reviewer])
    scores = alpha*affinity.loc[papers][reviewer]
    if reviewer in subject_sim.columns:
        scores += (1-alpha)*subject_sim.loc[papers][reviewer]
    else:
        print "Warning reviewer ", reviewer, " not found in subject similarities."
   
    scores.sort(ascending=False)
    additional_reviewers[reviewer] = scores[:required_papers].index
    for paper in additional_reviewers[reviewer]:
        if additional_papers.has_key(str(paper)):
            additional_papers[str(paper)].append(reviewer)
        else:
            additional_papers[str(paper)] = [reviewer]
    

This bit of code writes the allocation for sharing with Corinna, just for hand checking to ensure that something sensible is going on.

In [ ]:
f = open(os.path.join(cmtutils.nips_data_directory, 'reviewer_bidding_allocation.txt'), 'w')

for reviewer in assignment_reviewer:
    f.write('Reviewer ' + reviewer + '\n')
    f.write('\n')
    for paper in assignment_reviewer[reviewer]:
        f.write(str(paper) + " " + "https://cmt.research.microsoft.com/NIPS2014/Protected/Chair/ViewSubmissionDetails.aspx?paperId=" + str(paper) + '\n')
    f.write('\n')
f.close()
    

This code was for writing the export file for CMT to load in the bidding allocation.

In [ ]:
f = open(os.path.join(cmtutils.nips_data_directory, 'reviewer_bidding_allocation.tsv'), 'w')

for reviewer in assignment_reviewer:
    for paper in assignment_reviewer[reviewer]:
        f.write(', '.join([reviewer, str(paper)]) + '\n')
f.close()
    

This code is similar, but uses the CMT XML format which they find easier to load in.

In [ ]:
f = open(os.path.join(cmtutils.nips_data_directory, 'reviewer_assignments.xml'), 'w')
f.write('<assignments>\n')
for paper in assignment_paper:
    f.write('  <submission submissionId="' + paper + '">\n')
    for reviewer in assignment_paper[paper]:
        f.write('    <reviewer email="' + reviewer + '"/>\n')
    f.write('  </submission>\n')
f.write('</assignments>\n')
f.close()
In [ ]:
f = open(os.path.join(cmtutils.nips_data_directory, 'additional_reviewer_bidding_allocation.tsv'), 'w')

for reviewer in additional_reviewers:
    for paper in additional_reviewers[reviewer]:
        f.write(', '.join([reviewer, str(paper)]) + '\n')
f.close()
In [ ]:
str_val = ''
for reviewer in additional_reviewers:
    if len(str_val) >0:
        str_val += ';' + reviewer
    else:
        str_val = reviewer
print str_val
In [ ]:
len(set(additional_reviewers.keys() + assignment_reviewer.keys()))

Finding Extra Reviewers

Some reviewers complained that they weren't seeing enough papers in their area. Most of these reviewers had many secondary subject areas. The similarity measure being used above (mainly for the purposes of speed) was originally not weighting primary key differently from the secondary keys. This meant reviewers with many secondary keys were getting a lot of papers not in their core area. In this next section of code we added additional papers to reviewers for bidding.

In [ ]:
alpha = 0.5
all_papers = affinity.index
min_papers = 25
additional_papers = {}
additional_reviewers = {}
additional_number_assigned = []

for reviewer in reviewers:
    required_papers = 25
    if conflicts_by_reviewer.has_key(reviewer):
        papers = set(all_papers) - set(conflicts_by_reviewer[reviewer])
    else:
        papers = set(all_papers)
    scores = (1-alpha)*subject_sim.loc[papers][reviewer]
    if reviewer in affinity.columns:
        scores += alpha*affinity.loc[papers][reviewer]
    else:
        print "Warning reveiwer ", reviewer, " not found in TPMS scores."

    scores.sort(ascending=False)
    additional_reviewers[reviewer] = scores[:required_papers].index
    for paper in additional_reviewers[reviewer]:
        if additional_papers.has_key(str(paper)):
            additional_papers[str(paper)].append(reviewer)
        else:
            additional_papers[str(paper)] = [reviewer]
    
In [ ]:
f = open(os.path.join(cmtutils.nips_data_directory, 'further_additional_reviewer_bidding_allocation.tsv'), 'w')

for reviewer in additional_reviewers:
    for paper in additional_reviewers[reviewer]:
        f.write(', '.join([reviewer, str(paper)]) + '\n')
f.close()