This notebook imports review scores from CMT for analysis. In particular it does some quality checks on correlation, and on review length. It was first used for exploring the status of the reviews and what areas may need to be addressed. This led to the creation of Attention Reports which were emailed to area chairs and highlighted particular issues. Those attention reports can be found
import cmtutils import sqlite3 import pandas as pd import numpy as np import pylab as plt %matplotlib inline plt.rcParams['figure.figsize'] = (10., 8.) plt.rcParams['font.size'] = 16 filename = '2014-07-23_paper_list.xls' papers = cmtutils.cmt_papers_read(filename=filename) filename = '2014-07-24_reviews.xls' reviews = cmtutils.cmt_reviews_read(filename=filename)
For NIPS 2014 we experimented with duplicate papers: we pushed papers through the system twice, exposing them to different subsets of the reviewers. The first thing we'll look at is the duplicate papers. Firstly we identify them by matching on title.
duplicate_list =  for ID, title in papers.papers.Title.iteritems(): if int(ID)>1779 and int(ID) != 1949: pair = list(papers.papers[papers.papers['Title'].str.contains(papers.papers.Title[ID].strip())].index) pair.sort(key=int) duplicate_list.append(pair)
Next we compute the correlation coefficients for the duplicated papers for the average impact and quality scores.
quality =  impact =  confidence =  for duplicate_pair in duplicate_list: quality.append([np.mean(reviews.reviews.loc[duplicate_pair].Quality), np.mean(reviews.reviews.loc[duplicate_pair].Quality)]) impact.append([np.mean(reviews.reviews.loc[duplicate_pair].Impact), np.mean(reviews.reviews.loc[duplicate_pair].Impact)]) confidence.append([np.mean(reviews.reviews.loc[duplicate_pair].Conf), np.mean(reviews.reviews.loc[duplicate_pair].Conf)]) quality = np.array(quality) impact = np.array(impact) confidence = np.array(confidence) quality_cor = np.corrcoef(quality.T)[0, 1] impact_cor = np.corrcoef(impact.T)[0, 1] confidence_cor = np.corrcoef(confidence.T)[0, 1] print "Quality correlation: ", quality_cor print "Impact correlation: ", impact_cor print "Confidence correlation: ", confidence_cor
To visualize the quality score correlation we plot the group 1 papers against the group 2 papers. Here we add a small amount of jitter to ensure points to help visualize points that would otherwise fall on the same position.
plt.plot(quality[:, 0]+np.random.randn(quality.shape)*0.06125, quality[:, 1]+np.random.randn(quality.shape)*0.06125, 'rx') plt.title('Quality Correlation: ' + str(quality_cor))
plt.plot(impact[:, 0]+np.random.randn(impact.shape)*0.06125, impact[:, 1]+np.random.randn(impact.shape)*0.06125, 'rx') plt.title('Impact Correlation: ' + str(impact_cor))
plt.plot(confidence[:, 0]+np.random.randn(confidence.shape)*0.06125, confidence[:, 1]+np.random.randn(confidence.shape)*0.06125, 'rx') plt.title('Confidence Correlation: ' + str(confidence_cor))
The main score people are familiar with for NIPS, the quality score, ranging from 1 to 10. A paper with 6 or above is recommended for accept. Here we plot a histogram of individual review quality scores for the whole conference.
reviews.reviews.Quality.hist(bins=10, range=(1, 10)) plt.title('Reviewer Scores')
Another question of interest is what is the typical 'span' of quality scores, i.e. how consistent are the reviews (we could also measure this by variance, but span gives us the maximum score minus the minimum which gives a sensible integer value. If a paper has a very large span, it may need attention from the are chair. First we need to compute this for each paper.
span =  for paper in sorted(set(reviews.reviews.index), key = int): span.append(np.max(reviews.reviews.loc[paper].Quality) - np.min(reviews.reviews.loc[paper].Quality)) span = np.array(span) plt.hist(span, bins = np.max(span)-np.min(span)+1, range=(np.min(span),np.max(span)+1)) plt.title('Span of Reviewer Scores per Paper')
Impact score is either 1 or 2. This was an innovation introduced by Zoubin and Max to try and help identify papers that were likely to have an influence (independently of their technical qualities). Let's see how often a paper is scored as high impact.
reviews.reviews.Impact.hist(bins=2, range=(1, 2)) plt.title('Histogram of Impact')
Confidence scores vary from 1 to 5. Low confidence scores may indicate a inappropriate reviewer allocation, but just as often they'll indicate a poorly written paper or a paper that isn't really appropriate for the NIPS community. By investigating confidence scores we can determine papers that may need attention.
reviews.reviews.Conf.hist(bins=5, range=(1, 5)) plt.title('Histogram of Confidence')
Another concern could be very short reviews. Let's have a look at review length for the conference reviews.
comment_len =  for paper, comments in reviews.reviews.Comments.iteritems(): comment_len.append(len(comments)) comment_len = np.array(comment_len) plt.hist(comment_len,bins=20)
This script goes through reviews and prints those which are less than 200 characters long. These may need addressing by alerting the area chair or prompting the reviewer to provide more detailed comments.
for index, review in reviews.reviews.loc[(comment_len<200)].iterrows(): print index, review['Email'], review['Comments'] #plt.plot(comment_len, reviews.reviews.Quality, 'rx')
comments_len = pd.Series(comment_len, index=reviews.reviews.index)
There is one paper with a comment length of over 12000 characters. That's quite long. Let's also have a look at that.
comments_len.hist(bins=20) reviews.reviews.loc[comments_len>12000][['Quality', 'Impact', 'Email']]