Find Duplicate Users

18th June 2014 Alan Saul

This script loads the users from an excel sheet of all users, Users And Roles > View All Users and Role > Export list to Excel, using the custom built cmtutil library.

It then finds users with the same First and Last name. User accounts are easily created in CMT by co-authors. We need to ensure that these conflicts have their correct domain names listed. The aim should be to merge these accounts, or delete the surplus ones.

We need to also pay particular care to those users who have submitted on one, and are reviewing/metareviewing on another, as they may end up reviewing their own papers!

It then finds users which really don't belong in the system (not chair, associate chair, reviewer, external-reviewer, meta-reviewer, editor and have also not submitted a paper this year), these could be deleted.

In [39]:
import cmtutils
import pandas as pd
data_dir = cmtutils.cmt_data_directory
cmt=cmtutils.cmt_reviewers_read(filename="2014-10-26_full_user_list.xls")

Get list of all users and booleanise their categorical Yes, No fields

In [40]:
users = cmt.reviewers

#Booleanise the variables
d = {'Yes': True, 'No': False}
for col in ['IsAssociateChair', 'IsAuthor', 'IsChair', 'IsExternalReviewer','ConflictDomainsNotEnteredForSubmittedPapers',
            'IsMetaReviewer', 'IsProceedingsEditor','IsReviewer', 'IsSubmissionOwner']:
    users[col] = users[col].map(d)

Update the Submitters List

Submissions column only contains those that submitted the paper. However, we can copy and paste the emails of all authors from CMT by attempting to email all authors. Then we can tell whether they submitted from that list. Go to Submissions -> View Active papers -> Email -> All Authors and then copy the recipient list. Save it as submitted_email_list.txt.

In [41]:
import re
import os
users_tmp = users
with open(os.path.join(cmtutils.cmt_data_directory, 'submitted_email_list.txt')) as f:
    file_str = f.read().lower()
    matches = re.findall(r'<(.+?)>', file_str) 

submitted_users = users.set_index('Email').loc[matches]
submitted_users.index.name = 'Email'
user_by_email = users.set_index('Email')

print "Found {} real submission owners on list of authors from email list, \
and {} listed as have submitted in User and Roles list".format(len(submitted_users), users['IsSubmissionOwner'].sum())

print "That is a mismatch of {}".format(abs(users['IsSubmissionOwner'].sum() - len(submitted_users)))

#Remove all "submitted" for users in list, then re-add them from the email list
users['IsSubmissionOwner'] = False
submitted_users['IsSubmissionOwner'] = True
#Make email the primary key
users = users.set_index('Email')
users.loc[matches,'IsSubmissionOwner'] = True

#Sanity check
assert len(users) == len(users_tmp)
assert users['IsSubmissionOwner'].sum() == len(submitted_users)

print "Users now has {} submission authors".format(users['IsSubmissionOwner'].sum())
Found 3596 real submission owners on list of authors from email list, and 3780 listed as have submitted in User and Roles list
That is a mismatch of 184
Users now has 3596 submission authors
In [42]:
users.IsSubmissionOwner.sum()
Out[42]:
3596

Add column for first name last name as a key

In [43]:
users['Name'] = users['FirstName'] + users['LastName']
users.sort_index(inplace=True)
cols = users.columns.tolist()
cols = cols[-1:] + cols[:-1]
users = users[cols]

Find Duplicated Users

Get dataframes of users with no role (might want to delete them in the future), and those with duplicate "FirstName LastName" combos.

In [44]:
#Roleless
no_roles = users[~users.IsAssociateChair & ~users.IsChair & ~users.IsExternalReviewer 
                 & ~users.IsMetaReviewer & ~users.IsProceedingsEditor 
                 & ~users.IsReviewer & ~users.IsSubmissionOwner]

#Get duplicates with roles
roles = users[users.IsAssociateChair | users.IsChair | users.IsExternalReviewer | users.IsMetaReviewer | users.IsProceedingsEditor 
              | users.IsReviewer | users.IsSubmissionOwner]
NameCol = roles["Name"]
#Get the duplicated rows (first time it is repeated) get these rows from the name column then get a boolean matrix
#for whether each name is contained in the row of duplicate names, then sort by names
dupes_with_roles = roles[NameCol.isin(NameCol[NameCol.duplicated()])].sort("Name")

Filter out all those which aren't a submitter and another role combined between their accounts That is, duplicates are okay as long as they are just submitters and not also reviewers, and duplicate reviewers are the same as long as they are not also submitting

In [45]:
#Group Names together
grouped = dupes_with_roles.groupby('Name')
other_roles = ['IsAssociateChair', 'IsChair', 'IsExternalReviewer', 'IsMetaReviewer', 'IsProceedingsEditor','IsReviewer']

#Filter
submitter_and_reviewer = grouped.filter(lambda x: (x['IsSubmissionOwner'].sum() > 0) 
                                        and (sum(x[other_roles].sum()) > 0)
                                       )

Save the list to a csv file for later processing.

In [46]:
submitter_and_reviewer_reindex = submitter_and_reviewer.reindex_axis(['FirstName', 'LastName', 'Institute', 'IsReviewer', 'IsSubmissionOwner', 'IsAssociateChair', 'IsChair', 'IsExternalReviewer', 'IsMetaReviewer', 'IsProceedingsEditor', 'LastLoginDate'], axis=1)
submitter_and_reviewer_reindex.to_csv(os.path.join(data_dir, "submitters_and_reviewers_dupes.csv"), encoding='utf-8')