Analyzing CS trends from ICML and NIPS Research Papers

Introduction

Machine Learning (ML) is a hot buzzword in the field of data analytics and is often a popular application in the realm of Artificial Intelligence (AI). As many researchers try their best to come up with something new to beat state-of-the-art technologies, many research conferences see a significant increase in the number of paper submissions in the recent years. This brings us to several interesting questions: Is there a trend that we can spot in these conference paper submissions? Has there been a shift of focus in the various tracks in Machine Learning?

Setting up the environment

In [1]:
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns
from sklearn.cluster import KMeans, MiniBatchKMeans
from wordcloud import WordCloud
/Users/jovianlin/anaconda3/envs/python3/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
%matplotlib inline
sns.set(color_codes=True)
pd.set_option('max_colwidth', 500)

The pre-processed dataset

The International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems (NIPS) are two flagship machine learning conferences, largest by attendance and attract the most researchers from all areas of machine learning.

The dataset consists of ICML and NIPS papers from 2014 to 2016. The JSON files were crawled by @evandrix.

Reading in the JSON files

In [3]:
def jsonlines_to_df(file_path):
    with open(file_path) as f: 
        return pd.DataFrame(json.loads(line) for line in f)
In [4]:
icml = jsonlines_to_df('./data/icml.json')
nips = jsonlines_to_df('./data/nips.json')

As always, we have to clean and sanitize the dataset before we proceed to mine insights from it.

Cleaning the dataset

In [5]:
def clean_df(df, name):
    ordered_cols = ['_id', 'title', 'abstract', 'authors', 'date']
    df = df.loc[:, ordered_cols].copy()
    df['_id'] = df['_id'].astype(str)
    df['year'] = df['date'].astype(int)
    df['_id'] = ('%s_' % name) + df['_id']
    df['authors_str'] = df['authors'].apply(lambda x: ', '.join(x))
    df.drop('date', axis=1, inplace=True)
    return df[['_id', 'title', 'abstract', 'authors', 'authors_str', 'year']]
In [6]:
icml = clean_df(icml, 'icml').set_index('_id').sort_values('year', ascending=False)
nips = clean_df(nips, 'nips').set_index('_id').sort_values('year', ascending=False)

A preview of the cleaned dataset

In [7]:
icml.head(1)
Out[7]:
title abstract authors authors_str year
_id
icml_yenb16 PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification We consider Multiclass and Multilabel classification with extremely large number of classes, of which only few are labeled to each instance. In such setting, standard methods that have training, prediction cost linear to the number of classes become intractable. State-of-the-art methods thus aim to reduce the complexity by exploiting correlation between labels under assumption that the similarity between labels can be captured by structures such as low-rank matrix or balanced tree. However, ... [Ian En-Hsu Yen, Inderjit Dhillon, Kai Zhong, Pradeep Ravikumar, Xiangru Huang] Ian En-Hsu Yen, Inderjit Dhillon, Kai Zhong, Pradeep Ravikumar, Xiangru Huang 2016

Describing the dataset

Each data point consists of five fields: id, title, abstract, authors and year. A total of 902 ICML papers and 1382 NIPS papers from 2014-2016 were crawled.

ICML count

In [8]:
icml.shape
Out[8]:
(902, 5)

NIPS count

In [9]:
nips.shape
Out[9]:
(1382, 5)

Putting the papers from the two conferences together, we observe an increase in the accepted papers from approximately 700 in 2014 to 900 in 2016, with a total of 2284 accepted papers over the three years.

Combining ICML and NIPS papers from 2014 to 2016

In [10]:
papers = pd.concat([icml, nips], axis=0).sort_values('year', ascending=False)
In [11]:
papers.shape
Out[11]:
(2284, 5)
In [12]:
papers.dtypes
Out[12]:
title          object
abstract       object
authors        object
authors_str    object
year            int64
dtype: object
In [13]:
papers.head(3)
Out[13]:
title abstract authors authors_str year
_id
icml_yenb16 PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification We consider Multiclass and Multilabel classification with extremely large number of classes, of which only few are labeled to each instance. In such setting, standard methods that have training, prediction cost linear to the number of classes become intractable. State-of-the-art methods thus aim to reduce the complexity by exploiting correlation between labels under assumption that the similarity between labels can be captured by structures such as low-rank matrix or balanced tree. However, ... [Ian En-Hsu Yen, Inderjit Dhillon, Kai Zhong, Pradeep Ravikumar, Xiangru Huang] Ian En-Hsu Yen, Inderjit Dhillon, Kai Zhong, Pradeep Ravikumar, Xiangru Huang 2016
nips_6042 Learning to Communicate with Deep Multi-Agent Reinforcement Learning We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. We propose two a... [Jakob Foerster, Nando de Freitas, Shimon Whiteson, Yannis M. Assael] Jakob Foerster, Nando de Freitas, Shimon Whiteson, Yannis M. Assael 2016
nips_6037 Sub-sampled Newton Methods with Non-uniform Sampling We consider the problem of finding the minimizer of a convex function $F: \mathbb R^d \rightarrow \mathbb R$ of the form $F(w) \defeq \sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $\nabla^2 f_i(w)$ is readily available.We consider the regime where $n \gg d$. We propose randomized Newton-type algorithms that exploit \textit{non-uniform} sub-sampling of $\{\nabla^2 f_i(w)\}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity, and are applicab... [Christopher Ré, Farbod Roosta-Khorasani, Jiyan Yang, Michael W. Mahoney, Peng Xu] Christopher Ré, Farbod Roosta-Khorasani, Jiyan Yang, Michael W. Mahoney, Peng Xu 2016
In [14]:
s = papers['year'].value_counts().sort_index()
i = s.index; i.name = 'Year'

ax = pd.DataFrame(s.values, index=i, columns=['counts']).plot(kind='bar', rot=0, legend=False, title='Total Papers Each Year')
ax.set_ylabel("No. of Papers")
fig = ax.get_figure()
fig.savefig('./distributions/total_papers_each_year.png')

Before we proceed to do further analysis on the uprising trends and topics, we create another field that combines the three columns "title", "abstract" and "authors_str". We will then be mainly using this field for further analysis.

Further pre-processing

Using regex to restructure/clean certain text

In [15]:
papers['abstract'] = papers.abstract.str.replace(r'\\\w+{(.*?)}', r'\1')
papers['abstract'] = papers.abstract.str.replace(r'{(.*?)}', r'\1')  # Removes surrouding { or }
papers['abstract'] = papers.abstract.str.replace(r'\\\w+\b', r'')  # Removes: \\it, \\defeq, etc

Combining columns "title", "abstract", and "authors_str" to form the new column "text"

In [16]:
papers['text'] = papers['title'] + ' ' + papers['abstract'] + ' ' + papers['authors_str']
In [17]:
papers['text'][0]
Out[17]:
'PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification We consider Multiclass and Multilabel classification with extremely large number of classes, of which only few are labeled to each instance. In such setting, standard methods that have training, prediction cost linear to the number of classes become intractable. State-of-the-art methods thus aim to reduce the complexity by exploiting correlation between labels under assumption that the similarity between labels can be captured by structures such as low-rank matrix or balanced tree. However, as the diversity of labels increases in the feature space, structural assumption can be easily violated, which leads to degrade in the testing performance. In this work, we show that a margin-maximizing loss with l1 penalty, in case of Extreme Classification, yields extremely sparse solution both in primal and in dual without sacrificing the expressive power of predictor. We thus propose a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW) algorithm that exploits both primal and dual sparsity to achieve a complexity sublinear to the number of primal and dual variables. A bi-stochastic search method is proposed to further improve the efficiency. In our experiments on both Multiclass and Multilabel problems, the proposed method achieves significant higher accuracy than existing approaches of Extreme Classification with very competitive training and prediction time. Ian En-Hsu Yen, Inderjit Dhillon, Kai Zhong, Pradeep Ravikumar, Xiangru Huang'
In [18]:
papers.shape
Out[18]:
(2284, 6)

Start of Analysis

We will be looking at using several unsupervised methods like the TF-IDF vectorizer, PCA with t-SNE and clustering to mine insights. Visualizations will also be provided to better explore the current trends in machine learning research.

TF-IDF Vectorizer (X_df) [Exploration]

The TF-IDF vectorizer converts a collection of raw documents to a matrix of TF-IDF features. As the abstracts typically contain stopwords that may add noise to our analysis, we first perform stopwords removal and yield 16431 distinct features.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [20]:
stopwords = ['a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'd', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']
In [21]:
vect = TfidfVectorizer(min_df=3, max_df=0.8, ngram_range=(1, 3), token_pattern=r'\b[A-Za-z]{3,}\b', stop_words=stopwords)
X = vect.fit_transform(papers['text'])

X_df = pd.DataFrame(X.toarray(), index=papers.index, columns=vect.get_feature_names())
print(X_df.shape)
(2284, 16431)

PCA + T-SNE (X_tsne_df) [Exploration]

t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.

It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot (that you will see later).

However, as each of our 1D-vectors is very large (~28K in size), we first perform principal component analysis (PCA) to reduce the dimension before applying t-SNE.

In [22]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
In [23]:
tsne = TSNE(n_components=2, random_state=0)

if X_df.shape[1] <= 10000:
    X_tsne = tsne.fit_transform(X.toarray())
else:
    n_components_pca = 10000
    X_tsne = tsne.fit_transform(PCA(n_components=n_components_pca, random_state=0).fit_transform(X.toarray()))
In [24]:
X_tsne_df = pd.DataFrame(X_tsne, index=papers.index, columns=['x', 'y'])
X_tsne_df.head(3)
Out[24]:
x y
_id
icml_yenb16 8.867819 8.180670
nips_6042 -3.378937 -8.173583
nips_6037 5.635920 6.437424

MiniBatchKmeans Clustering

We proceed to perform clustering to uncover the different tracks in machine learning. Furthermore, we use visual aids like scatter plots and word clouds to better summarize and observe the uprising trends and changes in the field of machine learning.

In [25]:
def get_clusters(X, index, n_cluster, batch_size=200, series_name='clusters'):
    model = MiniBatchKMeans(n_clusters=n_cluster, batch_size=batch_size, random_state=0).fit(X) 
    return pd.Series(model.labels_, index=index, name=series_name)

Visualizing with Scatter Plot

I played around with k between 2 and 31 (inclusive). You can see that the clusters make more sense when k smaller than 10. Eventually, I settled for k = 7.

In [26]:
size = 4
aspect = 1.0
palette = 'Set1'
save_dir = './scatterplots'

for k in range(7, 8): # Original: range(2, 16):

    clusters = get_clusters(X, papers.index, n_cluster=k)
    plot_data = pd.concat([X_tsne_df, clusters, papers.year], axis=1)
    plot_data.rename(columns={'year': 'Year', 'clusters': 'Cluster'}, inplace=True)
    
    sns_plot_1 = sns.lmplot('x', 'y', 
                            row='Cluster',
                            col='Year',
                            hue="Cluster",
                            data=plot_data, 
                            markers="o", # http://matplotlib.org/api/markers_api.html#matplotlib.markers.MarkerStyle
                            scatter_kws={"s": 20, "alpha": 0.4},
                            palette=palette, size=size, aspect=aspect, fit_reg=False)
    
    sns_plot_1.savefig('%s/n%d_indiv_clusters.png' % (save_dir, k))
    
    sns_plot_2 = sns.lmplot('x', 'y', 
                            col='Year',
                            hue="Cluster",
                            data=plot_data, 
                            markers="o", # http://matplotlib.org/api/markers_api.html#matplotlib.markers.MarkerStyle
                            scatter_kws={"s": 20, "alpha": 0.4},
                            palette=palette, size=size, aspect=aspect, fit_reg=False)
    
    sns_plot_2.savefig('%s/n%d_all_clusters.png' % (save_dir, k))
    
    sns_plot_3 = sns.lmplot('x', 'y', 
                            hue="Cluster",
                            data=plot_data, 
                            markers="o", # http://matplotlib.org/api/markers_api.html#matplotlib.markers.MarkerStyle
                            scatter_kws={"s": 80, "alpha": 0.4},
                            palette=palette, size=size*3, aspect=aspect, fit_reg=False)

    sns_plot_3.savefig('%s/n%d_all_time.png' % (save_dir, k))    

Visualizing with Bar Plot

We can adopt another perspective: using a bar plot to visualize. Note: k = 7

Here, we observe the changes in each of the 7 clusters over the years, in terms of absolute numbers and also in terms of percentages.

As we can see, the percentage of paper contributions from cluster 1 (Deep Neural Networks) has the most significant increase from 2014 to 2016. Most clusters experience a decreasing trend in the percentage of paper contributions, with the decrease by cluster 6 (Latent Markov Models) being the greatest. It is not difficult to see that many other areas of machine learning have slowly started making way for the uprising Deep Neural Networks learning.

In [27]:
n_cluster = 7
In [28]:
clusters = get_clusters(X, papers.index, n_cluster=n_cluster)
clusters.name = 'Clusters'
clusters.head(5)
Out[28]:
_id
icml_yenb16    2
nips_6042      4
nips_6037      2
nips_6038      4
nips_6039      1
Name: Clusters, dtype: int32
In [29]:
new_papers = pd.concat([papers, clusters], axis=1).loc[:, ['title', 'year', 'Clusters']]
new_papers.rename(columns={'title': 'Title', 'year': 'Year'}, inplace=True)
new_papers.head(3)
Out[29]:
Title Year Clusters
_id
icml_yenb16 PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification 2016 2
nips_6042 Learning to Communicate with Deep Multi-Agent Reinforcement Learning 2016 4
nips_6037 Sub-sampled Newton Methods with Non-uniform Sampling 2016 2
In [30]:
sns.set_context('poster')
sns.plt.title('Changes in Clusters over Time')
ax = sns.countplot(x='Clusters', hue='Year', data=new_papers, palette='Set1')
ax.set(xlabel='Clusters', ylabel='Count')
fig = ax.get_figure()
fig.savefig('./distributions/dist_years_over_clusters.png')