SC-2 - Can We Distinguish "Close" Model Variants?

In this second experiment in seriation classification, the question is whether we can distinguish variants of the same basic structural model which differ in their details. In this case, I look at "lineage" models with four variants: lineages that split early and evolve for a longer period, split late and have more time as a single unified lineage, lineages that coalesce early and evolve for a longer period, and late coalescence with a longer period of separate evolution. These models are clearly interesting from an archaeological perspective, but since distinguishing them visually in seriations may rely upon additional information to orient things, they may be topologically equivalent. Thus, my expectation in this experiment is very low classification performance, close to chance.

I use the same approach as sc-1: a gradient boosted classifier with the Laplacian eigenvalues as features.

In [1]:
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cPickle as pickle
from copy import deepcopy
from sklearn.utils import shuffle

%matplotlib inline
plt.style.use("fivethirtyeight")
sns.set()
Couldn't import dot_parser, loading of dot files will not be possible.
In [2]:
all_graphs = pickle.load(open("train-cont-graphs.pkl",'r'))
all_labels = pickle.load(open("train-cont-labels.pkl",'r'))

In addition to needing a train/test split, we need to ensure reasonable class balance. A simple approach to this is simply to shuffle both lists before taking a random sample without replacement.

In [3]:
def train_test_split(graph_list, label_list, test_fraction=0.20):
    """
    Randomly splits a set of graphs and labels into training and testing data sets.  We need a custom function
    because the dataset isn't a numeric matrix, but a list of NetworkX Graph objects.  In case there is class 
    structure (i.e., we filled the arrays first with instances of one class, then another class...) we consistently
    shuffle both lists.
    """
    
    graph_list, label_list = shuffle(graph_list, label_list)
    
    rand_ix = np.random.random_integers(0, len(graph_list), size=int(len(graph_list) * test_fraction))
    print "random indices: %s" % rand_ix
    
    test_graphs = []
    test_labels = []
    
    train_graphs = []
    train_labels = []
    
    # first copy the chosen test values, without deleting anything since that would alter the indices
    for ix in rand_ix:
        test_graphs.append(graph_list[ix])
        test_labels.append(label_list[ix])
        
    # now copy the indices that are NOT in the test index list
    for ix in range(0, len(graph_list)):
        if ix in rand_ix:
            continue
        train_graphs.append(graph_list[ix])
        train_labels.append(label_list[ix])
        
    return (train_graphs, train_labels, test_graphs, test_labels)

Feature Engineering

The goal here is to construct a standard training and test data matrix of numeric values, which will contain the sorted Laplacian eigenvalues of the graphs in each data set. One feature will thus represent the largest eigenvalue for each graph, a second feature will represent the second largest eigenvalue, and so on.

We do not necessarily assume that all of the graphs have the same number of vertices, although if there are marked differences, we would need to handle missing data for those graphs which had many fewer eigenvalues (or restrict our slice of the spectrum to the smallest number of eigenvalues present).

In [4]:
train_graphs, train_labels, test_graphs, test_labels = train_test_split(all_graphs, all_labels, test_fraction=0.1)
print "train size: %s" % len(train_graphs)
print "test size: %s" % len(test_graphs)
random indices: [164 188 191  40  48  11  87 119 105 171 156  47 164 158 185  50 139  32
 101 158]
train size: 182
test size: 20
In [5]:
def graphs_to_eigenvalue_matrix(graph_list, num_eigenvalues = None):
    """
    Given a list of NetworkX graphs, returns a numeric matrix where rows represent graphs, 
    and columns represent the reverse sorted eigenvalues of the Laplacian matrix for each graph,
    possibly trimmed to only use the num_eigenvalues largest values.  If num_eigenvalues is 
    unspecified, all eigenvalues are used.
    """    
    # we either use all of the eigenvalues, or the number requested (and zero-pad if needed)
    if num_eigenvalues is None:
        ev_used = n
    else:
        ev_used = num_eigenvalues
    
    data_mat = np.zeros((len(graph_list),ev_used))
    
    for ix in range(0, len(graph_list)):
        spectrum = sorted(nx.spectrum.laplacian_spectrum(graph_list[ix], weight=None), reverse=True)
        # if the spectrum is shorter than the number of eigenvalues used (due to multiplicity), zero pad the result
        if len(spectrum) < ev_used:
            spectrum = np.lib.pad(spectrum, (0,ev_used-len(spectrum)), 'constant', constant_values=(0,0))
        data_mat[ix,:] = spectrum[0:ev_used]
        
    return data_mat
    

Classifier

We're going to be using a gradient boosted classifier, which has some of best accuracy of any of the standard classifier methods. Ultimately we'll figure out the best hyperparameters using cross-validation, but first we just want to see whether the approach gets us anywhere in the right ballpark -- remember, we can 80% accuracy with just eigenvalue distance, so we have to be in that neighborhood or higher to be worth the effort of switching to a more complex model.

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [7]:
train_matrix = graphs_to_eigenvalue_matrix(train_graphs, num_eigenvalues=20)
test_matrix = graphs_to_eigenvalue_matrix(test_graphs, num_eigenvalues=20)
print train_matrix.shape
print test_matrix.shape
(182, 20)
(20, 20)
In [8]:
clf = GradientBoostingClassifier(n_estimators = 250)
clf.fit(train_matrix, train_labels)
Out[8]:
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=250,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
In [9]:
pred_label = clf.predict(test_matrix)
In [10]:
cm = confusion_matrix(test_labels, pred_label)
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)

print cmdf
print classification_report(test_labels, pred_label)
print "Accuracy on test: %0.3f" % accuracy_score(test_labels, pred_label)
          predicted 0  predicted 1  predicted 2  predicted 3
actual 0            3            2            0            0
actual 1            1            1            0            0
actual 2            0            0            2            2
actual 3            0            0            4            5
             precision    recall  f1-score   support

          0       0.75      0.60      0.67         5
          1       0.33      0.50      0.40         2
          2       0.33      0.50      0.40         4
          3       0.71      0.56      0.63         9

avg / total       0.61      0.55      0.57        20

Accuracy on test: 0.550

Overall, the accuracy is low, but interestingly, there is a pattern. We never mistake seriations which have an "early" event from those with a "late" event, but we have trouble telling a early split from an early coalescence, and trouble telling a late split from a late coalescence. This is a slightly weird result, actually.

Finding Optimal Hyperparameters

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
In [12]:
pipeline = Pipeline([
        ('clf', GradientBoostingClassifier())
    ])

params = {
     'clf__learning_rate': [5.0,2.0,1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005],
     'clf__n_estimators': [10,25,50,100,250,500,1000]
}

grid_search = GridSearchCV(pipeline, params, cv=5, n_jobs = -1, verbose = 1)
In [13]:
grid_search.fit(train_matrix, train_labels)
Fitting 5 folds for each of 70 candidates, totalling 350 fits
[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed:   23.5s
[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed:   48.6s finished
Out[13]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('clf', GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=-1,
       param_grid={'clf__learning_rate': [5.0, 2.0, 1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005], 'clf__n_estimators': [10, 25, 50, 100, 250, 500, 1000]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=1)
In [14]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters:")
best_params = grid_search.best_estimator_.get_params()
for param in sorted(params.keys()):
    print("param: %s: %r" % (param, best_params[param]))
Best score: 0.593
Best parameters:
param: clf__learning_rate: 1.0
param: clf__n_estimators: 50
In [15]:
pred_label = grid_search.predict(test_matrix)
In [16]:
cm = confusion_matrix(test_labels, pred_label)
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)

print cmdf
print classification_report(test_labels, pred_label)
print "Accuracy on test: %0.3f" % accuracy_score(test_labels, pred_label)
          predicted 0  predicted 1  predicted 2  predicted 3
actual 0            3            2            0            0
actual 1            1            1            0            0
actual 2            0            0            3            1
actual 3            0            0            3            6
             precision    recall  f1-score   support

          0       0.75      0.60      0.67         5
          1       0.33      0.50      0.40         2
          2       0.50      0.75      0.60         4
          3       0.86      0.67      0.75         9

avg / total       0.71      0.65      0.66        20

Accuracy on test: 0.650

Observations

The results are actually more encouraging than I expected, but also stranger in terms of the ability to tell early lineage events from late lineage events, but to have significant difficulty telling splitting from coalescence. If anything would be distinguishable, I'd have thought early split/late coalescence or early coalescence/late split might be distinguishable. I need to look at more of the actual seriation graphs and see why this might be the case.