#!/usr/bin/env python # coding: utf-8 # # Table of Contents #

1  SC-2 - Can We Distinguish "Close" Model Variants?
1.1  Feature Engineering
1.2  Classifier
1.3  Finding Optimal Hyperparameters
1.4  Observations
# # SC-2 - Can We Distinguish "Close" Model Variants? # # In this second experiment in seriation classification, the question is whether we can distinguish variants of the same basic structural model which differ in their details. In this case, I look at "lineage" models with four variants: lineages that split early and evolve for a longer period, split late and have more time as a single unified lineage, lineages that coalesce early and evolve for a longer period, and late coalescence with a longer period of separate evolution. These models are clearly interesting from an archaeological perspective, but since distinguishing them visually in seriations may rely upon additional information to orient things, they may be topologically equivalent. Thus, my expectation in this experiment is very low classification performance, close to chance. # # I use the same approach as sc-1: a gradient boosted classifier with the Laplacian eigenvalues as features. # In[1]: import numpy as np import networkx as nx import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import cPickle as pickle from copy import deepcopy from sklearn.utils import shuffle get_ipython().run_line_magic('matplotlib', 'inline') plt.style.use("fivethirtyeight") sns.set() # In[2]: all_graphs = pickle.load(open("train-cont-graphs.pkl",'r')) all_labels = pickle.load(open("train-cont-labels.pkl",'r')) # In addition to needing a train/test split, we need to ensure reasonable class balance. A simple approach to this is simply to shuffle both lists before taking a random sample without replacement. # In[3]: def train_test_split(graph_list, label_list, test_fraction=0.20): """ Randomly splits a set of graphs and labels into training and testing data sets. We need a custom function because the dataset isn't a numeric matrix, but a list of NetworkX Graph objects. In case there is class structure (i.e., we filled the arrays first with instances of one class, then another class...) we consistently shuffle both lists. """ graph_list, label_list = shuffle(graph_list, label_list) rand_ix = np.random.random_integers(0, len(graph_list), size=int(len(graph_list) * test_fraction)) print "random indices: %s" % rand_ix test_graphs = [] test_labels = [] train_graphs = [] train_labels = [] # first copy the chosen test values, without deleting anything since that would alter the indices for ix in rand_ix: test_graphs.append(graph_list[ix]) test_labels.append(label_list[ix]) # now copy the indices that are NOT in the test index list for ix in range(0, len(graph_list)): if ix in rand_ix: continue train_graphs.append(graph_list[ix]) train_labels.append(label_list[ix]) return (train_graphs, train_labels, test_graphs, test_labels) # ## Feature Engineering ## # The goal here is to construct a standard training and test data matrix of numeric values, which will contain the sorted Laplacian eigenvalues of the graphs in each data set. One feature will thus represent the largest eigenvalue for each graph, a second feature will represent the second largest eigenvalue, and so on. # # We do not necessarily assume that all of the graphs have the same number of vertices, although if there are marked differences, we would need to handle missing data for those graphs which had many fewer eigenvalues (or restrict our slice of the spectrum to the smallest number of eigenvalues present). # In[4]: train_graphs, train_labels, test_graphs, test_labels = train_test_split(all_graphs, all_labels, test_fraction=0.1) print "train size: %s" % len(train_graphs) print "test size: %s" % len(test_graphs) # In[5]: def graphs_to_eigenvalue_matrix(graph_list, num_eigenvalues = None): """ Given a list of NetworkX graphs, returns a numeric matrix where rows represent graphs, and columns represent the reverse sorted eigenvalues of the Laplacian matrix for each graph, possibly trimmed to only use the num_eigenvalues largest values. If num_eigenvalues is unspecified, all eigenvalues are used. """ # we either use all of the eigenvalues, or the number requested (and zero-pad if needed) if num_eigenvalues is None: ev_used = n else: ev_used = num_eigenvalues data_mat = np.zeros((len(graph_list),ev_used)) for ix in range(0, len(graph_list)): spectrum = sorted(nx.spectrum.laplacian_spectrum(graph_list[ix], weight=None), reverse=True) # if the spectrum is shorter than the number of eigenvalues used (due to multiplicity), zero pad the result if len(spectrum) < ev_used: spectrum = np.lib.pad(spectrum, (0,ev_used-len(spectrum)), 'constant', constant_values=(0,0)) data_mat[ix,:] = spectrum[0:ev_used] return data_mat # In[ ]: # ## Classifier ## # We're going to be using a gradient boosted classifier, which has some of best accuracy of any of the standard classifier methods. Ultimately we'll figure out the best hyperparameters using cross-validation, but first we just want to see whether the approach gets us anywhere in the right ballpark -- remember, we can 80% accuracy with just eigenvalue distance, so we have to be in that neighborhood or higher to be worth the effort of switching to a more complex model. # In[6]: from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # In[7]: train_matrix = graphs_to_eigenvalue_matrix(train_graphs, num_eigenvalues=20) test_matrix = graphs_to_eigenvalue_matrix(test_graphs, num_eigenvalues=20) print train_matrix.shape print test_matrix.shape # In[8]: clf = GradientBoostingClassifier(n_estimators = 250) clf.fit(train_matrix, train_labels) # In[9]: pred_label = clf.predict(test_matrix) # In[10]: cm = confusion_matrix(test_labels, pred_label) cmdf = pd.DataFrame(cm) cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns) cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index) print cmdf print classification_report(test_labels, pred_label) print "Accuracy on test: %0.3f" % accuracy_score(test_labels, pred_label) # Overall, the accuracy is low, but interestingly, there is a pattern. We never mistake seriations which have an "early" event from those with a "late" event, but we have trouble telling a early split from an early coalescence, and trouble telling a late split from a late coalescence. This is a slightly weird result, actually. # ## Finding Optimal Hyperparameters ## # In[ ]: # In[11]: from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV # In[12]: pipeline = Pipeline([ ('clf', GradientBoostingClassifier()) ]) params = { 'clf__learning_rate': [5.0,2.0,1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01, 0.005], 'clf__n_estimators': [10,25,50,100,250,500,1000] } grid_search = GridSearchCV(pipeline, params, cv=5, n_jobs = -1, verbose = 1) # In[13]: grid_search.fit(train_matrix, train_labels) # In[14]: print("Best score: %0.3f" % grid_search.best_score_) print("Best parameters:") best_params = grid_search.best_estimator_.get_params() for param in sorted(params.keys()): print("param: %s: %r" % (param, best_params[param])) # In[15]: pred_label = grid_search.predict(test_matrix) # In[16]: cm = confusion_matrix(test_labels, pred_label) cmdf = pd.DataFrame(cm) cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns) cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index) print cmdf print classification_report(test_labels, pred_label) print "Accuracy on test: %0.3f" % accuracy_score(test_labels, pred_label) # ## Observations ## # The results are actually more encouraging than I expected, but also stranger in terms of the ability to tell early lineage events from late lineage events, but to have significant difficulty telling splitting from coalescence. If anything would be distinguishable, I'd have thought early split/late coalescence or early coalescence/late split might be distinguishable. I need to look at more of the actual seriation graphs and see why this might be the case. # In[ ]: