Created by: Colin Kälin and Lewis Tunstall, December 2019
Blog post: https://bit.ly/2Qj82FF
Summary: This notebook provides a walkthrough on creating topological features for machine learning with giotto-learn.
The dataset comes from the Predicting Molecular Properties competition on Kaggle. The goal is to predict the coupling strength between atoms in molecules. To download the data files run the following cell:
# uncomment and run to download data
# !wget https://storage.googleapis.com/l2f-open-models/molecule-bond-prediction/data.zip; unzip -o data.zip; rm data.zip
Here we will briefly give an overview of the pipeline that is used in this notebook:
At the end of this walkthrough, you should obtain the results shown in the figure below, which highlight that for this example, the combination of topological and non-topological feature outperform a model based purely on non-topological features.
%load_ext autoreload
%autoreload 2
# General imports
from itertools import product
import os, random, sys
sys.path.append("src/")
# Other
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=FutureWarning)
# Import for data handling
import numpy as np
import pandas as pd
import networkx as nx
import pickle
from data.dataset_utils import get_selected_structures
# Machine Learning imports
from gtda.homology import VietorisRipsPersistence
from models.model import cv_model
import gtda.diagrams as diag
# Feature creation
from data.dataset_utils import create_non_TDA_pickle
from features.features import *
# Plotting imports
import seaborn as sns
import matplotlib.pyplot as plt
from visualization.molecule_plotting import plot_molecule
from visualization.plotting import plot_diagram, plot_betti_curves, plot_molecule_types
import gtda.diagrams as diag
from visualization.plotting_results import *
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as gobj
For ths demo we have prepared a dataset that contains the 100 largest molecules from the training set, along with a variety of non-topological features inspired from this Kaggle kernel.
with open("data/processed/largest_100_molecules.pkl", "rb") as f:
X, y, molecule_selection, structures, molecule_list = pickle.load(f)
The features are contained in the X
dataframe
X.head()
X.shape
where each row corresponds to an atom-pair and there are 12,609 such pairs in our dataset. The target variable (the scalar coupling constant) is a continuous quantity and contained in y
:
y.head()
Metadata about the molecules is contained in the structures
dataframe
structures.head()
while molecule_selection
and molecule_list
contain IDs for each molecule.
For these 100 molecules we have five different bond types: 1JHC, 2JHC, 3JHC, 2JHH and 3JHH. The plot below shows the different bond strengths related to the different types.
plot_molecule_types(molecule_selection)
To show how features generated via topological data analysis (TDA) can be used to improve machine learning models, we will combine topological features with conventional ones.
For the conventional features we have taken those from this Kaggle kernel, which include geometric quantities such as the distance of a given atom to the mean $(x,y,z)$ coordinates -- see the kernel for a detailed explanation of what each feature is. In case you want to recreate the data on your own, you can use the function create_non_TDA_features()
in the feature.py
script in this repository.
For the purposes of this notebook we can treat these features as given, since the goal is to focus on the creation of topological features.
In general, there are two different ways to generate topological feature from data: either treat the molecule as a point cloud or a graph.
In order to use TDA we start with a point cloud. In our case this is just the molecule where the $(x,y,z)$ coordinates of the atoms are given with respect to the mean.
selected_structures = get_selected_structures(molecule_selection); selected_structures.head()
Now we can take this point cloud to find some interesting structures. In order to do this, we want to create a persistence diagram using the Vietoris-Rips filtration algorithm.
persistence_diagrams = []
for molecule in molecule_selection:
homology_dimensions = [0, 1, 2]
persistence = VietorisRipsPersistence(
metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=1
)
point_cloud = selected_structures[selected_structures["molecule_name"] == molecule][
["x_new", "y_new", "z_new"]
].values
point_cloud = point_cloud.reshape((1, point_cloud.shape[0], point_cloud.shape[1]))
persistence_diagrams.append(persistence.fit_transform(point_cloud))
Now we are ready to plot the persistence diagram for a given molecule. In order to learn more about how to interpret this diagram, see this review article.
You can try out one of the 100 molecules by choosing the molecule_idx
variable (between 0 and 99):
molecule_idx = 72
persistence_fig = plot_diagram(persistence_diagrams[molecule_idx][0])
iplot(persistence_fig)
From these diagrams, a variety of features can be extracted to feed our regression model downstream. Since each persistence diagram is a NumPy array of birth-death-dimension triples, we can code up simple functions to manipulate the data. For instance, we can calculate the average lifetime of points as follows:
def average_lifetime(X_scaled, homology_dim):
"""
Parameters
----------
X_scaled: scaled persistence diagrams, numpy array
homology_dim: dimension of the homology to consider, integer
Returns
-------
avg_lifetime_list: list of average lifetime for each time window
"""
avg_lifetime_list = []
for i in range(X_scaled.shape[0]):
persistence_table = pd.DataFrame(
X_scaled[i], columns=["birth", "death", "homology"]
)
persistence_table["lifetime"] = (
persistence_table["death"] - persistence_table["birth"]
)
avg_lifetime_list.append(
persistence_table[persistence_table["homology"] == homology_dim][
"lifetime"
].mean()
)
return avg_lifetime_list
Feature-generating functions like these can be found in the features.py
module of this repo. Let's import the average_lifetime()
function to see how it works:
from features.features import average_lifetime
for homology_dim in homology_dimensions:
print(
f"Homology dimension = {homology_dim} | Average lifetime = {average_lifetime(persistence_diagrams[molecule_idx], homology_dim)}"
)
In other words, for each molecule we can assign 3 new features: "average lifetime" for each homology dimension. Another interesting feature is the number of "relevant" holes, which returns the number cyan coloured points (H1) that are above some predefined threshold from the birth = death diagonal (the dashed line in the diagram):
molecule_idx = 72
molecule_fig = plot_molecule(molecule_selection[molecule_idx], structures)
iplot(molecule_fig)
# Depending on the threshold 'theta' the function is more or less susceptible to noise
print('Number of relevant holes:', num_relevant_holes(persistence_diagrams[molecule_idx], 1, theta=0.55)[0])
Similar to the persistence diagram, a plot of the Betti curves can be used to extract features. It focuses on how the number of holes changes with increasing radius $\varepsilon$ in the filtration:
molecule_idx = 72
betti_curves = diag.BettiCurve()
betti_curves.fit(persistence_diagrams[molecule_idx])
X_betti_curves = betti_curves.transform(persistence_diagrams[molecule_idx])
betti_fig = plot_betti_curves(X_betti_curves[0])
iplot(betti_fig)
In a similar manner to persistence diagrams, we can also derive features from the Betti curves. One natural feature is the area under the curve, which implemented as code might look something like:
def area_under_Betti_curve(X_betti_curves, homology_dim):
"""Compute the area under the Betti curve for a given Betti curve
Parameters
----------
X_betti_curves : ndarray, shape (n_samples, n_homology_dimensions, n_values)
homology_dim : int
Homology dimension to consider, must be contained in the persistence diagram
Returns
-------
area : list, shape (n_samples)
List of areas under the Betti curve for a given homology dimension.
"""
area = []
for n in range(X_betti_curves.shape[0]):
area.append(np.trapz(X_betti_curves[n, homology_dim], dx=1))
return area
To see it in action for a single molecule, we can import it as follows:
from features.features import area_under_Betti_curve
for homology_dim in homology_dimensions:
print(
f"Homology dimension = {homology_dim} | Area under Betti curve = {area_under_Betti_curve(X_betti_curves, homology_dim)}"
)
Altogether, we created a wide variety of topological features, including:
We provide the pre-calculated set of features and they can be accessed as follows:
with open('data/processed/tda_features_cloud.pkl', 'rb') as f:
features_dict_cloud = pickle.load(f)
As the features are molecule-specific, i.e. we have calculated a feature value for each molecule but not for all atom pairs, we have to assign the correct value to the samples in the dataset.
list(features_dict_cloud.keys())
Next we attach these topological features to X
:
attach_tda_features(X, features_dict_cloud, molecule_list)
X.head()
We can also take another approach and think about the molecule as a graph instead of as a point cloud. In order to do this, one has to define a distance matrix that indicates the distance between two points in the molecule. Then we can take the same steps as before: calculate the persistence diagram and extract features.
You should know: In order for giotto-learn's VietorisRipsPersistence()
class to know that the matrix we are passing has to be interpreted as a distance matrix and not as a point cloud, it's important to set the variable metric
to precomputed
. This can be seen in the computing_persistence_diagram()
function.
Calculating the persistence diagrams and the features takes a while. For this reason we load the data from a pickle file. In case you want to recompute the diagrams and the features, set the recalculate_graph_pd
variable to True
.
pers_diag_list_graph = get_graph_persistence_diagrams(molecule_selection, structures,
recalculate_graph_pd=False)
Let's load the pre-calculated features as a dictionary:
with open('data/processed/tda_features_graph.pkl', 'rb') as f:
features_dict_graph = pickle.load(f)
And again we add the newly created features to the dataset.
attach_tda_features(X, features_dict_graph, molecule_list, suffix='_graph')
At this point we have created some features and would like to test them. We train an XGBRegressor model and use 5-fold cross-validation. Let's start with the non-topological features and later compare the results if we add topological ones.
The score is calculated as follows (see also here):
$$\text{score} = \frac{1}{T}\sum_{t=1}^{T}\log \left( \frac{1}{n_t}\sum_{i=1}^{n_t}|y_i-\hat{y}_i| \right)$$where:
non_tda_columns = ['atom_index_0', 'atom_index_1', 'type', 'type_0', 'type_1', 'atom_0',
'x_0', 'y_0', 'z_0', 'atom_1', 'x_1', 'y_1', 'z_1', 'dist', 'dist_x',
'dist_y', 'dist_z', 'dist_to_type_mean', 'dist_to_type_0_mean',
'dist_to_type_1_mean', 'molecule_dist_mean_x', 'molecule_dist_std_x',
'molecule_dist_skew_x', 'molecule_dist_kurt_x', 'molecule_dist_mean_y',
'molecule_dist_std_y', 'molecule_dist_skew_y', 'molecule_dist_kurt_y',
'meanx', 'meany', 'meanz', 'dist_0tomean', 'dist_1tomean', 'meanxH',
'meanyH', 'meanzH', 'dist_0tomeanH', 'dist_1tomeanH', 'meanxC',
'meanyC', 'meanzC', 'dist_0tomeanC', 'dist_1tomeanC', 'meanxN',
'meanyN', 'meanzN', 'dist_0tomeanN', 'dist_1tomeanN', 'meanxO',
'meanyO', 'meanzO', 'dist_0tomeanO', 'dist_1tomeanO', 'meanxF',
'meanyF', 'meanzF', 'dist_0tomeanF', 'dist_1tomeanF', 'atom_count',
'atom_0l', 'x_0l', 'y_0l', 'z_0l', 'atom_0r', 'x_0r', 'y_0r', 'z_0r',
'dist_0l', 'dist_0r', 'atom_1l', 'x_1l', 'y_1l', 'z_1l', 'atom_1r',
'x_1r', 'y_1r', 'z_1r', 'dist_1l', 'dist_1r']
results_mean_1, results_details_1 = cv_model(X, y, non_tda_columns)
results_mean_1.append(np.mean(results_mean_1))
What changes if we include the TDA features?
results_mean_2, results_details_2 = cv_model(X, y, X.columns)
results_mean_2.append(np.mean(results_mean_2))
Let's plot the scores and see what improvement we get with TDA. Don't forget that due to the logarithm in the formula above a lower score is acually better (i.e. -0.5 is better than -0.3 for example).
create_summary_df(np.array([results_mean_1, results_mean_2]).T)
fig = plot_results(create_summary_df(np.array([results_mean_1, results_mean_2]).T))
iplot(fig)
improvements = (((np.mean((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100)),
np.max(((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100),
np.min(((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100))
print('Mean improvement:', np.round(improvements[0], 2), '%')
print('Max. improvement:', np.round(improvements[1], 2), '%')
print('Min. improvement:', np.round(improvements[2], 2), '%')
It turns out that for all types of bonds, the topological features helped to improve the score but these improvements were bigger for some types than for others.