Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy

In this notebook, we showcase the ease of use of one of the core components of giotto-tda: VietorisRipsPersistence, along with vectorization methods. We first list steps in a typical, topological-feature extraction routine and then show to encapsulate them with a standard scikit-learn–like pipeline.

If you are looking at a static version of this notebook and would like to run its contents, head over to GitHub and download the source.

License: AGPLv3

Generate data

Let's begin by generating 3D point clouds of spheres and tori, along with a label of 0 (1) for each sphere (torus). We also add noise to each point cloud, whose effect is to displace the points sampling the surfaces by a random amount in a random direction. Note: You will need the auxiliary module to run this cell. You can change the second argument of generate_point_clouds to obtain a finer or coarser sampling, or tune the level of noise via the third.

In [ ]:
from data.generate_datasets import make_point_clouds
n_samples_per_class = 10
point_clouds, labels = make_point_clouds(n_samples_per_class, 10, 0.1)
print(f"There are {point_clouds.shape[0]} point clouds in {point_clouds.shape[2]} dimensions, "
      f"each with {point_clouds.shape[1]} points.")

Calculate persistent homology

Instantiate a VietorisRipsPersistence transformer and calculate so-called persistence diagrams for this collection of point clouds.

In [ ]:
from gtda.homology import VietorisRipsPersistence

VR = VietorisRipsPersistence(homology_dimensions=[0, 1, 2])  # Parameter explained in the text
diagrams = VR.fit_transform(point_clouds)

Important note: VietorisRipsPersistence, and all other "persistent homology" transformers in gtda.homology, expect input in the form of a 3D array or, in some cases, a list of 2D arrays. For each entry in the input (here, for each point cloud in point_clouds) they compute a topological summary which is also a 2D array, and then stack all these summaries into a single output 3D array. So, in our case, diagrams[i] represents the topology of point_clouds[i]. diagrams[i] is interpreted as follows:

  • Each row is a triplet describing a single topological feature found in point_clouds[i].
  • The first and second entries (respectively) in the triplet denote the values of the "filtration parameter" at which the feature appears or disappears respectively. They are referred to as the "birth" and "death" values of the feature (respectively). The meaning of "filtration parameter" depends on the specific transformer, but in the case of VietorisRipsPersistence on point clouds it has the interpretation of a length scale.
  • A topological feature can be a connected component, 1D hole/loop, 2D cavity, or more generally $d$-dimensional "void" which exists in the data at scales between its birth and death values. The integer $d$ is the homology dimension (or degree) of the feature and is stored as the third entry in the triplet. In this example, the shapes should have 2D cavities so we explicitly tell VietorisRipsPersistence to look for these by using the homology_dimensions parameter!

If we make one scatter plot per available homology dimension, and plot births and deaths as x- and y-coordinates of points in 2D, we end up with a 2D representation of diagrams[i], and the reason why it is called a persistence diagram:

In [ ]:
from gtda.plotting import plot_diagram

i = 0

The notebook Plotting in giotto-tda delves deeper into the plotting functions and class methods available in giotto-tda.

Extract features

Instantiate a PersistenceEntropy transformer and extract scalar features from the persistence diagrams.

In [ ]:
from gtda.diagrams import PersistenceEntropy

PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)

Use the new features in a standard classifier

Leverage the compatibility with scikit-learn to perform a train-test split and score the features.

In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(features, labels)
model = RandomForestClassifier(), y_train)
model.score(X_valid, y_valid)

Encapsulate the steps above in a pipeline

  1. Define an end-to-end pipeline by chaining transformers from giotto-tda with scikit-learn ones
  2. Train-test split the input point cloud data and labels.
  3. Fir the pipeline on the training data.
  4. Score the fitted pipeline on the test data.
In [ ]:
from sklearn.pipeline import make_pipeline

steps = [VietorisRipsPersistence(homology_dimensions=[0, 1, 2]),
pipeline = make_pipeline(*steps)

pcs_train, pcs_valid, labels_train, labels_valid = train_test_split(point_clouds, labels), labels_train)

pipeline.score(pcs_valid, labels_valid)