Created by: Davide Burba, December 2019
In this notebook, we apply the Mapper algorithm to point cloud data by using giotto-tda, an open-source Python library for topological data analysis.
%load_ext autoreload
%autoreload 2
# data wrangling
import pandas as pd
import numpy as np
# tda magic
from gtda.mapper.filter import Projection
from gtda.mapper.cover import (
OneDimensionalCover,
CubicalCover
)
from gtda.mapper.pipeline import make_mapper_pipeline
from gtda.mapper.visualization import (
plot_static_mapper_graph,
plot_interactive_mapper_graph
)
# data viz
import plotly.graph_objects as go
from matplotlib.colors import to_rgba_array, to_rgba
# ml tools
from sklearn.cluster import DBSCAN
import warnings
warnings.filterwarnings("ignore")
The starting point for most Mapper analyses is a point cloud in an $n$-dimensional Euclidean space. For our purposes we'll use a dataset containing 20,000 points sampled from a Santa Claus shape in 3-dimensions.
Note: The dataset has been sampled from a mesh object available here through the usage of CloudCompare.
data = pd.read_csv("data.csv")
data.head()
As we can see, the data consists of 3-dimensional coordinates $ v_i = (x_i, y_i, z_i)$ for each point, along with a label for it's colour. Let's have a look at this 3-dimensional point cloud:
fig = go.Figure(
data=[
go.Scatter3d(
x=data.x,
y=data.y,
z=data.z,
mode="markers",
marker=dict(size=1, color=data.color,),
)
]
)
fig.show()
Let's consider an $n$-dimensional point cloud, i.e. a dataset with $n$ numerical features. The Mapper algorithm follows three main steps:
The topological graph is then composed of:
Even if its formulation could seem a bit intimidating, the best way to intuitively understand how the Mapper algorithm works is to use it and "play" with its parameters. The API of giotto-tda is scikit-learn compatible and provides a very convenient way to fit the Mapper through a pipeline object.
We will apply the Mapper algorithm to the Santa Claus point cloud (the so-called "Santa Cloud"). Our goal is to get a minimal graph which preserves the topological properties of the original figure.
First, let's define an array of coordinates to act as the point cloud to analyse:
coords = data[["x","y","z"]].values
The easiest way to produce a meaningful graph from a dataset is to use the identity as filter function.
Formally, we map each vector of coordinates $\vec{v}_i$ to itself via $f: \vec{v} \to \vec{v}$. Since the mapped values lie in $\mathbb{R}^3$, we use a 3-dimensional cover.
def plot_mapper_graph_from_pipeline(pipe):
""" A helper function, that plots the graph
resulting from applying the pipeline `pipe`
on the data.
Parameters
-----------
pipe :
"""
# generate topological graph from point cloud
graph = pipeline.fit_transform(coords)
# get cluster member indices
node_elements = graph["node_metadata"]["node_elements"]
# configure choice of layout (x,z values)
layout = np.array([np.mean(coords[el], axis=0)[[0,2]] for el in node_elements])
# define node coloring
node_colors = np.array([data.loc[el, "color"].value_counts().index[0] for el in node_elements])
plotly_kwargs = {
'node_trace_marker_colorscale': None,
'node_trace_marker_showscale': False
}
# initialise and display figure
fig = plot_static_mapper_graph(pipeline, coords,
layout, plotly_kwargs=plotly_kwargs)
fig.update_traces(patch={'hoverlabel_bgcolor': node_colors,
'marker_color': node_colors})
fig.show(config={'scrollZoom':True})
return fig
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=None,
cover=CubicalCover(),
clusterer=DBSCAN(),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
We can see that the Mapper output preserved the geometrical structure of Santa Claus. However, using the identity as filter function could be prohibitive in high dimensional spaces!
Let's try to build a more realistic example by using a 2D filter projection.
We map each vector of coordinates $\vec{v}_i = (x_i, y_i, z_i)$ to the $x$ and $z$ axis via $f: \vec{v} \to (x_i, z_i)$. Since the mapped values lie in $\mathbb{R}^2$, we use a 2-dimensional cover.
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=[0,2]),
cover=CubicalCover(),
clusterer=DBSCAN(),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
We can see that we lost information on the Santa Claus depth, but we can clearly distinguish the figure anyway. Let's try to get an even more minimal graph with a 1-dimensional projection.
Let use the "height" of the Santa cloud as a filter function, i.e. we map each vector of coordinates $\vec{v}_i = (x_i, y_i, z_i)$ to the $z$-axis via $f: \vec{v} \to z$. Since the mapped values lie in $\mathbb{R}$, we use a 1-dimensional cover.
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=2),
cover=OneDimensionalCover(),
clusterer=DBSCAN(),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
The graph is not very satisfactory as we cannot distinguish any part of Santa Claus' body. Since the graph corresponds to a line of connected points, it means that the clustering algorithm always returned a unique cluster for each interval in the cover.
Let's try to use more restrictive clustering parameters.
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=2),
cover=OneDimensionalCover(),
clusterer=DBSCAN(eps=0.06),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
We can now distinguish the two legs, but if you zoom in you can see that Santa now has 4 arms!
We need to increase the overlapping fraction of the intervals.
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=2),
cover=OneDimensionalCover(overlap_frac=0.35),
clusterer=DBSCAN(eps=0.06),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
Now we can clearly distinguish the legs, the arms and the head.
Let's see what happens if we increase the number of intervals in the cover to 20 (default is 10).
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=2),
cover=OneDimensionalCover(n_intervals=20, overlap_frac=0.35),
clusterer=DBSCAN(eps=0.06),
)
# plot the graph arising from applying the pipeline
fig = plot_mapper_graph_from_pipeline(pipeline)
If we increase the number of bins in the cover to 20, we see the appearance of a node representing the hat and a branch separating the chest from the beard
As we saw above, it takes several steps of interation to obtain a topological Mapper graph that accurately captures the underlying structure of the point cloud. To help speedup the iterations, giotto-tda provides the option to dynamically change the Mapper parameters and see how this affects the output!
# configure Mapper pipeline
pipeline = make_mapper_pipeline(
filter_func=Projection(columns=2),
cover=OneDimensionalCover(),
clusterer=DBSCAN(),
)
# generate figure
plot_interactive_mapper_graph(pipeline, coords)