In this notebook we explore a few of the core features included in giotto-tda's implementation of the Mapper algorithm.
%load_ext autoreload
%autoreload 2
# data wrangling
import numpy as np
import pandas as pd
# data viz
import plotly.graph_objects as go
# tda magic
from gtda.mapper import (
CubicalCover,
make_mapper_pipeline,
Projection,
plot_static_mapper_graph,
plot_interactive_mapper_graph
)
from gtda.mapper.utils.visualization import set_node_sizeref
# from gtda.mapper.utils.visualization import set_node_sizeref
# ml tools
from sklearn import datasets
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
As a simple example, let's generate a two-dimensional point cloud of two concentric circles. The goal will be to examine how Mapper can be used to generate a topological graph that captures the salient features of the data.
data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)
fig = go.Figure(
data=go.Scatter(x=data[:, 0], y=data[:, 1], mode="markers"),
layout={"autosize": False},
)
fig.show()
Given a dataset ${\cal D}$ of points $x \in \mathbb{R}^n$, the basic steps behind Mapper are as follows:
from gtda.mapper.filter import FilterFunctionName
from gtda.mapper.cover import CoverName
# scikit-learn method
from sklearn.cluster import ClusteringAlgorithm
# giotto-tda method
from gtda.mapper.cluster import FirstSimpleGap
These four steps are implemented in the MapperPipeline
object that mimics the Pipeline
class from scikit-learn. We provide a convenience function make_mapper_pipeline()
that allows you to pass the choice of filter function, cover, and clustering algorithm as arguments. For example, to project our data onto the $x$- and $y$-axes, we could setup the pipeline as follows:
# define filter function - can be any scikit-learn transformer
filter_func = Projection(columns=[0, 1])
# define cover
cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
# choose clustering algorithm - default is DBSCAN
clusterer = DBSCAN()
# configure parallelism of clustering step
n_jobs = 1
# initialise pipeline
pipe = make_mapper_pipeline(
filter_func=filter_func,
cover=cover,
clusterer=clusterer,
verbose=False,
n_jobs=n_jobs,
)
With the Mapper pipeline at hand, it is now a simple matter to visualise it. To warm up, let's examine the graph in two-dimensions using the default arguments of giotto-tda's plotting function:
fig = plot_static_mapper_graph(pipe, data)
# display figure
fig.show(config={"scrollZoom": True})
From the figure we can see that we have captured the salient topological features of our underlying data, namely two holes!
By default, the nodes of the Mapper graph are colored by the mean value of the points that belong to a given node. However, in this example it is more instructive to colour by the $x$- and $y$-axes. This can be achieved by toggling the color_by_columns_dropdown
, which calculates the coloring for each column in the input data array. At the same time, let's configure the choice of colorscale:
plotly_kwargs = {
'node_trace_marker_colorscale':'Blues'
}
fig = plot_static_mapper_graph(pipe, data, color_by_columns_dropdown=True, plotly_kwargs=plotly_kwargs)
# display figure
fig.show(config={"scrollZoom": True})
In the dropdown menu, the entry color_variable
refers to a user-defined quantity to color by - by default it is the average value of the points in each node. In general, one can configure this quantity to be an array, a scikit-learn transformer, or a list of indices to select from the data. For example, coloring by a PCA component can be implemented as follows:
# initialise estimator to color graph by
pca = PCA(n_components=1).fit(data)
fig = plot_static_mapper_graph(pipe, data, color_by_columns_dropdown=True, color_variable=pca)
# display figure
fig.show(config={"scrollZoom": True})
It is also possible to feed plot_static_mapper_graph()
a pandas DataFrame:
data_df = pd.DataFrame(data, columns=['x', 'y']); data_df.head()
Before plotting we need to update the Mapper pipeline to know about the projection onto the column names. This can be achieved using the set_params()
method as follows:
pipe.set_params(filter_func=Projection(columns=['x', 'y']));
fig = plot_static_mapper_graph(pipe, data_df, color_by_columns_dropdown=True)
# display figure
fig.show(config={"scrollZoom": True})
By default, plot_static_mapper_graph()
uses the Kamada–Kawai algorithm for the layout; however any of the layout algorithms defined in python-igraph are supported (see here for a list of possible layouts). For example, we can switch to the Fruchterman–Reingold layout as follows:
# rest back to numpy projection
pipe.set_params(filter_func=Projection(columns=[0,1]));
fig = plot_static_mapper_graph(pipe, data, layout='fruchterman_reingold', color_by_columns_dropdown=True)
# display figure
fig.show(config={"scrollZoom": True})
It is also possible to visualise the Mapper graph in 3-dimensions by configuring the layout_dim
argument:
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, color_by_columns_dropdown=True)
# display figure
fig.show(config={"scrollZoom": True})
Behind the scenes of plot_static_mapper_graph()
is a MapperPipeline
object pipe
that can be used like a typical scikit-learn estimator. For example, to extract the underlying graph data structure we can do the following:
graph = pipe.fit_transform(data)
The resulting graph is an python-igraph object that contains metadata that is stored in the form of dictionaries. We can access this data as follows:
graph['node_metadata'].keys()
Here node_id
is a globally unique node identifier used to construct the graph, while pullback_set_label
and partial_cluster_label
refer to the interval and cluster sets described above. The node_elements
refers to the indices of our original data that belong to each node. For example, to find which points belong to the first node of the graph we can access the desired data as follows:
node_id, node_elements = graph['node_metadata']['node_id'], graph['node_metadata']['node_elements']
print('Node Id: {}, \nNode elements: {}, \nData points: {}'.format(node_id[0], node_elements[0], data[node_elements[0]]))
The node_elements
are handy for situations when we want to customise e.g. the size of the node scale. In this example, we use the utility function set_node_sizeref()
and pass the function as a plotly argument:
# configure scale for node sizes
plotly_kwargs = {
"node_trace_marker_sizeref": set_node_sizeref(node_elements, node_scale=30)
}
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, color_by_columns_dropdown=True, plotly_kwargs=plotly_kwargs)
# display figure
fig.show(config={"scrollZoom": True})
The resulting graph is much easier to decipher with the enlarged node scaling!
In some cases, the list of filter functions provided in filter.py
or scikit-learn may not be sufficient for the task at hand. In such cases, one can pass any callable to the pipeline that acts row-wise on the input data. For example, we can project by taking the sum of the $(x,y)$ coordinates as follows:
filter_func = np.sum
pipe = make_mapper_pipeline(
filter_func=filter_func,
cover=cover,
clusterer=clusterer,
verbose=True,
n_jobs=n_jobs,
)
fig = plot_static_mapper_graph(pipe, data, plotly_kwargs=None)
# display figure
fig.show(config={"scrollZoom": True})
In general, any callable (i.e. function) that operates row-wise can be passed. For example we can filter by the ratio of $x$- and $y$-coordinates as follows:
def calculate_xy_ratio(row):
return row[0] / row[1]
pipe = make_mapper_pipeline(
filter_func=calculate_xy_ratio,
cover=cover,
clusterer=clusterer,
verbose=True,
n_jobs=n_jobs,
)
fig = plot_static_mapper_graph(pipe, data, plotly_kwargs=None)
# display figure
fig.show(config={"scrollZoom": True})
In general, buidling useful Mapper graphs requires some iteration through the various parameters in the cover and clustering algorithm. To simplify that process, giotto-tda provides an interactive figure that can be configured in real-time. If invalid parameters are selected, the Show logs checkbox can be used to see what went wrong.
pipe = make_mapper_pipeline()
# generate interactive plot
plot_interactive_mapper_graph(pipe, data, color_by_columns_dropdown=True)