The Mapper Algorithm: Topological Summaries of High-Dimensional Data
Published:
The Core Idea
Standard clustering gives a partition: each point is assigned to exactly one cluster. But topology asks about shape, not partition. The Mapper algorithm builds a topological summary by:
- Projecting data onto a low-dimensional “filter space.”
- Clustering the projected data locally.
- Connecting local clusters globally.
The Algorithm
Input: Point cloud \(X \subseteq \mathbb{R}^n\), filter function \(f: X \to \mathbb{R}^d\) (typically \(d=1\) or \(2\)), cover \(\{U_i\}\) of the range of \(f\).
Algorithm:
- For each cover element \(U_i\), compute \(X_i = f^{-1}(U_i) = \{x \in X : f(x) \in U_i\}\).
- Cluster each \(X_i\) using any clustering algorithm (e.g., single-linkage, DBSCAN).
- Each cluster \(C_{ij}\) becomes a node in the output graph.
- Add an edge between \(C_{ij}\) and \(C_{kl}\) if \(C_{ij} \cap C_{kl} \neq \emptyset\) (overlap between clusters in adjacent cover elements).
Choice of Filter Function
The filter \(f\) determines what “shape” Mapper reveals:
- Density estimate \(\hat{\rho}(x)\): reveals flares (high-density cores connected to low-density tails).
- PCA/SVD projection: finds elongated structure, branching.
- Eccentricity \(e(x) = \max_{y \in X} d(x,y)\): finds loop structure (points far from everything else).
- Distance to a measure: robust version of density.
- Domain-specific functions: gene expression levels, neural activation norms, task outputs.
Relation to the Reeb Graph
The Reeb graph of \(f: M \to \mathbb{R}\) on a manifold \(M\) is the quotient space that contracts each connected component of \(f^{-1}(t)\) to a point. Mapper approximates the Reeb graph from a finite sample.
Theorem: Under mild conditions (fine enough cover, consistent clustering), Mapper convergences to the Reeb graph as the sample density increases and the cover resolution grows.
Landmark Applications
Cancer genomics (Nicolau et al. 2011): Mapper on breast cancer gene expression data (with density filter) revealed a previously unknown subgroup of patients with 100% survival rate — a “flare” in the data shape invisible to clustering.
Neural network analysis: Mapper on activation patterns across layers reveals branching structure that corresponds to decision boundaries.
Materials science: Mapper on atomic configuration spaces reveals transition pathways between crystal structures.
Parameters and Stability
Mapper has three parameters: filter \(f\), cover resolution \(r\) (number of intervals), and overlap \(p\)%:
- Resolution: too low → single giant cluster; too high → disconnected fragments.
- Overlap: higher overlap = more edges = more topological features detected.
- Stability: small changes in \(f\) produce bounded changes in the Mapper graph (in an appropriate metric), provided the clustering is stable.
References
- G. Singh, F. Mémoli, G. Carlsson, “Topological Methods for the Analysis of High Dimensional Data Sets,” SPBG 2007.
- M. Nicolau, A. Levine, G. Carlsson, “Topology Based Data Analysis Identifies a Subgroup of Breast Cancers with a Unique Mutational Profile and Excellent Survival,” PNAS 2011.
- KeplerMapper: kepler-mapper.scikit-tda.org
