The objective of this blog is to document my understanding of the Mapper algorithm1 introduced by Singh et al. along with a review of available statistical literature on the Mapper. The textbook by Prof. Tamal Dey2 is an extremely helpful source for understanding the topological motivation behind Mapper. In particular, chapters 7 and 9 of the textbook focus on Mapper and its required background.
Let $\mathcal{X}$ be a finitely triangulable topological space endowed with the topology $\tau_{\mathcal{X}}$. Assume that $f:\mathcal{X} \longrightarrow \mathbb{R}$ is a continuous function we want to study. If our intention is to obtain a topological summary, then one obvious option is to obtain barcodes3:
- Let $X_t = \lbrace f^{-1}(-\infty, x_t] \rbrace$ for $t \in \lbrace 0, \cdots, m \rbrace$ and $-\infty < x_0 < \cdots < x_m < \infty$ be a finite sequence of sub-level sets of $f$.
- This induces a sequence of homology groups (at different dimensions) connected by inclusion homomorphisms that are studied to obtain barcodes as the topological summary.
Another approach is to look at the Reeb graph ($\mathcal{R}_f(\mathcal{X})$) associated with the continuous function $f$. When $f$ is “nice”, the Reeb graph of $f$ provides a one-dimensional topological summary of the function $f$. Compared to barcodes, there is a loss of homological information with an increased ease in computing the Reeb graph of $f$. The statistical version of the Reeb graph for a point cloud data $X \in \mathcal{X}$ is called Mapper which helps us obtain a low dimensional combinatorial representation of the data.
Reeb graphs
$(\mathcal{X}, \tau_{\mathcal{X}})$ is a topological space and $f: \mathcal{X} \longrightarrow \mathbb{R}$ is a continuous function. The overall idea behind Mapper is to look at the connected components of the level sets $f^{-1}(a)$ as $a$ varies in $(-\infty, \infty)$. An illustration of how to compute the Reeb graph corresponding to $f$ that denotes the height function from the baseline is given below:
An object of genus two is on the left while its corresponding Reeb graph w.r.t. the height function is on the right. Note that the points lying on the object that intersect with (blue) lines parallel to the baseline correspond to points with the same function value i.e. they are contours of $f$. The number of connected components at each such contour line is counted. A node exists at a point where the number of such connected components changes. Intuitively, we are “crushing” all the points of a contour line within a connected component into a single point on the Reeb graph. This point may or may not be a node. The resultant graph obtained this way is called the Reeb graph associated with f.
The intuition provided above can be made rigorous with the idea of quotient topology which “crushes points together” by identifying points using an equivalence relation. For $a \in \mathbb{R}$, define the level set at $a$ as
\[f^{-1}(a) = \lbrace x \in \mathcal{X} | f(x) = a \rbrace \subset \mathcal{X} \text{ for } a \in \mathbb{R}\]Clearly, $f^{-1}(a) \subseteq \mathcal{X}$ which induces a subspace topology, $\tau_{f^{-1}(a)}$, on $f^{-1}(a)$:
\[\tau_{f^{-1}(a)} = \lbrace f^{-1}(a) \cap U : U \in \tau_{\mathcal{X}} \rbrace\]Define an equivalence relation $\thicksim$ on $\mathcal{X}$ as: for $x, y \in \mathcal{X}$, we say $x \thicksim y$ iff
- $f(x) = f(y) = \alpha$ for $\alpha \in \mathbb{R}$
- $x,y$ belong to the same connected component of $f^{-1}(\alpha)$ Then Reeb graph $R_f$ of $f:\mathcal{X} \longrightarrow \mathbb{R}$ is the quotient space $\mathcal{X}/\thicksim$.
Let $\Phi: \mathcal{X} \longrightarrow R_f$ be the quotient map defined as $x \mapsto [x]$. Since $f: \mathcal{X} \longrightarrow \mathbb{R}$ is continuous, the quotient map induces a continuous map $\tilde{f}: R_f \longrightarrow \mathbb{R}$ defined as $[x] \mapsto f(z)$ for $z \in \Phi^{-1}(x)$ such that $f = \tilde{f} \circ \Phi$. To put it simply, every Reeb graph corresponds to a continuous function $\tilde{f}$ which associates a real number to every node of the Reeb graph.
If $f$ is “nice”, then it can be proved that the corresponding Reeb graph is a finite $1$-dimensional regular CW-complex. In other words, it is a graph. Morse functions on compact smooth manifold and piecewise linear functions on compact polyhedrons are examples of such “nice” functions. Hence under such niceness conditions on $f$, we can view the Reeb graph $R_f$ as the underlying space of a $1$-dimensional cell complex that has a continuous function $\tilde{f}$ defined on it.
Merge Tree of a function $f: \mathcal{X} \longrightarrow \mathbb{R}$ is obtained by working with sub-level sets of the form $f^{-1}(-\infty, x]$ for an increasing sequence of $x$ values. Similarly Split Tree of $f$ is obtained by working with super-level sets of the form $f^{-1}[x, \infty)$ for decreasing $x$ values. Therefore, both the Merge and Split trees of $f$ can be defined as quotient spaces for appropriately defined equivalence classes.
Mapper
Mapper is the statistical version of the Reeb graph. Assume as usual that $f:\mathcal{X} \longrightarrow \mathbb{R}$ is a continuous map. We have access to a dataset $X \subseteq \mathcal{X}$ containing $n$ points $x_1, \cdots, x_n$. Range on $f$ when we have access to $X$ alone is based on $f(x_1), \cdots, f(x_n)$. This restricted range will be denoted by $f(X)$.
Mapper, $M_f$ | Reeb graph, $R_f$ |
---|---|
$f$ is function of interest but we have access to $f(x) \quad \forall x \in X$ | $f$ is the function of interest and we have access to $f(x) \quad \forall x \in \mathcal{X}$ |
Range $f(X) = \bigcup_{j=1}^S I_j$ is defined based on two parameters chosen by user | Range $\mathbb{R} = \bigcup_{a \in \mathbb{R}} \lbrace a \rbrace$ |
Within each pre-image $f^{-1}(I_j)$ apply clustering algorithm and shrink each cluster into a node | Within each level set $f^{-1}(a)$ “shrink” each connected component into a single point on graph |
- The function $f$ used in the mapper is called the filter function. The mapper $M_f$ associated with the dataset $X$ is obtained by taking the nerve of the clusters in the pre-images $\lbrace f^{-1} (I_j)\rbrace_{j=1}^S$.
- Therefore $S$ sets a lower bound on the number of clusters and hence the number of nodes in the final combinatorial representation.
- The intervals $I_1, \cdots, I_S$ are chosen such that at most two intervals intersect. This ensures that the final combinatorial representation is a graph.
- The two parameters based on which intervals $I_1, \cdots, I_S$ are chosen are called gain and resolution parameters denoted by $g$ and $r$ respectively. The gain parameter controls the proportion of overlap between consecutive intervals $I_j, I_{j+1}$ while the resolution parameter controls the length of intervals $I_j$.
-
Singh, G., Memoli, F., & Carlsson, G. (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In Eurographics Symposium on Point-Based Graphics. The Eurographics Association. ↩
-
Dey, T., & Wang, Y. (2022). Computational Topology for Data Analysis. Cambridge: Cambridge University Press. ↩
-
Ghrist, R. (2007). Barcodes: The persistent topology of data. In Bulletin of the American Mathematical Society (Vol. 45, Issue 01, pp. 61–76). American Mathematical Society (AMS). ↩
-
Carrière, M., Michel, B., & Oudot, S. (2017). Statistical Analysis and Parameter Selection for Mapper, Journal of Machine Learning Research, 19(12):1−39, 2018. ↩
-
Carrière, M., Oudot, S. Structure and Stability of the One-Dimensional Mapper. Found Comput Math 18, 1333–1396 (2018) ↩