optimal transport aims to find the lowest-cost way of moving points from the first distribution
to the second one. When the transportation cost is a metric, the optimal transport distance
is also called the 1-Wasserstein distance (Villani, 2009).
Optimal transport is applied in many areas such as machine learning (Solomon et al.,
2014; Frogner et al., 2015; Montavon et al., 2016; Kolouri et al., 2017; Arjovsky et al., 2017;
Genevay et al., 2016; Lee and Raginsky, 2018), statistics (El Moselhy and Marzouk, 2012;
Reich, 2013; Panaretos and Zemel, 2016), and computer graphics (Dominitz and Tannen-
baum, 2009; Rubner et al., 2000; Rabin et al., 2011; Lavenant et al., 2018; Solomon et al.,
2015). Since the optimal transport distance is computationally expensive (cubic in the num-
ber of points), several methods have emerged to efficiently approximate optimal transport
distance. One of the most popular methods is the Sinkhorn distance, which uses entropic
regularization to compute an approximation of optimal transport in quadratic time (Cuturi,
2013).
Another approach relies on approximating the original metric with a tree metric (Evans
and Matsen, 2012; Le et al., 2019a; Indyk and Thaper, 2003; Takezawa et al., 2021; Yamada
et al., 2022). While this approach yields a coarser approximation of the optimal transport, it
has linear time complexity with respect to the size of the metric space. The classic Quadtree
is one of the most widely used tree approximations methods. It recursively partitions a metric
space into four quadrants to construct internal nodes and represents each element of the
original discrete metric space as a leaf node. Quadtree Wasserstein distance approximates
the true optimal transport distance with a logarithmic distortion (Indyk and Thaper, 2003).
Flowtree (Backurs et al., 2019) and sliced-tree Wasserstein (Le et al., 2019b) are Quadtree-
based methods designed to improve over the 1-Wasserstein distance approximation of the
Quadtree algorithm. A main drawback of Quadtree-based methods is that they require a
Euclidean embedding of the original discrete metric space. In contrast, clustertree is a tree-
based approximation that does not require a Euclidean embedding of the original discrete
metric space (Le et al., 2019b). Given a fixed Quadtree or clustertree topology, Yamada
et al. (2022) propose a state-of-the-art method based on optimizing the weights on the tree
to best approximate Wasserstein distances but does not change the topology of the input
tree.
Our goal is to find the tree topology and weight that closely approximates the Wasserstein
distance in the original metric space. To achieve this, we introduce a projected gradient
descent procedure over the space of ultrametrics to find a tree that approximates the original
Wasserstein distances. As constraining the problem to tree metrics is challenging in general,
we instead use ultrametrics — a subfamily of tree metrics — as a proxy for tree weights and
structure.
Problem Statement (informal) Let Xbe a point set in a metric space and W1(·,·) be
the optimal transport distance. We want to learn an ultrametric on Xsuch that ultrametric
optimal transport Wuapproximates W1. To achieve this, we cast this as a regression problem
over ultrametric trees.
The key advantage of optimizing via ultrametrics is that we can project any (semi)metric
to the ultrametrics; see Section 3. This projection allows us to optimize in the space of ultra-
metrics via a projected gradient descent-type procedure. Our contributions are as follows.
1. We define a quadratic cost function to measure the discrepancy between the true and
ultrametric optimal transport distances. This cost does not require the point positions a
2