Optimal transport inference

I feel the earth mover under my feet, I feel the ψ tumbling down, I feel my heart start to trembling, whenever you’re around (my barycentre)

March 16, 2021 — May 3, 2023

functional analysis

measure

metrics

optimization

statistics

Suspiciously similar content

Doing inference where the probability metric measuring discrepancy between some target distribution and the implied inferential distribution is an optimal-transport one. Frequently intractable, but neat when we can get it. Sometimes we might get there by estimating the (gradients of) an actual OT loss, or even the transport maps implying that loss.

Placeholder/grab bag.

TODO: should we break this into discrete-state and continuous-state cases? Machinery looks different.

1 NNs

Wasserstein GANs and OT Gans (Salimans et al. 2018) are argued to do an approximate optimal transport inference, indirectly.

2 Surprise connection to matrix factorisation

Non-negative matrix factorisation via OT is a thing, e.g. in topic modeling (Huynh, Zhao, and Phung 2020; Zhao et al. 2020).

3 Via Fisher distance

See e.g. (J. H. Huggins et al. 2018b, 2018a) for a particular Bayes posterior approximation using OT and fisher distance.

4 Minibatched

Daniel Daza in Approximating Wasserstein distances with PyTorch touches upon Fatras et al. (2020):

Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches i.e., they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property.

5 Linearized embedding

Noted in Bai et al. (2023) via Cheng-Soon Ong:

Comparing K (probability) measures requires the pairwise calculation of transport-based distances, which, despite the significant recent computational speed-ups, remains to be relatively expensive. To address this problem, W. Wang et al. (2013) proposed the Linear Optimal Transport (LOT) framework, which linearizes the 2-Wasserstein distance utilising its weak Riemannian structure. In short, the probability measures are embedded into the tangent space at a fixed reference measure (e.g., the measures’ Wasserstein barycenter) through a logarithmic map. The Euclidean distances between the embedded measures then approximate the 2-Wasserstein distance between the probability measures. The LOT framework is computationally attractive as it only requires the computation of one optimal transport problem per input measure, reducing the otherwise quadratic cost to linear. Moreover, the framework provides theoretical guarantees on convexifying certain sets of probability measures […], which is critical in supervised and unsupervised learning from sets of probability measures.

6 Tools

6.1 OTT

Optimal Transport Tools (OTT) (Cuturi et al. 2022), a toolbox for all things Wasserstein (documentation):

The goal of OTT is to provide sturdy, versatile and efficient optimal transport solvers, taking advantage of JAX features, such as JIT, auto-vectorization and implicit differentiation.

A typical OT problem has two ingredients: a pair of weight vectors a and b (one for each measure), with a ground cost matrix that is either directly given, or derived as the pairwise evaluation of a cost function on pairs of points taken from two measures. The main design choice in OTT comes from encapsulating the cost in a Geometry object, and [bundling] it with a few useful operations (notably kernel applications). The most common geometry is that of two clouds of vectors compared with the squared Euclidean distance, as illustrated in the example below:

import jax
import jax.numpy as jnp
from ott.tools import transport
# Samples two point clouds and their weights.
rngs = jax.random.split(jax.random.PRNGKey(0),4)
n, m, d = 12, 14, 2
x = jax.random.normal(rngs[0], (n,d)) + 1
y = jax.random.uniform(rngs[1], (m,d))
a = jax.random.uniform(rngs[2], (n,))
b = jax.random.uniform(rngs[3], (m,))
a, b = a / jnp.sum(a), b / jnp.sum(b)
# Computes the couplings via Sinkhorn algorithm.
ot = transport.solve(x, y, a=a, b=b)
P = ot.matrix

The call to sinkhorn above works out the optimal transport solution by storing its output. The transport matrix can be instantiated using those optimal solutions and the Geometry again. That transport matrix links each point from the first point cloud to one or more points from the second, as illustrated below.

To be more precise, the sinkhorn algorithm operates on the Geometry, taking into account weights a and b, to solve the OT problem, produce a named tuple that contains two optimal dual potentials f and g (vectors of the same size as a and b), the objective reg_ot_cost and a log of the errors of the algorithm as it converges, and a converged flag.

Figure 3: obtained coupling from Cuturi et al. (2022)

6.2 POT

POT: Python Optimal Transport (Rémi Flamary et al. 2021)

This open source Python library provides several solvers for optimization problems related to Optimal Transport for signal, image processing and machine learning.

Website and documentation: https://PythonOT.github.io/

Source Code (MIT): https://github.com/PythonOT/POT

POT provides the following generic OT solvers (links to examples):

OT Network Simplex solver for the linear program/ Earth Movers Distance.

Conditional gradient and Generalized conditional gradient for regularized OT.

Entropic regularization OT solver with Sinkhorn Knopp Algorithm, stabilized version, greedy Sinkhorn and Screening Sinkhorn.

Bregman projections for Wasserstein barycenter, convolutional barycenter and unmixing.

Sinkhorn divergence and entropic regularization OT from empirical data.

Debiased Sinkhorn barycenters Sinkhorn divergence barycenter

Smooth optimal transport solvers (dual and semi-dual) for KL and squared L2 regularizations.

Weak OT solver between empirical distributions

Non regularized Wasserstein barycenters with LP solver (only small scale).

Gromov-Wasserstein distances and GW barycenters (exact and regularized), differentiable using gradients from Graph Dictionary Learning

Fused-Gromov-Wasserstein distances solver and FGW barycenters

Stochastic solver and differentiable losses for Large-scale Optimal Transport (semi-dual problem and dual problem)

Sampled solver of Gromov Wasserstein for large-scale problem with any loss functions

Non regularized free support Wasserstein barycenters.

One dimensional Unbalanced OT with KL relaxation and barycenter \[10, 25\]. Also exact unbalanced OT with KL and quadratic regularization and the regularization path of UOT

Partial Wasserstein and Gromov-Wasserstein (exact and entropic formulations).

Sliced Wasserstein \[31, 32\] and Max-sliced Wasserstein that can be used for gradient flows.

Graph Dictionary Learning solvers.

Several backends for easy use of POT with Pytorch/jax/Numpy/Cupy/Tensorflow arrays.

POT provides the following Machine Learning related solvers:

Optimal transport for domain adaptation with group lasso regularization, Laplacian regularization and semi supervised setting.

Linear OT mapping and Joint OT mapping estimation.

Wasserstein Discriminant Analysis (requires autograd + pymanopt).

JCPOT algorithm for multi-source domain adaptation with target shift.

Some other examples are available in the documentation.

6.3 GeomLoss

GeomLoss: Geometric Loss functions between sampled measures, images and volumes

The GeomLoss library provides efficient GPU implementations for:

Kernel norms (also known as Maximum Mean Discrepancies).

Hausdorff divergences, which are positive definite generalizations of the Chamfer-ICP loss and are analogous to log-likelihoods of Gaussian Mixture Models.

Debiased Sinkhorn divergences, which are affordable yet positive and definite approximations of Optimal Transport (Wasserstein) distances.

It is hosted on GitHub and distributed under the permissive MIT license.

GeomLoss functions are available through the custom PyTorch layers SamplesLoss, ImagesLoss and VolumesLoss which allow you to work with weighted point clouds (of any dimension), density maps and volumetric segmentation masks.

7 Incoming

Rigollet and Weed (2018):

We give a statistical interpretation of entropic optimal transport by showing that performing maximum-likelihood estimation for Gaussian deconvolution corresponds to calculating a projection with respect to the entropic optimal transport distance.
Thomas Viehmann, An efficient implementation of the Sinkhorn algorithm for the GPU is a Pytorch CUDA extension (Viehmann 2019)
Marco Cuturi’s course notes on OT include a 400 page slide deck.
Optimal Transportation Theory | LN’s webpage
optimaltransporttutorial.github.io/
Supervised Training of Conditional Monge Maps

8 References

Agueh, and Carlier. 2011. “Barycenters in the Wasserstein Space.” SIAM Journal on Mathematical Analysis.

Alaya, Berar, Gasso, et al. 2019. “Screening Sinkhorn Algorithm for Regularized Optimal Transport.” Advances in Neural Information Processing Systems.

Altschuler, Niles-Weed, and Rigollet. 2017. “Near-Linear Time Approximation Algorithms for Optimal Transport via Sinkhorn Iteration.” In.

Ambrogioni, Güçlü, Güçlütürk, et al. 2018. “Wasserstein Variational Inference.” In Proceedings of the 32Nd International Conference on Neural Information Processing Systems. NIPS’18.

Ambrogioni, Guclu, and van Gerven. 2019. “Wasserstein Variational Gradient Descent: From Semi-Discrete Optimal Transport to Ensemble Variational Inference.”

Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.

Angenent, Haker, and Tannenbaum. 2003. “Minimizing Flows for the Monge-Kantorovich Problem.” SIAM Journal on Mathematical Analysis.

Arjovsky, Chintala, and Bottou. 2017. “Wasserstein Generative Adversarial Networks.” In International Conference on Machine Learning.

Arora, Ge, Liang, et al. 2017. “Generalization and Equilibrium in Generative Adversarial Nets (GANs).” arXiv:1703.00573 [Cs].

Bachoc, Suvorikova, Ginsbourger, et al. 2019. “Gaussian Processes with Multidimensional Distribution Inputs via Optimal Transport and Hilbertian Embedding.” arXiv:1805.00753 [Stat].

Bai, Medri, Martin, et al. 2023. “Linear Optimal Partial Transport Embedding.”

Benamou. 2021. “Optimal Transportation, Modelling and Numerical Simulation.” Acta Numerica.

Benamou, Carlier, Cuturi, et al. 2014. “Iterative Bregman Projections for Regularized Transportation Problems.” arXiv:1412.5154 [Math].

Bishop, and Doucet. 2014. “Distributed Nonlinear Consensus in the Space of Probability Measures.” IFAC Proceedings Volumes, 19th IFAC World Congress,.

Blanchet, Chen, and Zhou. 2018. “Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.” arXiv:1802.04885 [Stat].

Blanchet, Jambulapati, Kent, et al. 2018. “Towards Optimal Running Times for Optimal Transport.” arXiv:1810.07717 [Cs].

Blanchet, Kang, and Murthy. 2016. “Robust Wasserstein Profile Inference and Applications to Machine Learning.” arXiv:1610.05627 [Math, Stat].

Blanchet, Murthy, and Si. 2019. “Confidence Regions in Wasserstein Distributionally Robust Estimation.” arXiv:1906.01614 [Math, Stat].

Blanchet, Murthy, and Zhang. 2018. “Optimal Transport Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes.” arXiv:1810.02403 [Math].

Blondel, Seguy, and Rolet. 2018. “Smooth and Sparse Optimal Transport.” In AISTATS 2018.

Boissard. 2011. “Simple Bounds for the Convergence of Empirical and Occupation Measures in 1-Wasserstein Distance.” Electronic Journal of Probability.

Bonneel. n.d. “Displacement Interpolation Using Lagrangian Mass Transport.”

Bonnotte. 2012. “From Knothe’s Rearrangement to Brenier’s Optimal Transport Map.”

Canas, and Rosasco. 2012. “Learning Probability Measures with Respect to Optimal Transport Metrics.” arXiv:1209.1077 [Cs, Stat].

Carlier, Cuturi, Pass, et al. 2017. “Optimal Transport Meets Probability, Statistics and Machine Learning.”

Carlier, Galichon, and Santambrogio. 2008. “From Knothe’s Transport to Brenier’s Map and a Continuation Method for Optimal Transport.”

Cazelles, Robert, and Tobar. 2019. “The Wasserstein-Fourier Distance for Stationary Time Series.”

Chizat, Peyré, Schmitzer, et al. 2017. “Scaling Algorithms for Unbalanced Transport Problems.” arXiv:1607.05816 [Math].

Chu, Blanchet, and Glynn. 2019. “Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning.” In ICML.

Corenflos, Thornton, Deligiannidis, et al. 2021. “Differentiable Particle Filtering via Entropy-Regularized Optimal Transport.” arXiv:2102.07850 [Cs, Stat].

Coscia. 2020. “Generalized Euclidean Measure to Estimate Network Distances.”

Courty, Flamary, Tuia, et al. 2016. “Optimal Transport for Domain Adaptation.” arXiv:1507.00504 [Cs].

Cuturi. 2013. “Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.” In Advances in Neural Information Processing Systems 26.

Cuturi, and Doucet. 2014. “Fast Computation of Wasserstein Barycenters.” In International Conference on Machine Learning.

Cuturi, Meng-Papaxanthos, Tian, et al. 2022. “Optimal Transport Tools (OTT): A JAX Toolbox for All Things Wasserstein.”

Fatras, Zine, Flamary, et al. 2020. “Learning with Minibatch Wasserstein : Asymptotic and Gradient Properties.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.

Fernholz. 1983. von Mises calculus for statistical functionals. Lecture Notes in Statistics 19.

Feydy, Séjourné, Vialard, et al. 2019. “Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.

Flamary, Rémi, Courty, Gramfort, et al. 2021. “POT: Python Optimal Transport.” Journal of Machine Learning Research.

Flamary, Rémi, Cuturi, Courty, et al. 2018. “Wasserstein Discriminant Analysis.” Machine Learning.

Flamary, Remi, Rakotomamonjy, Courty, et al. n.d. “Optimal Transport with Laplacian Regularization.”

Frogner, Zhang, Mobahi, et al. 2015. “Learning with a Wasserstein Loss.” In Advances in Neural Information Processing Systems 28.

Gao, and Kleywegt. 2022. “Distributionally Robust Stochastic Optimization with Wasserstein Distance.”

Garbuno-Inigo, Hoffmann, Li, et al. 2020. “Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler.” SIAM Journal on Applied Dynamical Systems.

Genevay, Cuturi, Peyré, et al. 2016. “Stochastic Optimization for Large-Scale Optimal Transport.” In Advances in Neural Information Processing Systems 29.

Genevay, Peyré, and Cuturi. 2017. “Learning Generative Models with Sinkhorn Divergences.” arXiv:1706.00292 [Stat].

Goodfellow, Pouget-Abadie, Mirza, et al. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27. NIPS’14.

Gozlan, and Léonard. 2010. “Transport Inequalities. A Survey.” arXiv:1003.3852 [Math].

Gulrajani, Ahmed, Arjovsky, et al. 2017. “Improved Training of Wasserstein GANs.” arXiv:1704.00028 [Cs, Stat].

Guo, Hong, Lin, et al. 2017. “Relaxed Wasserstein with Applications to GANs.” arXiv:1705.07164 [Cs, Stat].

Huggins, Jonathan, Adams, and Broderick. 2017. “PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30.

Huggins, Jonathan H., Campbell, Kasprzak, et al. 2018a. “Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat].

———, et al. 2018b. “Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” arXiv:1809.09505 [Cs, Math, Stat].

Huggins, Jonathan H., Kasprzak, Campbell, et al. 2019. “Practical Posterior Error Bounds from Variational Objectives.” arXiv:1910.04102 [Cs, Math, Stat].

Huynh, Zhao, and Phung. 2020. “OTLDA: A Geometry-Aware Optimal Transport Approach for Topic Modeling.” In Advances in Neural Information Processing Systems.

Khan, and Zhang. 2022. “When Optimal Transport Meets Information Geometry.” Information Geometry.

Kim, and Mehta. 2019. “An Optimal Control Derivation of Nonlinear Smoothing Equations.”

Léonard. 2014. “A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport.” Discrete & Continuous Dynamical Systems - A.

Liu, Huidong, Gu, and Samaras. 2018. “A Two-Step Computation of the Exact GAN Wasserstein Distance.” In International Conference on Machine Learning.

Liu, Qiang, and Wang. 2019. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems.

Louizos, and Welling. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.

Magyar, and Sambridge. 2022. “The Wasserstein Distance as a Hydrological Objective Function.” Preprint.

Mahdian, Blanchet, and Glynn. 2019. “Optimal Transport Relaxations with Application to Wasserstein GANs.” arXiv:1906.03317 [Cs, Math, Stat].

Mallasto, Gerolin, and Minh. 2021. “Entropy-Regularized 2-Wasserstein Distance Between Gaussian Measures.” Information Geometry.

Marzouk, Moselhy, Parno, et al. 2016. “Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification.

Maurya. 2018. “Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.

Minh. 2022. “Finite Sample Approximations of Exact and Entropic Wasserstein Distances Between Covariance Operators and Gaussian Processes.” SIAM/ASA Journal on Uncertainty Quantification.

Mohajerin Esfahani, and Kuhn. 2018. “Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations.” Mathematical Programming.

Montavon, Müller, and Cuturi. 2016. “Wasserstein Training of Restricted Boltzmann Machines.” In Advances in Neural Information Processing Systems 29.

Ostrovski, Dabney, and Munos. n.d. “Autoregressive Quantile Networks for Generative Modeling.”

Panaretos, and Zemel. 2019. “Statistical Aspects of Wasserstein Distances.” Annual Review of Statistics and Its Application.

Perrot, Courty, Flamary, et al. n.d. “Mapping Estimation for Discrete Optimal Transport.”

Peyré, and Cuturi. 2019. Computational Optimal Transport.

Peyré, Cuturi, and Solomon. 2016. “Gromov-Wasserstein Averaging of Kernel and Distance Matrices.” In International Conference on Machine Learning.

Qian, Hong, Cai, et al. 2016. “Non-Negative Matrix Factorization with Sinkhorn Distance.” In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. IJCAI’16.

Redko, Courty, Flamary, et al. 2019. “Optimal Transport for Multi-Source Domain Adaptation Under Target Shift.” In The 22nd International Conference on Artificial Intelligence and Statistics.

Rezende, and Mohamed. 2015. “Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.

Rigollet, and Weed. 2018. “Entropic Optimal Transport Is Maximum-Likelihood Deconvolution.”

Rustamov. 2021. “Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat.

Salimans, Zhang, Radford, et al. 2018. “Improving GANs Using Optimal Transport.”

Sambridge, Jackson, and Valentine. 2022. “Geophysical Inversion and Optimal Transport.” Geophysical Journal International.

Santambrogio. 2015. Optimal Transport for Applied Mathematicians. Edited by Filippo Santambrogio. Progress in Nonlinear Differential Equations and Their Applications.

Scetbon, Cuturi, and Peyré. 2021. “Low-Rank Sinkhorn Factorization.” In Proceedings of the 38th International Conference on Machine Learning.

Schmitzer. 2019. “Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems.” arXiv:1610.06519 [Cs, Math].

Schmitz, Heitz, Bonneel, et al. 2018. “Wasserstein Dictionary Learning: Optimal Transport-Based Unsupervised Nonlinear Dictionary Learning.” SIAM Journal on Imaging Sciences.

Solomon, de Goes, Peyré, et al. 2015. “Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains.” ACM Transactions on Graphics.

Spantini, Bigoni, and Marzouk. 2017. “Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research.

Taghvaei, and Mehta. 2021. “An Optimal Transport Formulation of the Ensemble Kalman Filter.” IEEE Transactions on Automatic Control.

van den Berg, Hasenclever, Tomczak, et al. 2018. “Sylvester Normalizing Flows for Variational Inference.” In UAI18.

Verdinelli, and Wasserman. 2019. “Hybrid Wasserstein Distance and Fast Distribution Clustering.” Electronic Journal of Statistics.

Viehmann. 2019. “Implementation of Batched Sinkhorn Iterations for Entropy-Regularized Wasserstein Loss.”

Wang, Yifei, Chen, and Li. 2021. “Projected Wasserstein Gradient Descent for High-Dimensional Bayesian Inference.”

Wang, Wei, Slepčev, Basu, et al. 2013. “A Linear Optimal Transportation Framework for Quantifying and Visualizing Variations in Sets of Images.” International Journal of Computer Vision.

Wang, Prince Zizhuang, and Wang. 2019. “Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

Zhang, Stephen Y. 2021. “A Unified Framework for Non-Negative Matrix and Tensor Factorisations with a Smoothed Wasserstein Loss.”

Zhang, Rui, Walder, Bonilla, et al. 2020. “Quantile Propagation for Wasserstein-Approximate Gaussian Processes.” In Proceedings of NeurIPS 2020.

Zhao, Phung, Huynh, et al. 2020. “Neural Topic Model via Optimal Transport.” In.

Zhu, Jiao, and Tse. 2020. “Deconstructing Generative Adversarial Networks.” IEEE Transactions on Information Theory.