
is known as permutation invariance property of neural networks. Moreover, vanilla averaging is
not naturally designed to work when using local models with different architectures (e.g., different
widths). In order to address these challenges, Singh & Jaggi (2019) proposed to first find the best
alignment between the neurons (weights) of different networks by using optimal transport (OT) Vil-
lani (2008); Santambrogio (2015); Peyr´
e & Cuturi (2018) and then carrying out a vanilla averaging
step. In Liu et al. (2022), the authors formulate the model fusion as a graph matching problem,
which utilizes the second-order similarity of model weights to align neurons. Other approaches,
like those proposed in Yurochkin et al. (2019); Wang et al. (2020), interpret nodes of local models
as random permutations of latent “global nodes” modeled according to a Beta-Bernoulli process
prior Thibaux & Jordan (2007). By using “global nodes”, nodes from different input NNs can be
embedded into a common space where comparisons and aggregation are meaningful. Most works
in the literature discussing the fusion problem have mainly focused on the aggregation of fully con-
nected (FC) neural networks and CNNs, but have not, for the most part, explored other kinds of
architectures like RNNs and LSTMs. One exception to this general state of the art is the work Wang
et al. (2020), which considers the fusion of RNNs by ignoring hidden-to-hidden weights during the
neurons’ matching, thus discarding some useful information in the pre-trained RNNs. For more
references on the fusion problem see in the Appendix A.1.
A different line of research that has attracted considerable attention in the past few years is the quest
for a comprehensive understanding of the loss landscape of deep neural networks, a fundamental
component in studying the optimization and generalization properties of NNs Li et al. (2018); Mei
et al. (2018); Neyshabur et al. (2017); Nguyen et al. (2018); Izmailov et al. (2018). Due to over-
parameterization, scale, and permutation invariance properties of neural networks, the loss land-
scapes of DNNs have many local minima Keskar et al. (2016); Zhang et al. (2021). Different works
have asked and answered affirmatively the question of whether there exist paths of small-increasing
loss connecting different local minima found by SGD Garipov et al. (2018); Draxler et al. (2018).
This phenomenon is often referred to as mode connectivity Garipov et al. (2018) and the loss in-
crease along paths between two models is often referred to as (energy) barrier Draxler et al. (2018).
It has been observed that low-barrier paths are non-linear, i.e., linear interpolation of two different
models will not usually produce a neural network with small loss. These observations suggest that,
from the perspective of local structure properties of loss landscapes, different SGD solutions belong
to different (well-separated) basins Neyshabur et al. (2020). However, recent work Entezari et al.
(2021) has conjectured that local minima found by SGD do end up lying on the same basin of the
loss landscape after a proper permutation of weights is applied to one of the models. The question
of how to find these desired permutations remains in general elusive.
The purpose of this paper is twofold. On one hand, we present a large family of barycenter-based
fusion algorithms that can be used to aggregate models within the families of fully connected NNs,
CNNs, ResNets, RNNs and LSTMs. The most general family of fusion algorithms that we intro-
duce relies on the concept of Gromov-Wasserstein barycenter (GWB), which allows us to use the
information in hidden-to-hidden layers in RNNs and LSTMs in contrast to previous approaches in
the literature like that proposed in Wang et al. (2020). In order to motivate the GWB based fusion
algorithm for RNNs and LSTMs, we first discuss a Wasserstein barycenter (WB) based fusion algo-
rithm for fully connected, CNN, and ResNet models which follows closely the OT fusion algorithm
from Singh & Jaggi (2019). By creating a link between the NN model fusion problem and the prob-
lem of computing Wasserstein (or Gromov-Wasserstein) barycenters, our aim is to exploit the many
tools that have been developed in the last decade for the computation of WB (or GWB) —see the
Appendix A.2 for references— and to leverage the mathematical structure of OT problems. Using
our framework, we are able to fuse models with different architectures and build target models with
arbitrary specified dimensions (at least in terms of width). On the other hand, through several nu-
merical experiments in a variety of settings (architectures and datasets), we provide new evidence
backing certain aspects of the conjecture put forward in Entezari et al. (2021) about the local struc-
ture of NNs’ loss landscapes. Indeed, we find out that there exist sparse couplings between different
models that can map different local minima found by SGD into basins that are only separated by
low energy barriers. These sparse couplings, which can be thought of as approximations to actual
permutations, are obtained using our fusion algorithms, which, surprisingly, only use training data to
set the values of some hyperparameters. We explore this conjecture in imaging and natural language
processing (NLP) tasks and provide visualizations of our findings. Consider, for example, Figure 1
(left), which is the visualization of fusing two FC NNs independently trained on the MNIST dataset.
We can observe that the basins where model 1 and permuted model 2 (i.e. model 2 after multiplying
2