WASSERSTEIN BARYCENTER -BASED MODEL FUSION AND LINEAR MODE CONNECTIVITY OF NEURAL NET- WORKS

2025-04-24 0 0 1.45MB 24 页 10玖币
侵权投诉
WASSERSTEIN BARYCENTER-BASED MODEL FUSION
AND LINEAR MODE CONNECTIVITY OF NEURAL NET-
WORKS
Aditya Kumar Akash
Department of Computer Science
University of Wisconsin-Madison
aakash@wisc.edu
Sixu Li
Department of Statistics
University of Wisconsin-Madison
sli739@wisc.edu
Nicol´
as Garc´
ıa Trillos
Department of Statistics
University of Wisconsin-Madison
garciatrillo@wisc.edu
ABSTRACT
Based on the concepts of Wasserstein barycenter (WB) and Gromov-Wasserstein
barycenter (GWB), we propose a unified mathematical framework for neural net-
work (NN) model fusion and utilize it to reveal new insights about the linear mode
connectivity of SGD solutions. In our framework, the fusion occurs in a layer-wise
manner and builds on an interpretation of a node in a network as a function of the
layer preceding it. The versatility of our mathematical framework allows us to
talk about model fusion and linear mode connectivity for a broad class of NNs,
including fully connected NN, CNN, ResNet, RNN, and LSTM, in each case ex-
ploiting the specific structure of the network architecture. We present extensive
numerical experiments to: 1) illustrate the strengths of our approach in relation to
other model fusion methodologies and 2) from a certain perspective, provide new
empirical evidence for recent conjectures which say that two local minima found
by gradient-based methods end up lying on the same basin of the loss landscape
after a proper permutation of weights is applied to one of the models. 1
1 INTRODUCTION
The increasing use of edge devices like mobile phones, tablets, and vehicles, along with the sophisti-
cation in sensors present in them (e.g. cameras, GPS, and accelerometers), has led to the generation
of an enormous amount of data. However, data privacy concerns, communication costs, bandwidth
limits, and time sensitivity prevent the gathering of local data from edge devices into one single
centralized location. These obstacles have motivated the design and development of federated learn-
ing strategies which are aimed at pooling information from locally trained neural networks (NNs)
with the objective of building strong centralized models without relying on the collection of local
data McMahan et al. (2017); Kairouz et al. (2019). Due to these considerations, the problem of
NN fusion–i.e. combining multiple models which were trained differently into a single model–is a
fundamental task in federated learning.
A standard fusion method for aggregating models with the same architecture is FedAvg McMahan
et al. (2017), which involves element-wise averaging of the parameters of local models. This is also
known as vanilla averaging Singh & Jaggi (2019). Although easily implementable, vanilla averaging
performs poorly when fusing models whose weights do not have a one-to-one correspondence. This
happens because even when models are trained on the same dataset it is possible to obtain models
that differ only by a permutation of weights Wang et al. (2020); Yurochkin et al. (2019); this feature
Now at Google
Equal Contribution
1Code is available at: https://github.com/SixuLi/WBFusionAndLMC
1
arXiv:2210.06671v1 [cs.LG] 13 Oct 2022
is known as permutation invariance property of neural networks. Moreover, vanilla averaging is
not naturally designed to work when using local models with different architectures (e.g., different
widths). In order to address these challenges, Singh & Jaggi (2019) proposed to first find the best
alignment between the neurons (weights) of different networks by using optimal transport (OT) Vil-
lani (2008); Santambrogio (2015); Peyr´
e & Cuturi (2018) and then carrying out a vanilla averaging
step. In Liu et al. (2022), the authors formulate the model fusion as a graph matching problem,
which utilizes the second-order similarity of model weights to align neurons. Other approaches,
like those proposed in Yurochkin et al. (2019); Wang et al. (2020), interpret nodes of local models
as random permutations of latent “global nodes” modeled according to a Beta-Bernoulli process
prior Thibaux & Jordan (2007). By using “global nodes”, nodes from different input NNs can be
embedded into a common space where comparisons and aggregation are meaningful. Most works
in the literature discussing the fusion problem have mainly focused on the aggregation of fully con-
nected (FC) neural networks and CNNs, but have not, for the most part, explored other kinds of
architectures like RNNs and LSTMs. One exception to this general state of the art is the work Wang
et al. (2020), which considers the fusion of RNNs by ignoring hidden-to-hidden weights during the
neurons’ matching, thus discarding some useful information in the pre-trained RNNs. For more
references on the fusion problem see in the Appendix A.1.
A different line of research that has attracted considerable attention in the past few years is the quest
for a comprehensive understanding of the loss landscape of deep neural networks, a fundamental
component in studying the optimization and generalization properties of NNs Li et al. (2018); Mei
et al. (2018); Neyshabur et al. (2017); Nguyen et al. (2018); Izmailov et al. (2018). Due to over-
parameterization, scale, and permutation invariance properties of neural networks, the loss land-
scapes of DNNs have many local minima Keskar et al. (2016); Zhang et al. (2021). Different works
have asked and answered affirmatively the question of whether there exist paths of small-increasing
loss connecting different local minima found by SGD Garipov et al. (2018); Draxler et al. (2018).
This phenomenon is often referred to as mode connectivity Garipov et al. (2018) and the loss in-
crease along paths between two models is often referred to as (energy) barrier Draxler et al. (2018).
It has been observed that low-barrier paths are non-linear, i.e., linear interpolation of two different
models will not usually produce a neural network with small loss. These observations suggest that,
from the perspective of local structure properties of loss landscapes, different SGD solutions belong
to different (well-separated) basins Neyshabur et al. (2020). However, recent work Entezari et al.
(2021) has conjectured that local minima found by SGD do end up lying on the same basin of the
loss landscape after a proper permutation of weights is applied to one of the models. The question
of how to find these desired permutations remains in general elusive.
The purpose of this paper is twofold. On one hand, we present a large family of barycenter-based
fusion algorithms that can be used to aggregate models within the families of fully connected NNs,
CNNs, ResNets, RNNs and LSTMs. The most general family of fusion algorithms that we intro-
duce relies on the concept of Gromov-Wasserstein barycenter (GWB), which allows us to use the
information in hidden-to-hidden layers in RNNs and LSTMs in contrast to previous approaches in
the literature like that proposed in Wang et al. (2020). In order to motivate the GWB based fusion
algorithm for RNNs and LSTMs, we first discuss a Wasserstein barycenter (WB) based fusion algo-
rithm for fully connected, CNN, and ResNet models which follows closely the OT fusion algorithm
from Singh & Jaggi (2019). By creating a link between the NN model fusion problem and the prob-
lem of computing Wasserstein (or Gromov-Wasserstein) barycenters, our aim is to exploit the many
tools that have been developed in the last decade for the computation of WB (or GWB) —see the
Appendix A.2 for references— and to leverage the mathematical structure of OT problems. Using
our framework, we are able to fuse models with different architectures and build target models with
arbitrary specified dimensions (at least in terms of width). On the other hand, through several nu-
merical experiments in a variety of settings (architectures and datasets), we provide new evidence
backing certain aspects of the conjecture put forward in Entezari et al. (2021) about the local struc-
ture of NNs’ loss landscapes. Indeed, we find out that there exist sparse couplings between different
models that can map different local minima found by SGD into basins that are only separated by
low energy barriers. These sparse couplings, which can be thought of as approximations to actual
permutations, are obtained using our fusion algorithms, which, surprisingly, only use training data to
set the values of some hyperparameters. We explore this conjecture in imaging and natural language
processing (NLP) tasks and provide visualizations of our findings. Consider, for example, Figure 1
(left), which is the visualization of fusing two FC NNs independently trained on the MNIST dataset.
We can observe that the basins where model 1 and permuted model 2 (i.e. model 2 after multiplying
2
its weights by the coupling obtained by our fusion algorithm) land are close to each other and are
only separated by low energy barriers.
5 0 5 10 15 20 25 30 35
0
5
10
15
20
25 Base model 1
Base model 2
Permuted model 2
Fused model
1.7
1.9
2.1
2.4
2.9
3.8
5.3
8
>8
Figure 1: Left: The test error surface of FC NNs trained on MNIST. The permuted model 2 is
model 2 after multiplying its weights by the coupling obtained by our fusion algorithm. Right: The
illustration of our interpretations of FC NNs. Following our definitions, node v:= (γ2, w), where
γ2is a probability measure on layer N2and w:N2Ris the weight function corresponding to
node v. For example, the scalar w(z2)is the weight between nodes vand z2, and we use w2as the
shorthand notation of w(z2).
Our main contributions can then be summarized as follows: (a) we formulate the network model
fusion problem as a series of Wasserstein (Gromov-Wasserstein) barycenter problems, bridging in
this way the NN fusion problem with computational OT; (b) we empirically demonstrate that our
framework is highly effective at fusing different types of networks, including RNNs and LSTMs. (c)
we visualize the result of our fusion algorithm when aggregating two neural networks in a 2D-plane.
By doing this we not only provide some illustrations on how our fusion algorithms perform, but also
present empirical evidence for the conjecture made in Entezari et al. (2021), casting light over the
loss landscape of a variety of neural networks.
At the time of completing this work, we became aware of two very recent preprints which also
explore the conjecture made in Entezari et al. (2021) empirically. In particular, Ainsworth et al.
(2022) demonstrates that there is zero-barrier LMC (after permutation) between two independently
trained NNs (including ResNet) provided the width of layers is large enough. In Benzing et al.
(2022), the conjecture is explored for FC NNs, finding that the average of two randomly initialized
models using the permutation revealed through training gives a non-trivial NN. Compared to our
work, none of these two works explored this conjecture for recurrent NNs; we highlight that our
GWB fusion method is of particular relevance for this aim. To the best of our knowledge, we thus
provide the first-ever exploration of the conjecture posited in Entezari et al. (2021) for NLP tasks.
1.1 NOTATION
We first introduce some basic notation and briefly review a few relevant concepts from OT. A simplex
of histograms with nbins is denoted by Σn:= {aRn
+:Piai= 1}. The set of couplings between
histograms aΣn1and bΣn2is denoted by Γ(a, b) := {ΠRn1×n2
+: Π1n2=a, ΠT1n1=b},
where 1n:= (1,...,1)TRn. For any 4-way tensor L=Lijkli,j,k,l and matrix Π = πij i,j ,
we define the tensor-matrix multiplication of Land Πas the matrix L ⊗ Π := Pk,l Lijklπkli,j .
1.2 OPTIMAL TRANSPORT AND WASSERSTEIN BARYCENTERS
Let Xbe an arbitrary topological space and let c:X × X [0,)be a cost function assumed to
satisfy c(x, x) = 0 for every x. We denote by M+
1(X)the space of (Borel) probability measures
on X. For {xi}n1
i=1,{yj}n2
j=1 ∈ X, define discrete measures µ=Pn1
i=1 aiδxiand ν=Pn2
j=1 bjδyj
in M+
1(X), where aΣn1,bΣn2, and δxdenotes the Dirac delta measure at x∈ X. The
Wasserstein “distance” between µand ν, relative to the cost c, is defined as
W(µ, ν) := inf
ΠΓ(µ,ν)hC, Πi,(1)
3
where C:= c(xi, yj)i,j is the “cost” matrix between {xi}i,{yj}j∈ X,Π := πij i,j Γ(µ, ν)
is the coupling matrix between µand ν, and hA, Bi:= tr(ATB)is the Frobenius inner product.
Let {γi}n
i=1 ∈ M+
1(X)be a collection of discrete probability measures. The Wasserstein barycenter
problem (WBP) Agueh & Carlier (2011) associated with these measures reads
min
γ∈M+
1(X)
1
n
n
X
i=1
W(γ, γi).(2)
A minimizer of this problem is called a Wasserstein barycenter (WB) of the measures {γi}n
i=1 and
can be understood as an average of the input measures. In the sequel we will use the concept of WB
to define fusion algorithms for FC NN, CNN, and ResNet. For RNN and LSTM the fusion reduces
to solving a series of Gromov-Wasserstein barycenter-like problems (see the reviews of GWBP in
the Appendix).
2 WASSERSTEIN BARYCENTER BASED FUSION
In this section, we discuss our layer-wise fusion algorithm based on the concept of WB. First we
introduce the necessary interpretations of nodes and layers of NNs in Section 2.1. Next in Section
2.2, we describe how to compare layers and nodes across different NNs so as to make sense of
aggregating models through WB. Finally we present our fusion algorithm in Section 2.3.
2.1 NESTED DEFINITION OF FULLY CONNECTED NN
For a fully connected network N, we use vto index the nodes in its l-th layer Nl. Let γldenote a
probability measure on the l-th layer defined as the weighted sum of Dirac delta measure over the
nodes in that layer, i.e.,
γl:= 1
|Nl|X
vNl
δv∈ M+
1(Nl).(3)
We interpret a node vfrom the l-th layer as an element in Nlthat couples a function on the domain
Nl1(previous layer) with a probability measure. In particular, the node vis interpreted as v:=
(γl1, w), where γl1is a measure on the previous layer Nl1and wrepresents the weights between
the node vand the nodes in previous layer Nl1. These weights can be interpreted as a function w:
Nl1Rand we use the notation wqto denote the value of function wevaluated at the q-th node
in the previous layer Nl1. For the first layer i.e. l= 1, the nodes simply represent placeholders for
the input features. The above interpretation is illustrated in Figure 1(right). This interpretation of
associating nodes with a function of previous layer allows us to later define “distance” between nodes
in different NNs (see Section 2.2) and is motivated from T Lpspaces and distance Garc´
ıa Trillos &
Slepˇ
cev (2015); Thorpe et al. (2017) which is designed for comparing signals with different domains
(see more details in Appendix C.1).
2.2 COST FUNCTIONS FOR COMPARING LAYERS AND NODES
Having introduced our interpretations of NNs, we now define the cost functions for comparing layers
and nodes which will be used to aggregate models through WB. Consider the l-th layers Nland N0
l
of two NNs Nand N0respectively. We use Wasserstein distance between the measures γland γ0
l
over Nland N0
lrespectively to define distance between the layers:
dµ(γl, γ0
l) := W(γl, γ0
l) = inf
ΠlΓ(γl0
l)hCl,Πli(4)
where matrix Πl= [πl,jg]j,g is a coupling between the measures γland γ0
l; and Clis the cost matrix
give by Cl:= cl(v, v0)v,v0, where clis a cost function between nodes on the l-th layers.
Following our inductive interpretation of NNs, the cost function clcan also be defined inductively.
Consider nodes vand v0from l-th layer of NNs Nand N0respectively. For the first layer l= 1,
we pick a natural candidate for cost function, namely c1(v, v0) := 1v6=v0, a reasonable choice given
that all networks have the same input layer. For l2, recall our interpretation of nodes v=
4
(γl1, w), v0= (γ0
l1, w0), where γl1and γ0
l1denotes the respective measures associated with
previous layer l1and w, w0denotes the respective weight functions for nodes vand v0. Since the
domains of the weight functions wand w0are layers in different NNs, it is not clear how to compare
them directly. However in T Lpinterpretation, after finding a suitable coupling between the support
measures γl1and γ0
l1, one can couple the functions wand w0and use a direct L2-comparison.
Motivated by computational and methodological considerations, we use a slight modification of
the T Lpdistance and decouple the problem for the measures from the weights. Specifically, we
define cl(v, v0) := dµ(γl1, γ0
l1) + dW(w, w0); where dµis the Wasserstein distance (as defined
in equation 4) between the measures γl1and γ0
l1from layers l1. And dWis defined using the
optimal coupling of weight functions’ support measures, i.e.,
dW(w, w0) := X
q,s wqw0
s2(πl1,qs)=: hL(w, w0),l1)i,(5)
where L(w, w0) := (wqw0
s)2q,s and l1)=(πl1,qs)q,s is the optimal coupling between
γl1and γ0
l1. Note that dµ(γl1, γ0
l1)is a fixed constant when comparing any two nodes on the
l-th layers Nland N0
l. For simplicity, we let cl(v, v0) = dW(W, W 0)in what follows, and the
information of support measures γl1and γ0
l1is implicitly included in their optimal coupling
l1). Here we have omitted bias terms to ease the exposition of our framework, but a natural
implementation that accounts for bias terms can be obtained by simply concatenating them with the
weight matrix.
We set 1)equal to the identity matrix normalized by the size of input layer given that this is
a solution to equation 4when the cost c1is defined as c1(v, ˜v) := 1v6=˜v. Other choices of cost
function clare possible, e.g. the activation-based cost function proposed in Singh & Jaggi (2019).
2.3 FUSION ALGORITHM
In the following we consider ninput FC NNs N1, . . . , N n. We use Ni
lto denote the l-th layer of
the i-th network Niand ki
lto denote the number of nodes in that layer, i.e. ki
l=|Ni
l|. Let γi
lto
be the probability measure on layer Ni
lsimilar to definition in equation 3with the support points
being nodes in that layer. We denote the target model (i.e. the desired fusion output) by Nnew and
use k2, . . . , kmto denote the sizes of its layers Nnew
2, . . . , Nnew
m, respectively. We assume that all
networks, including the target model, have the same input layer and the same number of layers m.
Based on the discussion in Sections 2.1 and 2.2, we now describe an inductive construction of the
layers of the target network Nnew by fusing all ninput NNs. First, Nnew
1is set to be equal to N1
1: this
is the base case of the inductive construction and simply means that we set the input layer of Nnew
to be the same as that of the other models; we also set γ1:= γ1
1. Next, assuming that the fusion has
been completed for the layers 1to l1(l2), we consider the fusion of the l-th layer. For the
simplicity of notations, we drop the index lwhile referring to nodes and their corresponding weights
in this layer. In particular, we use vi
gand wi
gto denote the nodes in layer Ni
land their corresponding
weights. To carry out fusion of the l-th layer of the input models, we aggregate their corresponding
measures through finding WB which provides us with a sensible “average” l-th layer for the target
model. Hence, we consider the following WBP over γ1
l, . . . , γn
l:
min
γl,{Πi
l}i
1
n
n
X
i=1
W(γl, γi
l) := 1
n
n
X
i=1hCi
l,Πi
lis.t. γl=1
kl
kl
X
j=1
δvj, vj= (γl1, wj).(6)
Here the measure γlis the candidate l-th layer “average” of the input models and is forced to take a
specific form (notice that we have fixed the size of its support and the masses assigned to its support
points). Nodes vjin the support of γlare set to take the form vj= (γl1, wj), i.e. the measure γl1
obtained when fusing the (l1)-th layers is the first coordinate in all the vj. This plugs the current
layer of the target model with its previous layer. As done for the input models, wjis interpreted
as a function from the (l1)-th layer into the reals, and represents the actual weight vector from
the (l1)-layer to the j-th node in the l-th layer of the new model. Ci
l:= cl(vj, vi
g)j,g are the
cost matrices corresponding to WBP in equation 6, where clis a cost function between nodes on the
l-th layers (see in Section 2.2). Let Wland Wi
lto be the weight function matrices of the l-th layer
of target models Nnew and input model Nirespectively (e.g. Wl:= (w1, . . . , wkl)T) and define
5
摘要:

WASSERSTEINBARYCENTER-BASEDMODELFUSIONANDLINEARMODECONNECTIVITYOFNEURALNET-WORKSAdityaKumarAkashyDepartmentofComputerScienceUniversityofWisconsin-Madisonaakash@wisc.eduSixuLiyDepartmentofStatisticsUniversityofWisconsin-Madisonsli739@wisc.eduNicol´asGarc´aTrillosDepartmentofStatisticsUniversityofWi...

展开>> 收起<<
WASSERSTEIN BARYCENTER -BASED MODEL FUSION AND LINEAR MODE CONNECTIVITY OF NEURAL NET- WORKS.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.45MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注