WASSERSTEIN BARYCENTER -BASED MODEL FUSION AND LINEAR MODE CONNECTIVITY OF NEURAL NET- WORKS

2025-04-24 1 0 1.45MB 24 页 10玖币

侵权投诉

WASSERSTEIN BARYCENTER-BASED MODEL FUSION

AND LINEAR MODE CONNECTIVITY OF NEURAL NET-

WORKS

Aditya Kumar Akash ∗†

Department of Computer Science

University of Wisconsin-Madison

aakash@wisc.edu

Sixu Li †

Department of Statistics

University of Wisconsin-Madison

sli739@wisc.edu

Nicol´

as Garc´

ıa Trillos

Department of Statistics

University of Wisconsin-Madison

garciatrillo@wisc.edu

ABSTRACT

Based on the concepts of Wasserstein barycenter (WB) and Gromov-Wasserstein

barycenter (GWB), we propose a uniﬁed mathematical framework for neural net-

work (NN) model fusion and utilize it to reveal new insights about the linear mode

connectivity of SGD solutions. In our framework, the fusion occurs in a layer-wise

manner and builds on an interpretation of a node in a network as a function of the

layer preceding it. The versatility of our mathematical framework allows us to

talk about model fusion and linear mode connectivity for a broad class of NNs,

including fully connected NN, CNN, ResNet, RNN, and LSTM, in each case ex-

ploiting the speciﬁc structure of the network architecture. We present extensive

numerical experiments to: 1) illustrate the strengths of our approach in relation to

other model fusion methodologies and 2) from a certain perspective, provide new

empirical evidence for recent conjectures which say that two local minima found

by gradient-based methods end up lying on the same basin of the loss landscape

after a proper permutation of weights is applied to one of the models. 1

1 INTRODUCTION

The increasing use of edge devices like mobile phones, tablets, and vehicles, along with the sophisti-

cation in sensors present in them (e.g. cameras, GPS, and accelerometers), has led to the generation

of an enormous amount of data. However, data privacy concerns, communication costs, bandwidth

limits, and time sensitivity prevent the gathering of local data from edge devices into one single

centralized location. These obstacles have motivated the design and development of federated learn-

ing strategies which are aimed at pooling information from locally trained neural networks (NNs)

with the objective of building strong centralized models without relying on the collection of local

data McMahan et al. (2017); Kairouz et al. (2019). Due to these considerations, the problem of

NN fusion–i.e. combining multiple models which were trained differently into a single model–is a

fundamental task in federated learning.

A standard fusion method for aggregating models with the same architecture is FedAvg McMahan

et al. (2017), which involves element-wise averaging of the parameters of local models. This is also

known as vanilla averaging Singh & Jaggi (2019). Although easily implementable, vanilla averaging

performs poorly when fusing models whose weights do not have a one-to-one correspondence. This

happens because even when models are trained on the same dataset it is possible to obtain models

that differ only by a permutation of weights Wang et al. (2020); Yurochkin et al. (2019); this feature

∗Now at Google

†Equal Contribution

1Code is available at: https://github.com/SixuLi/WBFusionAndLMC

arXiv:2210.06671v1 [cs.LG] 13 Oct 2022

is known as permutation invariance property of neural networks. Moreover, vanilla averaging is

not naturally designed to work when using local models with different architectures (e.g., different

widths). In order to address these challenges, Singh & Jaggi (2019) proposed to ﬁrst ﬁnd the best

alignment between the neurons (weights) of different networks by using optimal transport (OT) Vil-

lani (2008); Santambrogio (2015); Peyr´

e & Cuturi (2018) and then carrying out a vanilla averaging

step. In Liu et al. (2022), the authors formulate the model fusion as a graph matching problem,

which utilizes the second-order similarity of model weights to align neurons. Other approaches,

like those proposed in Yurochkin et al. (2019); Wang et al. (2020), interpret nodes of local models

as random permutations of latent “global nodes” modeled according to a Beta-Bernoulli process

prior Thibaux & Jordan (2007). By using “global nodes”, nodes from different input NNs can be

embedded into a common space where comparisons and aggregation are meaningful. Most works

in the literature discussing the fusion problem have mainly focused on the aggregation of fully con-

nected (FC) neural networks and CNNs, but have not, for the most part, explored other kinds of

architectures like RNNs and LSTMs. One exception to this general state of the art is the work Wang

et al. (2020), which considers the fusion of RNNs by ignoring hidden-to-hidden weights during the

neurons’ matching, thus discarding some useful information in the pre-trained RNNs. For more

references on the fusion problem see in the Appendix A.1.

A different line of research that has attracted considerable attention in the past few years is the quest

for a comprehensive understanding of the loss landscape of deep neural networks, a fundamental

component in studying the optimization and generalization properties of NNs Li et al. (2018); Mei

et al. (2018); Neyshabur et al. (2017); Nguyen et al. (2018); Izmailov et al. (2018). Due to over-

parameterization, scale, and permutation invariance properties of neural networks, the loss land-

scapes of DNNs have many local minima Keskar et al. (2016); Zhang et al. (2021). Different works

have asked and answered afﬁrmatively the question of whether there exist paths of small-increasing

loss connecting different local minima found by SGD Garipov et al. (2018); Draxler et al. (2018).

This phenomenon is often referred to as mode connectivity Garipov et al. (2018) and the loss in-

crease along paths between two models is often referred to as (energy) barrier Draxler et al. (2018).

It has been observed that low-barrier paths are non-linear, i.e., linear interpolation of two different

models will not usually produce a neural network with small loss. These observations suggest that,

from the perspective of local structure properties of loss landscapes, different SGD solutions belong

to different (well-separated) basins Neyshabur et al. (2020). However, recent work Entezari et al.

(2021) has conjectured that local minima found by SGD do end up lying on the same basin of the

loss landscape after a proper permutation of weights is applied to one of the models. The question

of how to ﬁnd these desired permutations remains in general elusive.

The purpose of this paper is twofold. On one hand, we present a large family of barycenter-based

fusion algorithms that can be used to aggregate models within the families of fully connected NNs,

CNNs, ResNets, RNNs and LSTMs. The most general family of fusion algorithms that we intro-

duce relies on the concept of Gromov-Wasserstein barycenter (GWB), which allows us to use the

information in hidden-to-hidden layers in RNNs and LSTMs in contrast to previous approaches in

the literature like that proposed in Wang et al. (2020). In order to motivate the GWB based fusion

algorithm for RNNs and LSTMs, we ﬁrst discuss a Wasserstein barycenter (WB) based fusion algo-

rithm for fully connected, CNN, and ResNet models which follows closely the OT fusion algorithm

from Singh & Jaggi (2019). By creating a link between the NN model fusion problem and the prob-

lem of computing Wasserstein (or Gromov-Wasserstein) barycenters, our aim is to exploit the many

tools that have been developed in the last decade for the computation of WB (or GWB) —see the

Appendix A.2 for references— and to leverage the mathematical structure of OT problems. Using

our framework, we are able to fuse models with different architectures and build target models with

arbitrary speciﬁed dimensions (at least in terms of width). On the other hand, through several nu-

merical experiments in a variety of settings (architectures and datasets), we provide new evidence

backing certain aspects of the conjecture put forward in Entezari et al. (2021) about the local struc-

ture of NNs’ loss landscapes. Indeed, we ﬁnd out that there exist sparse couplings between different

models that can map different local minima found by SGD into basins that are only separated by

low energy barriers. These sparse couplings, which can be thought of as approximations to actual

permutations, are obtained using our fusion algorithms, which, surprisingly, only use training data to

set the values of some hyperparameters. We explore this conjecture in imaging and natural language

processing (NLP) tasks and provide visualizations of our ﬁndings. Consider, for example, Figure 1

(left), which is the visualization of fusing two FC NNs independently trained on the MNIST dataset.

We can observe that the basins where model 1 and permuted model 2 (i.e. model 2 after multiplying

its weights by the coupling obtained by our fusion algorithm) land are close to each other and are

only separated by low energy barriers.

−5 0 5 10 15 20 25 30 35

25 Base model 1

Base model 2

Permuted model 2

Fused model

1.7

1.9

2.1

2.4

2.9

3.8

5.3

Figure 1: Left: The test error surface of FC NNs trained on MNIST. The permuted model 2 is

model 2 after multiplying its weights by the coupling obtained by our fusion algorithm. Right: The

illustration of our interpretations of FC NNs. Following our deﬁnitions, node v:= (γ2, w), where

γ2is a probability measure on layer N2and w:N2→Ris the weight function corresponding to

node v. For example, the scalar w(z2)is the weight between nodes vand z2, and we use w2as the

shorthand notation of w(z2).

Our main contributions can then be summarized as follows: (a) we formulate the network model

fusion problem as a series of Wasserstein (Gromov-Wasserstein) barycenter problems, bridging in

this way the NN fusion problem with computational OT; (b) we empirically demonstrate that our

framework is highly effective at fusing different types of networks, including RNNs and LSTMs. (c)

we visualize the result of our fusion algorithm when aggregating two neural networks in a 2D-plane.

By doing this we not only provide some illustrations on how our fusion algorithms perform, but also

present empirical evidence for the conjecture made in Entezari et al. (2021), casting light over the

loss landscape of a variety of neural networks.

At the time of completing this work, we became aware of two very recent preprints which also

explore the conjecture made in Entezari et al. (2021) empirically. In particular, Ainsworth et al.

(2022) demonstrates that there is zero-barrier LMC (after permutation) between two independently

trained NNs (including ResNet) provided the width of layers is large enough. In Benzing et al.

(2022), the conjecture is explored for FC NNs, ﬁnding that the average of two randomly initialized

models using the permutation revealed through training gives a non-trivial NN. Compared to our

work, none of these two works explored this conjecture for recurrent NNs; we highlight that our

GWB fusion method is of particular relevance for this aim. To the best of our knowledge, we thus

provide the ﬁrst-ever exploration of the conjecture posited in Entezari et al. (2021) for NLP tasks.

1.1 NOTATION

We ﬁrst introduce some basic notation and brieﬂy review a few relevant concepts from OT. A simplex

of histograms with nbins is denoted by Σn:= {a∈Rn

+:Piai= 1}. The set of couplings between

histograms a∈Σn1and b∈Σn2is denoted by Γ(a, b) := {Π∈Rn1×n2

+: Π1n2=a, ΠT1n1=b},

where 1n:= (1,...,1)T∈Rn. For any 4-way tensor L=Lijkli,j,k,l and matrix Π = πij i,j ,

we deﬁne the tensor-matrix multiplication of Land Πas the matrix L ⊗ Π := Pk,l Lijklπkli,j .

1.2 OPTIMAL TRANSPORT AND WASSERSTEIN BARYCENTERS

Let Xbe an arbitrary topological space and let c:X × X → [0,∞)be a cost function assumed to

satisfy c(x, x) = 0 for every x. We denote by M+

1(X)the space of (Borel) probability measures

on X. For {xi}n1

i=1,{yj}n2

j=1 ∈ X, deﬁne discrete measures µ=Pn1

i=1 aiδxiand ν=Pn2

j=1 bjδyj

in M+

1(X), where a∈Σn1,b∈Σn2, and δxdenotes the Dirac delta measure at x∈ X. The

Wasserstein “distance” between µand ν, relative to the cost c, is deﬁned as

W(µ, ν) := inf

Π∈Γ(µ,ν)hC, Πi,(1)

where C:= c(xi, yj)i,j is the “cost” matrix between {xi}i,{yj}j∈ X,Π := πij i,j ∈Γ(µ, ν)

is the coupling matrix between µand ν, and hA, Bi:= tr(ATB)is the Frobenius inner product.

Let {γi}n

i=1 ∈ M+

1(X)be a collection of discrete probability measures. The Wasserstein barycenter

problem (WBP) Agueh & Carlier (2011) associated with these measures reads

min

γ∈M+

1(X)

i=1

W(γ, γi).(2)

A minimizer of this problem is called a Wasserstein barycenter (WB) of the measures {γi}n

i=1 and

can be understood as an average of the input measures. In the sequel we will use the concept of WB

to deﬁne fusion algorithms for FC NN, CNN, and ResNet. For RNN and LSTM the fusion reduces

to solving a series of Gromov-Wasserstein barycenter-like problems (see the reviews of GWBP in

the Appendix).

2 WASSERSTEIN BARYCENTER BASED FUSION

In this section, we discuss our layer-wise fusion algorithm based on the concept of WB. First we

introduce the necessary interpretations of nodes and layers of NNs in Section 2.1. Next in Section

2.2, we describe how to compare layers and nodes across different NNs so as to make sense of

aggregating models through WB. Finally we present our fusion algorithm in Section 2.3.

2.1 NESTED DEFINITION OF FULLY CONNECTED NN

For a fully connected network N, we use vto index the nodes in its l-th layer Nl. Let γldenote a

probability measure on the l-th layer deﬁned as the weighted sum of Dirac delta measure over the

nodes in that layer, i.e.,

γl:= 1

|Nl|X

v∈Nl

δv∈ M+

1(Nl).(3)

We interpret a node vfrom the l-th layer as an element in Nlthat couples a function on the domain

Nl−1(previous layer) with a probability measure. In particular, the node vis interpreted as v:=

(γl−1, w), where γl−1is a measure on the previous layer Nl−1and wrepresents the weights between

the node vand the nodes in previous layer Nl−1. These weights can be interpreted as a function w:

Nl−1→Rand we use the notation wqto denote the value of function wevaluated at the q-th node

in the previous layer Nl−1. For the ﬁrst layer i.e. l= 1, the nodes simply represent placeholders for

the input features. The above interpretation is illustrated in Figure 1(right). This interpretation of

associating nodes with a function of previous layer allows us to later deﬁne “distance” between nodes

in different NNs (see Section 2.2) and is motivated from T Lpspaces and distance Garc´

ıa Trillos &

Slepˇ

cev (2015); Thorpe et al. (2017) which is designed for comparing signals with different domains

(see more details in Appendix C.1).

2.2 COST FUNCTIONS FOR COMPARING LAYERS AND NODES

Having introduced our interpretations of NNs, we now deﬁne the cost functions for comparing layers

and nodes which will be used to aggregate models through WB. Consider the l-th layers Nland N0

of two NNs Nand N0respectively. We use Wasserstein distance between the measures γland γ0

over Nland N0

lrespectively to deﬁne distance between the layers:

dµ(γl, γ0

l) := W(γl, γ0

l) = inf

Πl∈Γ(γl,γ0

l)hCl,Πli(4)

where matrix Πl= [πl,jg]j,g is a coupling between the measures γland γ0

l; and Clis the cost matrix

give by Cl:= cl(v, v0)v,v0, where clis a cost function between nodes on the l-th layers.

Following our inductive interpretation of NNs, the cost function clcan also be deﬁned inductively.

Consider nodes vand v0from l-th layer of NNs Nand N0respectively. For the ﬁrst layer l= 1,

we pick a natural candidate for cost function, namely c1(v, v0) := 1v6=v0, a reasonable choice given

that all networks have the same input layer. For l≥2, recall our interpretation of nodes v=

(γl−1, w), v0= (γ0

l−1, w0), where γl−1and γ0

l−1denotes the respective measures associated with

previous layer l−1and w, w0denotes the respective weight functions for nodes vand v0. Since the

domains of the weight functions wand w0are layers in different NNs, it is not clear how to compare

them directly. However in T Lpinterpretation, after ﬁnding a suitable coupling between the support

measures γl−1and γ0

l−1, one can couple the functions wand w0and use a direct L2-comparison.

Motivated by computational and methodological considerations, we use a slight modiﬁcation of

the T Lpdistance and decouple the problem for the measures from the weights. Speciﬁcally, we

deﬁne cl(v, v0) := dµ(γl−1, γ0

l−1) + dW(w, w0); where dµis the Wasserstein distance (as deﬁned

in equation 4) between the measures γl−1and γ0

l−1from layers l−1. And dWis deﬁned using the

optimal coupling of weight functions’ support measures, i.e.,

dW(w, w0) := X

q,s wq−w0

s2(πl−1,qs)∗=: hL(w, w0),(Πl−1)∗i,(5)

where L(w, w0) := (wq−w0

s)2q,s and (Πl−1)∗=(πl−1,qs)∗q,s is the optimal coupling between

γl−1and γ0

l−1. Note that dµ(γl−1, γ0

l−1)is a ﬁxed constant when comparing any two nodes on the

l-th layers Nland N0

l. For simplicity, we let cl(v, v0) = dW(W, W 0)in what follows, and the

information of support measures γl−1and γ0

l−1is implicitly included in their optimal coupling

(Πl−1)∗. Here we have omitted bias terms to ease the exposition of our framework, but a natural

implementation that accounts for bias terms can be obtained by simply concatenating them with the

weight matrix.

We set (Π1)∗equal to the identity matrix normalized by the size of input layer given that this is

a solution to equation 4when the cost c1is deﬁned as c1(v, ˜v) := 1v6=˜v. Other choices of cost

function clare possible, e.g. the activation-based cost function proposed in Singh & Jaggi (2019).

2.3 FUSION ALGORITHM

In the following we consider ninput FC NNs N1, . . . , N n. We use Ni

lto denote the l-th layer of

the i-th network Niand ki

lto denote the number of nodes in that layer, i.e. ki

l=|Ni

l|. Let γi

lto

be the probability measure on layer Ni

lsimilar to deﬁnition in equation 3with the support points

being nodes in that layer. We denote the target model (i.e. the desired fusion output) by Nnew and

use k2, . . . , kmto denote the sizes of its layers Nnew

2, . . . , Nnew

m, respectively. We assume that all

networks, including the target model, have the same input layer and the same number of layers m.

Based on the discussion in Sections 2.1 and 2.2, we now describe an inductive construction of the

layers of the target network Nnew by fusing all ninput NNs. First, Nnew

1is set to be equal to N1

1: this

is the base case of the inductive construction and simply means that we set the input layer of Nnew

to be the same as that of the other models; we also set γ1:= γ1

1. Next, assuming that the fusion has

been completed for the layers 1to l−1(l≥2), we consider the fusion of the l-th layer. For the

simplicity of notations, we drop the index lwhile referring to nodes and their corresponding weights

in this layer. In particular, we use vi

gand wi

gto denote the nodes in layer Ni

land their corresponding

weights. To carry out fusion of the l-th layer of the input models, we aggregate their corresponding

measures through ﬁnding WB which provides us with a sensible “average” l-th layer for the target

model. Hence, we consider the following WBP over γ1

l, . . . , γn

min

γl,{Πi

l}i

i=1

W(γl, γi

l) := 1

i=1hCi

l,Πi

lis.t. γl=1

j=1

δvj, vj= (γl−1, wj).(6)

Here the measure γlis the candidate l-th layer “average” of the input models and is forced to take a

speciﬁc form (notice that we have ﬁxed the size of its support and the masses assigned to its support

points). Nodes vjin the support of γlare set to take the form vj= (γl−1, wj), i.e. the measure γl−1

obtained when fusing the (l−1)-th layers is the ﬁrst coordinate in all the vj. This plugs the current

layer of the target model with its previous layer. As done for the input models, wjis interpreted

as a function from the (l−1)-th layer into the reals, and represents the actual weight vector from

the (l−1)-layer to the j-th node in the l-th layer of the new model. Ci

l:= cl(vj, vi

g)j,g are the

cost matrices corresponding to WBP in equation 6, where clis a cost function between nodes on the

l-th layers (see in Section 2.2). Let Wland Wi

lto be the weight function matrices of the l-th layer

of target models Nnew and input model Nirespectively (e.g. Wl:= (w1, . . . , wkl)T) and deﬁne

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WASSERSTEINBARYCENTER-BASEDMODELFUSIONANDLINEARMODECONNECTIVITYOFNEURALNET-WORKSAdityaKumarAkashyDepartmentofComputerScienceUniversityofWisconsin-Madisonaakash@wisc.eduSixuLiyDepartmentofStatisticsUniversityofWisconsin-Madisonsli739@wisc.eduNicol´asGarc´aTrillosDepartmentofStatisticsUniversityofWi...

展开>> 收起<<

WASSERSTEIN BARYCENTER -BASED MODEL FUSION AND LINEAR MODE CONNECTIVITY OF NEURAL NET- WORKS.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WASSERSTEIN BARYCENTER -BASED MODEL FUSION AND LINEAR MODE CONNECTIVITY OF NEURAL NET- WORKS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: