Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks Biao Zhang

2025-05-03 0 0 1.86MB 16 页 10玖币

侵权投诉

Investigating Neuron Disturbing in Fusing Heterogeneous

Neural Networks

Biao Zhang

School of Mathematical Sciences

Fudan University

Shanghai, China

zhangb20@fudan.edu.cn

Shuqin Zhang ∗

School of Mathematical Sciences

Fudan University

Shanghai, China

zhangs@fudan.edu.cn

ABSTRACT

Fusing deep learning models trained on separately located clients into a global

model in a one-shot communication round is a straightforward implementation

of Federated Learning. Although current model fusion methods are shown ex-

perimentally valid in fusing neural networks with almost identical architectures,

they are rarely theoretically analyzed. In this paper, we reveal the phenomenon

of neuron disturbing, where neurons from heterogeneous local models interfere

with each other mutually. We give detailed explanations from a Bayesian view-

point combining the data heterogeneity among clients and properties of neural

networks. Furthermore, to validate our ﬁndings, we propose an experimental

method that excludes neuron disturbing and fuses neural networks via adaptively

selecting a local model, called

AMS

, to execute the prediction according to the

input. The experiments demonstrate that

AMS

is more robust in data heterogeneity

than general model fusion and ensemble methods. This implies the necessity of

considering neural disturbing in model fusion. Besides,

AMS

is available for fusing

models with varying architectures as an experimental algorithm, and we also list

several possible extensions of AMS for future work.

Keywords Federated Learning ·Model Fusion ·Data Heterogeneity ·Neural Network

1 Introduction

As concerns of privacy protection grow recently, federated learning algorithms [

], in which

the model is learned without data transmission among clients, develop rapidly. Communica-

tion costs between clients and the server, and client heterogeneity are two key challenges in

federated training. As the communication frequency, rounds of communication during a certain

period of time must increase to alleviate the discrepancy among the learned model on clients.

Classic algorithms such as

FedAvg

[

], and

FedProx

[

] focus on training a model via multiple

communication rounds of parameters, which results in high privacy risk and communication

cost. Different from this paradigm, model fusion methods are designed for fusing those local

∗Corresponding author

arXiv:2210.12974v2 [cs.LG] 29 Oct 2023

Zhang et al.

models from clients into a global model [

–

] in a one-shot manner based only on the weights

of the local models.

The anterior model fusion studies [

] points out that the directly average of parameters

is not rational due to the permutation invariant property of weights, and the channels from

different networks are always randomly permuted. Thus, many algorithms formulate the fusing

problem into alignment problem, including linear assignment [

], optimal transport [

]

and graph matching [

]. These researches assume that all local neural networks share the

same architectures, though a varying number of neurons in each layer is allowed. Besides, a

fundamental implicit assumption shared by them is, if the probability measurement of param-

eters in the fused global model well approximates the probability measurement of neurons

in local models in a speciﬁc sampling manner, then the performance of those local models is

also maintained (the neuron approximation assumption, NA assumption). This assumption

acknowledges the feasibility of model fusion. Nevertheless, the rationality of this prior knowl-

edge under federated learning settings has not been thoroughly investigated. In this paper, we

reveal that when the data heterogeneity and model optimization procedure discrepancy among

clients are large, the above assumption generally does not hold because of neuron disturbing, in

which neurons extracted from heterogeneous clients (data distributions on clients are non-IID

and model optimization procedures are different) disturbs each other and harms the fused

model performance. We present a basic analysis of neuron disturbing from a Bayesian view.

Furthermore, inspired by a phenomenon among heterogeneous models we called "absolute

conﬁdence of neural networks", we propose model fusion via adaptively selecting a local model

(

AMS

). As an experimental algorithm,

AMS

shows robustly better performance on fusing multilayer

perceptron neural networks (MLPs) and convolutional neural networks (CNNs) trained with

datasets with two kinds of data partitions. The performance of the other methods including the

ensemble method,

FedAvg

, and

PFNM

declines rapidly when the severity of data heterogeneity

reaches some poi. Those results verify the existence of neuron disturbing and indicate the

necessity of handling neuron disturbing in developing model fusion methods. Besides, in light

of computational complexity, we also list possible extensions of

AMS

for real-world applications

not limited to federated learning.

2 Related Work

2.1 Federated Learning

Federated learning aims to learn a shared global model from data distributed on edge clients

without data transmission.

FedAvg

[

] is the initial aggregation method, in which parameters

of local models trained with data on clients are averaged coordinate-wisely. Follow-up studies

[

–

] tackles the client drift mitigation issue [

] in which local optimums far away from

each other when the global model is optimized with different local objectives, and the average

of the resultant client updates then move away from the true global optimum. Data-sharing

methods including As for the aggregation schema, many methods require ideal assumptions

such as Lipschitz continuity [

–

] and convexity property [

]. Different from

these methods, model fusion, to learn a uniﬁed model from heterogeneous pre-trained local

models, provides an available approach to FL involving deep neural networks.

Zhang et al.

2.2 Ensembling methods

Ensemble methods [

–

] ensemble the outputs of different models to improve the prediction

performance. However, this kind of approach requires maintaining these models and thus

becomes infeasible with limited computational resources in many applications. In the prior

study [

], the performance of the ensemble method is viewed as the upper extreme of aggregating

when limited to a single communication. However, here in this paper, we analyze neuron

disturbing by combining the uniform ensemble method.

2.3 Model Fusion

Model fusion methods can be broadly divided into two categories. One category is knowledge

distillation [

–

], where the key idea is to employ the knowledge of pre-trained teacher neural

networks (local models) to learn a student neural network (global model). In [

], the authors

propose ensemble distillation for model fusion via training the global model through unlabeled

data on the outputs of local models. And in [

], the authors sample higher-quality global

models and combine them via a Bayesian model. Moreover, a data-free knowledge distillation

approach [

] is proposed in which the global model learns a generator to assemble local

information. Methods based on distillation are generally highly computationally complex and

may violate privacy protection because the distillation process in the global model needs either

extra proxy datasets or generators. Another category is parameter matching, where the key idea

is matching the parameters with inherent permutation invariance from different local models

before aggregating them together. In [

], the authors utilize optimal transport to minimize a

transportation cost matrix to align neurons across different neural networks (NNs). Some work

[

] optimizes the assignments between global and local components under a KL divergence

through variational inference. Liu et al. [

] formulate the parameter matching as a graph

matching problem and solve it with the corresponding method. Yurochkin et al. [

] develop a

Bayesian nonparametric meta-model to learn shared global structure among local parameters.

The meta-model treats the local parameters as noisy realizations of global parameters and

formally characterizes the generative process through the Beta-Bernoulli process (BBP) [

]. This

meta-model is successfully extended to different applications [

–

]. The following algorithms

formulate the fusing problem into optimal transport [

] and graph matching [

]. However,

the above methods rarely investigate the feasibility of model fusion and they are limited to fusing

neural networks with identical depth. Some of those algorithms such as OTfusion [

] rely on

multiple communication rounds.

2.4 Data Heterogeneity

In real-world scenarios, models on varying clients are often trained with heterogeneous

datasets, i.e., the data distribution from clients is non-IID. There are several categories of

non-identical client distributions, including covariate shift, prior probability shift, concept

shift, and unbalancedness [

]. Most previous empirical work on synthetic non-IID datasets

[

–

] have focused on label distribution skew, i.e., prior probability shift,

where a non-IID dataset is formed by partitioning an existing IID dataset based on the labels. In

this paper, we focus on data heterogeneity resulting from non-IID label distribution.

Zhang et al.

3 Preliminaries

Suppose there are

training data sets

,..., J

which are sampled from data set

{

(

Xi,Yi

)

},i=

, .. ., N

in a non-IID way, where

Xi∈RI

and

Yi∈RC

is in one-hot coding of

classes. The non-IID setting implies that these data distributions of labels from varying data

sets are quite different, i.e., heterogeneous. And the total number of samples is

=PJ

j=1¯¯Dj¯¯

where

¯¯Dj¯¯=Nj

for

,..., J

. Besides, we denote

as the corresponding test data set of

,..., J

, and the test data set is assumed to be sampled from the same distribution as its

corresponding training data set. The union of all the test data sets is

, and the total number

of sample in this test data set is

=PJ

j=1¯¯˜

Dj¯¯

, where

¯¯˜

Dj¯¯=˜

for

,..., J

. Without

loss of generality, we suppose that

Multilayer Perceptrons with two hidden layers are trained

with data sets

,..., J

, respectively. For the

th MLP, let (

W(j,0) ∈RI×Lj,0 ,b(j,0) ∈RLj,0

(

W(j,1) ∈RLj,0×Lj,1 ,b(j,1) ∈RLj,1

), (

W(j,2) ∈RLj,1×C,b(j,2) ∈RC

) be the weight and bias pairs of the

two hidden layers and softmax layer, respectively. Thus, the Jth MLP is

Fj(X)=softmax³W(j,2)σ¡W(j,1)σ(W(j,0)X+b(j,0))+b(j,1)¢+b(j,2)´,j=1,2,..., J,

where

is the nonlinear activation functions such as ReLU [

]. For simplicity, in our theoretical

analysis, the bias is neglected by default in this paper via the following augmentation:

W(j,k)˜

a(j,k)=¡W(j,k)b(j,k)¢µa(j,k)

1¶=W(j,k)a(j,k)+b(j,k),

where

a(j,k)

is the input of the

th layer in the

th MLP. Therefore, we only consider MLPs without

bias, that is

Fj(X)=softmax³W(j,2)σ¡W(j,1)σ(W(j,0)X)¢´,j=1,2,..., J, (1)

The task of fusing neural networks is to learn a global neural network with weights

θ(0) ∈RI×L0

θ(1) ∈RL0×L1,θ(2) ∈RL1×C.

4 Heterogeneous Neuron Disturbing

In previous model fusion methods such as

PFNM

[

] and OTfusion [

], the authors implicitly set

an assumption that, for each layer, if the probability measurement of neurons in local models

are well approximated by that of the fused global model in a speciﬁc sampling manner, then

the performance of those local models is also maintained. However, whether this assumption

holds for neural networks has not been investigated. In the following, we give a simple counter-

example of neuron approximation assumption, and reveal that neurons from heterogeneous

models disturb each other with varying severity which depends on the data heterogeneity. And

the neural network architecture is supposed to be adjusted to make the above assumption hold.

4.1 Neuron Disturbing from Optimization Unbalance

We ﬁrstly consider a simple binary classiﬁcation problem on 2D simulation data set. As shown in

Figure 1(a), the training and test data samples are randomly sampled from the selected region.

The region is bounded by

x2= −x1+

,x1∈

[

−

0),

x2=x1+

,x1∈

2],

x2= −x1−

,x1∈

[

−

and

x2=x1−

,x1∈

2]. We deﬁne a decision boundary

for data labelling, i.e.,

x2= −x1,x1∈

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InvestigatingNeuronDisturbinginFusingHeterogeneousNeuralNetworksBiaoZhangSchoolofMathematicalSciencesFudanUniversityShanghai,Chinazhangb20@fudan.edu.cnShuqinZhang∗SchoolofMathematicalSciencesFudanUniversityShanghai,Chinazhangs@fudan.edu.cnABSTRACTFusingdeeplearningmodelstrainedonseparatelylocatedcli...

展开>> 收起<<

Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks Biao Zhang.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks Biao Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: