Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks Biao Zhang

2025-05-03 0 0 1.86MB 16 页 10玖币
侵权投诉
Investigating Neuron Disturbing in Fusing Heterogeneous
Neural Networks
Biao Zhang
School of Mathematical Sciences
Fudan University
Shanghai, China
zhangb20@fudan.edu.cn
Shuqin Zhang
School of Mathematical Sciences
Fudan University
Shanghai, China
zhangs@fudan.edu.cn
ABSTRACT
Fusing deep learning models trained on separately located clients into a global
model in a one-shot communication round is a straightforward implementation
of Federated Learning. Although current model fusion methods are shown ex-
perimentally valid in fusing neural networks with almost identical architectures,
they are rarely theoretically analyzed. In this paper, we reveal the phenomenon
of neuron disturbing, where neurons from heterogeneous local models interfere
with each other mutually. We give detailed explanations from a Bayesian view-
point combining the data heterogeneity among clients and properties of neural
networks. Furthermore, to validate our findings, we propose an experimental
method that excludes neuron disturbing and fuses neural networks via adaptively
selecting a local model, called
AMS
, to execute the prediction according to the
input. The experiments demonstrate that
AMS
is more robust in data heterogeneity
than general model fusion and ensemble methods. This implies the necessity of
considering neural disturbing in model fusion. Besides,
AMS
is available for fusing
models with varying architectures as an experimental algorithm, and we also list
several possible extensions of AMS for future work.
Keywords Federated Learning ·Model Fusion ·Data Heterogeneity ·Neural Network
1 Introduction
As concerns of privacy protection grow recently, federated learning algorithms [
1
,
2
], in which
the model is learned without data transmission among clients, develop rapidly. Communica-
tion costs between clients and the server, and client heterogeneity are two key challenges in
federated training. As the communication frequency, rounds of communication during a certain
period of time must increase to alleviate the discrepancy among the learned model on clients.
Classic algorithms such as
FedAvg
[
1
], and
FedProx
[
3
] focus on training a model via multiple
communication rounds of parameters, which results in high privacy risk and communication
cost. Different from this paradigm, model fusion methods are designed for fusing those local
Corresponding author
arXiv:2210.12974v2 [cs.LG] 29 Oct 2023
Zhang et al.
models from clients into a global model [
4
15
] in a one-shot manner based only on the weights
of the local models.
The anterior model fusion studies [
5
,
8
,
9
] points out that the directly average of parameters
is not rational due to the permutation invariant property of weights, and the channels from
different networks are always randomly permuted. Thus, many algorithms formulate the fusing
problem into alignment problem, including linear assignment [
5
,
8
], optimal transport [
9
,
16
]
and graph matching [
14
]. These researches assume that all local neural networks share the
same architectures, though a varying number of neurons in each layer is allowed. Besides, a
fundamental implicit assumption shared by them is, if the probability measurement of param-
eters in the fused global model well approximates the probability measurement of neurons
in local models in a specific sampling manner, then the performance of those local models is
also maintained (the neuron approximation assumption, NA assumption). This assumption
acknowledges the feasibility of model fusion. Nevertheless, the rationality of this prior knowl-
edge under federated learning settings has not been thoroughly investigated. In this paper, we
reveal that when the data heterogeneity and model optimization procedure discrepancy among
clients are large, the above assumption generally does not hold because of neuron disturbing, in
which neurons extracted from heterogeneous clients (data distributions on clients are non-IID
and model optimization procedures are different) disturbs each other and harms the fused
model performance. We present a basic analysis of neuron disturbing from a Bayesian view.
Furthermore, inspired by a phenomenon among heterogeneous models we called "absolute
confidence of neural networks", we propose model fusion via adaptively selecting a local model
(
AMS
). As an experimental algorithm,
AMS
shows robustly better performance on fusing multilayer
perceptron neural networks (MLPs) and convolutional neural networks (CNNs) trained with
datasets with two kinds of data partitions. The performance of the other methods including the
ensemble method,
FedAvg
, and
PFNM
declines rapidly when the severity of data heterogeneity
reaches some poi. Those results verify the existence of neuron disturbing and indicate the
necessity of handling neuron disturbing in developing model fusion methods. Besides, in light
of computational complexity, we also list possible extensions of
AMS
for real-world applications
not limited to federated learning.
2 Related Work
2.1 Federated Learning
Federated learning aims to learn a shared global model from data distributed on edge clients
without data transmission.
FedAvg
[
1
] is the initial aggregation method, in which parameters
of local models trained with data on clients are averaged coordinate-wisely. Follow-up studies
[
3
,
17
19
] tackles the client drift mitigation issue [
17
] in which local optimums far away from
each other when the global model is optimized with different local objectives, and the average
of the resultant client updates then move away from the true global optimum. Data-sharing
methods including As for the aggregation schema, many methods require ideal assumptions
such as Lipschitz continuity [
3
,
17
,
20
23
] and convexity property [
20
,
21
,
23
]. Different from
these methods, model fusion, to learn a unified model from heterogeneous pre-trained local
models, provides an available approach to FL involving deep neural networks.
2
Zhang et al.
2.2 Ensembling methods
Ensemble methods [
24
28
] ensemble the outputs of different models to improve the prediction
performance. However, this kind of approach requires maintaining these models and thus
becomes infeasible with limited computational resources in many applications. In the prior
study [
5
], the performance of the ensemble method is viewed as the upper extreme of aggregating
when limited to a single communication. However, here in this paper, we analyze neuron
disturbing by combining the uniform ensemble method.
2.3 Model Fusion
Model fusion methods can be broadly divided into two categories. One category is knowledge
distillation [
29
31
], where the key idea is to employ the knowledge of pre-trained teacher neural
networks (local models) to learn a student neural network (global model). In [
11
], the authors
propose ensemble distillation for model fusion via training the global model through unlabeled
data on the outputs of local models. And in [
13
], the authors sample higher-quality global
models and combine them via a Bayesian model. Moreover, a data-free knowledge distillation
approach [
12
] is proposed in which the global model learns a generator to assemble local
information. Methods based on distillation are generally highly computationally complex and
may violate privacy protection because the distillation process in the global model needs either
extra proxy datasets or generators. Another category is parameter matching, where the key idea
is matching the parameters with inherent permutation invariance from different local models
before aggregating them together. In [
9
], the authors utilize optimal transport to minimize a
transportation cost matrix to align neurons across different neural networks (NNs). Some work
[
10
] optimizes the assignments between global and local components under a KL divergence
through variational inference. Liu et al. [
14
] formulate the parameter matching as a graph
matching problem and solve it with the corresponding method. Yurochkin et al. [
6
] develop a
Bayesian nonparametric meta-model to learn shared global structure among local parameters.
The meta-model treats the local parameters as noisy realizations of global parameters and
formally characterizes the generative process through the Beta-Bernoulli process (BBP) [
32
]. This
meta-model is successfully extended to different applications [
5
8
]. The following algorithms
formulate the fusing problem into optimal transport [
9
,
16
] and graph matching [
14
]. However,
the above methods rarely investigate the feasibility of model fusion and they are limited to fusing
neural networks with identical depth. Some of those algorithms such as OTfusion [
9
,
16
] rely on
multiple communication rounds.
2.4 Data Heterogeneity
In real-world scenarios, models on varying clients are often trained with heterogeneous
datasets, i.e., the data distribution from clients is non-IID. There are several categories of
non-identical client distributions, including covariate shift, prior probability shift, concept
shift, and unbalancedness [
33
]. Most previous empirical work on synthetic non-IID datasets
[
1
,
3
,
6
,
9
,
11
,
14
,
16
,
16
19
] have focused on label distribution skew, i.e., prior probability shift,
where a non-IID dataset is formed by partitioning an existing IID dataset based on the labels. In
this paper, we focus on data heterogeneity resulting from non-IID label distribution.
3
Zhang et al.
3 Preliminaries
Suppose there are
J
training data sets
Dj
,
j=
1
,
2
,..., J
which are sampled from data set
D=
{
(
Xi,Yi
)
},i=
1
,
2
, .. ., N
in a non-IID way, where
XiRI
and
YiRC
is in one-hot coding of
classes. The non-IID setting implies that these data distributions of labels from varying data
sets are quite different, i.e., heterogeneous. And the total number of samples is
N
:
=PJ
j=1¯¯Dj¯¯
where
¯¯Dj¯¯=Nj
for
j=
1
,
2
,..., J
. Besides, we denote
˜
Dj
as the corresponding test data set of
Dj
,
j=
1
,
2
,..., J
, and the test data set is assumed to be sampled from the same distribution as its
corresponding training data set. The union of all the test data sets is
˜
D
, and the total number
of sample in this test data set is
˜
N
:
=PJ
j=1¯¯˜
Dj¯¯
, where
¯¯˜
Dj¯¯=˜
Nj
for
j=
1
,
2
,..., J
. Without
loss of generality, we suppose that
J
Multilayer Perceptrons with two hidden layers are trained
with data sets
Dj
,
j=
1
,
2
,..., J
, respectively. For the
j
th MLP, let (
W(j,0) RI×Lj,0 ,b(j,0) RLj,0
),
(
W(j,1) RLj,0×Lj,1 ,b(j,1) RLj,1
), (
W(j,2) RLj,1×C,b(j,2) RC
) be the weight and bias pairs of the
two hidden layers and softmax layer, respectively. Thus, the Jth MLP is
Fj(X)=softmax³W(j,2)σ¡W(j,1)σ(W(j,0)X+b(j,0))+b(j,1)¢+b(j,2)´,j=1,2,..., J,
where
σ
is the nonlinear activation functions such as ReLU [
34
]. For simplicity, in our theoretical
analysis, the bias is neglected by default in this paper via the following augmentation:
˜
W(j,k)˜
a(j,k)=¡W(j,k)b(j,k)¢µa(j,k)
1=W(j,k)a(j,k)+b(j,k),
where
a(j,k)
is the input of the
k
th layer in the
j
th MLP. Therefore, we only consider MLPs without
bias, that is
Fj(X)=softmax³W(j,2)σ¡W(j,1)σ(W(j,0)X)¢´,j=1,2,..., J, (1)
The task of fusing neural networks is to learn a global neural network with weights
θ(0) RI×L0
,
θ(1) RL0×L1,θ(2) RL1×C.
4 Heterogeneous Neuron Disturbing
In previous model fusion methods such as
PFNM
[
5
] and OTfusion [
9
], the authors implicitly set
an assumption that, for each layer, if the probability measurement of neurons in local models
are well approximated by that of the fused global model in a specific sampling manner, then
the performance of those local models is also maintained. However, whether this assumption
holds for neural networks has not been investigated. In the following, we give a simple counter-
example of neuron approximation assumption, and reveal that neurons from heterogeneous
models disturb each other with varying severity which depends on the data heterogeneity. And
the neural network architecture is supposed to be adjusted to make the above assumption hold.
4.1 Neuron Disturbing from Optimization Unbalance
We firstly consider a simple binary classification problem on 2D simulation data set. As shown in
Figure 1(a), the training and test data samples are randomly sampled from the selected region.
The region is bounded by
x2= −x1+
1
,x1
[
2
,
0),
x2=x1+
1
,x1
[0
,
2],
x2= −x1
1
,x1
[
2
,
0)
and
x2=x1
1
,x1
[0
,
2]. We define a decision boundary
Cl
for data labelling, i.e.,
x2= −x1,x1
4
摘要:

InvestigatingNeuronDisturbinginFusingHeterogeneousNeuralNetworksBiaoZhangSchoolofMathematicalSciencesFudanUniversityShanghai,Chinazhangb20@fudan.edu.cnShuqinZhang∗SchoolofMathematicalSciencesFudanUniversityShanghai,Chinazhangs@fudan.edu.cnABSTRACTFusingdeeplearningmodelstrainedonseparatelylocatedcli...

展开>> 收起<<
Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks Biao Zhang.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.86MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注