main discrepancy between the server domain and each client
domain increases, the performance of the aggregated model
on the client side will decrease. 2) Client-to-Clients (C2C)
discrepancy. There is a knowledge conflict among multiple
client models. Since the teacher models are learned from
different client domains, the sample in the server domain is
likely to obtain different predictions from different teacher
models, which may severely interfere with each other. Based
on these two factors, we can see that if we can reasonably
take into account the similarity between the server domain
and different client domains, and assign the most appropriate
teacher models to different samples in the server domain, the
generalization error upper bound can be reduced.
Motivated by our analysis, we propose an adaptive knowl-
edge aggregation method, Domain Discrepancy Aware Dis-
tillation, called FedD3A. During distillation, we assign in-
dependent teacher weights to a server-side sample based on
how similar this sample is to the client domains. Then we
can reduce the knowledge conflict between different teacher
models. In each round, the client extracts the features of the
local data using the backbone of the global model, and then
calculates the subspace projection matrix of the local feature
space. The server obtains the projection matrices from the
clients, and calculates the angles between the server data
and the local feature space to measure similarity without ac-
cessing client data features. Overall, our contributions are as
follows:
•
By analyzing the generalization error of the aggregated
model in the client domains, we point out two possible
reasons why distillation-based model aggregation perfor-
mance drops when there is a discrepancy between the
server unlabeled data and the client domain data, namely
Server-to-Client (S2C) discrepancy and Client-to-Clients
(C2C) discrepancy.
•
Motivated by our analysis, we propose an aggregation
method FedD3A based on domain discrepancy aware dis-
tillation, which further exploits the potential of abundant
unlabeled data on the server.
•
To validate our method, we conduct extensive experiments
on several datasets. The results show that compared with
baselines, our method has a significant improvement in
both cross-silo and cross-device FL settings.
2 Related Work
Federated Learning.
FedAvg (McMahan et al. 2017) pro-
poses the federated averaging method. In each round of com-
munication, a group of clients is randomly selected, the initial
model is sent to all clients for training, and then the models
trained by the clients are collected by the server. The aggre-
gated model is obtained by averaging the model parameters
of the clients. Some works (Li et al. 2019; Sahu et al. 2018)
demonstrate the convergence of FedAvg and point out that
the performance of FedAvg will degrade when different client
datasets are non-iid distributed. A lot of research has tried to
solve the non-iid problem encountered by FedAvg. One type
of work (Li et al. 2020a; Karimireddy et al. 2020; Reddi et al.
2020; Wang et al. 2020b,a; Singh and Jaggi 2020; Su, Li, and
Xue 2022) attempts to improve the fitting ability of the global
aggregated model. The other type of work (Dinh, Tran, and
Nguyen 2020; Fallah, Mokhtari, and Ozdaglar 2020; Hanzely
et al. 2020; Li et al. 2021) seeks to establish the personal-
ized federated learning (pFL), in which the clients can train
different models with different parameters.
Knowledge Distillation.
Knowledge distillation (Hinton,
Vinyals, and Dean 2015) is a knowledge transfer approach
and is initially proposed for model compression. Usually, a
larger model is used as the teacher model, and the knowl-
edge of the teacher model is transferred to the student model
by letting the smaller student model learn the output of the
teacher model. The techniques of KD are mainly divided into
logits-based distillation (Hinton, Vinyals, and Dean 2015; Li
et al. 2017), feature-based distillation (Romero et al. 2014;
Huang and Wang 2017; Yim et al. 2017), and relation-based
distillation (Park et al. 2019; Liu et al. 2019; Tung and Mori
2019). Some works (Du et al. 2020; Shen, He, and Xue 2019)
have also conducted research on multi-teacher distillation.
(Du et al. 2020) tries to make the gradient of the student
model close to that of all teacher models by multi-objective
optimization. (Shen, He, and Xue 2019) uses adversarial
learning to force students to learn intermediate features sim-
ilar to multiple teachers. There are some studies (Yin et al.
2020; Lopes, Fenu, and Starner 2017; Chawla et al. 2021) on
data-free distillation, which use pseudo samples generated
by the teacher model to replace real data.
Federated Learning with Knowledge Distillation.
There are several ways of applying KD in FL: 1) The first is
to perform distillation on the clients (Yao et al. 2021; Wu et al.
2022; Zhu, Hong, and Zhou 2021), treating the global aggre-
gated model as the teacher. 2) In (Gong et al. 2021, 2022; Sun
and Lyu 2020), the server and all the clients share a public
unlabeled dataset. The predictions of different models on the
dataset are transmitted among all parties to perform distilla-
tion. This is often used for personalized federated learning.
3) The third way is to directly use the client models as the
teachers. (Lin et al. 2020; Guha, Talwalkar, and Smith 2019)
suppose the server has unlabeled data and use an ensemble
of multiple client models as the teacher. The average output
of the teacher model is then used to calculate the distillation
loss. (Sturluson et al. 2021) proposes using the median-based
scores instead of the average logits of teacher outputs for
distillation. In addition, the date-free method has also been
applied on the server, (Zhang et al. 2021; Zhang and Yuan
2021) try to learn a generative model based on the ensem-
ble of client models. However, high-quality pseudo samples
rely on the running mean and variance carried by BN layers
in the client models, which may reveal privacy (Yin et al.
2020, 2021). The data-free method requires a large amount
of computational cost, and there is no evidence that it can be
applied to tasks other than image classification. Nevertheless,
data-free method is orthogonal to FedD3A, and all pseudo
samples can be used as our training data.
3 Proposed Method
Notations and Analysis
Following the domain adaptation (Ben-David et al. 2010)
field, during the analysis, we consider the binary classifica-