VFLens Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization

2025-05-06 0 0 2.78MB 15 页 10玖币
侵权投诉
VFLens: Co-design the Modeling Process for Eicient Vertical
Federated Learning via Visualization
Yun Tian
School of Information Science and
Technology, ShanghaiTech University
Shanghai, China
tianyun@shanghaitech.edu.cn
He Wang
School of Information Science and
Technology, ShanghaiTech University
Shanghai, China
wanghe1@shanghaitech.edu.cn
Laixin Xie
School of Information Science and
Technology, ShanghaiTech University
Shanghai, China
xielx@shanghaitech.edu.cn
Xiaojuan Ma
Department of Computer Science and
Engineering, The Hong Kong
University of Science and Technology
Hong Kong, China
mxj@cse.ust.hk
Quan Li
School of Information Science and
Technology, ShanghaiTech University
Shanghai, China
liquan@shanghaitech.edu.cn
ABSTRACT
As a decentralized training approach, federated learning enables
multiple organizations to jointly train a model without exposing
their private data. This work investigates vertical federated learn-
ing (VFL) to address scenarios where collaborating organizations
have the same set of users but with dierent features, and only one
party holds the labels. While VFL shows good performance, prac-
titioners often face uncertainty when preparing non-transparent,
internal/external features and samples for the VFL training phase.
Moreover, to balance the prediction accuracy and the resource con-
sumption of model inference, practitioners require to know which
subset of prediction instances is genuinely needed to invoke the VFL
model for inference. To this end, we co-design the VFL modeling
process by proposing an interactive real-time visualization system,
VFLens, to help practitioners with feature engineering, sample se-
lection, and inference. A usage scenario, a quantitative experiment,
and expert feedback suggest that VFLens helps practitioners boost
VFL eciency at a lower cost with sucient condence.
CCS CONCEPTS
Human-centered computing Visualization
;Human com-
puter interaction (HCI); Interaction design.
KEYWORDS
Federated Learning, Visual Analytics, Feature Interpretation, Sam-
ple Selection
The corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9869-5/22/10. . . $15.00
https://doi.org/10.1145/3565698.3565765
ACM Reference Format:
Yun Tian, He Wang, Laixin Xie, Xiaojuan Ma, and Quan Li. 2022. VFLens:
Co-design the Modeling Process for Ecient Vertical Federated Learning via
Visualization. In The Tenth International Symposium of Chinese CHI (Chinese
CHI 2022), October 22–23, 2022, Guangzhou, China and Online, China. ACM,
New York, NY, USA, 15 pages. https://doi.org/10.1145/3565698.3565765
1 INTRODUCTION
There is hope that all industries will benet from big data-driven
articial intelligence (AI), especially after the huge success of Al-
phaGo. However, with the exception of a few industries, most elds
lack sucient data or have data quality issues to support the con-
struction of a reliable and robust machine learning (ML) model. At
the same time, companies are reluctant to share or aggregate their
valuable data in a centralized manner due to industry competition
and privacy and security concerns, leaving the data often existing
as a set of isolated data silos. As a viable decentralized solution that
can potentially break down barriers between data sources while
preserving privacy and security, federated learning (FL) enables
users to collaboratively learn an ML model while keeping all data
that may contain private information on their local device [
5
,
52
].
Depending on how the data is partitioned between parties and
application scenarios, FL can be divided into two main categories,
namely horizontal FL (HFL) and vertical FL (VFL) [
52
]. The focus
of this study is VFL, also known as feature-based FL. VFL can be
applied to situations where two datasets have considerable overlap
in sample IDs but dier in feature space [
11
]. A typical example
of VFL is a collaboration between an e-commerce retail company
and a nancial institution in the same city. Their customer set
may contain the majority of residents in the area; therefore, the
intersection of their customer spaces is huge. However, since the
nancial institution records its customers’ income, spending be-
havior, and credit rating, while the e-commerce retailer retains its
customers’ browsing and purchasing history, their feature spaces
are quite dierent. In this case, VFL allows both parties to train
a joint ML model for product purchase prediction based on cus-
tomer and product information under privacy-preserving security
conditions (Figure 1).
arXiv:2210.00472v1 [cs.HC] 2 Oct 2022
Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China Yun Tian, He Wang, Laixin Xie, Xiaojuan Ma, and an Li
Figure 1: Suppose two parties, i.e., a local nancial institution (a) and an e-commerce retail company (b) want to co-build a ML
model for product purchase prediction. Only the nancial institution has the label Y: loan or not and neither party wants to
expose their features X. The two parties has an overlapping sample IDs, i.e., user id:
1
4
. The target is to establish a joint model
under the condition of protecting privacy and the eect of the joint model is better than that of unilateral data modeling.
Although VFL has shown good performance in scenarios such
as nancial risk management [
13
,
51
,
60
], healthcare [
44
], and e-
commerce ad recommendation [
50
], real world practitioners have
encountered the following challenges when trying to use VFL for
their application domains [
24
]:
1) Uncertainty in sample selec-
tion for training.
Traditional VFL practitioners mainly utilize two
methods to prepare for the VFL training phase. First, when there is
not much available data with labels, they may utilize all the overlap-
ping samples with labels to train a joint VFL model for convenience.
However, it is well known that the training speed of FL is much
slower than that of the local model due to the design of data en-
cryption and communication mechanisms. In some cases, utilizing
all overlapping samples with labels for VFL model training can
lead to much longer training time. Second, when the amount of
data available for training is large, they may select some labeled
data samples for training and evaluate the model based on gen-
eral metrics such as accuracy,loss, and mAP. As a slang expression
in classical ML terminology, ‘garbage in, garbage out” [
20
] indi-
cates the samples used for prediction should have a high-quality
match with their specic jointly trained VFL models. Although the
role of interactive data iteration in ML is emphasized and domain
experts acknowledge that “data samples need to have appropriate
signals for the model to be useful” [
20
], there is little support for
their ne-tuning of training data samples in VFL scenarios. Both
practices rely heavily on feedback from VFL model performance for
further evaluation, which is sometimes too time-consuming and
expensive. In particular, things get worse when the communica-
tion of the VFL training process is not so stable, as domain experts
have to repeatedly re-upgrade the model training phase by trial
and error. Therefore, an intuitive sample data evolution interaction
mechanism that allows domain experts to compare the data char-
acteristics and performance of dierent sample training datasets
in a VFL scenario is necessary.
2) Non-transparent feature se-
lection and assessment.
Successful ML applications require an
iterative process to create models that provide the desired per-
formance. One of the key processes involves feature engineering
and in this study, we focus on feature selection and assessment.
However, unlike traditional centralized ML modeling or HFL in
which all data features are available and easy to assess, in VFL,
participants update only their internal feature parameters during
training, and external features from other parties are not visible to
them due to the design of privacy-preserving mechanisms, which
poses unique challenges for internal/external feature selection and
assessment. That is, in addition to selecting the necessary internal
features or transforming the original internal features into other
powerful alternatives, practitioners are exploring how to assess
external features from other parties [
39
,
45
]. However, the lack of
comprehensive consideration of the contribution of internal and
external features while protecting privacy still undermines the use
of VFL in production.
3) Costly and time-consuming inference.
The inference phase of VFL modeling requires online coordination
between two (or more) parties to accomplish the inference task,
which inevitably poses a challenge to computational resources and
raises costs. According to our collaborating domain experts, the
cost and deployment eciency of federated modeling are issues
that require rational planning for practical applications, and the
use of homomorphic encryption in VFL can lead to a signicant re-
duction in the computational speed and information transfer speed
of federated modeling compared to centralized ML modeling [
22
].
To solve this problem, in addition to optimizing the computational
modeling process, another intuitive approach is to reduce the over-
all data volume. That is, not all samples to be predicted need to be
truly predicted with the help of external features from other parties.
For example, those samples in which practitioners have relatively
high condence in their labels do not need to be predicted by invok-
ing an online trained VFL model because, e.g., the sample features
are poor, and these samples can be safely ignored. Thus, how to
visually help domain experts distinguish samples with dierent
Co-design the Modeling Process for Eicient VFL via Visualization Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China
condence in their labels is a desirable capability for real-world
VFL deployments.
In this study, we co-design the modeling process to help VFL
practitioners improve the eciency of VFL modeling from the
perspective of visualization. We rst conduct an observational study
of the current practices of collaborating domain experts to identify
their main needs and concerns regarding VFL applications. Then,
we streamline the analysis pipeline of feature and sample spaces and
propose an interactive visualization system called VFLens.VFLens
helps domain experts to interactively participate in feature selection,
assessment, and sample data iteration processes before the VFL
model training phase, in feature interpretation after the VFL model
training phase, and in data sample selection during the VFL model
inference phase. A case study and expert feedback conrm the
ecacy of VFLens. Our main contributions are summarized below.
We describe the problem in the VFL context from the perspec-
tive of feature and sample space through an observational
study and in-depth discussions of design requirements with
VFL domain experts.
We co-design the VFL modeling process to support domain
experts to interactively participate in the data iteration, fea-
ture selection and assessment, and sample prediction pro-
cesses. To the best of our knowledge, VFLens is the rst such
eort in the VFL scenario.
We evaluate VFLens through a usage scenario, a quantitative
experiment and expert interviews.
2 RELATED WORK
The literature that overlaps this work can be categorized into four
groups, namely, federated learning,visualizations for federated learn-
ing,feature selection and assessment, and sample selection in machine
learning.
2.1 Federated Learning
Federated learning was rst proposed by Google, which prevents
data from being transmitted by distributing model training to each
mobile terminal [
5
]. Later, they released the rst commercial FL
application, GBoard [
17
], which uses a recursive neural language
model to predict the next word in a keyboard application. GBoard
allows each local mobile device to train the model using local data
from the same distributed ML model. The global model can be
updated by averaging the model parameters collected over all lo-
cal models. Along the same lines, many studies have reshaped
dierent ML models into a federated framework, including deci-
sion trees [
31
,
58
], linear/logistic regression [
32
,
36
], and neural
networks [
46
,
56
]. These works are categorized as HFL because
the clients share the same feature space but dier in the sample
space. Unlike HFL, VFL is applicable to scenarios where we have
many overlapping instances but few overlapping features [
51
]. For
example, an insurance company and an online retailer in a local
city have many overlapping users, but each has its own feature
space. VFL “merges” features and uses homomorphic encryption to
protect the data privacy of the participating parties, and requires
a more sophisticated mechanism to decompose the loss function
of each party. This study focuses on VFL, “virtually aggregation”
of dierent features to compute training losses and gradients in
a privacy-preserving manner, and jointly build an ML model [
11
]
with data from both parties.
2.2 Visualizations for Federated Learning
Researchers from academia and industry are using visualizations to
demonstrate, explain, and monitor the process of federated learn-
ing. For example, in industry, Lenovo has simulated the industrial
revolution in factories by demonstrating the process of horizontal
federated learning to predict the internal pressure of hardware [
38
].
Similarly, Cloudera Fast Forward Labs released an interactive sim-
ulation prototype, Turbofan Tycoon, which takes advantage of vi-
sualization to examine the federated model and predict when a
turbofan will fail [
35
]. FATEBoard
1
utilizes dashboard visualizations
to display modeling logs, metrics, and evaluation results, including
information on data sets, job status, computational plots, and model
output [
12
]. While FATEBoard can help domain experts understand
the ranking of features and the performance of models, it does
not support detailed and interactive inspection of the sample and
feature spaces. On the other hand, in academia, Wei et al. [
47
] de-
veloped a game to demonstrate the superiority of HFL and built a
visualization prototype to help understand the operation of HFL.
However, this work assumes that client-side data can be witnessed
by the server-side. Li et al. [
30
] proposed HFLens, which strictly
follows a data privacy-preserving design and supports comparative
visual interpretation at the overview, communication round, and
client instance levels. HFLens facilitates the investigation of the
overall HFL process involving all clients, the correlation analysis
of client information in one or dierent communication rounds,
the identication of potential anomalies, and the evaluation of the
contribution of each HFL client. However, the pain point for VFL
is not the anomaly detection like HFLens, because for VFL there
are generally not as many data collaborators as for HFL, and the
collaborators partnerships with common interests. In this work, we
do not focus on the operational process of FL, but rather improve
the eciency of VFL modeling by involving domain experts in the
sample and feature space.
2.3 Feature Selection and Assessment
There is a large amount of existing work related to feature selec-
tion [
4
,
6
], which has two main diculties. First, a large number
of features are used in the process of building machine learning
models; however, if several features are linearly correlated with
each other, many of them will be redundant, which adds additional
computational eort and leads to more complex parameters. Second,
common feature analysis methods use feature correlation metrics,
but correlation metrics cannot measure nonlinear relationships.
Isabelle et al. [
15
] performed a survey of automatic feature selec-
tion methods. The authors abstracted the core problem of feature
selection, which is to nd a minimal subset of features from a large
number of features. The authors also argued that there are many
options for feature selection and that there is no one universal and
unique solution. There are other types of feature selection methods,
such as wrappers [
25
], which iteratively eliminate features by re-
gression or classication models to nd the ideal subset of features.
There are also metric-based methods [
2
,
14
], where users pick the
1https://fate.fedai.org/
摘要:

VFLens:Co-designtheModelingProcessforEfficientVerticalFederatedLearningviaVisualizationYunTianSchoolofInformationScienceandTechnology,ShanghaiTechUniversityShanghai,Chinatianyun@shanghaitech.edu.cnHeWangSchoolofInformationScienceandTechnology,ShanghaiTechUniversityShanghai,Chinawanghe1@shanghaitech....

展开>> 收起<<
VFLens Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.78MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注