VFLens Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization

2025-05-06 0 0 2.78MB 15 页 10玖币

侵权投诉

VFLens: Co-design the Modeling Process for Eicient Vertical

Federated Learning via Visualization

Yun Tian

School of Information Science and

Technology, ShanghaiTech University

Shanghai, China

tianyun@shanghaitech.edu.cn

He Wang

School of Information Science and

Technology, ShanghaiTech University

Shanghai, China

wanghe1@shanghaitech.edu.cn

Laixin Xie

School of Information Science and

Technology, ShanghaiTech University

Shanghai, China

xielx@shanghaitech.edu.cn

Xiaojuan Ma

Department of Computer Science and

Engineering, The Hong Kong

University of Science and Technology

Hong Kong, China

mxj@cse.ust.hk

Quan Li∗

School of Information Science and

Technology, ShanghaiTech University

Shanghai, China

liquan@shanghaitech.edu.cn

ABSTRACT

As a decentralized training approach, federated learning enables

multiple organizations to jointly train a model without exposing

their private data. This work investigates vertical federated learn-

ing (VFL) to address scenarios where collaborating organizations

have the same set of users but with dierent features, and only one

party holds the labels. While VFL shows good performance, prac-

titioners often face uncertainty when preparing non-transparent,

internal/external features and samples for the VFL training phase.

Moreover, to balance the prediction accuracy and the resource con-

sumption of model inference, practitioners require to know which

subset of prediction instances is genuinely needed to invoke the VFL

model for inference. To this end, we co-design the VFL modeling

process by proposing an interactive real-time visualization system,

VFLens, to help practitioners with feature engineering, sample se-

lection, and inference. A usage scenario, a quantitative experiment,

and expert feedback suggest that VFLens helps practitioners boost

VFL eciency at a lower cost with sucient condence.

CCS CONCEPTS

•Human-centered computing →Visualization

;Human com-

puter interaction (HCI); Interaction design.

KEYWORDS

Federated Learning, Visual Analytics, Feature Interpretation, Sam-

ple Selection

∗The corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China

ACM ISBN 978-1-4503-9869-5/22/10. . . $15.00

https://doi.org/10.1145/3565698.3565765

ACM Reference Format:

Yun Tian, He Wang, Laixin Xie, Xiaojuan Ma, and Quan Li. 2022. VFLens:

Co-design the Modeling Process for Ecient Vertical Federated Learning via

Visualization. In The Tenth International Symposium of Chinese CHI (Chinese

CHI 2022), October 22–23, 2022, Guangzhou, China and Online, China. ACM,

New York, NY, USA, 15 pages. https://doi.org/10.1145/3565698.3565765

1 INTRODUCTION

There is hope that all industries will benet from big data-driven

articial intelligence (AI), especially after the huge success of Al-

phaGo. However, with the exception of a few industries, most elds

lack sucient data or have data quality issues to support the con-

struction of a reliable and robust machine learning (ML) model. At

the same time, companies are reluctant to share or aggregate their

valuable data in a centralized manner due to industry competition

and privacy and security concerns, leaving the data often existing

as a set of isolated data silos. As a viable decentralized solution that

can potentially break down barriers between data sources while

preserving privacy and security, federated learning (FL) enables

users to collaboratively learn an ML model while keeping all data

that may contain private information on their local device [

Depending on how the data is partitioned between parties and

application scenarios, FL can be divided into two main categories,

namely horizontal FL (HFL) and vertical FL (VFL) [

]. The focus

of this study is VFL, also known as feature-based FL. VFL can be

applied to situations where two datasets have considerable overlap

in sample IDs but dier in feature space [

]. A typical example

of VFL is a collaboration between an e-commerce retail company

and a nancial institution in the same city. Their customer set

may contain the majority of residents in the area; therefore, the

intersection of their customer spaces is huge. However, since the

nancial institution records its customers’ income, spending be-

havior, and credit rating, while the e-commerce retailer retains its

customers’ browsing and purchasing history, their feature spaces

are quite dierent. In this case, VFL allows both parties to train

a joint ML model for product purchase prediction based on cus-

tomer and product information under privacy-preserving security

conditions (Figure 1).

arXiv:2210.00472v1 [cs.HC] 2 Oct 2022

Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China Yun Tian, He Wang, Laixin Xie, Xiaojuan Ma, and an Li

Figure 1: Suppose two parties, i.e., a local nancial institution (a) and an e-commerce retail company (b) want to co-build a ML

model for product purchase prediction. Only the nancial institution has the label Y: loan or not and neither party wants to

expose their features X. The two parties has an overlapping sample IDs, i.e., user id:

–

. The target is to establish a joint model

under the condition of protecting privacy and the eect of the joint model is better than that of unilateral data modeling.

Although VFL has shown good performance in scenarios such

as nancial risk management [

], healthcare [

], and e-

commerce ad recommendation [

], real world practitioners have

encountered the following challenges when trying to use VFL for

their application domains [

1) Uncertainty in sample selec-

tion for training.

Traditional VFL practitioners mainly utilize two

methods to prepare for the VFL training phase. First, when there is

not much available data with labels, they may utilize all the overlap-

ping samples with labels to train a joint VFL model for convenience.

However, it is well known that the training speed of FL is much

slower than that of the local model due to the design of data en-

cryption and communication mechanisms. In some cases, utilizing

all overlapping samples with labels for VFL model training can

lead to much longer training time. Second, when the amount of

data available for training is large, they may select some labeled

data samples for training and evaluate the model based on gen-

eral metrics such as accuracy,loss, and mAP. As a slang expression

in classical ML terminology, ‘‘garbage in, garbage out” [

] indi-

cates the samples used for prediction should have a high-quality

match with their specic jointly trained VFL models. Although the

role of interactive data iteration in ML is emphasized and domain

experts acknowledge that “data samples need to have appropriate

signals for the model to be useful” [

], there is little support for

their ne-tuning of training data samples in VFL scenarios. Both

practices rely heavily on feedback from VFL model performance for

further evaluation, which is sometimes too time-consuming and

expensive. In particular, things get worse when the communica-

tion of the VFL training process is not so stable, as domain experts

have to repeatedly re-upgrade the model training phase by trial

and error. Therefore, an intuitive sample data evolution interaction

mechanism that allows domain experts to compare the data char-

acteristics and performance of dierent sample training datasets

in a VFL scenario is necessary.

2) Non-transparent feature se-

lection and assessment.

Successful ML applications require an

iterative process to create models that provide the desired per-

formance. One of the key processes involves feature engineering

and in this study, we focus on feature selection and assessment.

However, unlike traditional centralized ML modeling or HFL in

which all data features are available and easy to assess, in VFL,

participants update only their internal feature parameters during

training, and external features from other parties are not visible to

them due to the design of privacy-preserving mechanisms, which

poses unique challenges for internal/external feature selection and

assessment. That is, in addition to selecting the necessary internal

features or transforming the original internal features into other

powerful alternatives, practitioners are exploring how to assess

external features from other parties [

]. However, the lack of

comprehensive consideration of the contribution of internal and

external features while protecting privacy still undermines the use

of VFL in production.

3) Costly and time-consuming inference.

The inference phase of VFL modeling requires online coordination

between two (or more) parties to accomplish the inference task,

which inevitably poses a challenge to computational resources and

raises costs. According to our collaborating domain experts, the

cost and deployment eciency of federated modeling are issues

that require rational planning for practical applications, and the

use of homomorphic encryption in VFL can lead to a signicant re-

duction in the computational speed and information transfer speed

of federated modeling compared to centralized ML modeling [

To solve this problem, in addition to optimizing the computational

modeling process, another intuitive approach is to reduce the over-

all data volume. That is, not all samples to be predicted need to be

truly predicted with the help of external features from other parties.

For example, those samples in which practitioners have relatively

high condence in their labels do not need to be predicted by invok-

ing an online trained VFL model because, e.g., the sample features

are poor, and these samples can be safely ignored. Thus, how to

visually help domain experts distinguish samples with dierent

Co-design the Modeling Process for Eicient VFL via Visualization Chinese CHI 2022, October 22–23, 2022, Guangzhou, China and Online, China

condence in their labels is a desirable capability for real-world

VFL deployments.

In this study, we co-design the modeling process to help VFL

practitioners improve the eciency of VFL modeling from the

perspective of visualization. We rst conduct an observational study

of the current practices of collaborating domain experts to identify

their main needs and concerns regarding VFL applications. Then,

we streamline the analysis pipeline of feature and sample spaces and

propose an interactive visualization system called VFLens.VFLens

helps domain experts to interactively participate in feature selection,

assessment, and sample data iteration processes before the VFL

model training phase, in feature interpretation after the VFL model

training phase, and in data sample selection during the VFL model

inference phase. A case study and expert feedback conrm the

ecacy of VFLens. Our main contributions are summarized below.

•

We describe the problem in the VFL context from the perspec-

tive of feature and sample space through an observational

study and in-depth discussions of design requirements with

VFL domain experts.

•

We co-design the VFL modeling process to support domain

experts to interactively participate in the data iteration, fea-

ture selection and assessment, and sample prediction pro-

cesses. To the best of our knowledge, VFLens is the rst such

eort in the VFL scenario.

•

We evaluate VFLens through a usage scenario, a quantitative

experiment and expert interviews.

2 RELATED WORK

The literature that overlaps this work can be categorized into four

groups, namely, federated learning,visualizations for federated learn-

ing,feature selection and assessment, and sample selection in machine

learning.

2.1 Federated Learning

Federated learning was rst proposed by Google, which prevents

data from being transmitted by distributing model training to each

mobile terminal [

]. Later, they released the rst commercial FL

application, GBoard [

], which uses a recursive neural language

model to predict the next word in a keyboard application. GBoard

allows each local mobile device to train the model using local data

from the same distributed ML model. The global model can be

updated by averaging the model parameters collected over all lo-

cal models. Along the same lines, many studies have reshaped

dierent ML models into a federated framework, including deci-

sion trees [

], linear/logistic regression [

], and neural

networks [

]. These works are categorized as HFL because

the clients share the same feature space but dier in the sample

space. Unlike HFL, VFL is applicable to scenarios where we have

many overlapping instances but few overlapping features [

]. For

example, an insurance company and an online retailer in a local

city have many overlapping users, but each has its own feature

space. VFL “merges” features and uses homomorphic encryption to

protect the data privacy of the participating parties, and requires

a more sophisticated mechanism to decompose the loss function

of each party. This study focuses on VFL, “virtually aggregation”

of dierent features to compute training losses and gradients in

a privacy-preserving manner, and jointly build an ML model [

]

with data from both parties.

2.2 Visualizations for Federated Learning

Researchers from academia and industry are using visualizations to

demonstrate, explain, and monitor the process of federated learn-

ing. For example, in industry, Lenovo has simulated the industrial

revolution in factories by demonstrating the process of horizontal

federated learning to predict the internal pressure of hardware [

Similarly, Cloudera Fast Forward Labs released an interactive sim-

ulation prototype, Turbofan Tycoon, which takes advantage of vi-

sualization to examine the federated model and predict when a

turbofan will fail [

]. FATEBoard

utilizes dashboard visualizations

to display modeling logs, metrics, and evaluation results, including

information on data sets, job status, computational plots, and model

output [

]. While FATEBoard can help domain experts understand

the ranking of features and the performance of models, it does

not support detailed and interactive inspection of the sample and

feature spaces. On the other hand, in academia, Wei et al. [

] de-

veloped a game to demonstrate the superiority of HFL and built a

visualization prototype to help understand the operation of HFL.

However, this work assumes that client-side data can be witnessed

by the server-side. Li et al. [

] proposed HFLens, which strictly

follows a data privacy-preserving design and supports comparative

visual interpretation at the overview, communication round, and

client instance levels. HFLens facilitates the investigation of the

overall HFL process involving all clients, the correlation analysis

of client information in one or dierent communication rounds,

the identication of potential anomalies, and the evaluation of the

contribution of each HFL client. However, the pain point for VFL

is not the anomaly detection like HFLens, because for VFL there

are generally not as many data collaborators as for HFL, and the

collaborators partnerships with common interests. In this work, we

do not focus on the operational process of FL, but rather improve

the eciency of VFL modeling by involving domain experts in the

sample and feature space.

2.3 Feature Selection and Assessment

There is a large amount of existing work related to feature selec-

tion [

], which has two main diculties. First, a large number

of features are used in the process of building machine learning

models; however, if several features are linearly correlated with

each other, many of them will be redundant, which adds additional

computational eort and leads to more complex parameters. Second,

common feature analysis methods use feature correlation metrics,

but correlation metrics cannot measure nonlinear relationships.

Isabelle et al. [

] performed a survey of automatic feature selec-

tion methods. The authors abstracted the core problem of feature

selection, which is to nd a minimal subset of features from a large

number of features. The authors also argued that there are many

options for feature selection and that there is no one universal and

unique solution. There are other types of feature selection methods,

such as wrappers [

], which iteratively eliminate features by re-

gression or classication models to nd the ideal subset of features.

There are also metric-based methods [

], where users pick the

1https://fate.fedai.org/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VFLens:Co-designtheModelingProcessforEfficientVerticalFederatedLearningviaVisualizationYunTianSchoolofInformationScienceandTechnology,ShanghaiTechUniversityShanghai,Chinatianyun@shanghaitech.edu.cnHeWangSchoolofInformationScienceandTechnology,ShanghaiTechUniversityShanghai,Chinawanghe1@shanghaitech....

展开>> 收起<<

VFLens Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VFLens Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: