Outsourcing Training without Uploading Data via Efficient Collaborative Open-Source Sampling Junyuan Hong

2025-04-29 0 0 1.99MB 20 页 10玖币
侵权投诉
Outsourcing Training without Uploading Data
via Efficient Collaborative Open-Source Sampling
Junyuan Hong
Michigan State University
hongju12@msu.edu
Lingjuan Lyu
Sony AI
lingjuan.lv@sony.com
Jiayu Zhou
Michigan State University
jiayuz@msu.edu
Michael Spranger
Sony AI
michael.spranger@sony.com
Abstract
As deep learning blooms with growing demand for computation and data resources,
outsourcing model training to a powerful cloud server becomes an attractive alterna-
tive to training at a low-power and cost-effective end device. Traditional outsourc-
ing requires uploading device data to the cloud server, which can be infeasible in
many real-world applications due to the often sensitive nature of the collected data
and the limited communication bandwidth. To tackle these challenges, we propose
to leverage widely available open-source data, which is a massive dataset collected
from public and heterogeneous sources (e.g., Internet images). We develop a novel
strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct
a proximal proxy dataset from open-source data for cloud training, in lieu of client
data. ECOS probes open-source data on the cloud server to sense the distribution
of client data via a communication- and computation-efficient sampling process,
which only communicates a few compressed public features and client scalar re-
sponses. Extensive empirical studies show that the proposed ECOS improves the
quality of automated client labeling, model compression, and label outsourcing
when applied in various learning scenarios.
1 Introduction
Nowadays, powerful machine learning services are essential in many devices that supports our daily
routines. Delivering such services is typically done through client devices that are power-efficient
and thus very restricted in computing capacity. The client devices can collect data through built-in
sensors and make predictions by machine learning models. However, their stringent computing
power often makes the local training prohibitive, especially for high-capacity deep models. One
widely adopted solution is to outsource the cumbersome training to cloud servers equipped with mas-
sive computational power, using machine-learning-as-a-service (MLaaS). Amazon Sagemaker [
29
],
Google ML Engine [
6
], and Microsoft Azure ML Studio [
4
] are among the most successful industrial
adoptions, where users upload training data to designated cloud storage, and the optimized machine
learning engines then handle the training. One major challenge of the outsourcing solution in many
applications is that the local data collected are sensitive and protected by regulations, therefore
prohibiting data sharing. Notable examples include General Data Protection Regulation (GDPR) [
1
]
and Health Insurance Portability and Accountability Act (HIPPA) [2].
Work done during internship at Sony AI. Corresponding to: Lingjuan Lyu.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12575v1 [cs.LG] 23 Oct 2022
Figure 1: Illustration of the proposed ECOS framework. Instead of uploading local data for cloud
training, ECOS downloads the centroids of clustered open-source features to efficiently sense the
client distribution, where the client counts the local neighbor samples of the centroids as the coverage
score. Based on the the scores of centroids, the server adaptively samples proximal and diverse data
for training a transferable model on the cloud.
On the other hand, recent years witnessed a surging amount of general-purpose and massive datasets
authorized for public use, such as ImageNet [
15
], CelebA [
31
], and MIMIC [
26
]. Moreover, many
task-specific datasets used by local clients can be well considered as biased subsets of these large
public datasets [
41
,
28
]. Therefore, the availability of these datasets allows us to use them to model
confidential local data, facilitating training outsourcing without directly sharing the local data. One
approach is to use the private client dataset to craft pseudo labels for a public dataset in a confidential
manner [
56
,
40
], assuming that the public and local data are identically-and-independently-distributed
(iid). In addition, Alon et al. showed that an iid public data can strongly supplement client learning,
which greatly reduces the private sample complexity [
3
]. However, the iid assumption can often be too
strong for general-purpose open-source datasets, since they are usually collected from heterogeneous
sources with distributional biases from varying environments. For example, a search of ‘digits’ online
yields digits images from handwriting scans, photos, to artwork of digits.
In this paper, we relax the iid assumption in training outsourcing and instead consider the availability
of an open-source dataset. We first study the gap between the iid data and the heterogeneous open-
source data in training outsourcing, and show the low sample efficiency of open-source data. We show
that in order to effectively train a model from open-source data that is transferable to the client data,
the open-source data needs to communicate more samples than those of iid data. The main reason
behind such low sample efficiency is that we accidentally included out-of-distribution (OoD) samples,
which poison the training and significantly degrade accuracy at the target (client) data distribution [
5
].
We propose a novel framework called Efficient Collaborative Open-source Sampling (ECOS) to
tackle this challenge, which filters the open-source dataset through an efficient collaboration between
the client and server and does not require client data to be shared. During the collaboration, the server
sends compressed representative features (centroids) of the open-source dataset to the client. The
client then identifies and excludes OoD centroids and returns their privately computed categorical
scores to the server. The server then adaptively and diversely decompresses the neighbors of the
selected centroids. The main idea is illustrated in Fig. 1.
Our major contributions are summarized as follows:
New problem and insight: Motivated by the strong demands for efficient and confidential outsourc-
ing, using public data in place of the client data is an attractive solution. However, the impact of
heterogeneous sources of the public data, namely open-source data, is rarely studied in existing works.
Our empirical study shows the potential challenges due to such heterogeneity.
New sampling paradigm: We propose a new unified sampling paradigm, where the server only
sends very few query data to the client and requests very few responses that efficiently and privately
guide the cloud for various learning settings on open-source data. To our best knowledge, our method
enables efficient cloud outsourcing under the most practical assumption of open-source public data,
and does not require accessing raw client data or executing cumbersome local training.
Compelling results: In all three practical learning scenarios, our method improves the model
accuracy with pseudo, manual or pre-trained supervisions. Besides, our method shows competitive
efficiency in terms of both communication and computation.
2
2 Related Work
There are a series of efforts studying how to leverage the data and computation resources on the
cloud to assist client model training, especially when client data cannot be shared [
53
,
48
]. We
categorize them as follows: 1) Feature sharing: Methods like group knowledge transfer [
21
], split
learning [
47
] and domain adaptation [
18
,
17
] transfer edge knowledge by communicating features
extracted by networks. To provide a theoretical guarantee of privacy protection, [
37
] proposed an
advanced information removal to disentangle sensitive attributes from shared features. In the notion of
rigorous privacy definition, Liu et al. leveraged public data to assist private information release [
30
].
Earlier, data encryption was used for outsourcing, which however is too computation-intensive for
a client and less applicable for large-scale data and deep networks [
12
,
27
]. Federated Learning
(FL) [
34
] considers the same constraint on data sharing but allocates the burdens of training [
23
] and
communication [57] to clients and opens up a series of challenges on privacy [11], security [33,10]
and knowledge transfer [
24
]. 2) Private labeling: PATE and its variants were proposed to generate
client-approximated labels for unlabeled public data, on which a model can be trained [
39
,
40
].
Without training multiple models by clients, Private kNN was a more efficient alternative which
explored the private neighborhood of public images for labeling [
56
]. These approaches are based on
a strong assumption of the availability of public data that is iid as the local data. This paper considers
a more practical yet challenging setting where public data are from multiple agnostic sources with
heterogeneous features.
Sampling from public data has been explored in central settings. For example, Xu et al. [
51
] used a
few target-domain samples as a seed dataset to filter the open-domain datasets by positive-unlabeled
learning [
32
]. Yan et al. [
52
] used a model to find the proxy datasets from multiple candidate datasets.
In self-supervised contrastive learning, model-aware
K
-center (MAK) used a model pre-trained on
the seed dataset to find desired-class samples from open-world dataset [
25
]. Though these methods
provided effective sampling, they are less applicable when the seed dataset is placed at the low-energy
edge, because the private seed data at the edge cannot be shared with the cloud for filtering and the
edge device is incapable of computation-intensive training. To address these challenges, we develop
a new sampling strategy requiring only light-weight computation at the edge.
3 Outsourcing Model Training With Open-Source Data
3.1 Problem Setting and Challenges
Motivated in Section 1, we aim to outsource the training process from computation-constrained
devices to the powerful cloud server, where a proxy public dataset without privacy concerns is
used in place of the client dataset for cloud training. One solution is (private) client labeling by
k-nearest-neighbors (kNN) [
56
], where the client and cloud server communicate the pseudo-label of
a public dataset privately and the server trains a classifier by the labeled and unlabeled samples in a
semi-supervised manner. The success of this strategy depends on the key assumptions that public
data in the cloud and private data in the client are iid, which are rather strong in practice and thus
prevent it from many real-world applications. In this work, we make a more realistic assumption
that the public datasets are as accessible as open-source data. An open-source dataset consists of
Table 1: Test accuracy (%) with different client domains (columns). Cloud data are identically
distributed as the client data (ID) or including more data from 5 distinct domains (ID+OoD) without
overlapped samples. We first label a number of randomly selected cloud examples (i.e., sampling
budget) privately by client data [
56
], and then train a classifier to recognize digit images. The privacy
cost
is accounted for in the notion of differential privacy. Larger budgets imply more privacy and
communication costs. More results on different settings are enclosed in Appendix B.3.
Cloud Sampling MNIST SVHN USPS SynthDigits MNIST-M Average
Data Budget Acc (%) Acc (%) Acc (%) Acc (%) Acc (%) Acc (%)
ID 1000 84.3±2.44.48 51.6±1.44.08 87.1±0.54.51 73.2±1.54.57 55.5±1.04.46 70.44.42
ID+OoD
1000 78.0±3.54.30 40.6±1.63.75 82.2±2.74.32 62.1±1.64.41 49.1±1.04.27 62.44.21
8000 82.2±4.15.89 47.9±1.85.89 85.4±0.55.89 64.4±3.65.89 53.3±2.25.89 66.65.89
16000 82.6±1.47.17 48.5±1.77.17 86.7±1.97.17 67.5±2.37.17 52.0±3.07.17 67.47.17
32000 84.1±1.69.32 49.4±0.29.32 86.8±2.09.32 68.5±0.19.32 53.0±2.79.32 68.49.32
3
biased features from multiple heterogeneous sources (feature domains), and therefore includes not
only in-distribution (ID) samples similar to the client data but also multi-domain OoD samples.
The immediate question is how the OoD samples affect the outsourced training. In Table 1, we
empirically study the problem by using a 5-domain dataset, Digits, where 50% of one domain is used
on the client and the remained 50% together with the other 4 domains serve as the public dataset on
the cloud. To conduct the cloud training, we leverage the client data to generate pseudo labels for the
unlabeled public samples, following [
56
]. It turns out that the presence of OoD samples in the cloud
greatly degrades the model accuracy. The inherent reason for the degradation is that the distributional
shift of data [43] compromised the transferability of the model to the client data [50].
Problem formulation by sampling principles
. Given a client dataset
Dp
and an open-source dataset
Dq
, the goal of open-source sampling is to find a proper subset
S
from
Dq
, whose distribution matches
Dp. In [25], Model-Aware K-center (MAK) formulated the sampling as a principled optimization:
minSDq∆(S, Dp)H(SDp;Dq),(1)
where
∆(S, Dp) := Ex0Dp[minxSkφ(x)φ(x0)k2]
measures proximity as the set differ-
ence between
S
and
Dp
using a feature extractor
φ
, and the latter
H(SDp;Dq) :=
maxx0DqminxSDpkφ(x)φ(x0)k2
measures diversity by contradicting
SDp
and
Dq
(sup-
pose
Dq
is the most diverse set)
2
. Solving Eq. (1) results in an NP-hard problem that is intractable [
13
],
and MAK provides an approximated solution by a coordinate-wise greedy strategy. It first pre-trains
the model representations on
Dp
and finds a large candidate set with the best proximity to extracted
features. Then, it incrementally selects the most diverse samples from the candidate set until the
sampling budget is used up.
Though MAK is successful in the central setting, it is not applicable when
Dp
is isolated from cloud
open-source data and is located at a resource-constrained client for two reasons: 1) Communication
inefficiency. Uploading client data may result in privacy leakage, sending public data to the client is a
direct alternative but the cost can be prohibitive. 2) Computation inefficiency. Pre-training a model
on
Dp
or proximal sampling (which computes the distances between paired samples from
Dq
and
Dp) induces unaffordable computation overheads for the low-energy client.
3.2 Proposed Solution: Efficient Collaborative Open-Source Sampling (ECOS)
To address the above challenges, we design a new strategy that 1) uses compressed queries to reduce
the communication and computation overhead and 2) uses a novel principled objective to effectively
sample from open-source data with the client responses of the compressed queries.
Construct communication-efficient and an informative query set ˆ
Φqat cloud
. Let
d
be the
number of pixels of an image, the communication overhead of transmitting
Dq
to the client is given
by O(d|Dq|). For communication efficiency, we optimize the following two factors:
i) Data dimension
d
.First, we transmit extracted features
Φq={φ(x)|xDq}
instead of images to
reduce the communication overhead to
O(de|Dq|)
, where
de
is a much smaller embedding dimension.
For accurate estimation of the distance
, a pre-defined discriminative feature space is essential
without extra training on the client. Depending on resources, one may consider hand-crafted features
such as HOG [14], or an off-the-shelf pre-trained model such as ResNet pre-trained on ImageNet.
ii) Data size
|Dq|
.Even with compression, sending all data for querying is inefficient due to the huge
size of open-source data
|Dq|
. Meanwhile, too many queries would cast unacceptable privacy costs
to the client. As querying on similar samples leads to redundant information in querying, we propose
to reduce such redundancy by selecting informative samples. We use the classic clustering method
KMeans [
20
] for compressing similar samples by clustering them, and collect the
R
mean vectors
or centroids into
ˆ
Φq={cr}R
r=1
. We denote
R
as the compression size and
ˆ
Dq
as the set of original
samples corresponding to ˆ
Φq.
New sampling objective
. We note that sending the compact set
ˆ
Φq
in place of
Dq
prohibits the client
from optimizing
∆(S, Dp)
in Eq. (1) for
SDq
. Instead, we sample a set of centroids
ˆ
Sˆ
Φq
and
decompress them by the cluster assignment into corresponding original samples with rich features
2
Note that we use
L2
-norm distance instead of normalized cosine similarity in
∆(S, Dp)
in contrast to MAK,
since normalized cosine similarity is not essential if the feature space is not trained under the cosine metric. We
also omit the tailedness objective which is irrelevant in our context.
4
摘要:

OutsourcingTrainingwithoutUploadingDataviaEfcientCollaborativeOpen-SourceSamplingJunyuanHongMichiganStateUniversityhongju12@msu.eduLingjuanLyuSonyAIlingjuan.lv@sony.comJiayuZhouMichiganStateUniversityjiayuz@msu.eduMichaelSprangerSonyAImichael.spranger@sony.comAbstractAsdeeplearningbloomswithgrowin...

展开>> 收起<<
Outsourcing Training without Uploading Data via Efficient Collaborative Open-Source Sampling Junyuan Hong.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.99MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注