Outsourcing Training without Uploading Data via Efﬁcient Collaborative Open-Source Sampling Junyuan Hong

2025-04-29 0 0 1.99MB 20 页 10玖币

侵权投诉

Outsourcing Training without Uploading Data

via Efﬁcient Collaborative Open-Source Sampling

Junyuan Hong∗

Michigan State University

hongju12@msu.edu

Lingjuan Lyu

Sony AI

lingjuan.lv@sony.com

Jiayu Zhou

Michigan State University

jiayuz@msu.edu

Michael Spranger

Sony AI

michael.spranger@sony.com

Abstract

As deep learning blooms with growing demand for computation and data resources,

outsourcing model training to a powerful cloud server becomes an attractive alterna-

tive to training at a low-power and cost-effective end device. Traditional outsourc-

ing requires uploading device data to the cloud server, which can be infeasible in

many real-world applications due to the often sensitive nature of the collected data

and the limited communication bandwidth. To tackle these challenges, we propose

to leverage widely available open-source data, which is a massive dataset collected

from public and heterogeneous sources (e.g., Internet images). We develop a novel

strategy called Efﬁcient Collaborative Open-source Sampling (ECOS) to construct

a proximal proxy dataset from open-source data for cloud training, in lieu of client

data. ECOS probes open-source data on the cloud server to sense the distribution

of client data via a communication- and computation-efﬁcient sampling process,

which only communicates a few compressed public features and client scalar re-

sponses. Extensive empirical studies show that the proposed ECOS improves the

quality of automated client labeling, model compression, and label outsourcing

when applied in various learning scenarios.

1 Introduction

Nowadays, powerful machine learning services are essential in many devices that supports our daily

routines. Delivering such services is typically done through client devices that are power-efﬁcient

and thus very restricted in computing capacity. The client devices can collect data through built-in

sensors and make predictions by machine learning models. However, their stringent computing

power often makes the local training prohibitive, especially for high-capacity deep models. One

widely adopted solution is to outsource the cumbersome training to cloud servers equipped with mas-

sive computational power, using machine-learning-as-a-service (MLaaS). Amazon Sagemaker [

Google ML Engine [

], and Microsoft Azure ML Studio [

] are among the most successful industrial

adoptions, where users upload training data to designated cloud storage, and the optimized machine

learning engines then handle the training. One major challenge of the outsourcing solution in many

applications is that the local data collected are sensitive and protected by regulations, therefore

prohibiting data sharing. Notable examples include General Data Protection Regulation (GDPR) [

]

and Health Insurance Portability and Accountability Act (HIPPA) [2].

∗Work done during internship at Sony AI. Corresponding to: Lingjuan Lyu.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12575v1 [cs.LG] 23 Oct 2022

Figure 1: Illustration of the proposed ECOS framework. Instead of uploading local data for cloud

training, ECOS downloads the centroids of clustered open-source features to efﬁciently sense the

client distribution, where the client counts the local neighbor samples of the centroids as the coverage

score. Based on the the scores of centroids, the server adaptively samples proximal and diverse data

for training a transferable model on the cloud.

On the other hand, recent years witnessed a surging amount of general-purpose and massive datasets

authorized for public use, such as ImageNet [

], CelebA [

], and MIMIC [

]. Moreover, many

task-speciﬁc datasets used by local clients can be well considered as biased subsets of these large

public datasets [

]. Therefore, the availability of these datasets allows us to use them to model

conﬁdential local data, facilitating training outsourcing without directly sharing the local data. One

approach is to use the private client dataset to craft pseudo labels for a public dataset in a conﬁdential

manner [

], assuming that the public and local data are identically-and-independently-distributed

(iid). In addition, Alon et al. showed that an iid public data can strongly supplement client learning,

which greatly reduces the private sample complexity [

]. However, the iid assumption can often be too

strong for general-purpose open-source datasets, since they are usually collected from heterogeneous

sources with distributional biases from varying environments. For example, a search of ‘digits’ online

yields digits images from handwriting scans, photos, to artwork of digits.

In this paper, we relax the iid assumption in training outsourcing and instead consider the availability

of an open-source dataset. We ﬁrst study the gap between the iid data and the heterogeneous open-

source data in training outsourcing, and show the low sample efﬁciency of open-source data. We show

that in order to effectively train a model from open-source data that is transferable to the client data,

the open-source data needs to communicate more samples than those of iid data. The main reason

behind such low sample efﬁciency is that we accidentally included out-of-distribution (OoD) samples,

which poison the training and signiﬁcantly degrade accuracy at the target (client) data distribution [

We propose a novel framework called Efﬁcient Collaborative Open-source Sampling (ECOS) to

tackle this challenge, which ﬁlters the open-source dataset through an efﬁcient collaboration between

the client and server and does not require client data to be shared. During the collaboration, the server

sends compressed representative features (centroids) of the open-source dataset to the client. The

client then identiﬁes and excludes OoD centroids and returns their privately computed categorical

scores to the server. The server then adaptively and diversely decompresses the neighbors of the

selected centroids. The main idea is illustrated in Fig. 1.

Our major contributions are summarized as follows:

•

New problem and insight: Motivated by the strong demands for efﬁcient and conﬁdential outsourc-

ing, using public data in place of the client data is an attractive solution. However, the impact of

heterogeneous sources of the public data, namely open-source data, is rarely studied in existing works.

Our empirical study shows the potential challenges due to such heterogeneity.

•

New sampling paradigm: We propose a new uniﬁed sampling paradigm, where the server only

sends very few query data to the client and requests very few responses that efﬁciently and privately

guide the cloud for various learning settings on open-source data. To our best knowledge, our method

enables efﬁcient cloud outsourcing under the most practical assumption of open-source public data,

and does not require accessing raw client data or executing cumbersome local training.

•

Compelling results: In all three practical learning scenarios, our method improves the model

accuracy with pseudo, manual or pre-trained supervisions. Besides, our method shows competitive

efﬁciency in terms of both communication and computation.

2 Related Work

There are a series of efforts studying how to leverage the data and computation resources on the

cloud to assist client model training, especially when client data cannot be shared [

]. We

categorize them as follows: 1) Feature sharing: Methods like group knowledge transfer [

], split

learning [

] and domain adaptation [

] transfer edge knowledge by communicating features

extracted by networks. To provide a theoretical guarantee of privacy protection, [

] proposed an

advanced information removal to disentangle sensitive attributes from shared features. In the notion of

rigorous privacy deﬁnition, Liu et al. leveraged public data to assist private information release [

Earlier, data encryption was used for outsourcing, which however is too computation-intensive for

a client and less applicable for large-scale data and deep networks [

]. Federated Learning

(FL) [

] considers the same constraint on data sharing but allocates the burdens of training [

] and

communication [57] to clients and opens up a series of challenges on privacy [11], security [33,10]

and knowledge transfer [

]. 2) Private labeling: PATE and its variants were proposed to generate

client-approximated labels for unlabeled public data, on which a model can be trained [

Without training multiple models by clients, Private kNN was a more efﬁcient alternative which

explored the private neighborhood of public images for labeling [

]. These approaches are based on

a strong assumption of the availability of public data that is iid as the local data. This paper considers

a more practical yet challenging setting where public data are from multiple agnostic sources with

heterogeneous features.

Sampling from public data has been explored in central settings. For example, Xu et al. [

] used a

few target-domain samples as a seed dataset to ﬁlter the open-domain datasets by positive-unlabeled

learning [

]. Yan et al. [

] used a model to ﬁnd the proxy datasets from multiple candidate datasets.

In self-supervised contrastive learning, model-aware

-center (MAK) used a model pre-trained on

the seed dataset to ﬁnd desired-class samples from open-world dataset [

]. Though these methods

provided effective sampling, they are less applicable when the seed dataset is placed at the low-energy

edge, because the private seed data at the edge cannot be shared with the cloud for ﬁltering and the

edge device is incapable of computation-intensive training. To address these challenges, we develop

a new sampling strategy requiring only light-weight computation at the edge.

3 Outsourcing Model Training With Open-Source Data

3.1 Problem Setting and Challenges

Motivated in Section 1, we aim to outsource the training process from computation-constrained

devices to the powerful cloud server, where a proxy public dataset without privacy concerns is

used in place of the client dataset for cloud training. One solution is (private) client labeling by

k-nearest-neighbors (kNN) [

], where the client and cloud server communicate the pseudo-label of

a public dataset privately and the server trains a classiﬁer by the labeled and unlabeled samples in a

semi-supervised manner. The success of this strategy depends on the key assumptions that public

data in the cloud and private data in the client are iid, which are rather strong in practice and thus

prevent it from many real-world applications. In this work, we make a more realistic assumption

that the public datasets are as accessible as open-source data. An open-source dataset consists of

Table 1: Test accuracy (%) with different client domains (columns). Cloud data are identically

distributed as the client data (ID) or including more data from 5 distinct domains (ID+OoD) without

overlapped samples. We ﬁrst label a number of randomly selected cloud examples (i.e., sampling

budget) privately by client data [

], and then train a classiﬁer to recognize digit images. The privacy

cost



is accounted for in the notion of differential privacy. Larger budgets imply more privacy and

communication costs. More results on different settings are enclosed in Appendix B.3.

Cloud Sampling MNIST SVHN USPS SynthDigits MNIST-M Average

Data Budget Acc (%) ↑↓Acc (%) ↑↓Acc (%) ↑↓Acc (%) ↑↓Acc (%) ↑↓Acc (%) ↑↓

ID 1000 84.3±2.44.48 51.6±1.44.08 87.1±0.54.51 73.2±1.54.57 55.5±1.04.46 70.44.42

ID+OoD

1000 78.0±3.54.30 40.6±1.63.75 82.2±2.74.32 62.1±1.64.41 49.1±1.04.27 62.44.21

8000 82.2±4.15.89 47.9±1.85.89 85.4±0.55.89 64.4±3.65.89 53.3±2.25.89 66.65.89

16000 82.6±1.47.17 48.5±1.77.17 86.7±1.97.17 67.5±2.37.17 52.0±3.07.17 67.47.17

32000 84.1±1.69.32 49.4±0.29.32 86.8±2.09.32 68.5±0.19.32 53.0±2.79.32 68.49.32

biased features from multiple heterogeneous sources (feature domains), and therefore includes not

only in-distribution (ID) samples similar to the client data but also multi-domain OoD samples.

The immediate question is how the OoD samples affect the outsourced training. In Table 1, we

empirically study the problem by using a 5-domain dataset, Digits, where 50% of one domain is used

on the client and the remained 50% together with the other 4 domains serve as the public dataset on

the cloud. To conduct the cloud training, we leverage the client data to generate pseudo labels for the

unlabeled public samples, following [

]. It turns out that the presence of OoD samples in the cloud

greatly degrades the model accuracy. The inherent reason for the degradation is that the distributional

shift of data [43] compromised the transferability of the model to the client data [50].

Problem formulation by sampling principles

. Given a client dataset

and an open-source dataset

, the goal of open-source sampling is to ﬁnd a proper subset

from

, whose distribution matches

Dp. In [25], Model-Aware K-center (MAK) formulated the sampling as a principled optimization:

minS⊆Dq∆(S, Dp)−H(S∪Dp;Dq),(1)

where

∆(S, Dp) := Ex0∈Dp[minx∈Skφ(x)−φ(x0)k2]

measures proximity as the set differ-

ence between

and

using a feature extractor

, and the latter

H(S∪Dp;Dq) :=

maxx0∈Dqminx∈S∪Dpkφ(x)−φ(x0)k2

measures diversity by contradicting

S∪Dp

and

(sup-

pose

is the most diverse set)

. Solving Eq. (1) results in an NP-hard problem that is intractable [

and MAK provides an approximated solution by a coordinate-wise greedy strategy. It ﬁrst pre-trains

the model representations on

and ﬁnds a large candidate set with the best proximity to extracted

features. Then, it incrementally selects the most diverse samples from the candidate set until the

sampling budget is used up.

Though MAK is successful in the central setting, it is not applicable when

is isolated from cloud

open-source data and is located at a resource-constrained client for two reasons: 1) Communication

inefﬁciency. Uploading client data may result in privacy leakage, sending public data to the client is a

direct alternative but the cost can be prohibitive. 2) Computation inefﬁciency. Pre-training a model

or proximal sampling (which computes the distances between paired samples from

and

Dp) induces unaffordable computation overheads for the low-energy client.

3.2 Proposed Solution: Efﬁcient Collaborative Open-Source Sampling (ECOS)

To address the above challenges, we design a new strategy that 1) uses compressed queries to reduce

the communication and computation overhead and 2) uses a novel principled objective to effectively

sample from open-source data with the client responses of the compressed queries.

Construct communication-efﬁcient and an informative query set ˆ

Φqat cloud

. Let

be the

number of pixels of an image, the communication overhead of transmitting

to the client is given

by O(d|Dq|). For communication efﬁciency, we optimize the following two factors:

i) Data dimension

.First, we transmit extracted features

Φq={φ(x)|x∈Dq}

instead of images to

reduce the communication overhead to

O(de|Dq|)

, where

is a much smaller embedding dimension.

For accurate estimation of the distance

∆

, a pre-deﬁned discriminative feature space is essential

without extra training on the client. Depending on resources, one may consider hand-crafted features

such as HOG [14], or an off-the-shelf pre-trained model such as ResNet pre-trained on ImageNet.

ii) Data size

|Dq|

.Even with compression, sending all data for querying is inefﬁcient due to the huge

size of open-source data

|Dq|

. Meanwhile, too many queries would cast unacceptable privacy costs

to the client. As querying on similar samples leads to redundant information in querying, we propose

to reduce such redundancy by selecting informative samples. We use the classic clustering method

KMeans [

] for compressing similar samples by clustering them, and collect the

mean vectors

or centroids into

Φq={cr}R

r=1

. We denote

as the compression size and

as the set of original

samples corresponding to ˆ

Φq.

New sampling objective

. We note that sending the compact set

Φq

in place of

prohibits the client

from optimizing

∆(S, Dp)

in Eq. (1) for

S∈Dq

. Instead, we sample a set of centroids

S∈ˆ

Φq

and

decompress them by the cluster assignment into corresponding original samples with rich features

Note that we use

-norm distance instead of normalized cosine similarity in

∆(S, Dp)

in contrast to MAK,

since normalized cosine similarity is not essential if the feature space is not trained under the cosine metric. We

also omit the tailedness objective which is irrelevant in our context.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OutsourcingTrainingwithoutUploadingDataviaEfcientCollaborativeOpen-SourceSamplingJunyuanHongMichiganStateUniversityhongju12@msu.eduLingjuanLyuSonyAIlingjuan.lv@sony.comJiayuZhouMichiganStateUniversityjiayuz@msu.eduMichaelSprangerSonyAImichael.spranger@sony.comAbstractAsdeeplearningbloomswithgrowin...

展开>> 收起<<

Outsourcing Training without Uploading Data via Efﬁcient Collaborative Open-Source Sampling Junyuan Hong.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Outsourcing Training without Uploading Data via Efﬁcient Collaborative Open-Source Sampling Junyuan Hong

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: