
biased features from multiple heterogeneous sources (feature domains), and therefore includes not
only in-distribution (ID) samples similar to the client data but also multi-domain OoD samples.
The immediate question is how the OoD samples affect the outsourced training. In Table 1, we
empirically study the problem by using a 5-domain dataset, Digits, where 50% of one domain is used
on the client and the remained 50% together with the other 4 domains serve as the public dataset on
the cloud. To conduct the cloud training, we leverage the client data to generate pseudo labels for the
unlabeled public samples, following [
56
]. It turns out that the presence of OoD samples in the cloud
greatly degrades the model accuracy. The inherent reason for the degradation is that the distributional
shift of data [43] compromised the transferability of the model to the client data [50].
Problem formulation by sampling principles
. Given a client dataset
Dp
and an open-source dataset
Dq
, the goal of open-source sampling is to find a proper subset
S
from
Dq
, whose distribution matches
Dp. In [25], Model-Aware K-center (MAK) formulated the sampling as a principled optimization:
minS⊆Dq∆(S, Dp)−H(S∪Dp;Dq),(1)
where
∆(S, Dp) := Ex0∈Dp[minx∈Skφ(x)−φ(x0)k2]
measures proximity as the set differ-
ence between
S
and
Dp
using a feature extractor
φ
, and the latter
H(S∪Dp;Dq) :=
maxx0∈Dqminx∈S∪Dpkφ(x)−φ(x0)k2
measures diversity by contradicting
S∪Dp
and
Dq
(sup-
pose
Dq
is the most diverse set)
2
. Solving Eq. (1) results in an NP-hard problem that is intractable [
13
],
and MAK provides an approximated solution by a coordinate-wise greedy strategy. It first pre-trains
the model representations on
Dp
and finds a large candidate set with the best proximity to extracted
features. Then, it incrementally selects the most diverse samples from the candidate set until the
sampling budget is used up.
Though MAK is successful in the central setting, it is not applicable when
Dp
is isolated from cloud
open-source data and is located at a resource-constrained client for two reasons: 1) Communication
inefficiency. Uploading client data may result in privacy leakage, sending public data to the client is a
direct alternative but the cost can be prohibitive. 2) Computation inefficiency. Pre-training a model
on
Dp
or proximal sampling (which computes the distances between paired samples from
Dq
and
Dp) induces unaffordable computation overheads for the low-energy client.
3.2 Proposed Solution: Efficient Collaborative Open-Source Sampling (ECOS)
To address the above challenges, we design a new strategy that 1) uses compressed queries to reduce
the communication and computation overhead and 2) uses a novel principled objective to effectively
sample from open-source data with the client responses of the compressed queries.
Construct communication-efficient and an informative query set ˆ
Φqat cloud
. Let
d
be the
number of pixels of an image, the communication overhead of transmitting
Dq
to the client is given
by O(d|Dq|). For communication efficiency, we optimize the following two factors:
i) Data dimension
d
.First, we transmit extracted features
Φq={φ(x)|x∈Dq}
instead of images to
reduce the communication overhead to
O(de|Dq|)
, where
de
is a much smaller embedding dimension.
For accurate estimation of the distance
∆
, a pre-defined discriminative feature space is essential
without extra training on the client. Depending on resources, one may consider hand-crafted features
such as HOG [14], or an off-the-shelf pre-trained model such as ResNet pre-trained on ImageNet.
ii) Data size
|Dq|
.Even with compression, sending all data for querying is inefficient due to the huge
size of open-source data
|Dq|
. Meanwhile, too many queries would cast unacceptable privacy costs
to the client. As querying on similar samples leads to redundant information in querying, we propose
to reduce such redundancy by selecting informative samples. We use the classic clustering method
KMeans [
20
] for compressing similar samples by clustering them, and collect the
R
mean vectors
or centroids into
ˆ
Φq={cr}R
r=1
. We denote
R
as the compression size and
ˆ
Dq
as the set of original
samples corresponding to ˆ
Φq.
New sampling objective
. We note that sending the compact set
ˆ
Φq
in place of
Dq
prohibits the client
from optimizing
∆(S, Dp)
in Eq. (1) for
S∈Dq
. Instead, we sample a set of centroids
ˆ
S∈ˆ
Φq
and
decompress them by the cluster assignment into corresponding original samples with rich features
2
Note that we use
L2
-norm distance instead of normalized cosine similarity in
∆(S, Dp)
in contrast to MAK,
since normalized cosine similarity is not essential if the feature space is not trained under the cosine metric. We
also omit the tailedness objective which is irrelevant in our context.
4