Dataset Distillation via Factorization Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang

2025-08-18 0 0 3.05MB 21 页 10玖币
侵权投诉
Dataset Distillation via Factorization
Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang
National University of Singapore
{songhua.liu,e0823044,xyang}@u.nus.edu, {jingweny,xinchao}@nus.edu.sg
Abstract
In this paper, we study dataset distillation (DD), from a novel perspective and
introduce a dataset factorization approach, termed HaBa, which is a plug-and-
play strategy portable to any existing DD baseline. Unlike conventional DD
approaches that aim to produce distilled and representative samples, HaBa explores
decomposing a dataset into two components: data Hallucination networks and
Bases, where the latter is fed into the former to reconstruct image samples. The
flexible combinations between bases and hallucination networks, therefore, equip
the distilled data with exponential informativeness gain, which largely increase
the representation capability of distilled datasets. To furthermore increase the
data efficiency of compression results, we further introduce a pair of adversarial
contrastive constraints on the resultant hallucination networks and bases, which
increase the diversity of generated images and inject more discriminant information
into the factorization. Extensive comparisons and experiments demonstrate that
our method can yield significant improvement on downstream classification tasks
compared with previous state of the arts, while reducing the total number of
compressed parameters by up to 65%. Moreover, distilled datasets by our approach
also achieve ~10% higher accuracy than baseline methods in cross-architecture
generalization. Our code is available here.
1 Introduction
The success of deep models on a variety of vision tasks, such as image classification [
26
,
11
,
38
],
object detection [
37
,
36
], and semantic segmentation [
43
,
56
,
29
], is largely attributed to the huge
amount of data used for training and various pre-trained models [
57
]. However, the sheer amount
of data introduces significant obstacles for storage, transmission, and data pre-processing. Besides,
publishing raw data inevitably brings about privacy or copyright issue in practice [
44
,
10
]. To
alleviate these problems, Wang et al. [
52
] pioneer the research of dataset distillation (DD), to distill a
large dataset into a synthetic one with only a limited number of samples, so that the training efforts
with the distilled dataset for downstream models on the original dataset can be largely reduced,
which facilitates a series of applications like continual learning [
41
,
40
,
54
,
31
] and black-box
optimization [
7
]. Due the significant practical value of DD, many endeavours have been made on this
area [
62
,
60
,
61
,
51
,
24
,
6
,
63
] to design novel supervision signals to train the synthetic datasets and
to further improve their performances.
Nevertheless, there is a potential drawback in conventional settings of DD: it largely treats each
synthetic sample independently and ignores the inter coherence and relationship between different
instances. As such, the information embraced by each sample, despite distilled, is by nature limited.
Using the synthetic samples for training downstream models, therefore, inevitably leads to the loss of
dataset information. Moreover, the few distilled samples are incompatible with the enormous number
of parameters in a deep model and may yield the risk of overfitting.
To verify these potential issues, we conduct a pre-experiment on CIFAR10 dataset with 10 synthetic
images per class, using MTT [
6
], the current SOTA solution on DD, as the baseline. In addition
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.16774v1 [cs.CV] 30 Oct 2022
𝐻!
𝐻"... ...
...
Bases Hallucinators
Real Dataset
Synthetic Dataset
Performance
Matching
𝐻#
Dataset
Factorization
Figure 1: Intuition of our hallucinator-basis factorization for dataset distillation.
to the baseline setting, we also incorporate all the checkpoint synthetic datasets after each 100 DD
iterations in the convergent stage to train the downstream model. Since the synthetic images are
fine-tuned during this stage, multiple checkpoints can be viewed as related but different, which may
somehow increase the diversity. As a result, it yields overall lower test loss and hence better final
results in downstream training, as shown in the blue and green curves in Fig. 2, which indicates that
current DD solutions can be potentially improved by leveraging some sample-wise relationships to
diversify the distilled data. Nevertheless, simply involving more data samples may also increase the
memory overhead. This fact motivates us to ask: is it possible to encode some shared relationships in
a dataset implicitly, instead of storing samples directly, to avoid such additional storage costs?
Figure 2: Visualization
of test loss using syn-
thetic datasets generated
by MTT, MTT with multi-
ple checkpoints, and ours.
We show in this paper that, it can indeed be made possible through
reformulating the DD task as a factorization problem. As shown in
Fig. 1, we propose a novel perspective dubbed HaBa, to factorize a
dataset into two compositions: data Hallucination networks and Bases.
A data hallucination network, or hallucinator, can take any basis as
input and output the corresponding hallucinated image. Supervised
by the training objective of DD, a set of hallucinators can synthesize
multiple samples from a common basis and are optimized to extract
effective relationships among different samples in original datasets
explicitly. In this way, information of
|H|×|B|
images can be included
for a factorization result with
|H|
hallucinators and
|B|
bases via
arbitrary pair-wise combination, which improves the data efficiency
of traditional DD exponentially. As shown in the yellow curve in Fig.
2, with the same budget on the storage, our strategy achieves better
test performance compared with the MTT baseline.
To further increase the informativeness of factorized results, we introduce a pair of adversarial
contrastive constraints to promote sample-wise diversity. The goal of HaBa is to minimize the
correlation among images composed of different hallucinators but a common basis, while an adversary
tries to maximize it. Such an adversarial scheme, in turn, enforces the hallucinators to produce
diversified images and increases the amount of useful information.
Notably, HaBa is a versatile strategy that can be built upon existing DD baselines, since it is compatible
with any training objective for measuring the similarity between downstream performances as shown
in Fig. 1, We conduct extensive experiments to demonstrate the advantages of the proposed method
over baseline ones. In all benchmarks and comparisons, HaBa produces significant and consistent
improvement on training downstream models, while reducing the total number of compressed
parameters by up to 65%. Furthermore, it demonstrates strong cross-architecture generalization
ability with accuracy improvement higher than 10%. Our contributions are summarized as follows:
We study dataset factorization, a novel perspective to explore dataset distillation, and propose
a novel approach termed HaBa for hallucinator-basis factorization.
We present a pair of adversarial contrastive objectives to further increase the data diversity
and information capability.
2
HaBa is a plug-and-play scheme compatible with all existing training objectives of DD and
can yield significant and consistent improvement over the state of the arts.
2 Related Works
The goal of dataset distillation (DD) is to optimize a smaller synthetic dataset such that it is capable to
take place of original one for training downstream tasks, which is different from coreset selection [
1
,
8
,
15
,
42
,
48
], another branch for dataset compression, directly selecting samples from raw datasets.
In this section, we provide a detailed review of previous methods in DD.
Motivated from knowledge distillation [
18
,
14
,
59
,
58
] aiming at model compression, Wang et al. [
52
]
introduce the concept of dataset distillation for dataset compression. The idea is to optimize the
synthetic images so that they can minimize loss functions of downstream tasks, where a bilevel
optimization algorithm [
13
] is involved. Following this routine, several works further consider
learnable labels beyond samples [
4
,
46
]. Subsequently, Zhao et al. [
62
] and several following
approaches [
60
,
28
] consider matching gradients of a downstream model produced by synthetic
samples and real images, which improve the performance significantly. Most recently, Cazenavette et
al. [
6
] argue that single-iteration gradient matching may lead to inferior performance due to error
accumulation across multiple steps and thereby propose to match long-range training dynamics of an
expert trained on the original dataset. As an alternative method to profile training effects produced by
different sets, Nguyen et al. [
33
,
34
] also introduce the kernel ridge-regression approach based on the
Neural Tangent Kernel (NTK) in infinitely wide convolutional networks [20].
Apart from matching training effects, there are also methods matching data distributions between
original and synthetic datasets. For instance, Zhao et al. [
61
] propose a simple but effective Maximum
Mean Discrepancy (MMD) constraint for DD, which does not involve the training of downstream
models and enjoys superior training efficiency. Wang et al. [
51
] propose CAFE, explicitly attempting
to align the synthetic and real distributions in the feature space of a downstream network.
Above mentioned methods are dedicated to exploring suitable training objectives and pipelines for
DD. However, there are few works concerning improving the data efficiency for distilled samples.
Although Zhao et al. [
60
] propose differentiable siamese augmentation (DSA) to enrich the training
data, the augmentation operations used, e.g., crop, flip, scale, and rotation, cannot encode any
information about the target datasets. In this paper, we study the task in a factorization perspective,
to factorize a dataset into two different compositions: data hallucination networks and bases. Both
parts carry important knowledge of the raw dataset. For downstream training, hallucinators and bases
can perform arbitrary pair-wise combination, i.e., sending any basis to any hallucinator, to create
a training sample. The idea of factorization can improve the diversity of distilled training datasets
significantly, without introducing additional costs for storage. It is also a versatile strategy compatible
with all aforementioned DD methods, which will be demonstrated in the experiment part.
Concurrent Works on Efficient Distilled Dataset Parameterization:
As a concurrent work, Kim
et al. [
24
] propose IDC for efficient synthetic data parameterization. It reveals that only storing
down-sample version of synthetic images and conducting bilinear upsampling in downstream training
would not hurt the performance much. Thus, given the same budget of storage, it can store
4×
number of
2×
down-sample synthetic images compared with the baseline. Both IDC and HaBa
in this paper are dedicated to improving the data efficiency of synthetic parameters. Interestingly,
according to the definition of our hallucinator-basis factorization, IDC can in fact be treated as a
special case of HaBa, where the hallucinator is a parameter-free upsampling function and each basis
has a smaller spatial size. Nevertheless, the main focuses for IDC and HaBa are different and they are
in fact two orthogonal techniques, which can readily join force to enhance the baseline performance,
as discussed in Sec. 4.2.
3 Methods
In this section, we elaborate our proposed method HaBa for dataset distillation (DD). Assume
that there is an original dataset
T={(xi, yi)}|T |
i=1
with
|T |
pairs of a training sample
xi
and the
corresponding label
yi
. DD targets a synthetic dataset
S={(ˆxi,ˆyi)}|S|
i=1
with
|S|  |T |
and expects
that a model trained on Scan have similar performance than that trained on T.
3
𝐻!
𝐻"
Down
Stream
Adv.
𝒯
!"#$
%%
&'#.
&').
For and
For Adv.
Sample Sample
Sample
𝜇
𝜎
Encoder
Decoder
H
𝐵*
𝐵+
Figure 3: Left: Overall pipeline of the proposed hallucinator-basis factorization.
B
,
H
, and
T
denote sets of bases, hallucinators, and original data respectively. Adv. denotes an adversary model.
We adopt batch size 2here for clarity; Right: Architecture of a hallucinator in detail.
Traditional DD methods treat each synthetic sample independently and ignore the inner relationship
between different samples within a dataset, which results in poor data/information efficiency. Focusing
on such drawback, we study DD from a novel perspective and redefine it as a hallucinator-basis
factorization problem:
S={Hθj}|H|
j=1 ∪ {(ˆxi,ˆyi)}|B|
i=1,(1)
where there are
|H|
hallucination networks and
|B|
bases. The
j
-th hallucinator is parameterized by
θi
and we denote it by
Hθi
for
1j≤ |H|
. For downstream training, a training data pair
(˜xij ,˜yij )
is created online via sending the
i
-th basis, with any
1i≤ |B|
, to the
j
-th hallucinator, with any
1j≤ |H|,i.e.,˜xij =Hθj(ˆxi). In this paper, the label ˜yij is simply taken as ˆyi.
An overview of our method is shown in Fig. 3(Left). To go deeper into the technical details, we first
start with the introduction of our basis and data hallucination network in Sec. 3.1. Then, we propose
an adversarial contrastive constraint to increase data diversity in Sec. 3.2. Finally, we present the
whole training pipeline of the hallucinator-basis factorization for DD in Sec. 3.3.
3.1 Basis and Hallucinator
Basis:
Typically, for an image classification dataset
T={(xi, yi)}|T |
i=1
,
xiRh×w×c
and
yi
{0,1, ..., C 1}
for each
1i≤ |T |
, where each
xi
is a
c
-channel image with a resolution of
h×w
,
and
C
is the total number of classes. In previous DD methods, the format/shape of synthetic data
pairs
(ˆx, ˆy)
has to be held the same as that of real data, so as to make sure the consistency between
input and output formats in the training and test time for downstream models. By contrast, since
hallucinator networks are capable of spatial-wise and channel-wise transformation, the shape of each
ˆxi
,
1i≤ |B|
, denoted as
h0×w0×c0
, is not necessarily the same as that of original samples and
thus more flexible. And for a classification problem, we do not modify its label space in this paper
for simplicity and maintain the categorical format.
Hallucinator:
Given a basis
ˆxRh0×w0×c0
, a data hallucination network, aims to create a new
image
˜xRh×w×c
based on
ˆx
, which can be viewed as a conditional image generation problem.
Inspired by image style transfer [
22
,
19
,
21
,
30
], a typical conditional image generation problem,
we devise an encoder-transformation-decoder based architecture for hallucinators, as shown in Fig.
3(Right). Specifically, the encoder, denoted as
enc
, is composed of CNN blocks, which non-linearly
maps an input
ˆx
to a feature space
Rh00×w00 ×c00
. Then, an affine transformation with scale
σ
and shift
µ
is conducted on the derived feature, where
σ
and
µ
are treated as network parameters in this paper.
At last, the decoder
dec
under a symmetric CNN architecture with
enc
projects the transformed
feature back to the image space. Formally, this process can be written as:
ˆ
f=enc(ˆx),˜
f=σ׈
f+µ, ˜x=dec(˜
f),(2)
where the multiplication is element-wise operation. There are
|H|
hallucinators in the whole factor-
ization pipeline and each would be trained to implicitly encode some sample-wise relations by its
network parameters.
3.2 Adversarial Contrastive Constraint
Ideally, the knowledge encoded by different hallucinators should be as different/orthogonal as possible
to get the most benefits for each individual. To instantiate such regularization, let’s consider two
4
composed images
˜xij
and
˜xik
from two different hallucinators
Hθj
and
Hθk
but a common basis
ˆxi
.
The divergence between
˜xij
and
˜xik
is expected to be large. To measure the divergence, a feature
extractor is required to map an input image to a feature space, and how to train such a feature extractor
to find an appropriate feature space is of great importance.
In this paper, we formalize the training of hallucinators and the feature extractor as a min-max game
in a self-consistent manner, where the feature extractor desires to minimize the divergence between
˜xij
and
˜xik
while hallucinators, as well as bases, are optimized to maximize it so that the two players
can reinforce each other. In specific, the feature extractor, denoted as
F
and parameterized by
ψ
, is
typically a CNN structure for the downstream task and we adopt features at the last hidden layer before
the output layer, denoted as
F1(˜xij )
and
F1(˜xik)
.
F
is optimized to maximize the correlation
between the two feature vectors, which can be quantified by the metric of mutual information (MI).
Inspired by the lower bound of MI [
49
], the objective to minimize the divergence for
F
is given by
the following contrastive form:
Lcon. =1
|H|2
1
|B| X
1j,k≤|H|,
j6=k
|B|
X
i=1
log exp {F>
1(˜xij )F1(˜xik)}
P|B|
u=1 exp {F>
1(˜xij )F1(˜xuk)},(3)
where
τ
is a scalar temperature coefficient. For the classification problem, we can alternatively
adopt the supervised form of the contrastive loss
Lcon.
, where
˜xuk
with the same class label as
˜xij
are also taken into consideration as positive samples in Eq. 3. The supervised contrastive loss can
benefit to increase the correlation of samples from the same class [
23
] for a more reasonable feature
representation.
In addition, the feature space is expected to reflect the task-specific property for a meaningful
representation. Thus, we also incorporate the task loss
Ltask
,e.g., cross-entropy loss in classification
tasks, over the synthetic dataset as a supervision signal for
F
. In this way, the overall training
objective for Fis defined as:
min
ψLF=λcon.Lcon. +λtaskLtask,(4)
where λcon. and λtask are hyper-parameters controlling the weight for each term.
F
acts as an adversary to minimize the divergence between
˜xij
and
˜xik
, while the synthetic dataset
is expected to maximize it to increase data diversity. To this ends, the similarity between
F1(˜xij )
and
F1(˜xik)
becomes one loss term for hallucinator-basis factorization. In this paper, we adopt the
cosine-similarity and the objective Lcos. is given by:
Lcos. =1
|H|2
1
|B| X
1j,k≤|H|,
j6=k
|B|
X
i=1
F>
1(˜xij )F1(˜xik)
kF1(˜xij )k2kF1(˜xik)k2
.(5)
During training, the feature extractor and the factorized components are updated alternately to play
this min-max game.
3.3 Factorization Training Pipeline
Following previous paradigms [
62
,
61
,
6
,
51
], the synthetic dataset
S
is updated in an iterative
algorithm. In each iteration, we randomly sample a batch of hallucinators and bases and conduct
pair-wise combinations. The composed images are evaluated by the objective of dataset distillation
LDD and the similarity metric in Eq. 5:
min
SLS=λDD LDD +λcos.Lcos.,(6)
where hyper-parameters λDD and λcos. balance the loss.
Notably, the hallucinator-basis factorization is compatible with a variety of configurations of
LDD
by previous arts, which makes it a versatile and effective strategy for DD. In this paper, we adopt
the trajectories matching loss in Cazenavette et al. [
6
] as
LDD
by default thanks to its superior
performance. The basic idea is to update a downstream model from a cached checkpoint
φ
t
at
iteration
t
, using the synthetic dataset
S
for
N
times, and using the real dataset
T
for
M
times
5
摘要:

DatasetDistillationviaFactorizationSonghuaLiuKaiWangXingyiYangJingwenYeXinchaoWangNationalUniversityofSingapore{songhua.liu,e0823044,xyang}@u.nus.edu,{jingweny,xinchao}@nus.edu.sgAbstractInthispaper,westudydatasetdistillation(DD),fromanovelperspectiveandintroduceadatasetfactorizationapproach,termedH...

展开>> 收起<<
Dataset Distillation via Factorization Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:3.05MB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注