•
HaBa is a plug-and-play scheme compatible with all existing training objectives of DD and
can yield significant and consistent improvement over the state of the arts.
2 Related Works
The goal of dataset distillation (DD) is to optimize a smaller synthetic dataset such that it is capable to
take place of original one for training downstream tasks, which is different from coreset selection [
1
,
8
,
15
,
42
,
48
], another branch for dataset compression, directly selecting samples from raw datasets.
In this section, we provide a detailed review of previous methods in DD.
Motivated from knowledge distillation [
18
,
14
,
59
,
58
] aiming at model compression, Wang et al. [
52
]
introduce the concept of dataset distillation for dataset compression. The idea is to optimize the
synthetic images so that they can minimize loss functions of downstream tasks, where a bilevel
optimization algorithm [
13
] is involved. Following this routine, several works further consider
learnable labels beyond samples [
4
,
46
]. Subsequently, Zhao et al. [
62
] and several following
approaches [
60
,
28
] consider matching gradients of a downstream model produced by synthetic
samples and real images, which improve the performance significantly. Most recently, Cazenavette et
al. [
6
] argue that single-iteration gradient matching may lead to inferior performance due to error
accumulation across multiple steps and thereby propose to match long-range training dynamics of an
expert trained on the original dataset. As an alternative method to profile training effects produced by
different sets, Nguyen et al. [
33
,
34
] also introduce the kernel ridge-regression approach based on the
Neural Tangent Kernel (NTK) in infinitely wide convolutional networks [20].
Apart from matching training effects, there are also methods matching data distributions between
original and synthetic datasets. For instance, Zhao et al. [
61
] propose a simple but effective Maximum
Mean Discrepancy (MMD) constraint for DD, which does not involve the training of downstream
models and enjoys superior training efficiency. Wang et al. [
51
] propose CAFE, explicitly attempting
to align the synthetic and real distributions in the feature space of a downstream network.
Above mentioned methods are dedicated to exploring suitable training objectives and pipelines for
DD. However, there are few works concerning improving the data efficiency for distilled samples.
Although Zhao et al. [
60
] propose differentiable siamese augmentation (DSA) to enrich the training
data, the augmentation operations used, e.g., crop, flip, scale, and rotation, cannot encode any
information about the target datasets. In this paper, we study the task in a factorization perspective,
to factorize a dataset into two different compositions: data hallucination networks and bases. Both
parts carry important knowledge of the raw dataset. For downstream training, hallucinators and bases
can perform arbitrary pair-wise combination, i.e., sending any basis to any hallucinator, to create
a training sample. The idea of factorization can improve the diversity of distilled training datasets
significantly, without introducing additional costs for storage. It is also a versatile strategy compatible
with all aforementioned DD methods, which will be demonstrated in the experiment part.
Concurrent Works on Efficient Distilled Dataset Parameterization:
As a concurrent work, Kim
et al. [
24
] propose IDC for efficient synthetic data parameterization. It reveals that only storing
down-sample version of synthetic images and conducting bilinear upsampling in downstream training
would not hurt the performance much. Thus, given the same budget of storage, it can store
4×
number of
2×
down-sample synthetic images compared with the baseline. Both IDC and HaBa
in this paper are dedicated to improving the data efficiency of synthetic parameters. Interestingly,
according to the definition of our hallucinator-basis factorization, IDC can in fact be treated as a
special case of HaBa, where the hallucinator is a parameter-free upsampling function and each basis
has a smaller spatial size. Nevertheless, the main focuses for IDC and HaBa are different and they are
in fact two orthogonal techniques, which can readily join force to enhance the baseline performance,
as discussed in Sec. 4.2.
3 Methods
In this section, we elaborate our proposed method HaBa for dataset distillation (DD). Assume
that there is an original dataset
T={(xi, yi)}|T |
i=1
with
|T |
pairs of a training sample
xi
and the
corresponding label
yi
. DD targets a synthetic dataset
S={(ˆxi,ˆyi)}|S|
i=1
with
|S| |T |
and expects
that a model trained on Scan have similar performance than that trained on T.
3