Dataset Distillation via Factorization Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang

2025-08-18 1 0 3.05MB 21 页 10玖币

侵权投诉

Dataset Distillation via Factorization

Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang

National University of Singapore

{songhua.liu,e0823044,xyang}@u.nus.edu, {jingweny,xinchao}@nus.edu.sg

Abstract

In this paper, we study dataset distillation (DD), from a novel perspective and

introduce a dataset factorization approach, termed HaBa, which is a plug-and-

play strategy portable to any existing DD baseline. Unlike conventional DD

approaches that aim to produce distilled and representative samples, HaBa explores

decomposing a dataset into two components: data Hallucination networks and

Bases, where the latter is fed into the former to reconstruct image samples. The

ﬂexible combinations between bases and hallucination networks, therefore, equip

the distilled data with exponential informativeness gain, which largely increase

the representation capability of distilled datasets. To furthermore increase the

data efﬁciency of compression results, we further introduce a pair of adversarial

contrastive constraints on the resultant hallucination networks and bases, which

increase the diversity of generated images and inject more discriminant information

into the factorization. Extensive comparisons and experiments demonstrate that

our method can yield signiﬁcant improvement on downstream classiﬁcation tasks

compared with previous state of the arts, while reducing the total number of

compressed parameters by up to 65%. Moreover, distilled datasets by our approach

also achieve ~10% higher accuracy than baseline methods in cross-architecture

generalization. Our code is available here.

1 Introduction

The success of deep models on a variety of vision tasks, such as image classiﬁcation [

object detection [

], and semantic segmentation [

], is largely attributed to the huge

amount of data used for training and various pre-trained models [

]. However, the sheer amount

of data introduces signiﬁcant obstacles for storage, transmission, and data pre-processing. Besides,

publishing raw data inevitably brings about privacy or copyright issue in practice [

]. To

alleviate these problems, Wang et al. [

] pioneer the research of dataset distillation (DD), to distill a

large dataset into a synthetic one with only a limited number of samples, so that the training efforts

with the distilled dataset for downstream models on the original dataset can be largely reduced,

which facilitates a series of applications like continual learning [

] and black-box

optimization [

]. Due the signiﬁcant practical value of DD, many endeavours have been made on this

area [

] to design novel supervision signals to train the synthetic datasets and

to further improve their performances.

Nevertheless, there is a potential drawback in conventional settings of DD: it largely treats each

synthetic sample independently and ignores the inter coherence and relationship between different

instances. As such, the information embraced by each sample, despite distilled, is by nature limited.

Using the synthetic samples for training downstream models, therefore, inevitably leads to the loss of

dataset information. Moreover, the few distilled samples are incompatible with the enormous number

of parameters in a deep model and may yield the risk of overﬁtting.

To verify these potential issues, we conduct a pre-experiment on CIFAR10 dataset with 10 synthetic

images per class, using MTT [

], the current SOTA solution on DD, as the baseline. In addition

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.16774v1 [cs.CV] 30 Oct 2022

𝐻!

𝐻"... ...

...

Bases Hallucinators

Real Dataset

Synthetic Dataset

Performance

Matching

⟸

𝐻#

Dataset

Factorization

Figure 1: Intuition of our hallucinator-basis factorization for dataset distillation.

to the baseline setting, we also incorporate all the checkpoint synthetic datasets after each 100 DD

iterations in the convergent stage to train the downstream model. Since the synthetic images are

ﬁne-tuned during this stage, multiple checkpoints can be viewed as related but different, which may

somehow increase the diversity. As a result, it yields overall lower test loss and hence better ﬁnal

results in downstream training, as shown in the blue and green curves in Fig. 2, which indicates that

current DD solutions can be potentially improved by leveraging some sample-wise relationships to

diversify the distilled data. Nevertheless, simply involving more data samples may also increase the

memory overhead. This fact motivates us to ask: is it possible to encode some shared relationships in

a dataset implicitly, instead of storing samples directly, to avoid such additional storage costs?

Figure 2: Visualization

of test loss using syn-

thetic datasets generated

by MTT, MTT with multi-

ple checkpoints, and ours.

We show in this paper that, it can indeed be made possible through

reformulating the DD task as a factorization problem. As shown in

Fig. 1, we propose a novel perspective dubbed HaBa, to factorize a

dataset into two compositions: data Hallucination networks and Bases.

A data hallucination network, or hallucinator, can take any basis as

input and output the corresponding hallucinated image. Supervised

by the training objective of DD, a set of hallucinators can synthesize

multiple samples from a common basis and are optimized to extract

effective relationships among different samples in original datasets

explicitly. In this way, information of

|H|×|B|

images can be included

for a factorization result with

|H|

hallucinators and

|B|

bases via

arbitrary pair-wise combination, which improves the data efﬁciency

of traditional DD exponentially. As shown in the yellow curve in Fig.

2, with the same budget on the storage, our strategy achieves better

test performance compared with the MTT baseline.

To further increase the informativeness of factorized results, we introduce a pair of adversarial

contrastive constraints to promote sample-wise diversity. The goal of HaBa is to minimize the

correlation among images composed of different hallucinators but a common basis, while an adversary

tries to maximize it. Such an adversarial scheme, in turn, enforces the hallucinators to produce

diversiﬁed images and increases the amount of useful information.

Notably, HaBa is a versatile strategy that can be built upon existing DD baselines, since it is compatible

with any training objective for measuring the similarity between downstream performances as shown

in Fig. 1, We conduct extensive experiments to demonstrate the advantages of the proposed method

over baseline ones. In all benchmarks and comparisons, HaBa produces signiﬁcant and consistent

improvement on training downstream models, while reducing the total number of compressed

parameters by up to 65%. Furthermore, it demonstrates strong cross-architecture generalization

ability with accuracy improvement higher than 10%. Our contributions are summarized as follows:

•

We study dataset factorization, a novel perspective to explore dataset distillation, and propose

a novel approach termed HaBa for hallucinator-basis factorization.

•

We present a pair of adversarial contrastive objectives to further increase the data diversity

and information capability.

•

HaBa is a plug-and-play scheme compatible with all existing training objectives of DD and

can yield signiﬁcant and consistent improvement over the state of the arts.

2 Related Works

The goal of dataset distillation (DD) is to optimize a smaller synthetic dataset such that it is capable to

take place of original one for training downstream tasks, which is different from coreset selection [

], another branch for dataset compression, directly selecting samples from raw datasets.

In this section, we provide a detailed review of previous methods in DD.

Motivated from knowledge distillation [

] aiming at model compression, Wang et al. [

]

introduce the concept of dataset distillation for dataset compression. The idea is to optimize the

synthetic images so that they can minimize loss functions of downstream tasks, where a bilevel

optimization algorithm [

] is involved. Following this routine, several works further consider

learnable labels beyond samples [

]. Subsequently, Zhao et al. [

] and several following

approaches [

] consider matching gradients of a downstream model produced by synthetic

samples and real images, which improve the performance signiﬁcantly. Most recently, Cazenavette et

al. [

] argue that single-iteration gradient matching may lead to inferior performance due to error

accumulation across multiple steps and thereby propose to match long-range training dynamics of an

expert trained on the original dataset. As an alternative method to proﬁle training effects produced by

different sets, Nguyen et al. [

] also introduce the kernel ridge-regression approach based on the

Neural Tangent Kernel (NTK) in inﬁnitely wide convolutional networks [20].

Apart from matching training effects, there are also methods matching data distributions between

original and synthetic datasets. For instance, Zhao et al. [

] propose a simple but effective Maximum

Mean Discrepancy (MMD) constraint for DD, which does not involve the training of downstream

models and enjoys superior training efﬁciency. Wang et al. [

] propose CAFE, explicitly attempting

to align the synthetic and real distributions in the feature space of a downstream network.

Above mentioned methods are dedicated to exploring suitable training objectives and pipelines for

DD. However, there are few works concerning improving the data efﬁciency for distilled samples.

Although Zhao et al. [

] propose differentiable siamese augmentation (DSA) to enrich the training

data, the augmentation operations used, e.g., crop, ﬂip, scale, and rotation, cannot encode any

information about the target datasets. In this paper, we study the task in a factorization perspective,

to factorize a dataset into two different compositions: data hallucination networks and bases. Both

parts carry important knowledge of the raw dataset. For downstream training, hallucinators and bases

can perform arbitrary pair-wise combination, i.e., sending any basis to any hallucinator, to create

a training sample. The idea of factorization can improve the diversity of distilled training datasets

signiﬁcantly, without introducing additional costs for storage. It is also a versatile strategy compatible

with all aforementioned DD methods, which will be demonstrated in the experiment part.

Concurrent Works on Efﬁcient Distilled Dataset Parameterization:

As a concurrent work, Kim

et al. [

] propose IDC for efﬁcient synthetic data parameterization. It reveals that only storing

down-sample version of synthetic images and conducting bilinear upsampling in downstream training

would not hurt the performance much. Thus, given the same budget of storage, it can store

4×

number of

2×

down-sample synthetic images compared with the baseline. Both IDC and HaBa

in this paper are dedicated to improving the data efﬁciency of synthetic parameters. Interestingly,

according to the deﬁnition of our hallucinator-basis factorization, IDC can in fact be treated as a

special case of HaBa, where the hallucinator is a parameter-free upsampling function and each basis

has a smaller spatial size. Nevertheless, the main focuses for IDC and HaBa are different and they are

in fact two orthogonal techniques, which can readily join force to enhance the baseline performance,

as discussed in Sec. 4.2.

3 Methods

In this section, we elaborate our proposed method HaBa for dataset distillation (DD). Assume

that there is an original dataset

T={(xi, yi)}|T |

i=1

with

|T |

pairs of a training sample

and the

corresponding label

. DD targets a synthetic dataset

S={(ˆxi,ˆyi)}|S|

i=1

with

|S|  |T |

and expects

that a model trained on Scan have similar performance than that trained on T.

𝐻!

𝐻"

ℋ

Down

Stream

Adv.

𝒯ℬ

ℒ!"#$

ℒ%%

ℒ&'#.

ℒ&').

For ℬand ℋ

For Adv.

Sample Sample

Sample

𝜇

𝜎

Encoder

Decoder

𝐵*

𝐵+

Figure 3: Left: Overall pipeline of the proposed hallucinator-basis factorization.

, and

denote sets of bases, hallucinators, and original data respectively. Adv. denotes an adversary model.

We adopt batch size 2here for clarity; Right: Architecture of a hallucinator in detail.

Traditional DD methods treat each synthetic sample independently and ignore the inner relationship

between different samples within a dataset, which results in poor data/information efﬁciency. Focusing

on such drawback, we study DD from a novel perspective and redeﬁne it as a hallucinator-basis

factorization problem:

S={Hθj}|H|

j=1 ∪ {(ˆxi,ˆyi)}|B|

i=1,(1)

where there are

|H|

hallucination networks and

|B|

bases. The

-th hallucinator is parameterized by

θi

and we denote it by

Hθi

for

1≤j≤ |H|

. For downstream training, a training data pair

(˜xij ,˜yij )

is created online via sending the

-th basis, with any

1≤i≤ |B|

, to the

-th hallucinator, with any

1≤j≤ |H|,i.e.,˜xij =Hθj(ˆxi). In this paper, the label ˜yij is simply taken as ˆyi.

An overview of our method is shown in Fig. 3(Left). To go deeper into the technical details, we ﬁrst

start with the introduction of our basis and data hallucination network in Sec. 3.1. Then, we propose

an adversarial contrastive constraint to increase data diversity in Sec. 3.2. Finally, we present the

whole training pipeline of the hallucinator-basis factorization for DD in Sec. 3.3.

3.1 Basis and Hallucinator

Basis:

Typically, for an image classiﬁcation dataset

T={(xi, yi)}|T |

i=1

xi∈Rh×w×c

and

yi∈

{0,1, ..., C −1}

for each

1≤i≤ |T |

, where each

is a

-channel image with a resolution of

h×w

and

is the total number of classes. In previous DD methods, the format/shape of synthetic data

pairs

(ˆx, ˆy)

has to be held the same as that of real data, so as to make sure the consistency between

input and output formats in the training and test time for downstream models. By contrast, since

hallucinator networks are capable of spatial-wise and channel-wise transformation, the shape of each

ˆxi

1≤i≤ |B|

, denoted as

h0×w0×c0

, is not necessarily the same as that of original samples and

thus more ﬂexible. And for a classiﬁcation problem, we do not modify its label space in this paper

for simplicity and maintain the categorical format.

Hallucinator:

Given a basis

ˆx∈Rh0×w0×c0

, a data hallucination network, aims to create a new

image

˜x∈Rh×w×c

based on

ˆx

, which can be viewed as a conditional image generation problem.

Inspired by image style transfer [

], a typical conditional image generation problem,

we devise an encoder-transformation-decoder based architecture for hallucinators, as shown in Fig.

3(Right). Speciﬁcally, the encoder, denoted as

enc

, is composed of CNN blocks, which non-linearly

maps an input

ˆx

to a feature space

Rh00×w00 ×c00

. Then, an afﬁne transformation with scale

and shift

is conducted on the derived feature, where

and

are treated as network parameters in this paper.

At last, the decoder

dec

under a symmetric CNN architecture with

enc

projects the transformed

feature back to the image space. Formally, this process can be written as:

f=enc(ˆx),˜

f=σ×ˆ

f+µ, ˜x=dec(˜

f),(2)

where the multiplication is element-wise operation. There are

|H|

hallucinators in the whole factor-

ization pipeline and each would be trained to implicitly encode some sample-wise relations by its

network parameters.

3.2 Adversarial Contrastive Constraint

Ideally, the knowledge encoded by different hallucinators should be as different/orthogonal as possible

to get the most beneﬁts for each individual. To instantiate such regularization, let’s consider two

composed images

˜xij

and

˜xik

from two different hallucinators

Hθj

and

Hθk

but a common basis

ˆxi

The divergence between

˜xij

and

˜xik

is expected to be large. To measure the divergence, a feature

extractor is required to map an input image to a feature space, and how to train such a feature extractor

to ﬁnd an appropriate feature space is of great importance.

In this paper, we formalize the training of hallucinators and the feature extractor as a min-max game

in a self-consistent manner, where the feature extractor desires to minimize the divergence between

˜xij

and

˜xik

while hallucinators, as well as bases, are optimized to maximize it so that the two players

can reinforce each other. In speciﬁc, the feature extractor, denoted as

and parameterized by

, is

typically a CNN structure for the downstream task and we adopt features at the last hidden layer before

the output layer, denoted as

F−1(˜xij )

and

F−1(˜xik)

is optimized to maximize the correlation

between the two feature vectors, which can be quantiﬁed by the metric of mutual information (MI).

Inspired by the lower bound of MI [

], the objective to minimize the divergence for

is given by

the following contrastive form:

Lcon. =−1

|H|2

|B| X

1≤j,k≤|H|,

j6=k

|B|

i=1

log exp {F>

−1(˜xij )F−1(˜xik)/τ}

P|B|

u=1 exp {F>

−1(˜xij )F−1(˜xuk)/τ},(3)

where

is a scalar temperature coefﬁcient. For the classiﬁcation problem, we can alternatively

adopt the supervised form of the contrastive loss

Lcon.

, where

˜xuk

with the same class label as

˜xij

are also taken into consideration as positive samples in Eq. 3. The supervised contrastive loss can

beneﬁt to increase the correlation of samples from the same class [

] for a more reasonable feature

representation.

In addition, the feature space is expected to reﬂect the task-speciﬁc property for a meaningful

representation. Thus, we also incorporate the task loss

Ltask

,e.g., cross-entropy loss in classiﬁcation

tasks, over the synthetic dataset as a supervision signal for

. In this way, the overall training

objective for Fis deﬁned as:

min

ψLF=λcon.Lcon. +λtaskLtask,(4)

where λcon. and λtask are hyper-parameters controlling the weight for each term.

acts as an adversary to minimize the divergence between

˜xij

and

˜xik

, while the synthetic dataset

is expected to maximize it to increase data diversity. To this ends, the similarity between

F−1(˜xij )

and

F−1(˜xik)

becomes one loss term for hallucinator-basis factorization. In this paper, we adopt the

cosine-similarity and the objective Lcos. is given by:

Lcos. =1

|H|2

|B| X

1≤j,k≤|H|,

j6=k

|B|

i=1

−1(˜xij )F−1(˜xik)

kF−1(˜xij )k2kF−1(˜xik)k2

.(5)

During training, the feature extractor and the factorized components are updated alternately to play

this min-max game.

3.3 Factorization Training Pipeline

Following previous paradigms [

], the synthetic dataset

is updated in an iterative

algorithm. In each iteration, we randomly sample a batch of hallucinators and bases and conduct

pair-wise combinations. The composed images are evaluated by the objective of dataset distillation

LDD and the similarity metric in Eq. 5:

min

SLS=λDD LDD +λcos.Lcos.,(6)

where hyper-parameters λDD and λcos. balance the loss.

Notably, the hallucinator-basis factorization is compatible with a variety of conﬁgurations of

LDD

by previous arts, which makes it a versatile and effective strategy for DD. In this paper, we adopt

the trajectories matching loss in Cazenavette et al. [

] as

LDD

by default thanks to its superior

performance. The basic idea is to update a downstream model from a cached checkpoint

φ∗

iteration

, using the synthetic dataset

for

times, and using the real dataset

for

times

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DatasetDistillationviaFactorizationSonghuaLiuKaiWangXingyiYangJingwenYeXinchaoWangNationalUniversityofSingapore{songhua.liu,e0823044,xyang}@u.nus.edu,{jingweny,xinchao}@nus.edu.sgAbstractInthispaper,westudydatasetdistillation(DD),fromanovelperspectiveandintroduceadatasetfactorizationapproach,termedH...

展开>> 收起<<

Dataset Distillation via Factorization Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dataset Distillation via Factorization Songhua Liu Kai Wang Xingyi Yang Jingwen Ye Xinchao Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: