OCD Learning to Overfit with Conditional Diffusion Models

2025-05-02 0 0 1.07MB 13 页 10玖币

侵权投诉

OCD: Learning to Overﬁt with Conditional Diffusion Models

Shahar Lutati 1Lior Wolf 1

Abstract

We present a dynamic model in which the

weights are conditioned on an input sample x

and are learned to match those that would be ob-

tained by ﬁnetuning a base model on xand its

label y. This mapping between an input sample

and network weights is approximated by a de-

noising diffusion model. The diffusion model we

employ focuses on modifying a single layer of

the base model and is conditioned on the input,

activations, and output of this layer. Since the

diffusion model is stochastic in nature, multiple

initializations generate different networks, form-

ing an ensemble, which leads to further improve-

ments. Our experiments demonstrate the wide

applicability of the method for image classiﬁca-

tion, 3D reconstruction, tabular data, speech sep-

aration, and natural language processing. Our

code is attached as supplementary material.

1. Introduction

Here is a simple local algorithm: For each testing pattern,

(1) select the few training examples located in the vicinity

of the testing pattern, (2) train a neural network with only

these few examples, and (3) apply the resulting network to

the testing pattern.

Bottou & Vapnik (1992)

Thirty years after the local learning method in the epigraph

was introduced, it can be modernized in a few ways. First,

instead of training a neural network from scratch on a hand-

ful of samples, the method can ﬁnetune, with the same sam-

ples, a base model that is pre-trained on the entire train-

ing set. The empirical success of transfer learning meth-

ods (Han et al.,2021) suggests that this would lead to an

improvement.

1Blavatnik School of Computer Science, Tel Aviv University.

Correspondence to: Shahar Lutati <shahar761@gmail.com>,

Lior Wolf <wolf@cs.tau.ac.il>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Second, instead of retraining a neural network each time,

we can learn to predict the weights of the locally-trained

neural network for each set of input samples. This idea uti-

lizes a dynamic, input-dependent architecture, also known

as a hypernetwork (Ha et al.,2016).

Third, we can take the approach to an extreme and consider

local regions that contain a single sample. During training,

we ﬁnetune the base model for each training sample sepa-

rately. In this process, which we call “overﬁtting”, we train

on each speciﬁc sample s= (x, y)from the training set,

starting with the weights of the base model and obtaining a

model fθs. We then learn a model gthat maps between x

(without the label) and the shift in the weights of fθsfrom

those of the base model. Given a test sample x, we ap-

ply the learned mapping gto it, obtain model weights, and

apply the resulting model to x.

The overﬁtted models are expected to be similar to the

base model, since the samples we overﬁt are part of the

training set of the base model. As a result, it is likely

that a diffusion process would be able to generate the

weights of the ﬁne-tuned networks. Recently, diffusion

models, such as DDPM (Ho et al.,2020) and DDIM (Song

et al.,2020) were shown to be highly successful in generat-

ing perceptual samples (Dhariwal & Nichol,2021b;Kong

et al.,2021). Here, we employ such models as hypernet-

works, i.e., as means for conditionally generating network

weights.

In order to make the diffusion models suitable for predict-

ing network weights, we make three adjustments. First, we

automatically select a speciﬁc layer of the neural model and

modify only this layer. This considerably reduces the size

of the generated data and, in our experience, is sufﬁcient

for supporting the overﬁtting effect. Second, we condition

the diffusion process on the input of the selected layer, its

activations, and its output. Third, since the diffusion pro-

cess assumes unit variance scale (Ho et al.,2020), we learn

the scale of the weight modiﬁcation separately.

Similarly to other diffusion processes, our hypernetwork

is initialized with normal noise and different initializations

lead to slightly different results. Using this feature of the

diffusion model, we generate multiple models from the

same instance and use the resulting ensemble technique

to further improve the prediction accuracy. Our method is

arXiv:2210.00471v5 [cs.LG] 9 Jun 2023

OCD: Learning to Overﬁt with Conditional Diffusion Models

widely applicable, and we evaluate it across ﬁve very differ-

ent domains: image classiﬁcation, image synthesis, regres-

sion in tabular data, speech separation, and few-shot NLP.

In all cases, the results obtained by our method improve

upon the baseline model to which our method is applied.

Whenever the baseline model is close to the state of the art,

the leap in performance sets new state-of-the-art results.

2. Related Work

Local learning approaches perform inference with mod-

els that are focused on training samples in the vicinity of

each test sample. This way, the predictions are based on

what are believed to be the most relevant data points. K-

nearest neighbors, for example, is a local learning method.

Bottou & Vapnik (1992) have presented a simple algorithm

for adjusting the capacity of the learned model locally, and

discuss the advantages of such models for learning with un-

even data distributions. Alpaydin & Jordan (1996) combine

multiple local perceptrons in either a cooperative or a dis-

criminative manner, and Zhang et al. (2006) combine mul-

tiple local support vector machines. These and other sim-

ilar contributions rely on local neighborhoods containing

multiple samples. The one-shot similarity kernel of Wolf

et al. (2009) contrasts a single test sample with many train-

ing samples, but it does not ﬁnetune a model based on a

single sample, as we do.

More recently, Wang et al. (2021) employ local learn-

ing to perform single-sample domain adaptation (includ-

ing robustness to corruption). The adaptation is performed

through an optimization process that maximizes the en-

tropy of the prediction provided for each test sample. Our

method does not require any test-time optimization and fo-

cuses (on the training samples) on improving the accuracy

of the ground truth label rather than label-agnostic conﬁ-

dence.

Alet et al. (2021) propose a method called Tailoring that

employs, like our method, meta-learning to local learning.

The approach is based on applying unsupervised learning

on a dataset that is created by augmenting the test sample,

in a way that is related to the adaptive instance normaliza-

tion of Huang & Belongie (2017). Our method does not

employ any such augmentation and is based on supervised

ﬁnetuning on a single sample.

Tailoring was tested on synthetic datasets with very spe-

ciﬁc structures, in a very speciﬁc unsupervised setting of

CIFAR-10. Additionally, it was tested as a defense against

adversarial samples, with results that fell short of the state

of the art in this ﬁeld. Since the empirical success obtained

by Tailoring so far is limited and since there is no published

code, it is not used as a baseline in our experiments.

As far as we can ascertain, all existing local learning con-

tributions are very different from our work. No other con-

tribution overﬁts samples of the training set, trains a hy-

pernetwork for local learning, nor builds a hypernetwork

based on diffusion models.

Hypernetworks (Ha et al.,2016)are neural models that

generate the weights of a second primary network, which

performs the actual prediction task. Since the inferred

weights are multiplied by the activations of the primary

network, hypernetworks are a form of multiplicative inter-

actions (Jayakumar et al.,2020), and extend layer-speciﬁc

dynamic networks, which have been used to adapt neural

models to the properties of the input sample (Klein et al.,

2015;Riegler et al.,2015).

Hypernetworks beneﬁt from the knowledge-sharing abil-

ity of the weight-generating network and are therefore

suited for meta-learning tasks, including few-shot learn-

ing (Bertinetto et al.,2016), continual learning (von Os-

wald et al.,2020), and model personalization (Shamsian

et al.,2021). When there is a need to repeatedly train sim-

ilar networks, predicting the weights can be more efﬁcient

than backpropagation. Hypernetworks have, therefore,

been used for neural architecture search (Brock et al.,2018;

Zhang et al.,2019), and hyperparameter selection (Lorraine

& Duvenaud,2018).

MEND by Mitchell et al. (2021) explores the problem of

model editing for large language models, in which the

model’s parameters are updated after training to incorpo-

rate new data. In our work, the goal is to predict the label

of the new sample and not to update the model. Unlike

MEND, our method does not employ the label of the new

sample.

Diffusion models Many of the recent generative models

for images (Ho et al.,2022;Chen et al.,2020;Dhariwal &

Nichol,2021a) and speech (Kong et al.,2021;Chen et al.,

2020) are based on a degenerate form of the Focker-Planck

equation. Sohl-Dickstein et al. (2015) showed that compli-

cated distributions could be learned using a simple diffu-

sion process. The Denoising Diffusion Probabilistic Model

(DDPM) of Ho et al. (2020) extends the framework and

presents high-quality image synthesis. Song et al. (2020)

sped up the inference time by an order of magnitude us-

ing implicit sampling with their DDIM method. Watson

et al. (2021) propose a dynamic programming algorithm to

ﬁnd an efﬁcient denoising schedule and San-Roman et al.

(2021) apply a learned scaling adjustment to noise schedul-

ing. Luhman & Luhman (2021) combined knowledge dis-

tillation with DDPMs.

The iterative nature of the denoising generation scheme

creates an opportunity to steer the process, by considering

the gradients of additional loss terms. The Iterative La-

tent Variable Reﬁnement (ILVR) method Choi et al. (2021)

OCD: Learning to Overﬁt with Conditional Diffusion Models

does so for images by directing the generated image to-

ward a low-resolution template. A similar technique was

subsequently employed for voice modiﬁcation Levkovitch

et al. (2022). Direct conditioning is also possible: Saharia

et al. (2022) generate photo-realistic text-to-image scenes

by conditioning a diffusion model on text embedding; Amit

et al. (2021) repeatedly condition on the input image to

obtain image segmentation. In voice generation, the mel-

spectrogram can be used as additional input to the denois-

ing network Chen et al. (2020); Kong et al. (2021); Liu et al.

(2021a), as can the input text for a text-to-speech diffusion

model Popov et al. (2021). The conditioning we employ is

of the direct type.

3. Method

Our method is based on a modiﬁed diffusion process. De-

note the training dataset as S={(xi, yi)}n

i=1, where xiare

the data points in the dataset S, and yiare the associated la-

bels. First, a base model, fθ(x) = f(x, θ)is trained over

the entire dataset, S, where θare the learned weights of the

model when trained over the entire dataset.

Next, for every training sample s∈Swe run ﬁne-tuning

based on that single sample to obtain the overﬁtted param-

eters (function) as θs(fθs).

θs=θ+ arg min

∆

(L(f(xs, θ + ∆), ys)) ,(1)

where Lis the loss function that is minimized in the train-

ing of the base model, xs, ysare the data point and label

of sample s, and ∆is the weight difference obtained when

ﬁnetuning the model. Finetuning is performed with three

gradient descent iterations, and, as shown in our runtime

analysis, is typically much less computationally demand-

ing than the training of the base network.

The meta-learning problem we consider is the one of learn-

ing a model g, which maps x(the input domain of sample

s), and potentially multiple latent representations of xin

the context of fθ, collectively denoted as I(x), to a vector

of weight differences, such that

g(x, I(x)) = θs−θ(2)

where θare the base model’s parameters trained over S,

and g(x, I(x)) is a mapping function that maps the input,

i.e., the xpart of s, and multiple latent representations of

it, I(x), to the desired shift in the model parameters.

Layer selection Current deep neural networks can have

millions or even billions of parameters. Thus, learning to

modify all network parameters can be a prohibitive task.

Therefore, we opt to modify, via function g, a single layer

of fθ.

To select this layer, we follow Lutati & Wolf (2021) and

choose the layer that presents the maximal entropy of the

loss, when ﬁxing the samples (x, y)∈s, and perturbing

the layer’s parameters.

The layer that yields the highest entropy score when per-

turbed is a natural candidate for a ﬁne-tuning algorithm,

since a large entropy reﬂects that, near the layer’s base pa-

rameters, the surface imposed by the loss function has a

large variance. Thus, a small perturbation could result in

a dramatic change in loss when the perturbation is in the

right direction.

Denote the perturbed weights, in which only layer Lis per-

turbed, as θL. The score used for selection is

Score =1

|S|X

(x,y)∈S

EntropyθL(L(f(x, θL), y)) ,(3)

where Lis the loss objective on which the function of fθ

is trained, and the entropy is computed over multiple draws

of θL. Since sampling does not involve a backpropaga-

tion computation, the process is not computationally de-

manding, and 10,000 samples per each training sample

s= (x, y)are used.

The entropy, per each sample s, is computed by ﬁtting a

Gaussian Kernel Density Estimation (GKDE) (Silverman,

1986) to the obtained empirical distribution of the loss

function. The layer that obtains the highest mean entropy

is selected.

The conditioning signal The latent representation, I(x),

has three components. Given a selected layer, L, we denote

the input to this layer, when passing a sample xto f(x, θ),

as iL(x)and the activation of this layer as aL(x). We also

use the output of the base function fθ(x).I(x)is given by

the tuple

I(x)=[iL(x), aL(x), fθ(x)] (4)

3.1. Diffusion Process

The goal of the diffusion process is to reconstruct ∆, the

difference between the ﬁne-tuned weights θs, and the base

model weights, θ. The process iteratively starts a random

ΩT, with the same dimensions as θs.

ΩT∼ N(0,1) (5)

Next, it iterates with Ωt, where tis decreasing and is the

diffusion step, and returns Ω0. After appropriate diffusion

steps, Ω0should be as close as possible to ∆for a given

point s.

The diffusion error estimation network, ϵΩis a function of

the current estimation, Ωt, the latent representation tuple,

I(x), and the diffusion timestep, t. The last is encoded

through a positional encoding network (Vaswani et al.,

2017), P E. All inputs, except for Ωt, are combined into

one vector: e=P E(t) + Ei(iL) + Ea(aL) + Eo(fθ(x)),

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OCD:LearningtoOverfitwithConditionalDiffusionModelsShaharLutati1LiorWolf1AbstractWepresentadynamicmodelinwhichtheweightsareconditionedonaninputsamplexandarelearnedtomatchthosethatwouldbeob-tainedbyfinetuningabasemodelonxanditslabely.Thismappingbetweenaninputsampleandnetworkweightsisapproximatedbyade...

展开>> 收起<<

OCD Learning to Overfit with Conditional Diffusion Models.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OCD Learning to Overfit with Conditional Diffusion Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: