OCD Learning to Overfit with Conditional Diffusion Models

2025-05-02 0 0 1.07MB 13 页 10玖币
侵权投诉
OCD: Learning to Overfit with Conditional Diffusion Models
Shahar Lutati 1Lior Wolf 1
Abstract
We present a dynamic model in which the
weights are conditioned on an input sample x
and are learned to match those that would be ob-
tained by finetuning a base model on xand its
label y. This mapping between an input sample
and network weights is approximated by a de-
noising diffusion model. The diffusion model we
employ focuses on modifying a single layer of
the base model and is conditioned on the input,
activations, and output of this layer. Since the
diffusion model is stochastic in nature, multiple
initializations generate different networks, form-
ing an ensemble, which leads to further improve-
ments. Our experiments demonstrate the wide
applicability of the method for image classifica-
tion, 3D reconstruction, tabular data, speech sep-
aration, and natural language processing. Our
code is attached as supplementary material.
1. Introduction
Here is a simple local algorithm: For each testing pattern,
(1) select the few training examples located in the vicinity
of the testing pattern, (2) train a neural network with only
these few examples, and (3) apply the resulting network to
the testing pattern.
Bottou & Vapnik (1992)
Thirty years after the local learning method in the epigraph
was introduced, it can be modernized in a few ways. First,
instead of training a neural network from scratch on a hand-
ful of samples, the method can finetune, with the same sam-
ples, a base model that is pre-trained on the entire train-
ing set. The empirical success of transfer learning meth-
ods (Han et al.,2021) suggests that this would lead to an
improvement.
1Blavatnik School of Computer Science, Tel Aviv University.
Correspondence to: Shahar Lutati <shahar761@gmail.com>,
Lior Wolf <wolf@cs.tau.ac.il>.
Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Second, instead of retraining a neural network each time,
we can learn to predict the weights of the locally-trained
neural network for each set of input samples. This idea uti-
lizes a dynamic, input-dependent architecture, also known
as a hypernetwork (Ha et al.,2016).
Third, we can take the approach to an extreme and consider
local regions that contain a single sample. During training,
we finetune the base model for each training sample sepa-
rately. In this process, which we call “overfitting”, we train
on each specific sample s= (x, y)from the training set,
starting with the weights of the base model and obtaining a
model fθs. We then learn a model gthat maps between x
(without the label) and the shift in the weights of fθsfrom
those of the base model. Given a test sample x, we ap-
ply the learned mapping gto it, obtain model weights, and
apply the resulting model to x.
The overfitted models are expected to be similar to the
base model, since the samples we overfit are part of the
training set of the base model. As a result, it is likely
that a diffusion process would be able to generate the
weights of the fine-tuned networks. Recently, diffusion
models, such as DDPM (Ho et al.,2020) and DDIM (Song
et al.,2020) were shown to be highly successful in generat-
ing perceptual samples (Dhariwal & Nichol,2021b;Kong
et al.,2021). Here, we employ such models as hypernet-
works, i.e., as means for conditionally generating network
weights.
In order to make the diffusion models suitable for predict-
ing network weights, we make three adjustments. First, we
automatically select a specific layer of the neural model and
modify only this layer. This considerably reduces the size
of the generated data and, in our experience, is sufficient
for supporting the overfitting effect. Second, we condition
the diffusion process on the input of the selected layer, its
activations, and its output. Third, since the diffusion pro-
cess assumes unit variance scale (Ho et al.,2020), we learn
the scale of the weight modification separately.
Similarly to other diffusion processes, our hypernetwork
is initialized with normal noise and different initializations
lead to slightly different results. Using this feature of the
diffusion model, we generate multiple models from the
same instance and use the resulting ensemble technique
to further improve the prediction accuracy. Our method is
1
arXiv:2210.00471v5 [cs.LG] 9 Jun 2023
OCD: Learning to Overfit with Conditional Diffusion Models
widely applicable, and we evaluate it across five very differ-
ent domains: image classification, image synthesis, regres-
sion in tabular data, speech separation, and few-shot NLP.
In all cases, the results obtained by our method improve
upon the baseline model to which our method is applied.
Whenever the baseline model is close to the state of the art,
the leap in performance sets new state-of-the-art results.
2. Related Work
Local learning approaches perform inference with mod-
els that are focused on training samples in the vicinity of
each test sample. This way, the predictions are based on
what are believed to be the most relevant data points. K-
nearest neighbors, for example, is a local learning method.
Bottou & Vapnik (1992) have presented a simple algorithm
for adjusting the capacity of the learned model locally, and
discuss the advantages of such models for learning with un-
even data distributions. Alpaydin & Jordan (1996) combine
multiple local perceptrons in either a cooperative or a dis-
criminative manner, and Zhang et al. (2006) combine mul-
tiple local support vector machines. These and other sim-
ilar contributions rely on local neighborhoods containing
multiple samples. The one-shot similarity kernel of Wolf
et al. (2009) contrasts a single test sample with many train-
ing samples, but it does not finetune a model based on a
single sample, as we do.
More recently, Wang et al. (2021) employ local learn-
ing to perform single-sample domain adaptation (includ-
ing robustness to corruption). The adaptation is performed
through an optimization process that maximizes the en-
tropy of the prediction provided for each test sample. Our
method does not require any test-time optimization and fo-
cuses (on the training samples) on improving the accuracy
of the ground truth label rather than label-agnostic confi-
dence.
Alet et al. (2021) propose a method called Tailoring that
employs, like our method, meta-learning to local learning.
The approach is based on applying unsupervised learning
on a dataset that is created by augmenting the test sample,
in a way that is related to the adaptive instance normaliza-
tion of Huang & Belongie (2017). Our method does not
employ any such augmentation and is based on supervised
finetuning on a single sample.
Tailoring was tested on synthetic datasets with very spe-
cific structures, in a very specific unsupervised setting of
CIFAR-10. Additionally, it was tested as a defense against
adversarial samples, with results that fell short of the state
of the art in this field. Since the empirical success obtained
by Tailoring so far is limited and since there is no published
code, it is not used as a baseline in our experiments.
As far as we can ascertain, all existing local learning con-
tributions are very different from our work. No other con-
tribution overfits samples of the training set, trains a hy-
pernetwork for local learning, nor builds a hypernetwork
based on diffusion models.
Hypernetworks (Ha et al.,2016)are neural models that
generate the weights of a second primary network, which
performs the actual prediction task. Since the inferred
weights are multiplied by the activations of the primary
network, hypernetworks are a form of multiplicative inter-
actions (Jayakumar et al.,2020), and extend layer-specific
dynamic networks, which have been used to adapt neural
models to the properties of the input sample (Klein et al.,
2015;Riegler et al.,2015).
Hypernetworks benefit from the knowledge-sharing abil-
ity of the weight-generating network and are therefore
suited for meta-learning tasks, including few-shot learn-
ing (Bertinetto et al.,2016), continual learning (von Os-
wald et al.,2020), and model personalization (Shamsian
et al.,2021). When there is a need to repeatedly train sim-
ilar networks, predicting the weights can be more efficient
than backpropagation. Hypernetworks have, therefore,
been used for neural architecture search (Brock et al.,2018;
Zhang et al.,2019), and hyperparameter selection (Lorraine
& Duvenaud,2018).
MEND by Mitchell et al. (2021) explores the problem of
model editing for large language models, in which the
model’s parameters are updated after training to incorpo-
rate new data. In our work, the goal is to predict the label
of the new sample and not to update the model. Unlike
MEND, our method does not employ the label of the new
sample.
Diffusion models Many of the recent generative models
for images (Ho et al.,2022;Chen et al.,2020;Dhariwal &
Nichol,2021a) and speech (Kong et al.,2021;Chen et al.,
2020) are based on a degenerate form of the Focker-Planck
equation. Sohl-Dickstein et al. (2015) showed that compli-
cated distributions could be learned using a simple diffu-
sion process. The Denoising Diffusion Probabilistic Model
(DDPM) of Ho et al. (2020) extends the framework and
presents high-quality image synthesis. Song et al. (2020)
sped up the inference time by an order of magnitude us-
ing implicit sampling with their DDIM method. Watson
et al. (2021) propose a dynamic programming algorithm to
find an efficient denoising schedule and San-Roman et al.
(2021) apply a learned scaling adjustment to noise schedul-
ing. Luhman & Luhman (2021) combined knowledge dis-
tillation with DDPMs.
The iterative nature of the denoising generation scheme
creates an opportunity to steer the process, by considering
the gradients of additional loss terms. The Iterative La-
tent Variable Refinement (ILVR) method Choi et al. (2021)
2
OCD: Learning to Overfit with Conditional Diffusion Models
does so for images by directing the generated image to-
ward a low-resolution template. A similar technique was
subsequently employed for voice modification Levkovitch
et al. (2022). Direct conditioning is also possible: Saharia
et al. (2022) generate photo-realistic text-to-image scenes
by conditioning a diffusion model on text embedding; Amit
et al. (2021) repeatedly condition on the input image to
obtain image segmentation. In voice generation, the mel-
spectrogram can be used as additional input to the denois-
ing network Chen et al. (2020); Kong et al. (2021); Liu et al.
(2021a), as can the input text for a text-to-speech diffusion
model Popov et al. (2021). The conditioning we employ is
of the direct type.
3. Method
Our method is based on a modified diffusion process. De-
note the training dataset as S={(xi, yi)}n
i=1, where xiare
the data points in the dataset S, and yiare the associated la-
bels. First, a base model, fθ(x) = f(x, θ)is trained over
the entire dataset, S, where θare the learned weights of the
model when trained over the entire dataset.
Next, for every training sample sSwe run fine-tuning
based on that single sample to obtain the overfitted param-
eters (function) as θs(fθs).
θs=θ+ arg min
(L(f(xs, θ + ∆), ys)) ,(1)
where Lis the loss function that is minimized in the train-
ing of the base model, xs, ysare the data point and label
of sample s, and is the weight difference obtained when
finetuning the model. Finetuning is performed with three
gradient descent iterations, and, as shown in our runtime
analysis, is typically much less computationally demand-
ing than the training of the base network.
The meta-learning problem we consider is the one of learn-
ing a model g, which maps x(the input domain of sample
s), and potentially multiple latent representations of xin
the context of fθ, collectively denoted as I(x), to a vector
of weight differences, such that
g(x, I(x)) = θsθ(2)
where θare the base model’s parameters trained over S,
and g(x, I(x)) is a mapping function that maps the input,
i.e., the xpart of s, and multiple latent representations of
it, I(x), to the desired shift in the model parameters.
Layer selection Current deep neural networks can have
millions or even billions of parameters. Thus, learning to
modify all network parameters can be a prohibitive task.
Therefore, we opt to modify, via function g, a single layer
of fθ.
To select this layer, we follow Lutati & Wolf (2021) and
choose the layer that presents the maximal entropy of the
loss, when fixing the samples (x, y)s, and perturbing
the layer’s parameters.
The layer that yields the highest entropy score when per-
turbed is a natural candidate for a fine-tuning algorithm,
since a large entropy reflects that, near the layer’s base pa-
rameters, the surface imposed by the loss function has a
large variance. Thus, a small perturbation could result in
a dramatic change in loss when the perturbation is in the
right direction.
Denote the perturbed weights, in which only layer Lis per-
turbed, as θL. The score used for selection is
Score =1
|S|X
(x,y)S
EntropyθL(L(f(x, θL), y)) ,(3)
where Lis the loss objective on which the function of fθ
is trained, and the entropy is computed over multiple draws
of θL. Since sampling does not involve a backpropaga-
tion computation, the process is not computationally de-
manding, and 10,000 samples per each training sample
s= (x, y)are used.
The entropy, per each sample s, is computed by fitting a
Gaussian Kernel Density Estimation (GKDE) (Silverman,
1986) to the obtained empirical distribution of the loss
function. The layer that obtains the highest mean entropy
is selected.
The conditioning signal The latent representation, I(x),
has three components. Given a selected layer, L, we denote
the input to this layer, when passing a sample xto f(x, θ),
as iL(x)and the activation of this layer as aL(x). We also
use the output of the base function fθ(x).I(x)is given by
the tuple
I(x)=[iL(x), aL(x), fθ(x)] (4)
3.1. Diffusion Process
The goal of the diffusion process is to reconstruct , the
difference between the fine-tuned weights θs, and the base
model weights, θ. The process iteratively starts a random
T, with the same dimensions as θs.
T∼ N(0,1) (5)
Next, it iterates with t, where tis decreasing and is the
diffusion step, and returns 0. After appropriate diffusion
steps, 0should be as close as possible to for a given
point s.
The diffusion error estimation network, ϵis a function of
the current estimation, t, the latent representation tuple,
I(x), and the diffusion timestep, t. The last is encoded
through a positional encoding network (Vaswani et al.,
2017), P E. All inputs, except for t, are combined into
one vector: e=P E(t) + Ei(iL) + Ea(aL) + Eo(fθ(x)),
3
摘要:

OCD:LearningtoOverfitwithConditionalDiffusionModelsShaharLutati1LiorWolf1AbstractWepresentadynamicmodelinwhichtheweightsareconditionedonaninputsamplexandarelearnedtomatchthosethatwouldbeob-tainedbyfinetuningabasemodelonxanditslabely.Thismappingbetweenaninputsampleandnetworkweightsisapproximatedbyade...

展开>> 收起<<
OCD Learning to Overfit with Conditional Diffusion Models.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.07MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注