
OCD: Learning to Overfit with Conditional Diffusion Models
widely applicable, and we evaluate it across five very differ-
ent domains: image classification, image synthesis, regres-
sion in tabular data, speech separation, and few-shot NLP.
In all cases, the results obtained by our method improve
upon the baseline model to which our method is applied.
Whenever the baseline model is close to the state of the art,
the leap in performance sets new state-of-the-art results.
2. Related Work
Local learning approaches perform inference with mod-
els that are focused on training samples in the vicinity of
each test sample. This way, the predictions are based on
what are believed to be the most relevant data points. K-
nearest neighbors, for example, is a local learning method.
Bottou & Vapnik (1992) have presented a simple algorithm
for adjusting the capacity of the learned model locally, and
discuss the advantages of such models for learning with un-
even data distributions. Alpaydin & Jordan (1996) combine
multiple local perceptrons in either a cooperative or a dis-
criminative manner, and Zhang et al. (2006) combine mul-
tiple local support vector machines. These and other sim-
ilar contributions rely on local neighborhoods containing
multiple samples. The one-shot similarity kernel of Wolf
et al. (2009) contrasts a single test sample with many train-
ing samples, but it does not finetune a model based on a
single sample, as we do.
More recently, Wang et al. (2021) employ local learn-
ing to perform single-sample domain adaptation (includ-
ing robustness to corruption). The adaptation is performed
through an optimization process that maximizes the en-
tropy of the prediction provided for each test sample. Our
method does not require any test-time optimization and fo-
cuses (on the training samples) on improving the accuracy
of the ground truth label rather than label-agnostic confi-
dence.
Alet et al. (2021) propose a method called Tailoring that
employs, like our method, meta-learning to local learning.
The approach is based on applying unsupervised learning
on a dataset that is created by augmenting the test sample,
in a way that is related to the adaptive instance normaliza-
tion of Huang & Belongie (2017). Our method does not
employ any such augmentation and is based on supervised
finetuning on a single sample.
Tailoring was tested on synthetic datasets with very spe-
cific structures, in a very specific unsupervised setting of
CIFAR-10. Additionally, it was tested as a defense against
adversarial samples, with results that fell short of the state
of the art in this field. Since the empirical success obtained
by Tailoring so far is limited and since there is no published
code, it is not used as a baseline in our experiments.
As far as we can ascertain, all existing local learning con-
tributions are very different from our work. No other con-
tribution overfits samples of the training set, trains a hy-
pernetwork for local learning, nor builds a hypernetwork
based on diffusion models.
Hypernetworks (Ha et al.,2016)are neural models that
generate the weights of a second primary network, which
performs the actual prediction task. Since the inferred
weights are multiplied by the activations of the primary
network, hypernetworks are a form of multiplicative inter-
actions (Jayakumar et al.,2020), and extend layer-specific
dynamic networks, which have been used to adapt neural
models to the properties of the input sample (Klein et al.,
2015;Riegler et al.,2015).
Hypernetworks benefit from the knowledge-sharing abil-
ity of the weight-generating network and are therefore
suited for meta-learning tasks, including few-shot learn-
ing (Bertinetto et al.,2016), continual learning (von Os-
wald et al.,2020), and model personalization (Shamsian
et al.,2021). When there is a need to repeatedly train sim-
ilar networks, predicting the weights can be more efficient
than backpropagation. Hypernetworks have, therefore,
been used for neural architecture search (Brock et al.,2018;
Zhang et al.,2019), and hyperparameter selection (Lorraine
& Duvenaud,2018).
MEND by Mitchell et al. (2021) explores the problem of
model editing for large language models, in which the
model’s parameters are updated after training to incorpo-
rate new data. In our work, the goal is to predict the label
of the new sample and not to update the model. Unlike
MEND, our method does not employ the label of the new
sample.
Diffusion models Many of the recent generative models
for images (Ho et al.,2022;Chen et al.,2020;Dhariwal &
Nichol,2021a) and speech (Kong et al.,2021;Chen et al.,
2020) are based on a degenerate form of the Focker-Planck
equation. Sohl-Dickstein et al. (2015) showed that compli-
cated distributions could be learned using a simple diffu-
sion process. The Denoising Diffusion Probabilistic Model
(DDPM) of Ho et al. (2020) extends the framework and
presents high-quality image synthesis. Song et al. (2020)
sped up the inference time by an order of magnitude us-
ing implicit sampling with their DDIM method. Watson
et al. (2021) propose a dynamic programming algorithm to
find an efficient denoising schedule and San-Roman et al.
(2021) apply a learned scaling adjustment to noise schedul-
ing. Luhman & Luhman (2021) combined knowledge dis-
tillation with DDPMs.
The iterative nature of the denoising generation scheme
creates an opportunity to steer the process, by considering
the gradients of additional loss terms. The Iterative La-
tent Variable Refinement (ILVR) method Choi et al. (2021)
2