Preprint FROM POINTS TO FUNCTIONS INFINITE -DIMENSIONAL REPRESENTATIONS IN

2025-05-02 0 0 3.88MB 30 页 10玖币
侵权投诉
Preprint
FROM POINTS TO FUNCTIONS:
INFINITE-DIMENSIONAL REPRESENTATIONS IN
DIFFUSION MODELS
Sarthak Mittal1,2, Guillaume Lajoie1,2, Stefan Bauer5, Arash Mehrjou3,4
1Mila, 2Universit´
e de Montr´
eal, 3MPI-IS, 4ETH Zurich, 5KTH Stockholm
ABSTRACT
Diffusion-based generative models learn to iteratively transfer unstructured noise
to a complex target distribution as opposed to Generative Adversarial Networks
(GANs) or the decoder of Variational Autoencoders (VAEs) which produce sam-
ples from the target distribution in a single step. Thus, in diffusion models ev-
ery sample is naturally connected to a random trajectory which is a solution to a
learned stochastic differential equation (SDE). Generative models are only con-
cerned with the final state of this trajectory that delivers samples from the desired
distribution. Abstreiter et al. (2021) showed that these stochastic trajectories can
be seen as continuous filters that wash out information along the way. Conse-
quently, it is reasonable to ask if there is an intermediate time step at which the
preserved information is optimal for a given downstream task. In this work, we
show that a combination of information content from different time steps gives a
strictly better representation for the downstream task. We introduce an attention
and recurrence based modules that “learn to mix” information content of various
time-steps such that the resultant representation leads to superior performance in
downstream tasks.1
1 INTRODUCTION
A lot of the progress in Machine Learning hinges on learning good representations of the data,
whether in supervised or unsupervised fashion. Typically in the absence of label information, learn-
ing a good representation is often guided by reconstruction of the input, as is the case with autoen-
coders and generative models like variational autoencoders (Vincent et al.,2010;Kingma & Welling,
2013;Rezende et al.,2014); or by some notion of invariance to certain transformations like in Con-
trastive Learning and similar approaches (Chen et al.,2020b;d;Grill et al.,2020). In this work, we
analyze a novel way of representation learning which was introduced in Abstreiter et al. (2021) with
a denoising objective using diffusion based models to obtain unbounded representations.
Diffusion-based models (Sohl-Dickstein et al.,2015;Song et al.,2020;2021;Sajjadi et al.,2018;
Niu et al.,2020;Cai et al.,2020;Chen et al.,2020a;Saremi et al.,2018;Dhariwal & Nichol,
2021;Luhman & Luhman,2021;Ho et al.,2021;Mehrjou et al.,2017) are generative models that
leverage step-wise perturbations to the samples of the data distribution (eg. CIFAR10), modeled
via a Stochastic Differential Equation (SDE), until convergence to an unstructured distribution (eg.
N(0,I)) called, in this context, the prior distribution. In contrast to this diffusion process, a “score
model” is learned to approximate the reverse process that iteratively converges to the data distribu-
tion starting from the prior distribution. Beyond the generative modelling capacity of score-based
models, we instead use the additionally encoded representations to perform inference tasks, such as
classification.
In this work, we revisit the formulation provided by Abstreiter et al. (2021); Preechakul et al. (2022)
which augments such diffusion-based systems with an encoder for performing representation learn-
ing which can be used for downstream tasks. In particular, we look at the infinite-dimensional
representation learning methodology from Abstreiter et al. (2021) and perform a deeper dive into
Correspondence author sarthmit@gmail.com
1Open-sourced implementation is available at https://github.com/sarthmit/traj drl
1
arXiv:2210.13774v1 [cs.LG] 25 Oct 2022
Preprint
Figure 1: Downstream performance of single point based representations (MLP) and full trajec-
tory based representations (RNN and Tsf) on different datasets for both types of learned encoders:
probabilistic (VDRL) and deterministic (DRL) using a 64-dimensional latent space (Top) and a 128-
dimensional latent space (Bottom).
(a) the benefits of utilizing the trajectory or multiple points on it as opposed to choosing just a
single point, and (b) the kind of information encoded at different points. Using trained attention
mechanisms over diffusion trajectories, we ask about similarity and differences of representations
across diffusion processes. Do they encode certain interpretable features at different points, or is it
redundant to look at the whole trajectory?
Our findings can be summarized as follows:
We propose using the trajectory-based representation combined with sequential architec-
tures like Recurrent Neural Networks (RNNs) and Transformers to perform downstream
predictions using multiple points as it leads to better performance than just finding one-
best point on the trajectory for downstream predictions (Abstreiter et al.,2021).
We analyze the representations obtained at different parts of the trajectory through Mutual
Information and Attention-based relevance to downstream tasks to showcase the differ-
ences in information contained along the trajectory.
We also provide insights into the benefits of using more points on the trajectory, with satu-
rating benefits as our discretization becomes finer. We further show that finer discretizations
lead to even more performance benefits when the latent space is severely restricted, eg. just
a 2-dimensional output from the encoder.
2 BEYOND FIXED REPRESENTATIONS
We first outline how diffusion-based representation learning systems are trained. Given some ex-
ample x0Rdwhich is sampled from the target distribution p0, the diffusion process constructs
the trajectory (xt)t[0,1] through the application of an SDE. In this work, we consider the Variance
Exploding SDE (Song et al.,2021) for this diffusion process, defined as
dx=f(x, t) + g(t)dw:= rd[σ2(t)]
dt dw(1)
2
Preprint
(a) VDRL |CIFAR10 (b) VDRL |CIFAR100 (c) VDRL |Mini-ImageNet
(d) DRL |CIFAR10 (e) DRL |CIFAR100 (f) DRL |Mini-ImageNet
Figure 2: Normalized Mutual Information between different points on the trajectory. Cell (i, j)
demonstrates the normalized mutual information, estimated with the MINE algorithm, between the
representations at time t=iand t=j.
where wis the standard Wiener process and σ2(·)the noise variance of the diffusion process. This
leads to a closed form distribution of xtconditional on x0as p0t(xt|x0) = N(xt;x0,[σ2(t)
σ2(0)]I). Given this diffusion process modeled through the Variance Exploding SDE, the reverse
SDE takes a similar form but requires the knowledge about the score function, i.e. xlog pt(x)for
all t[0,1]. A common way to obtain this score function is through the Explicit Score Match-
ing (Hyv¨
arinen & Dayan,2005) objective,
Exthksθ(xt, t)− ∇xtlog pt(xt)k2i(2)
which suffers from just one hiccup, which is that data about the ground-truth score function is not
available. To solve this problem, Denoising Score Matching (Vincent,2011) was proposed,
Ex0hExt|x0hksθ(xt, t)− ∇xtlog p0t(xt|x0)k2ii (3)
where the term log p0t(xt|x0)is available due to its closed-form structure. Given that the above
objective cannot be reduced to 0, Abstreiter et al. (2021) proposes the objective
Ex0hExt|x0hksθ(xt, Eφ(x0, t), t)− ∇xtlog p0t(xt|x0)k2ii (4)
where the additional input Eφ(x0, t)to the score function is obtained from a learned encoder. It pro-
vides information about the unperturbed sample that might be useful for denoising data at time step
tin the diffusion process. Training this system can lead to the objective being reduced to 0, thereby
providing incentive to the encoder Eφ(·, t)to learn meaningful representations for each time t. From
this, we obtain a trajectory-based representation (Eφ(x0, t))t[0,1] for each sample x0, as opposed
to finite sized representations obtained from typical Autoencoder (Bengio et al.,2013;Vinyals et al.,
2016;Kingma & Welling,2013;Rezende et al.,2014) and Contrastive Learning (Chen et al.,2020c;
Grill et al.,2020;Caron et al.,2021;Bromley et al.,1993;Chen & He,2020) approaches.
Following the setup in Abstreiter et al. (2021), we consider two different versions of the encoder
Eφ(·,·), (a) the VDRL setup, where the output of Eφ(·,·)represents a distribution from which
a sample is used, and the distribution is regularized using a KL-Divergence term with the standard
Normal distribution N(0,I), and (b) the DRL setup, where the output of the encoder is deterministic
and regularized using an L1distance metric to be as close to 0as possible. Typically in all our
experiments, we see that not only the trends hold with multiple seeds but also across these two types
of encoders, substantiating the statistical significance of the trends.
3
Preprint
(a) VDRL |CIFAR10 (b) VDRL |CIFAR100 (c) VDRL |Mini-ImageNet
(d) DRL |CIFAR10 (e) DRL |CIFAR100 (f) DRL |Mini-ImageNet
Figure 3: Attention Scores provided to different points on the trajectories, which are obtained from
diffusion based representation learning systems with probabilistic encoders (VDRL; top row) and
with deterministic encoders (DRL; bottom row) across the following datasets (i) Left: CIFAR10, (ii)
Middle: CIFAR100, and (iii) Right: Mini-Imagenet.
It is important to note that our goal here is strictly representation learning, and thus we use the
representations obtained for downstream (multitask-) image classification. This should not be con-
fused with generative modelling as the provided mechanism augments a generative model for rep-
resentation learning, but is not a generative model on its own. Since this representation learning
paradigm can be augmented with a time-conditioned encoder model, this leads to a natural extension
to trajectory-based (unbounded) representation, in contrast to typical bounded representation learn-
ing models like Autoencoders. Thus, this representation learning paradigm constructs a functional
map from the input space to a curve / trajectory in Rd, where we refer to das the dimensionality of
this encoded space.
2.1 INFINITE-DIMENSIONAL REPRESENTATION OF FINITE-DIMENSIONAL DATA
Normally in autoencoders or other static representation learning methods, the input data x0Rdis
mapped to a single point zRcin the code space. However, our proposed algorithm learns a richer
representation where the input x0is mapped to a curve in Rcinstead of a single point through the
encoder Eφ(·, t). Hence, the learned code is produced by the map x0(Eφ(x0, t))t[0,1] where
the infinite-dimensional object (Eφ(x0, t))t[0,1] is the encoding for x0.
The learned code is at least as good as static codes in terms of separation induced among the codes.
Consider two input samples x0and x0
0, hence we have:
kEφ(x0,0) Eφ(x0
0,0)k ≤ sup
t[0,1]
kEφ(x0, t)Eφ(x0
0, t)k(5)
which implies that the downstream task can at least recover the separation provided by finite-
dimensional codes from the infinite-dimensional code by looking for the maximum separation along
the representation trajectory.
A downstream task can leverage this rich encoding in various ways. Consider the classification task
where we want to find a mapping f:Rd→ {0,1}from input data to the label space. Instead of
giving x0as the input to f, we define f:H → {0,1}where the input to the classifier is the whole
trajectory (Eφ(x0, t))t[0,1]. Thus, the classifier can now use RNN and Transformer models to make
use of the information content of the entire trajectories.
3 EXPERIMENTS
We first train two kinds of diffusion-based generative model as outlined in Abstreiter et al. (2021),
based on probabilistic (VDRL) and deterministic (DRL) encoders respectively. After training, the
4
Preprint
(a) VDRL |Background Color (b) VDRL |Foreground Color (c) VDRL |Location (d) VDRL |Object Shape
(e) DRL |Background Color (f) DRL |Foreground Color (g) DRL |Location (h) DRL |Object Shape
Figure 4: Attention score profiles for different tasks under the Synthetic dataset, when using the
VDRL framework (top) and the DRL framework (bottom). The scores reveal that almost all points in
the trajectory store similar amounts of information about the background color, while the latter part
of the trajectory encodes more information about the foreground object. In particular, information
about the location is most heavily found near the end of the trajectory.
encoder model is kept fixed. For all our downstream experiments, we use this trained encoder
to obtain the trajectory based representation for each of the samples. While the trajectories lie
in a continuous domain [0,1], we sample it at regular intervals with length 0.1, unless specified
otherwise. This leads to a discretization of the trajectory, which is then used for various analysis
as outlined below. Further, we consider the dimensionality of the latent space, that is, the output of
the encoder, as 128 unless otherwise specified. Additional details about the architectures used, the
optimization strategy and other implementation details can be found in Appendix A.
3.1 DOWNSTREAM PERFORMANCE REVEALS BENEFITS OF TRAJECTORY INFORMATION
To understand the benefits of utilizing the trajectory-based representations, we train standard Multi-
Layer Perceptron (MLP) models at different points on the trajectory and compare it with Re-
current Neural Network (RNN) (Hochreiter & Schmidhuber,1997;Cho et al.,2014) and Trans-
former (Vaswani et al.,2017) based models that are able to aggregate information from different
parts of the trajectory.
We evaluate the MLP, RNN and Transformer based downstream models on diffusion systems with
both probabilistic encoders (VDRL) and also non-probabilistic ones (DRL). In Figure 1, we see the
performance of these different setups for the following datasets: CIFAR10 (Krizhevsky et al.,a),
CIFAR100 (Krizhevsky et al.,b) and Mini-ImageNet (Vinyals et al.,2016). Note that in contrast to
MLP implementations, RNN and Transformer use the entire trajectory and the obtained performance
is plotted across all time points for visual comparison. We typically see that RNN and Transformer
based models perform better than even the peaks obtained by the MLP systems. This shows that
there is no single point on the trajectory that encapsulates all the information necessary for opti-
mal classification, and thus utilizing the whole trajectory as opposed to individual points leads to
improvements in performance.
We further do this performance analysis for different dimensionality of the latent spaces, that is,
when the trajectory representation is embedded in a 64-dimensional Euclidean space (Figure 1: Top)
and when it is emebdded in a 128-dimensional Euclidean space (Figure 1: Bottom). We see similar
trends across the two settings, thus highlighting consistent benefits when using a discretization of
the whole trajectory.
3.2 MUTUAL INFORMATION REVEALS DIFFERENCES ALONG THE TRAJECTORY
In an effort to understand whether different parts of the trajectory based representation actually
contain different types of information about the sample, we evaluate the mutual information between
the representations at various points in the trajectory. We use the MINE algorithm (Belghazi et al.,
2018) to estimate the mutual information between the representations at any two different points
in the trajectory. Through this algorithm, we compute and analyse a normalized version of the
5
摘要:

PreprintFROMPOINTSTOFUNCTIONS:INFINITE-DIMENSIONALREPRESENTATIONSINDIFFUSIONMODELSSarthakMittaly1;2,GuillaumeLajoie1;2,StefanBauer5,ArashMehrjou3;41Mila,2Universit´edeMontr´eal,3MPI-IS,4ETHZurich,5KTHStockholmABSTRACTDiffusion-basedgenerativemodelslearntoiterativelytransferunstructurednoisetoacomple...

展开>> 收起<<
Preprint FROM POINTS TO FUNCTIONS INFINITE -DIMENSIONAL REPRESENTATIONS IN.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:3.88MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注