Preprint FROM POINTS TO FUNCTIONS INFINITE -DIMENSIONAL REPRESENTATIONS IN

2025-05-02 0 0 3.88MB 30 页 10玖币

侵权投诉

Preprint

FROM POINTS TO FUNCTIONS:

INFINITE-DIMENSIONAL REPRESENTATIONS IN

DIFFUSION MODELS

Sarthak Mittal†1,2, Guillaume Lajoie1,2, Stefan Bauer5, Arash Mehrjou3,4

1Mila, 2Universit´

e de Montr´

eal, 3MPI-IS, 4ETH Zurich, 5KTH Stockholm

ABSTRACT

Diffusion-based generative models learn to iteratively transfer unstructured noise

to a complex target distribution as opposed to Generative Adversarial Networks

(GANs) or the decoder of Variational Autoencoders (VAEs) which produce sam-

ples from the target distribution in a single step. Thus, in diffusion models ev-

ery sample is naturally connected to a random trajectory which is a solution to a

learned stochastic differential equation (SDE). Generative models are only con-

cerned with the ﬁnal state of this trajectory that delivers samples from the desired

distribution. Abstreiter et al. (2021) showed that these stochastic trajectories can

be seen as continuous ﬁlters that wash out information along the way. Conse-

quently, it is reasonable to ask if there is an intermediate time step at which the

preserved information is optimal for a given downstream task. In this work, we

show that a combination of information content from different time steps gives a

strictly better representation for the downstream task. We introduce an attention

and recurrence based modules that “learn to mix” information content of various

time-steps such that the resultant representation leads to superior performance in

downstream tasks.1

1 INTRODUCTION

A lot of the progress in Machine Learning hinges on learning good representations of the data,

whether in supervised or unsupervised fashion. Typically in the absence of label information, learn-

ing a good representation is often guided by reconstruction of the input, as is the case with autoen-

coders and generative models like variational autoencoders (Vincent et al.,2010;Kingma & Welling,

2013;Rezende et al.,2014); or by some notion of invariance to certain transformations like in Con-

trastive Learning and similar approaches (Chen et al.,2020b;d;Grill et al.,2020). In this work, we

analyze a novel way of representation learning which was introduced in Abstreiter et al. (2021) with

a denoising objective using diffusion based models to obtain unbounded representations.

Diffusion-based models (Sohl-Dickstein et al.,2015;Song et al.,2020;2021;Sajjadi et al.,2018;

Niu et al.,2020;Cai et al.,2020;Chen et al.,2020a;Saremi et al.,2018;Dhariwal & Nichol,

2021;Luhman & Luhman,2021;Ho et al.,2021;Mehrjou et al.,2017) are generative models that

leverage step-wise perturbations to the samples of the data distribution (eg. CIFAR10), modeled

via a Stochastic Differential Equation (SDE), until convergence to an unstructured distribution (eg.

N(0,I)) called, in this context, the prior distribution. In contrast to this diffusion process, a “score

model” is learned to approximate the reverse process that iteratively converges to the data distribu-

tion starting from the prior distribution. Beyond the generative modelling capacity of score-based

models, we instead use the additionally encoded representations to perform inference tasks, such as

classiﬁcation.

In this work, we revisit the formulation provided by Abstreiter et al. (2021); Preechakul et al. (2022)

which augments such diffusion-based systems with an encoder for performing representation learn-

ing which can be used for downstream tasks. In particular, we look at the inﬁnite-dimensional

representation learning methodology from Abstreiter et al. (2021) and perform a deeper dive into

†Correspondence author sarthmit@gmail.com

1Open-sourced implementation is available at https://github.com/sarthmit/traj drl

arXiv:2210.13774v1 [cs.LG] 25 Oct 2022

Preprint

Figure 1: Downstream performance of single point based representations (MLP) and full trajec-

tory based representations (RNN and Tsf) on different datasets for both types of learned encoders:

probabilistic (VDRL) and deterministic (DRL) using a 64-dimensional latent space (Top) and a 128-

dimensional latent space (Bottom).

(a) the beneﬁts of utilizing the trajectory or multiple points on it as opposed to choosing just a

single point, and (b) the kind of information encoded at different points. Using trained attention

mechanisms over diffusion trajectories, we ask about similarity and differences of representations

across diffusion processes. Do they encode certain interpretable features at different points, or is it

redundant to look at the whole trajectory?

Our ﬁndings can be summarized as follows:

• We propose using the trajectory-based representation combined with sequential architec-

tures like Recurrent Neural Networks (RNNs) and Transformers to perform downstream

predictions using multiple points as it leads to better performance than just ﬁnding one-

best point on the trajectory for downstream predictions (Abstreiter et al.,2021).

• We analyze the representations obtained at different parts of the trajectory through Mutual

Information and Attention-based relevance to downstream tasks to showcase the differ-

ences in information contained along the trajectory.

• We also provide insights into the beneﬁts of using more points on the trajectory, with satu-

rating beneﬁts as our discretization becomes ﬁner. We further show that ﬁner discretizations

lead to even more performance beneﬁts when the latent space is severely restricted, eg. just

a 2-dimensional output from the encoder.

2 BEYOND FIXED REPRESENTATIONS

We ﬁrst outline how diffusion-based representation learning systems are trained. Given some ex-

ample x0∈Rdwhich is sampled from the target distribution p0, the diffusion process constructs

the trajectory (xt)t∈[0,1] through the application of an SDE. In this work, we consider the Variance

Exploding SDE (Song et al.,2021) for this diffusion process, deﬁned as

dx=f(x, t) + g(t)dw:= rd[σ2(t)]

dt dw(1)

Preprint

(a) VDRL |CIFAR10 (b) VDRL |CIFAR100 (c) VDRL |Mini-ImageNet

(d) DRL |CIFAR10 (e) DRL |CIFAR100 (f) DRL |Mini-ImageNet

Figure 2: Normalized Mutual Information between different points on the trajectory. Cell (i, j)

demonstrates the normalized mutual information, estimated with the MINE algorithm, between the

representations at time t=iand t=j.

where wis the standard Wiener process and σ2(·)the noise variance of the diffusion process. This

leads to a closed form distribution of xtconditional on x0as p0t(xt|x0) = N(xt;x0,[σ2(t)−

σ2(0)]I). Given this diffusion process modeled through the Variance Exploding SDE, the reverse

SDE takes a similar form but requires the knowledge about the score function, i.e. ∇xlog pt(x)for

all t∈[0,1]. A common way to obtain this score function is through the Explicit Score Match-

ing (Hyv¨

arinen & Dayan,2005) objective,

Exthksθ(xt, t)− ∇xtlog pt(xt)k2i(2)

which suffers from just one hiccup, which is that data about the ground-truth score function is not

available. To solve this problem, Denoising Score Matching (Vincent,2011) was proposed,

Ex0hExt|x0hksθ(xt, t)− ∇xtlog p0t(xt|x0)k2ii (3)

where the term log p0t(xt|x0)is available due to its closed-form structure. Given that the above

objective cannot be reduced to 0, Abstreiter et al. (2021) proposes the objective

Ex0hExt|x0hksθ(xt, Eφ(x0, t), t)− ∇xtlog p0t(xt|x0)k2ii (4)

where the additional input Eφ(x0, t)to the score function is obtained from a learned encoder. It pro-

vides information about the unperturbed sample that might be useful for denoising data at time step

tin the diffusion process. Training this system can lead to the objective being reduced to 0, thereby

providing incentive to the encoder Eφ(·, t)to learn meaningful representations for each time t. From

this, we obtain a trajectory-based representation (Eφ(x0, t))t∈[0,1] for each sample x0, as opposed

to ﬁnite sized representations obtained from typical Autoencoder (Bengio et al.,2013;Vinyals et al.,

2016;Kingma & Welling,2013;Rezende et al.,2014) and Contrastive Learning (Chen et al.,2020c;

Grill et al.,2020;Caron et al.,2021;Bromley et al.,1993;Chen & He,2020) approaches.

Following the setup in Abstreiter et al. (2021), we consider two different versions of the encoder

Eφ(·,·), (a) the VDRL setup, where the output of Eφ(·,·)represents a distribution from which

a sample is used, and the distribution is regularized using a KL-Divergence term with the standard

Normal distribution N(0,I), and (b) the DRL setup, where the output of the encoder is deterministic

and regularized using an L1distance metric to be as close to 0as possible. Typically in all our

experiments, we see that not only the trends hold with multiple seeds but also across these two types

of encoders, substantiating the statistical signiﬁcance of the trends.

Preprint

(a) VDRL |CIFAR10 (b) VDRL |CIFAR100 (c) VDRL |Mini-ImageNet

(d) DRL |CIFAR10 (e) DRL |CIFAR100 (f) DRL |Mini-ImageNet

Figure 3: Attention Scores provided to different points on the trajectories, which are obtained from

diffusion based representation learning systems with probabilistic encoders (VDRL; top row) and

with deterministic encoders (DRL; bottom row) across the following datasets (i) Left: CIFAR10, (ii)

Middle: CIFAR100, and (iii) Right: Mini-Imagenet.

It is important to note that our goal here is strictly representation learning, and thus we use the

representations obtained for downstream (multitask-) image classiﬁcation. This should not be con-

fused with generative modelling as the provided mechanism augments a generative model for rep-

resentation learning, but is not a generative model on its own. Since this representation learning

paradigm can be augmented with a time-conditioned encoder model, this leads to a natural extension

to trajectory-based (unbounded) representation, in contrast to typical bounded representation learn-

ing models like Autoencoders. Thus, this representation learning paradigm constructs a functional

map from the input space to a curve / trajectory in Rd, where we refer to das the dimensionality of

this encoded space.

2.1 INFINITE-DIMENSIONAL REPRESENTATION OF FINITE-DIMENSIONAL DATA

Normally in autoencoders or other static representation learning methods, the input data x0∈Rdis

mapped to a single point z∈Rcin the code space. However, our proposed algorithm learns a richer

representation where the input x0is mapped to a curve in Rcinstead of a single point through the

encoder Eφ(·, t). Hence, the learned code is produced by the map x0→(Eφ(x0, t))t∈[0,1] where

the inﬁnite-dimensional object (Eφ(x0, t))t∈[0,1] is the encoding for x0.

The learned code is at least as good as static codes in terms of separation induced among the codes.

Consider two input samples x0and x0

0, hence we have:

kEφ(x0,0) −Eφ(x0

0,0)k ≤ sup

t∈[0,1]

kEφ(x0, t)−Eφ(x0

0, t)k(5)

which implies that the downstream task can at least recover the separation provided by ﬁnite-

dimensional codes from the inﬁnite-dimensional code by looking for the maximum separation along

the representation trajectory.

A downstream task can leverage this rich encoding in various ways. Consider the classiﬁcation task

where we want to ﬁnd a mapping f:Rd→ {0,1}from input data to the label space. Instead of

giving x0as the input to f, we deﬁne f:H → {0,1}where the input to the classiﬁer is the whole

trajectory (Eφ(x0, t))t∈[0,1]. Thus, the classiﬁer can now use RNN and Transformer models to make

use of the information content of the entire trajectories.

3 EXPERIMENTS

We ﬁrst train two kinds of diffusion-based generative model as outlined in Abstreiter et al. (2021),

based on probabilistic (VDRL) and deterministic (DRL) encoders respectively. After training, the

Preprint

(a) VDRL |Background Color (b) VDRL |Foreground Color (c) VDRL |Location (d) VDRL |Object Shape

(e) DRL |Background Color (f) DRL |Foreground Color (g) DRL |Location (h) DRL |Object Shape

Figure 4: Attention score proﬁles for different tasks under the Synthetic dataset, when using the

VDRL framework (top) and the DRL framework (bottom). The scores reveal that almost all points in

the trajectory store similar amounts of information about the background color, while the latter part

of the trajectory encodes more information about the foreground object. In particular, information

about the location is most heavily found near the end of the trajectory.

encoder model is kept ﬁxed. For all our downstream experiments, we use this trained encoder

to obtain the trajectory based representation for each of the samples. While the trajectories lie

in a continuous domain [0,1], we sample it at regular intervals with length 0.1, unless speciﬁed

otherwise. This leads to a discretization of the trajectory, which is then used for various analysis

as outlined below. Further, we consider the dimensionality of the latent space, that is, the output of

the encoder, as 128 unless otherwise speciﬁed. Additional details about the architectures used, the

optimization strategy and other implementation details can be found in Appendix A.

3.1 DOWNSTREAM PERFORMANCE REVEALS BENEFITS OF TRAJECTORY INFORMATION

To understand the beneﬁts of utilizing the trajectory-based representations, we train standard Multi-

Layer Perceptron (MLP) models at different points on the trajectory and compare it with Re-

current Neural Network (RNN) (Hochreiter & Schmidhuber,1997;Cho et al.,2014) and Trans-

former (Vaswani et al.,2017) based models that are able to aggregate information from different

parts of the trajectory.

We evaluate the MLP, RNN and Transformer based downstream models on diffusion systems with

both probabilistic encoders (VDRL) and also non-probabilistic ones (DRL). In Figure 1, we see the

performance of these different setups for the following datasets: CIFAR10 (Krizhevsky et al.,a),

CIFAR100 (Krizhevsky et al.,b) and Mini-ImageNet (Vinyals et al.,2016). Note that in contrast to

MLP implementations, RNN and Transformer use the entire trajectory and the obtained performance

is plotted across all time points for visual comparison. We typically see that RNN and Transformer

based models perform better than even the peaks obtained by the MLP systems. This shows that

there is no single point on the trajectory that encapsulates all the information necessary for opti-

mal classiﬁcation, and thus utilizing the whole trajectory as opposed to individual points leads to

improvements in performance.

We further do this performance analysis for different dimensionality of the latent spaces, that is,

when the trajectory representation is embedded in a 64-dimensional Euclidean space (Figure 1: Top)

and when it is emebdded in a 128-dimensional Euclidean space (Figure 1: Bottom). We see similar

trends across the two settings, thus highlighting consistent beneﬁts when using a discretization of

the whole trajectory.

3.2 MUTUAL INFORMATION REVEALS DIFFERENCES ALONG THE TRAJECTORY

In an effort to understand whether different parts of the trajectory based representation actually

contain different types of information about the sample, we evaluate the mutual information between

the representations at various points in the trajectory. We use the MINE algorithm (Belghazi et al.,

2018) to estimate the mutual information between the representations at any two different points

in the trajectory. Through this algorithm, we compute and analyse a normalized version of the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintFROMPOINTSTOFUNCTIONS:INFINITE-DIMENSIONALREPRESENTATIONSINDIFFUSIONMODELSSarthakMittaly1;2,GuillaumeLajoie1;2,StefanBauer5,ArashMehrjou3;41Mila,2Universit´edeMontr´eal,3MPI-IS,4ETHZurich,5KTHStockholmABSTRACTDiffusion-basedgenerativemodelslearntoiterativelytransferunstructurednoisetoacomple...

展开>> 收起<<

Preprint FROM POINTS TO FUNCTIONS INFINITE -DIMENSIONAL REPRESENTATIONS IN.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint FROM POINTS TO FUNCTIONS INFINITE -DIMENSIONAL REPRESENTATIONS IN

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: