Dynamic Latent Separation for Deep Learning

2025-05-03 0 0 4.79MB 15 页 10玖币

侵权投诉

Yi-Lin Tuan 1Zih-Yun Chiu 2William Yang Wang 1

Abstract

A core problem in machine learning is to learn

expressive latent variables for model prediction

on complex data that involves multiple sub-

components in a ﬂexible and interpretable fashion.

Here, we develop an approach that improves ex-

pressiveness, provides partial interpretation, and

is not restricted to speciﬁc applications. The key

idea is to dynamically distance data samples in

the latent space and thus enhance the output di-

versity. Our dynamic latent separation method,

inspired by atomic physics, relies on the jointly

learned structures of each data sample, which also

reveal the importance of each sub-component for

distinguishing data samples. This approach, atom

modeling, requires no supervision of the latent

space and allows us to learn extra partially inter-

pretable representations besides the original goal

of a model. We empirically demonstrate that the

algorithm also enhances the performance of small

to larger-scale models in various classiﬁcation and

generation problems.

1. Introduction

Deep neural networks with multiple hidden layers are

trained to be expressive models that learn complicated re-

lationships between their inputs and outputs (Srivastava

et al.,2014). Among various data types, data samples that

consist of many sub-units, such as images and texts, can

require models to be more expressive to consider nuanced

differences among sub-units. The demand for this delicacy

leads to developing large-scale and complex model architec-

tures (Vaswani et al.,2017), which cause drawbacks such

as compromised model interpretability (Ribeiro et al.,2016;

Bastani et al.,2017;Rudin,2019;Jain & Wallace,2019).

Various algorithms exist that improve model expressive-

ness not by advancing model architectures. For instance,

Department of Computer Science, University of California

Santa Barbara

Department of Electrical and Computer Engineer-

ing, University of California San Diego. Correspondence to: Yi-

Lin Tuan <ytuan@cs.ucsb.edu>.

Work in progress.

contrastive learning ameliorates classiﬁcation expressive-

ness (Dosovitskiy et al.,2014;Chen et al.,2020) by pushing

away latent features from different classes. Vector quan-

tization tackles the expressiveness of autoencoders (Van

Den Oord et al.,2017) by learning discrete representations

using a preset codebook. As a separate effort, post-hoc

methods or designed models that follow self-explaining

protocol (Alvarez-Melis & Jaakkola,2018) reveal some un-

derlying reason for model behaviors. While these methods

show promising results in their bundled applications, it is

yet certain of their transferability and usefulness to other

applications. Meanwhile, it is yet underexplored of gener-

alizable training algorithms that can simultaneously help

expressiveness and uncover partial explanations.

We present a novel algorithm that simultaneously improves

model expressiveness, provides an interpretation of sub-

component importance, and is generalizable to multiple

applications. Our method, atom modeling, ﬁrst maps the la-

tent representations of each sub-component in a data sample

to a learnable token importance and then dynamically dis-

tances data samples based on token importance using a loss

function inspired by Coulomb force (Coulomb,1785). After

training, token importance reveals which sub-components

in a data sample contribute to its semantic meaning and are

key to distinguishing itself from other data samples. The dy-

namic separation between data samples encourages a model

to predict diverse outputs, thus boosting expressiveness.

This method can be viewed as connecting sub-component

importance and inter-sample relationships to elevate im-

pacts from local details. A similar observation can be

found in atomic physics, where the balance distance be-

tween atoms, fundamental particles that form every matter

in nature, depends on the structure of sub-atomic particles in

each atom (Brown,2009;Halliday et al.,2013). In addition,

applying atom modeling in a neural network also amounts to

regularizing the representation space to preserve each data

sample’s uniqueness. Finally, atom modeling promotes ex-

pressiveness using a loss function with no latent supervision,

enabling it to be ﬂexibly applied to different applications.

We demonstrate the utility of atom modeling objective

functions by training or ﬁnetuning convolution neural net-

works, generative adversarial networks, and transformers

on Gaussian mixtures, natural texts (CoLA, Poem), and nat-

arXiv:2210.03728v3 [cs.LG] 11 Feb 2024

Dynamic Latent Separation for Deep Learning

Latent Function

(z)

atom

modeling

Output Function

o(.)

Latent Space

Figure 1.

Illustration of atom modeling use case. Consider a model

fθ=o(ℓ(z))

; data samples are transformed into the latent space and

their latent representations are distanced using atom modeling associated with the training criterion for output

. The colors labeled on

each image in the latent space present the learned token importance that indicates which part is more crucial to identify data samples.

ural images (MNIST, CIFAR10, CelebA-HQ, Oxford-IIIT

Pets, Oxford-Flowers102, ImageNet-1K). Our experiments

demonstrate that atom modeling outperforms baselines and

provides an interpretation of how each sub-unit affects the

learning, and shows how atom modeling alters the inter-

sample relationship.

2. Related Work

Atom modeling can be interpreted as a way of learning

representation by spacing data samples. The idea of keep-

ing a distance among data samples has previously been

used in manifold learning (Tenenbaum et al.,2000;Saul

& Roweis,2003;Cayton,2005;Lin & Zha,2008), graph

representations (Perozzi et al.,2014;Grover & Leskovec,

2016;Hamilton et al.,2017), kernel tricks (Muller et al.,

2001;Keerthi & Lin,2003;Hofmann et al.,2008), and con-

trastive learning (Weinberger & Saul,2009;Gutmann &

Hyv

arinen,2010;Sohn,2016;Oord et al.,2018;Chen et al.,

2020) where representations are trained to ﬁt a predeﬁned

inter-sample relationship. For instance, contrastive learning,

the most related one to our method, requires preset nega-

tive and positive pairs in order to push away opposing pairs

while bringing together the positive pairs. These methods

necessitate prior knowledge of inter-sample relationships.

Since atom modeling can be seen as a technique to discretize

representations within a continuous space, it is natural to

consider its discrete space counterpart: vector quantized

variational autoencoder (Van Den Oord et al.,2017;Razavi

et al.,2019;Esser et al.,2021), which maps encoder out-

put to an additional codebook with preset number of codes

as the way of discretization. This shows success in recon-

struction but is not easy to generalize to other models. In

comparison, atom modeling that promotes separation of the

original continuous embedding space frees the restriction on

the preset number of codes and the autoencoder architecture.

While atom modeling leverages ﬁne-grained component

importance to determine the balanced data sample dis-

tances, it also provides model-agnostic partial interpretation

that is orthogonal to extensions of model-dependent self-

interpretable designs (Alvarez-Melis & Jaakkola,2018) and

post-hoc explanation methods (Ribeiro et al.,2016). How-

ever, this work does not focus on explanation but demon-

strates an outgrowth of atom modeling.

3. Method

Our goal is to deﬁne a ﬂexible method that makes a model

more expressive for data samples with multiple sub-units,

such as images and texts, and does not need latent space

supervision. We say a model is expressive if it can accom-

modate various distinct outputs for different inputs.

We deﬁne a model in a general form:

y=fθ(z),z∼ D,(1)

where

is the data distribution or a random noise distri-

bution. We can easily ﬁt models for practical applications,

such as generation or classiﬁcation, into this form:

often a real vector yfor generation and a probability distri-

bution

P(Y)

for classiﬁcation. If

fθ

is expressive, different

is more likely to give different

P(Y)

. This dis-

tinction is desirable for promoting diversity in generation

models (Razavi et al.,2019) and encouraging entropy in

classiﬁcation models (Dubey et al.,2018).

To achieve the goal of distinct outputs, we ﬁrst write a model

in its composite form:

fθ(·) = o(ℓ(·)),(2)

where

ℓ(·)∈RN×h

gives the latent representation of an

input, and

o(·)

outputs the result given the latent represen-

tation.

is the number of sub-components, and

is the

dimension of the latent space. An intuitive way to increase

the probability that

fθ(zA)

differs from

fθ(zB)

is to let

ℓ(zA)

distance from

ℓ(zB)

. Here, we show the properties

of the output function,

o(·)

, that lead to this concurrent

increase.

Dynamic Latent Separation for Deep Learning

k=10

k=0.1

More Similar Structures

Figure 2. LA

with varied atomic structure similarity

. The dis-

tance having the minimum loss depends on the intra-sample struc-

tures. As the structures are more similar (decayed

), the minimum

loss distance becomes larger. Simultaneously, the distance cannot

be zero.

Lemma 1. A G-Lipschitz function

o(·)

and a K-Lipschitz

inverse function of

o(·)

returns the output space distance

such that:

K∥v−u∥≤∥o(v)−o(u)∥ ≤ G∥v−u∥,(3)

where vand uare any vector in the latent space.

Equation 3indicates that if the latent distance increases, the

bounds of output distance also increase. All proofs in this

paper are in appendix A.

The next challenge is that, in a general case without la-

tent space supervision, how we should set apart the latent

variables produced by

ℓ(·)

. We propose to dynamically

distance latent representations by separating the currently

close variables and neglecting the already distant variables.

Whether the variables are close or distant depends on their

intra-sample structures. This leads us to ﬁrst map a latent

variable to a new embedding space by a learnable mapping

function A(·)such that:

{(qi,pi)}N

i=1 =A(ℓ(z)) ,(4)

where

qi∈R

is the importance score of the

-th row (token)

ℓ(z)

, and

pi∈Rh′

is the position of the same token in

the new space.

Then, we propose a dynamic distancing loss function:

LA=EzA,zB∼D X

i∈A,j∈B

iqB

dqA

i, qB

j,pA

i,pB

j,(5)

where

dqA

i, qB

j,pA

i,pB

j∈R

is a distance between

-th

and

-th tokens in

and

and is derived from their

intra-sample structures. We also use

and

as sets

{1,··· , NA}and {1,··· , NB}.

By minimizing

in Equation 5, the optimal distance be-

tween

ℓ(zA)

and

ℓ(zB)

cannot be

. That is, our proposed

Attribute to Meaning?

Yes No

Distinguish data? Yes +1 -1

No 0 Not considered

Table 1. Interpretation of token importance qi.

atomic loss forces

ℓ(zA)

and

ℓ(zB)

to be apart. In addition,

the optimal distances are not identical for different data

pairs, and these optimal values depend on each data’s intra-

sample structure. Figure 2shows examples of the atomic

loss function and optimal distances.

3.1. Token Importance

In Equation 5,

i∈R

is a learnable importance score of a

token in a data sample

. Given a latent representation of

ℓ(zA)=[eA

1eA

2...eA

NA]∈RNA×h

, we deﬁne the token

importance as:

i= 2σ(Q(eA

i)) −1∈[−1,1] ,(6)

where

σ(·)

is the sigmoid function, and

Q(·) : Rh7→ R

maps the original h-dimension latent variable to an unnor-

malized importance score. We rescale the score to

[−1,1]

it is a simple way to have three types of multiplication

qiqj

needed in Equation 5: polarity (negative), likeness (posi-

tive), and no effect (zero). Since only when

is not zero,

qiqj

attributes to

, one role of token importance is as

asking if the i-th token in a data sample makes it distinguish-

able from other data samples. As shown in Table 1, token

importance +1 or -1 helps distinguish data while 0 does not.

3.2. Atomic Distance

We deﬁne the atomic distance between data samples A and

B by: ¯

dAB =∥µA−µB∥p,(7)

and the distances among their

-th and

-th tokens

d(qA

i, qB

j,pA

i,pB

j)∈R

in Equation 5or

dij

for brevity

by:

dij =¯

dAB +rA+rB

2step(−qA

iqB

j),∀i∈A, j ∈B . (8)

Here

µA

and

µB

are respectively an average of the most

crucial tokens of A and B,

and

are the token deviation

within a data sample, and

step(·)

is the step function. We

formally deﬁne them as:

µA=1

NAX

i∈A

ipA

i,(9)

rA=1

NAX

i∈A

(1 −mA

i)∥pA

i−µA∥p,(10)

i= 1 −max(−qA

i,0) ∈[0,1] ,(11)

i=P(eA

i)∈Rh′,(12)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DynamicLatentSeparationforDeepLearningYi-LinTuan1Zih-YunChiu2WilliamYangWang1AbstractAcoreprobleminmachinelearningistolearnexpressivelatentvariablesformodelpredictiononcomplexdatathatinvolvesmultiplesub-componentsinaflexibleandinterpretablefashion.Here,wedevelopanapproachthatimprovesex-pressiveness,...

展开>> 收起<<

Dynamic Latent Separation for Deep Learning.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dynamic Latent Separation for Deep Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: