Dynamic Latent Separation for Deep Learning

2025-05-03 0 0 4.79MB 15 页 10玖币
侵权投诉
Dynamic Latent Separation for Deep Learning
Yi-Lin Tuan 1Zih-Yun Chiu 2William Yang Wang 1
Abstract
A core problem in machine learning is to learn
expressive latent variables for model prediction
on complex data that involves multiple sub-
components in a flexible and interpretable fashion.
Here, we develop an approach that improves ex-
pressiveness, provides partial interpretation, and
is not restricted to specific applications. The key
idea is to dynamically distance data samples in
the latent space and thus enhance the output di-
versity. Our dynamic latent separation method,
inspired by atomic physics, relies on the jointly
learned structures of each data sample, which also
reveal the importance of each sub-component for
distinguishing data samples. This approach, atom
modeling, requires no supervision of the latent
space and allows us to learn extra partially inter-
pretable representations besides the original goal
of a model. We empirically demonstrate that the
algorithm also enhances the performance of small
to larger-scale models in various classification and
generation problems.
1. Introduction
Deep neural networks with multiple hidden layers are
trained to be expressive models that learn complicated re-
lationships between their inputs and outputs (Srivastava
et al.,2014). Among various data types, data samples that
consist of many sub-units, such as images and texts, can
require models to be more expressive to consider nuanced
differences among sub-units. The demand for this delicacy
leads to developing large-scale and complex model architec-
tures (Vaswani et al.,2017), which cause drawbacks such
as compromised model interpretability (Ribeiro et al.,2016;
Bastani et al.,2017;Rudin,2019;Jain & Wallace,2019).
Various algorithms exist that improve model expressive-
ness not by advancing model architectures. For instance,
1
Department of Computer Science, University of California
Santa Barbara
2
Department of Electrical and Computer Engineer-
ing, University of California San Diego. Correspondence to: Yi-
Lin Tuan <ytuan@cs.ucsb.edu>.
Work in progress.
contrastive learning ameliorates classification expressive-
ness (Dosovitskiy et al.,2014;Chen et al.,2020) by pushing
away latent features from different classes. Vector quan-
tization tackles the expressiveness of autoencoders (Van
Den Oord et al.,2017) by learning discrete representations
using a preset codebook. As a separate effort, post-hoc
methods or designed models that follow self-explaining
protocol (Alvarez-Melis & Jaakkola,2018) reveal some un-
derlying reason for model behaviors. While these methods
show promising results in their bundled applications, it is
yet certain of their transferability and usefulness to other
applications. Meanwhile, it is yet underexplored of gener-
alizable training algorithms that can simultaneously help
expressiveness and uncover partial explanations.
We present a novel algorithm that simultaneously improves
model expressiveness, provides an interpretation of sub-
component importance, and is generalizable to multiple
applications. Our method, atom modeling, first maps the la-
tent representations of each sub-component in a data sample
to a learnable token importance and then dynamically dis-
tances data samples based on token importance using a loss
function inspired by Coulomb force (Coulomb,1785). After
training, token importance reveals which sub-components
in a data sample contribute to its semantic meaning and are
key to distinguishing itself from other data samples. The dy-
namic separation between data samples encourages a model
to predict diverse outputs, thus boosting expressiveness.
This method can be viewed as connecting sub-component
importance and inter-sample relationships to elevate im-
pacts from local details. A similar observation can be
found in atomic physics, where the balance distance be-
tween atoms, fundamental particles that form every matter
in nature, depends on the structure of sub-atomic particles in
each atom (Brown,2009;Halliday et al.,2013). In addition,
applying atom modeling in a neural network also amounts to
regularizing the representation space to preserve each data
sample’s uniqueness. Finally, atom modeling promotes ex-
pressiveness using a loss function with no latent supervision,
enabling it to be flexibly applied to different applications.
We demonstrate the utility of atom modeling objective
functions by training or finetuning convolution neural net-
works, generative adversarial networks, and transformers
on Gaussian mixtures, natural texts (CoLA, Poem), and nat-
1
arXiv:2210.03728v3 [cs.LG] 11 Feb 2024
Dynamic Latent Separation for Deep Learning
Latent Function
(z)
atom
modeling
Output Function
o(.)
Latent Space
y
Figure 1.
Illustration of atom modeling use case. Consider a model
fθ=o((z))
; data samples are transformed into the latent space and
their latent representations are distanced using atom modeling associated with the training criterion for output
y
. The colors labeled on
each image in the latent space present the learned token importance that indicates which part is more crucial to identify data samples.
ural images (MNIST, CIFAR10, CelebA-HQ, Oxford-IIIT
Pets, Oxford-Flowers102, ImageNet-1K). Our experiments
demonstrate that atom modeling outperforms baselines and
provides an interpretation of how each sub-unit affects the
learning, and shows how atom modeling alters the inter-
sample relationship.
2. Related Work
Atom modeling can be interpreted as a way of learning
representation by spacing data samples. The idea of keep-
ing a distance among data samples has previously been
used in manifold learning (Tenenbaum et al.,2000;Saul
& Roweis,2003;Cayton,2005;Lin & Zha,2008), graph
representations (Perozzi et al.,2014;Grover & Leskovec,
2016;Hamilton et al.,2017), kernel tricks (Muller et al.,
2001;Keerthi & Lin,2003;Hofmann et al.,2008), and con-
trastive learning (Weinberger & Saul,2009;Gutmann &
Hyv
¨
arinen,2010;Sohn,2016;Oord et al.,2018;Chen et al.,
2020) where representations are trained to fit a predefined
inter-sample relationship. For instance, contrastive learning,
the most related one to our method, requires preset nega-
tive and positive pairs in order to push away opposing pairs
while bringing together the positive pairs. These methods
necessitate prior knowledge of inter-sample relationships.
Since atom modeling can be seen as a technique to discretize
representations within a continuous space, it is natural to
consider its discrete space counterpart: vector quantized
variational autoencoder (Van Den Oord et al.,2017;Razavi
et al.,2019;Esser et al.,2021), which maps encoder out-
put to an additional codebook with preset number of codes
as the way of discretization. This shows success in recon-
struction but is not easy to generalize to other models. In
comparison, atom modeling that promotes separation of the
original continuous embedding space frees the restriction on
the preset number of codes and the autoencoder architecture.
While atom modeling leverages fine-grained component
importance to determine the balanced data sample dis-
tances, it also provides model-agnostic partial interpretation
that is orthogonal to extensions of model-dependent self-
interpretable designs (Alvarez-Melis & Jaakkola,2018) and
post-hoc explanation methods (Ribeiro et al.,2016). How-
ever, this work does not focus on explanation but demon-
strates an outgrowth of atom modeling.
3. Method
Our goal is to define a flexible method that makes a model
more expressive for data samples with multiple sub-units,
such as images and texts, and does not need latent space
supervision. We say a model is expressive if it can accom-
modate various distinct outputs for different inputs.
We define a model in a general form:
y=fθ(z),z∼ D,(1)
where
D
is the data distribution or a random noise distri-
bution. We can easily fit models for practical applications,
such as generation or classification, into this form:
y
is
often a real vector yfor generation and a probability distri-
bution
P(Y)
for classification. If
fθ
is expressive, different
z
is more likely to give different
y
or
P(Y)
. This dis-
tinction is desirable for promoting diversity in generation
models (Razavi et al.,2019) and encouraging entropy in
classification models (Dubey et al.,2018).
To achieve the goal of distinct outputs, we first write a model
in its composite form:
fθ(·) = o((·)),(2)
where
(·)RN×h
gives the latent representation of an
input, and
o(·)
outputs the result given the latent represen-
tation.
N
is the number of sub-components, and
h
is the
dimension of the latent space. An intuitive way to increase
the probability that
fθ(zA)
differs from
fθ(zB)
is to let
(zA)
distance from
(zB)
. Here, we show the properties
of the output function,
o(·)
, that lead to this concurrent
increase.
2
Dynamic Latent Separation for Deep Learning
k=10
k=0.1
More Similar Structures
Figure 2. LA
with varied atomic structure similarity
k
. The dis-
tance having the minimum loss depends on the intra-sample struc-
tures. As the structures are more similar (decayed
k
), the minimum
loss distance becomes larger. Simultaneously, the distance cannot
be zero.
Lemma 1. A G-Lipschitz function
o(·)
and a K-Lipschitz
inverse function of
o(·)
returns the output space distance
such that:
Kvu∥≤∥o(v)o(u)∥ ≤ Gvu,(3)
where vand uare any vector in the latent space.
Equation 3indicates that if the latent distance increases, the
bounds of output distance also increase. All proofs in this
paper are in appendix A.
The next challenge is that, in a general case without la-
tent space supervision, how we should set apart the latent
variables produced by
(·)
. We propose to dynamically
distance latent representations by separating the currently
close variables and neglecting the already distant variables.
Whether the variables are close or distant depends on their
intra-sample structures. This leads us to first map a latent
variable to a new embedding space by a learnable mapping
function A(·)such that:
{(qi,pi)}N
i=1 =A((z)) ,(4)
where
qiR
is the importance score of the
i
-th row (token)
in
(z)
, and
piRh
is the position of the same token in
the new space.
Then, we propose a dynamic distancing loss function:
LA=EzA,zB∼D X
iA,jB
qA
iqB
j
dqA
i, qB
j,pA
i,pB
j,(5)
where
dqA
i, qB
j,pA
i,pB
jR
is a distance between
i
-th
and
j
-th tokens in
zA
and
zB
and is derived from their
intra-sample structures. We also use
A
and
B
as sets
{1,··· , NA}and {1,··· , NB}.
By minimizing
LA
in Equation 5, the optimal distance be-
tween
(zA)
and
(zB)
cannot be
0
. That is, our proposed
Attribute to Meaning?
Yes No
Distinguish data? Yes +1 -1
No 0 Not considered
Table 1. Interpretation of token importance qi.
atomic loss forces
(zA)
and
(zB)
to be apart. In addition,
the optimal distances are not identical for different data
pairs, and these optimal values depend on each data’s intra-
sample structure. Figure 2shows examples of the atomic
loss function and optimal distances.
3.1. Token Importance
In Equation 5,
qA
iR
is a learnable importance score of a
token in a data sample
A
. Given a latent representation of
A
,
(zA)=[eA
1eA
2...eA
NA]RNA×h
, we define the token
importance as:
qA
i= 2σ(Q(eA
i)) 1[1,1] ,(6)
where
σ(·)
is the sigmoid function, and
Q(·) : Rh7→ R
maps the original h-dimension latent variable to an unnor-
malized importance score. We rescale the score to
[1,1]
as
it is a simple way to have three types of multiplication
qiqj
needed in Equation 5: polarity (negative), likeness (posi-
tive), and no effect (zero). Since only when
qi
is not zero,
qiqj
attributes to
LA
, one role of token importance is as
asking if the i-th token in a data sample makes it distinguish-
able from other data samples. As shown in Table 1, token
importance +1 or -1 helps distinguish data while 0 does not.
3.2. Atomic Distance
We define the atomic distance between data samples A and
B by: ¯
dAB =µAµBp,(7)
and the distances among their
i
-th and
j
-th tokens
d(qA
i, qB
j,pA
i,pB
j)R
in Equation 5or
dij
for brevity
by:
dij =¯
dAB +rA+rB
2step(qA
iqB
j),iA, j B . (8)
Here
µA
and
µB
are respectively an average of the most
crucial tokens of A and B,
rA
and
rB
are the token deviation
within a data sample, and
step(·)
is the step function. We
formally define them as:
µA=1
NAX
iA
mA
ipA
i,(9)
rA=1
NAX
iA
(1 mA
i)pA
iµAp,(10)
mA
i= 1 max(qA
i,0) [0,1] ,(11)
pA
i=P(eA
i)Rh,(12)
3
摘要:

DynamicLatentSeparationforDeepLearningYi-LinTuan1Zih-YunChiu2WilliamYangWang1AbstractAcoreprobleminmachinelearningistolearnexpressivelatentvariablesformodelpredictiononcomplexdatathatinvolvesmultiplesub-componentsinaflexibleandinterpretablefashion.Here,wedevelopanapproachthatimprovesex-pressiveness,...

展开>> 收起<<
Dynamic Latent Separation for Deep Learning.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:4.79MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注