Compositional Generalization in Unsupervised Compositional Representation Learning A Study on Disentanglement and Emergent Language

2025-04-27 0 0 3.2MB 24 页 10玖币
侵权投诉
Compositional Generalization in Unsupervised
Compositional Representation Learning:
A Study on Disentanglement and Emergent Language
Zhenlin Xu Marc Niethammer Colin Raffel
Department of Computer Science
University of North Carolina at Chapel Hill
{zhenlinx, mn, craffel}@cs.unc.edu
Abstract
Deep learning models struggle with compositional generalization, i.e. the ability
to recognize or generate novel combinations of observed elementary concepts. In
hopes of enabling compositional generalization, various unsupervised learning algo-
rithms have been proposed with inductive biases that aim to induce compositional
structure in learned representations (e.g. disentangled representation and emergent
language learning). In this work, we evaluate these unsupervised learning algo-
rithms in terms of how well they enable compositional generalization. Specifically,
our evaluation protocol focuses on whether or not it is easy to train a simple model
on top of the learned representation that generalizes to new combinations of compo-
sitional factors. We systematically study three unsupervised representation learning
algorithms –
β
-VAE,
β
-TCVAE, and emergent language (EL) autoencoders – on
two datasets that allow directly testing compositional generalization. We find that
directly using the bottleneck representation with simple models and few labels may
lead to worse generalization than using representations from layers before or after
the learned representation itself. In addition, we find that the previously proposed
metrics for evaluating the levels of compositionality are not correlated with the
actual compositional generalization in our framework. Surprisingly, we find that
increasing pressure to produce a disentangled representation (e.g. increasing
β
in the
β
-VAE) produces representations with worse generalization, while repre-
sentations from EL models show strong compositional generalization. Motivated
by this observation, we further investigate the advantages of using EL to induce
compositional structure in unsupervised representation learning, finding that it
shows consistently stronger generalization than disentanglement models, especially
when using less unlabeled data for unsupervised learning and fewer labels for
downstream tasks. Taken together, our results shed new light onto the composi-
tional generalization behavior of different unsupervised learning algorithms with a
new setting to rigorously test this behavior, and suggest the potential benefits of
developing EL learning algorithms for more generalizable representations.
1 Introduction
A human’s ability to recognize or generate novel combinations of seen elementary concepts, also
known as compositional generalization, is desirable for building general artificial ingelligence (AI)
systems [
22
,
16
,
3
]. The Recognition-By-Components theory by Biederman [
3
] influenced the
early development of computer vision models that are inherently compositional, e.g., hierarchical
features [
13
,
14
] and part-based models [
37
,
38
]. However, modern deep learning systems still
struggle with this key capability of human intelligence [
33
]. A few works studied specific spatial and
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00482v2 [cs.LG] 5 Oct 2022
object-wise compositionality [
42
,
30
] or more general compositionality in the space of pre-defined
attributes [45, 39] or the semantics of human language descriptions [49, 44].
Humans often express complex meaning in a compositional manner: we combine elementary rep-
resentations to describe observations. For example, an object with simple geometry is described
by separate and independent properties such as color (red, blue, ...), position (left, right, close, far
away,...), and shape (circle, triangle, ...). Therefore, compositional representations are thought as
helpful or even essential to achieve compositional generalization [
22
,
15
,
3
,
28
]. However, consider-
ing that there are exponentially many possible combinations of a given set of elementary concepts, we
need to deal with this combinatorial explosion for real-world visual observations. It is unrealistic to an-
notate enough data to learn the fine-grained compositionality. Therefore, unsupervised compositional
representation learning is appealing because it does not require comprehensive labeling. However,
unsupervised representation learning heavily relies on the design of an effective inductive bias (e.g.
on the representation formulation) to induce the emergence of compositional representations.
A widely explored representation formulation with explicit compositionality is disentanglement.
The most common formulation of disentanglement is that the generative factors of observations
should be encoded into different factors of low-dimensional representations, and a change of a
single factor in an observation leads to a change in a single factor of the representation. State-
of-the-art unsupervised disentanglement models [
20
,
24
,
27
,
7
,
34
] are largely built on top of
variational generative models [
25
]. To measure the level of disentanglement, various quantitative
metrics have been proposed that are defined based on the statistical relations between the learned
representations and ground truth factors with an emphasis on the separation of factors. A summary of
disentanglement metrics and methods is provided in [
31
]. Another representation learning approach
with an inductive bias towards compositionality is emergent language learning. Natural language
allows us to describe novel composite concepts by combining expressions of their elementary concepts
according to grammar. Therefore, linguists have been interested in studying the compositionality
of discrete codes evolving during multi-agent communication when agents learn to complete a task
cooperatively [
9
,
29
,
19
,
40
]. Compositionality metrics with language structure assumptions, e.g.
topographic similarity [
4
], were used to evaluate the learned language. However, for the above two
types of methods, relatively few studies [
36
,
41
,
6
,
1
] have directly evaluated how well the learned
representations generalize to novel combinations on downstream tasks, which is the main motivation
for compositional representation learning in the first place.
In this work, we study the compositional generalization performance of representations learned
from unsupervised learning algorithms with two types of inductive biases for compositionality:
disentanglement and emergent languages (EL). Instead of measuring the compositionality and
disentanglement metrics defined based on various assumptions, we directly measure the generalization
performance on novel input combinations with a two-stage protocol. Specifically, with a dataset
divided into train and test sets ensuring that the test set contains novel combinations of concepts that
never appear in the train set, we first learn an unsupervised representation model from unlabeled
images in the train set. With very few labeled samples from the train set and the frozen unsupervised
representation model, we train simple (e.g. linear) models on top of learned representations to predict
the ground truth value for each generative factor of the dataset and evaluate these simple models
on the test set. These choices are aligned with common practices in recent deep representation
learning works, for example, self-supervised representation learning [
10
] and semi-supervised
learning with generative models [
26
]. Different from previous studies (e.g. [
6
,
36
]) that measure
unsupervised learning performance (e.g. image reconstruction), we evaluate the performance on
downstream tasks. We also emphasize how easily we can obtain downstream task models with the
learned representation, e.g. when using very few labels and simple linear models. These designs
highlight the generalization performance of the unsupervised learning stage, different from a setup
that uses many more or even all labeled samples of the train set in the downstream task learning stage
[
10
,
41
] or performs unsupervised learning on the entire dataset [
31
]. More importantly, we study
not only the compositional generalization of intermediate representations at the model bottleneck,
e.g. where the disentangled latent variables are formulated, but also the representations from layers
before or after it.
With the above evaluation protocol for compositional generalization, we explore selected unsupervised
learning algorithms by varying (1) the hyperparameters of each algorithm that may control the levels
of compositionality; (2) image datasets and the amount of data for both unsupervised and supervised
learning stages; and (3) design choices of EL learning. First, we find that, compared to the low-
2
Figure 1: Architectures of disentanglement models (left) and emergent language models (right).
dimensional latent variables from the model bottleneck, representations from the layers before or
after the model bottleneck enable better compositional generalization. Second, we find that attaining
higher scores on previously proposed compositionality/disentanglement metrics does not always
correlate with better generalization performance. Finally, the representations learned from EL models
show stronger generalization performance than disentanglement models. These findings reveal the
divergence between the efforts to achieve better compositionality / disentanglement metric scores and
the initial motivation to obtain better generalization performance. To our best knowledge, this is the
first study to compare representations in disentanglement models and emergent language learning
through the lens of compositional generalization with a unified evaluation setup. We also discuss
the advantage of the emergent language representation format versus disentanglement and connect it
with recent related research in representation learning.
2 Unsupervised Learning with Compositional Representation Inductive Bias
In this section, we introduce more details on unsupervised learning algorithms with two different
compositionality-seeking inductive biases on the representation formulation: disentangled representa-
tions in Section 2.1 and emergent languages in Section 2.2.
2.1 Learning Disentangled Representations
The concept of disentanglement assumes that high-dimensional observations of the real world
x
can be represented by low-dimensional latent variables
z
, where each dimension of
z
encodes
independent factors of variations in
x
. For unsupervised disentanglement learning algorithms, we
select the popular
β
-VAE and
β
-TCVAE methods which both modify the evidence lower bound
(ELBO) objective in the variational autoencoder (VAE).
β
-VAE use a hyperparameter
β
for the Kullback-Leibler (KL) regularization term of the vanilla VAE
loss to control the bandwidth of the VAE bottleneck:
Ep(x)[Eqφ(z|x)[log pθ(x|z)] βKL(qφ(z|x)||p(z))] ,(1)
where
p(z)
is the assumed prior distribution of the latent variables and its conditional distribution
qφ(z|x)
is parameterized by a neural network (encoder) whose parameters are
φ
and the posterior
pθ(x|z)is parameterized by a decoder whose parameters are θ.β= 1 corresponds to the VAE loss.
β
-TCVAE further decomposes the KL term in Eq.
(1)
into mutual information, total correlation, and
dimension-wise KL terms, and penalizes the total correlation with the hyperparameter β:
Eq(z|x)p(x)[log pθ(x|z)] αIq(x;z)βKL(qφ(z)|| Y
j
qφ(zj)) γX
j
KL(q(zj)||p(zj)) ,(2)
where
Iq(x;z)
,
KL(qφ(z)|| Qjqφ(zj))
and
PjKL(q(zj)||p(zj))
are the mutual information term,
total correlation term and the dimension-wise KL term respectively. The proposed
β
-TCVAE uses
α=γ= 1 and tunes βonly.
2.2 Learning Emergent Language
An alternative compositional representation learning method is emergent language (EL) learning,
which aims to learn a representation that mimics the properties of natural language. The emergent
language consists of sequences of discrete symbols from a vocabulary. Since the model combines
discrete symbols in the vocabulary to represent complex semantics in observations, one expects that
3
meaningful compositionality might naturally emerge in communication between multiple agents
using the emergent language to solve tasks. We consider the typical speaker-listener model (two-
agent communication) for EL learning, as shown in Fig. 1. We apply EL learning to the image
reconstruction task to be consistent with the reconstruction objective used by variational auto-encoder-
based models. Note that in our setting the terminology “speaker-listener” is equivalent to the more
common “encoder-decoder” terminology. The task is as follows.
1.
The speaker receives an input
x
and encodes it as a message
m={m1, m2, ...}
, a sequence
of discrete symbols from the vocabulary
V={c1, c2, cnV}
of size
nV
. The maximum
length of mis nmsg.
2.
The listener model receives the message
m
and the outputs
ˆ
x
, aiming to accurately recon-
struct the encoder input x.
In this work, we use a speaker and a listener that are both hybrids of a convolutional neural network
and an LSTM recurrent neural network. The flattened convolutional embedding of the input image,
EncConv(x)
, is used as the initial cell state of an LSTM encoding module (EncLSTM). EncLSTM
generates a discrete distribution
q(mt|x)
over
V
at each time step
t
autoregressively (with the
embedding of the discrete token sampled in the previous step t1as input):
q(mt|x) = EncLSTM(mt1|EncConv(x),emb(m1), .., emb(mt2)) ,(3)
where emb(
·
) is the learnable layer that projects a discrete token into a high-dimensional embedding.
We use the Gumbel-Softmax [
23
,
32
] to sample from the discrete distribution
q(mt|x)
and the
“straight-through” (ST) gradient estimator [
2
] for quantization (from soft-distribution to one-hot
vector). This allow us to estimate the gradients from the discrete sampling process.
mt=ST(GumbelSoftmax(q(mt|x))) .(4)
To allow the message to be of variable length, which better mimics natural language, we set one token
in Vto be the end-of-sequence (EOS) token that indicates the message end.
When the listener decodes the message
m
, it first maps each discrete symbol into an embedding
based on a learnable embedding layer and uses a decoding LSTM layer (DecLSTM) to process the
sequence of embeddings recurrently.
Embt(m) = DecLSTM(emb(mt)|emb(m1),emb(m2), .., emb(mt1)) .(5)
We use the output of DecLSTM at the ending step
T
as the embedding of the message to be the input
of a convolutional decoder for image reconstruction:
Emb(m) = EmbT(m),where T= min(nmsg,argmin
i
{mi== EOS, i ∈ {1..N}}).(6)
Finally, the convolutional decoder (DecConv) reconstructs the input by:
ˆ
x=DecConv(Emb(m)) .(7)
3 Experimental Design
3.1 Datasets
We are interested in whether representations can generalize to novel combinations of seen concepts.
Therefore, we need datasets that provide ground-truth labels of elementary concepts for (1) creating
train/test splits and (2) downstream task evaluation. We consider datasets with
ngen
independent
generative factors
F
=
{f1, ..., fngen }
where the space of factor
fi
is
Si
. For example, if
f1
is color,
then
S1
could be
{yellow, red, blue, ...}
. The data space
D
is defined by the Cartesian product of
the spaces of each factor, and therefore the cardinality of the dataset is |D|=Qngen
i|Si|.
In our study, we used two public image datasets studied in the disentanglement literature: dSprites [
35
]
and MPI3D [
17
]. The dSprites dataset contains images of 2D shapes generated from 5 factors
F=
{shape, scale, rotation, x and y position}
. To avoid label ambiguity due to the rotational symmetry
of the square and ellipse shapes, we limit the range of orientations to be within [0,
π/2
). Then the
cardinality of each factor’s space is
{3,6,10,32,32}
respectively, which makes
|D|= 183,320
.
MPI3D is a set of 3D datasets synthesized or recorded in a controlled environment with an object held
4
by a robotic arm. In our evaluation, the challenging real-world version (MPI3D-Real) is used. It has 7
factors
F
= {object-color(6), object-shape(6), object-size(2), camera-height(3), background-color(3),
horizontal-axis(40), vertical-axis(40)} with the corresponding cardinalities of the space in parentheses,
leading to a total of 1,036,800 images.
3.2 Compositional Generalization Evaluation Protocol
Our evaluation protocol emphasizes the compositional generalization that models can achieve on
downstream tasks. Our objective is to measure how easily an unsupervised representation learning
method can produce compositional generalization using a simple model on top of the learned
representation. (1) Data splits. We first split a dataset into train/test sets randomly while ensuring
that all samples in the test-set are novel combinations of elementary factors seen in the train-set. (2)
Unsupervised representation learning. We learn unsupervised representations from the unlabeled
images of the train-set of size
Ntrain
with a selected algorithm. (3) Learning for downstream tasks.
Then, we freeze the learned representation model and use the
Nlabel
labeled samples from the train
set (
Nlabel << Ntrain
), to train a simple classifier / regressor to predict the ground truth value for
each factor
fi
of the dataset. (4) Testing generalization performance. Lastly, we test the performance
of the downstream task models on novel combinations of seen values of elementary factors.
Readout model.
We argue that it is important to use simple read-out models and a limited number
of labeled training samples to evaluate downstream tasks. Otherwise, the performance gain from the
downstream task learning stage is mixed with that of the unsupervised learning stage. For example, if
the unsupervised representation model is an identity mapping, we can still get great performance with
a powerful read-out model and enough labeled samples. In our main article, we use linear models for
downstream tasks. However, perfectly disentangled but linearly inseparable representations may still
show poor performance with a linear readout model. Through sanity checking experiments (discussed
in Appendix B), the linear readout model can still generalize well on nonlinear oracle representations
possibly due to the limited value range of attributes in our datasets. In addition, extra results using a
non-linear readout model (Gradient Boosting Trees) are in Appendix C and the main observations
are consistent. The linear models we use to predict the values of the generative factors are ridge
regression, using the
R2
score as the evaluation metric, and logistic regression, using classification
accuracy as the evaluation metric. Since the
R2
score can be negative while
R2= 0
indicates random
guessing, we clip all negative R2scores to zero.
Representation Mode.
In unsupervised disentanglement learning, low-dimensional latent variables
with explicit disentanglement regularization are used as the representation for downstream tasks, e.g.
the mean values of Gaussian distributions of the latent variables in the VAE. If the disentangled latent
variables each represent a single ground truth factor, only a simple mapping between factors and the
corresponding variables must be learned to achieve good performance. However, it is questionable if
the learned latent variables disentangle in the assumed structure and therefore improve generalization.
On the other hand, the simple linear models in our evaluation protocol may not be capable to process
the latent variables (discrete messages) in emergent language (EL) learning because we would not
expect discrete sequential latent variables to be easily linearly separable. Therefore, we also evaluate
the intermediate features of the layers before or after the model bottleneck. Specifically, in addition
to the latent variables (
zlatent
), we also use the features immediately after the convolutional encoder
(
zpost
) or before the convolutional decoder (
zpre
) as representations of images, shown in Fig. 1. We
evaluate the use of zpost and zpre in both disentanglement and emergent language models.
3.3 Implementation details.
Data Splits
For the main experiments, we use a 1:9 train/test split. The train set size (10%) is smaller
than what common machine learning setups and previous studies have used [
36
,
41
]. However,
considering the number of possible combinations increases exponentially for real data with an
increased number of generative factors, using fewer training samples even at the unsupervised
learning stage can help to obtain more meaningful conclusions for real-world scenarios.
Model architectures.
Fig. 1 shows the architectures for disentanglement and emergent language
learning models. The encoder and decoder in disentanglement learning models are symmetric
architectures with a convolutional network and a multi-layer perceptron (MLP) similar to the design
in [
5
]. We scale up the size of the model by doubling the width of all layers, which improves the
5
摘要:

CompositionalGeneralizationinUnsupervisedCompositionalRepresentationLearning:AStudyonDisentanglementandEmergentLanguageZhenlinXuMarcNiethammerColinRaffelDepartmentofComputerScienceUniversityofNorthCarolinaatChapelHill{zhenlinx,mn,craffel}@cs.unc.eduAbstractDeeplearningmodelsstrugglewithcompositional...

收起<<
Compositional Generalization in Unsupervised Compositional Representation Learning A Study on Disentanglement and Emergent Language.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:24 页 大小:3.2MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注