Compositional Generalization in Unsupervised Compositional Representation Learning A Study on Disentanglement and Emergent Language

2025-04-27 1 0 3.2MB 24 页 10玖币

侵权投诉

Compositional Generalization in Unsupervised

Compositional Representation Learning:

A Study on Disentanglement and Emergent Language

Zhenlin Xu Marc Niethammer Colin Raffel

Department of Computer Science

University of North Carolina at Chapel Hill

{zhenlinx, mn, craffel}@cs.unc.edu

Abstract

Deep learning models struggle with compositional generalization, i.e. the ability

to recognize or generate novel combinations of observed elementary concepts. In

hopes of enabling compositional generalization, various unsupervised learning algo-

rithms have been proposed with inductive biases that aim to induce compositional

structure in learned representations (e.g. disentangled representation and emergent

language learning). In this work, we evaluate these unsupervised learning algo-

rithms in terms of how well they enable compositional generalization. Speciﬁcally,

our evaluation protocol focuses on whether or not it is easy to train a simple model

on top of the learned representation that generalizes to new combinations of compo-

sitional factors. We systematically study three unsupervised representation learning

algorithms –

-VAE,

-TCVAE, and emergent language (EL) autoencoders – on

two datasets that allow directly testing compositional generalization. We ﬁnd that

directly using the bottleneck representation with simple models and few labels may

lead to worse generalization than using representations from layers before or after

the learned representation itself. In addition, we ﬁnd that the previously proposed

metrics for evaluating the levels of compositionality are not correlated with the

actual compositional generalization in our framework. Surprisingly, we ﬁnd that

increasing pressure to produce a disentangled representation (e.g. increasing

in the

-VAE) produces representations with worse generalization, while repre-

sentations from EL models show strong compositional generalization. Motivated

by this observation, we further investigate the advantages of using EL to induce

compositional structure in unsupervised representation learning, ﬁnding that it

shows consistently stronger generalization than disentanglement models, especially

when using less unlabeled data for unsupervised learning and fewer labels for

downstream tasks. Taken together, our results shed new light onto the composi-

tional generalization behavior of different unsupervised learning algorithms with a

new setting to rigorously test this behavior, and suggest the potential beneﬁts of

developing EL learning algorithms for more generalizable representations.

1 Introduction

A human’s ability to recognize or generate novel combinations of seen elementary concepts, also

known as compositional generalization, is desirable for building general artiﬁcial ingelligence (AI)

systems [

]. The Recognition-By-Components theory by Biederman [

] inﬂuenced the

early development of computer vision models that are inherently compositional, e.g., hierarchical

features [

] and part-based models [

]. However, modern deep learning systems still

struggle with this key capability of human intelligence [

]. A few works studied speciﬁc spatial and

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00482v2 [cs.LG] 5 Oct 2022

object-wise compositionality [

] or more general compositionality in the space of pre-deﬁned

attributes [45, 39] or the semantics of human language descriptions [49, 44].

Humans often express complex meaning in a compositional manner: we combine elementary rep-

resentations to describe observations. For example, an object with simple geometry is described

by separate and independent properties such as color (red, blue, ...), position (left, right, close, far

away,...), and shape (circle, triangle, ...). Therefore, compositional representations are thought as

helpful or even essential to achieve compositional generalization [

]. However, consider-

ing that there are exponentially many possible combinations of a given set of elementary concepts, we

need to deal with this combinatorial explosion for real-world visual observations. It is unrealistic to an-

notate enough data to learn the ﬁne-grained compositionality. Therefore, unsupervised compositional

representation learning is appealing because it does not require comprehensive labeling. However,

unsupervised representation learning heavily relies on the design of an effective inductive bias (e.g.

on the representation formulation) to induce the emergence of compositional representations.

A widely explored representation formulation with explicit compositionality is disentanglement.

The most common formulation of disentanglement is that the generative factors of observations

should be encoded into different factors of low-dimensional representations, and a change of a

single factor in an observation leads to a change in a single factor of the representation. State-

of-the-art unsupervised disentanglement models [

] are largely built on top of

variational generative models [

]. To measure the level of disentanglement, various quantitative

metrics have been proposed that are deﬁned based on the statistical relations between the learned

representations and ground truth factors with an emphasis on the separation of factors. A summary of

disentanglement metrics and methods is provided in [

]. Another representation learning approach

with an inductive bias towards compositionality is emergent language learning. Natural language

allows us to describe novel composite concepts by combining expressions of their elementary concepts

according to grammar. Therefore, linguists have been interested in studying the compositionality

of discrete codes evolving during multi-agent communication when agents learn to complete a task

cooperatively [

]. Compositionality metrics with language structure assumptions, e.g.

topographic similarity [

], were used to evaluate the learned language. However, for the above two

types of methods, relatively few studies [

] have directly evaluated how well the learned

representations generalize to novel combinations on downstream tasks, which is the main motivation

for compositional representation learning in the ﬁrst place.

In this work, we study the compositional generalization performance of representations learned

from unsupervised learning algorithms with two types of inductive biases for compositionality:

disentanglement and emergent languages (EL). Instead of measuring the compositionality and

disentanglement metrics deﬁned based on various assumptions, we directly measure the generalization

performance on novel input combinations with a two-stage protocol. Speciﬁcally, with a dataset

divided into train and test sets ensuring that the test set contains novel combinations of concepts that

never appear in the train set, we ﬁrst learn an unsupervised representation model from unlabeled

images in the train set. With very few labeled samples from the train set and the frozen unsupervised

representation model, we train simple (e.g. linear) models on top of learned representations to predict

the ground truth value for each generative factor of the dataset and evaluate these simple models

on the test set. These choices are aligned with common practices in recent deep representation

learning works, for example, self-supervised representation learning [

] and semi-supervised

learning with generative models [

]. Different from previous studies (e.g. [

]) that measure

unsupervised learning performance (e.g. image reconstruction), we evaluate the performance on

downstream tasks. We also emphasize how easily we can obtain downstream task models with the

learned representation, e.g. when using very few labels and simple linear models. These designs

highlight the generalization performance of the unsupervised learning stage, different from a setup

that uses many more or even all labeled samples of the train set in the downstream task learning stage

[

] or performs unsupervised learning on the entire dataset [

]. More importantly, we study

not only the compositional generalization of intermediate representations at the model bottleneck,

e.g. where the disentangled latent variables are formulated, but also the representations from layers

before or after it.

With the above evaluation protocol for compositional generalization, we explore selected unsupervised

learning algorithms by varying (1) the hyperparameters of each algorithm that may control the levels

of compositionality; (2) image datasets and the amount of data for both unsupervised and supervised

learning stages; and (3) design choices of EL learning. First, we ﬁnd that, compared to the low-

Figure 1: Architectures of disentanglement models (left) and emergent language models (right).

dimensional latent variables from the model bottleneck, representations from the layers before or

after the model bottleneck enable better compositional generalization. Second, we ﬁnd that attaining

higher scores on previously proposed compositionality/disentanglement metrics does not always

correlate with better generalization performance. Finally, the representations learned from EL models

show stronger generalization performance than disentanglement models. These ﬁndings reveal the

divergence between the efforts to achieve better compositionality / disentanglement metric scores and

the initial motivation to obtain better generalization performance. To our best knowledge, this is the

ﬁrst study to compare representations in disentanglement models and emergent language learning

through the lens of compositional generalization with a uniﬁed evaluation setup. We also discuss

the advantage of the emergent language representation format versus disentanglement and connect it

with recent related research in representation learning.

2 Unsupervised Learning with Compositional Representation Inductive Bias

In this section, we introduce more details on unsupervised learning algorithms with two different

compositionality-seeking inductive biases on the representation formulation: disentangled representa-

tions in Section 2.1 and emergent languages in Section 2.2.

2.1 Learning Disentangled Representations

The concept of disentanglement assumes that high-dimensional observations of the real world

can be represented by low-dimensional latent variables

, where each dimension of

encodes

independent factors of variations in

. For unsupervised disentanglement learning algorithms, we

select the popular

-VAE and

-TCVAE methods which both modify the evidence lower bound

(ELBO) objective in the variational autoencoder (VAE).

-VAE use a hyperparameter

for the Kullback-Leibler (KL) regularization term of the vanilla VAE

loss to control the bandwidth of the VAE bottleneck:

Ep(x)[Eqφ(z|x)[log pθ(x|z)] −βKL(qφ(z|x)||p(z))] ,(1)

where

p(z)

is the assumed prior distribution of the latent variables and its conditional distribution

qφ(z|x)

is parameterized by a neural network (encoder) whose parameters are

and the posterior

pθ(x|z)is parameterized by a decoder whose parameters are θ.β= 1 corresponds to the VAE loss.

-TCVAE further decomposes the KL term in Eq.

(1)

into mutual information, total correlation, and

dimension-wise KL terms, and penalizes the total correlation with the hyperparameter β:

Eq(z|x)p(x)[log pθ(x|z)] −αIq(x;z)−βKL(qφ(z)|| Y

qφ(zj)) −γX

KL(q(zj)||p(zj)) ,(2)

where

Iq(x;z)

KL(qφ(z)|| Qjqφ(zj))

and

PjKL(q(zj)||p(zj))

are the mutual information term,

total correlation term and the dimension-wise KL term respectively. The proposed

-TCVAE uses

α=γ= 1 and tunes βonly.

2.2 Learning Emergent Language

An alternative compositional representation learning method is emergent language (EL) learning,

which aims to learn a representation that mimics the properties of natural language. The emergent

language consists of sequences of discrete symbols from a vocabulary. Since the model combines

discrete symbols in the vocabulary to represent complex semantics in observations, one expects that

meaningful compositionality might naturally emerge in communication between multiple agents

using the emergent language to solve tasks. We consider the typical speaker-listener model (two-

agent communication) for EL learning, as shown in Fig. 1. We apply EL learning to the image

reconstruction task to be consistent with the reconstruction objective used by variational auto-encoder-

based models. Note that in our setting the terminology “speaker-listener” is equivalent to the more

common “encoder-decoder” terminology. The task is as follows.

The speaker receives an input

and encodes it as a message

m={m1, m2, ...}

, a sequence

of discrete symbols from the vocabulary

V={c1, c2, cnV}

of size

. The maximum

length of mis nmsg.

The listener model receives the message

and the outputs

, aiming to accurately recon-

struct the encoder input x.

In this work, we use a speaker and a listener that are both hybrids of a convolutional neural network

and an LSTM recurrent neural network. The ﬂattened convolutional embedding of the input image,

EncConv(x)

, is used as the initial cell state of an LSTM encoding module (EncLSTM). EncLSTM

generates a discrete distribution

q(mt|x)

over

at each time step

autoregressively (with the

embedding of the discrete token sampled in the previous step t−1as input):

q(mt|x) = EncLSTM(mt−1|EncConv(x),emb(m1), .., emb(mt−2)) ,(3)

where emb(

) is the learnable layer that projects a discrete token into a high-dimensional embedding.

We use the Gumbel-Softmax [

] to sample from the discrete distribution

q(mt|x)

and the

“straight-through” (ST) gradient estimator [

] for quantization (from soft-distribution to one-hot

vector). This allow us to estimate the gradients from the discrete sampling process.

mt=ST(GumbelSoftmax(q(mt|x))) .(4)

To allow the message to be of variable length, which better mimics natural language, we set one token

in Vto be the end-of-sequence (EOS) token that indicates the message end.

When the listener decodes the message

, it ﬁrst maps each discrete symbol into an embedding

based on a learnable embedding layer and uses a decoding LSTM layer (DecLSTM) to process the

sequence of embeddings recurrently.

Embt(m) = DecLSTM(emb(mt)|emb(m1),emb(m2), .., emb(mt−1)) .(5)

We use the output of DecLSTM at the ending step

as the embedding of the message to be the input

of a convolutional decoder for image reconstruction:

Emb(m) = EmbT(m),where T= min(nmsg,argmin

{mi== EOS, i ∈ {1..N}}).(6)

Finally, the convolutional decoder (DecConv) reconstructs the input by:

x=DecConv(Emb(m)) .(7)

3 Experimental Design

3.1 Datasets

We are interested in whether representations can generalize to novel combinations of seen concepts.

Therefore, we need datasets that provide ground-truth labels of elementary concepts for (1) creating

train/test splits and (2) downstream task evaluation. We consider datasets with

ngen

independent

generative factors

{f1, ..., fngen }

where the space of factor

. For example, if

is color,

then

could be

{yellow, red, blue, ...}

. The data space

is deﬁned by the Cartesian product of

the spaces of each factor, and therefore the cardinality of the dataset is |D|=Qngen

i|Si|.

In our study, we used two public image datasets studied in the disentanglement literature: dSprites [

]

and MPI3D [

]. The dSprites dataset contains images of 2D shapes generated from 5 factors

{shape, scale, rotation, x and y position}

. To avoid label ambiguity due to the rotational symmetry

of the square and ellipse shapes, we limit the range of orientations to be within [0,

π/2

). Then the

cardinality of each factor’s space is

{3,6,10,32,32}

respectively, which makes

|D|= 183,320

MPI3D is a set of 3D datasets synthesized or recorded in a controlled environment with an object held

by a robotic arm. In our evaluation, the challenging real-world version (MPI3D-Real) is used. It has 7

factors

= {object-color(6), object-shape(6), object-size(2), camera-height(3), background-color(3),

horizontal-axis(40), vertical-axis(40)} with the corresponding cardinalities of the space in parentheses,

leading to a total of 1,036,800 images.

3.2 Compositional Generalization Evaluation Protocol

Our evaluation protocol emphasizes the compositional generalization that models can achieve on

downstream tasks. Our objective is to measure how easily an unsupervised representation learning

method can produce compositional generalization using a simple model on top of the learned

representation. (1) Data splits. We ﬁrst split a dataset into train/test sets randomly while ensuring

that all samples in the test-set are novel combinations of elementary factors seen in the train-set. (2)

Unsupervised representation learning. We learn unsupervised representations from the unlabeled

images of the train-set of size

Ntrain

with a selected algorithm. (3) Learning for downstream tasks.

Then, we freeze the learned representation model and use the

Nlabel

labeled samples from the train

set (

Nlabel << Ntrain

), to train a simple classiﬁer / regressor to predict the ground truth value for

each factor

of the dataset. (4) Testing generalization performance. Lastly, we test the performance

of the downstream task models on novel combinations of seen values of elementary factors.

Readout model.

We argue that it is important to use simple read-out models and a limited number

of labeled training samples to evaluate downstream tasks. Otherwise, the performance gain from the

downstream task learning stage is mixed with that of the unsupervised learning stage. For example, if

the unsupervised representation model is an identity mapping, we can still get great performance with

a powerful read-out model and enough labeled samples. In our main article, we use linear models for

downstream tasks. However, perfectly disentangled but linearly inseparable representations may still

show poor performance with a linear readout model. Through sanity checking experiments (discussed

in Appendix B), the linear readout model can still generalize well on nonlinear oracle representations

possibly due to the limited value range of attributes in our datasets. In addition, extra results using a

non-linear readout model (Gradient Boosting Trees) are in Appendix C and the main observations

are consistent. The linear models we use to predict the values of the generative factors are ridge

regression, using the

score as the evaluation metric, and logistic regression, using classiﬁcation

accuracy as the evaluation metric. Since the

score can be negative while

R2= 0

indicates random

guessing, we clip all negative R2scores to zero.

Representation Mode.

In unsupervised disentanglement learning, low-dimensional latent variables

with explicit disentanglement regularization are used as the representation for downstream tasks, e.g.

the mean values of Gaussian distributions of the latent variables in the VAE. If the disentangled latent

variables each represent a single ground truth factor, only a simple mapping between factors and the

corresponding variables must be learned to achieve good performance. However, it is questionable if

the learned latent variables disentangle in the assumed structure and therefore improve generalization.

On the other hand, the simple linear models in our evaluation protocol may not be capable to process

the latent variables (discrete messages) in emergent language (EL) learning because we would not

expect discrete sequential latent variables to be easily linearly separable. Therefore, we also evaluate

the intermediate features of the layers before or after the model bottleneck. Speciﬁcally, in addition

to the latent variables (

zlatent

), we also use the features immediately after the convolutional encoder

(

zpost

) or before the convolutional decoder (

zpre

) as representations of images, shown in Fig. 1. We

evaluate the use of zpost and zpre in both disentanglement and emergent language models.

3.3 Implementation details.

Data Splits

For the main experiments, we use a 1:9 train/test split. The train set size (10%) is smaller

than what common machine learning setups and previous studies have used [

]. However,

considering the number of possible combinations increases exponentially for real data with an

increased number of generative factors, using fewer training samples even at the unsupervised

learning stage can help to obtain more meaningful conclusions for real-world scenarios.

Model architectures.

Fig. 1 shows the architectures for disentanglement and emergent language

learning models. The encoder and decoder in disentanglement learning models are symmetric

architectures with a convolutional network and a multi-layer perceptron (MLP) similar to the design

in [

]. We scale up the size of the model by doubling the width of all layers, which improves the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CompositionalGeneralizationinUnsupervisedCompositionalRepresentationLearning:AStudyonDisentanglementandEmergentLanguageZhenlinXuMarcNiethammerColinRaffelDepartmentofComputerScienceUniversityofNorthCarolinaatChapelHill{zhenlinx,mn,craffel}@cs.unc.eduAbstractDeeplearningmodelsstrugglewithcompositional...

展开>> 收起<<

Compositional Generalization in Unsupervised Compositional Representation Learning A Study on Disentanglement and Emergent Language.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Compositional Generalization in Unsupervised Compositional Representation Learning A Study on Disentanglement and Emergent Language

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: