object-wise compositionality [
42
,
30
] or more general compositionality in the space of pre-defined
attributes [45, 39] or the semantics of human language descriptions [49, 44].
Humans often express complex meaning in a compositional manner: we combine elementary rep-
resentations to describe observations. For example, an object with simple geometry is described
by separate and independent properties such as color (red, blue, ...), position (left, right, close, far
away,...), and shape (circle, triangle, ...). Therefore, compositional representations are thought as
helpful or even essential to achieve compositional generalization [
22
,
15
,
3
,
28
]. However, consider-
ing that there are exponentially many possible combinations of a given set of elementary concepts, we
need to deal with this combinatorial explosion for real-world visual observations. It is unrealistic to an-
notate enough data to learn the fine-grained compositionality. Therefore, unsupervised compositional
representation learning is appealing because it does not require comprehensive labeling. However,
unsupervised representation learning heavily relies on the design of an effective inductive bias (e.g.
on the representation formulation) to induce the emergence of compositional representations.
A widely explored representation formulation with explicit compositionality is disentanglement.
The most common formulation of disentanglement is that the generative factors of observations
should be encoded into different factors of low-dimensional representations, and a change of a
single factor in an observation leads to a change in a single factor of the representation. State-
of-the-art unsupervised disentanglement models [
20
,
24
,
27
,
7
,
34
] are largely built on top of
variational generative models [
25
]. To measure the level of disentanglement, various quantitative
metrics have been proposed that are defined based on the statistical relations between the learned
representations and ground truth factors with an emphasis on the separation of factors. A summary of
disentanglement metrics and methods is provided in [
31
]. Another representation learning approach
with an inductive bias towards compositionality is emergent language learning. Natural language
allows us to describe novel composite concepts by combining expressions of their elementary concepts
according to grammar. Therefore, linguists have been interested in studying the compositionality
of discrete codes evolving during multi-agent communication when agents learn to complete a task
cooperatively [
9
,
29
,
19
,
40
]. Compositionality metrics with language structure assumptions, e.g.
topographic similarity [
4
], were used to evaluate the learned language. However, for the above two
types of methods, relatively few studies [
36
,
41
,
6
,
1
] have directly evaluated how well the learned
representations generalize to novel combinations on downstream tasks, which is the main motivation
for compositional representation learning in the first place.
In this work, we study the compositional generalization performance of representations learned
from unsupervised learning algorithms with two types of inductive biases for compositionality:
disentanglement and emergent languages (EL). Instead of measuring the compositionality and
disentanglement metrics defined based on various assumptions, we directly measure the generalization
performance on novel input combinations with a two-stage protocol. Specifically, with a dataset
divided into train and test sets ensuring that the test set contains novel combinations of concepts that
never appear in the train set, we first learn an unsupervised representation model from unlabeled
images in the train set. With very few labeled samples from the train set and the frozen unsupervised
representation model, we train simple (e.g. linear) models on top of learned representations to predict
the ground truth value for each generative factor of the dataset and evaluate these simple models
on the test set. These choices are aligned with common practices in recent deep representation
learning works, for example, self-supervised representation learning [
10
] and semi-supervised
learning with generative models [
26
]. Different from previous studies (e.g. [
6
,
36
]) that measure
unsupervised learning performance (e.g. image reconstruction), we evaluate the performance on
downstream tasks. We also emphasize how easily we can obtain downstream task models with the
learned representation, e.g. when using very few labels and simple linear models. These designs
highlight the generalization performance of the unsupervised learning stage, different from a setup
that uses many more or even all labeled samples of the train set in the downstream task learning stage
[
10
,
41
] or performs unsupervised learning on the entire dataset [
31
]. More importantly, we study
not only the compositional generalization of intermediate representations at the model bottleneck,
e.g. where the disentangled latent variables are formulated, but also the representations from layers
before or after it.
With the above evaluation protocol for compositional generalization, we explore selected unsupervised
learning algorithms by varying (1) the hyperparameters of each algorithm that may control the levels
of compositionality; (2) image datasets and the amount of data for both unsupervised and supervised
learning stages; and (3) design choices of EL learning. First, we find that, compared to the low-
2