
Emergence of Graphical Sensory-Motor Communication
less elements (phonemes). This corresponded to the first
level of compositionality within the notion of duality of pat-
terning (Hockett & Hockett,1960). Yet, these works did not
consider referential games and did not study agents’ ability
to compose meaningful words to denote referents, i.e. they
did not address the second level of the duality of patterning.
One of the goals of emergent communication research is
to develop machines that can interact with humans. As
a result, a variety of referential game approaches ensure
that the emergent language is as close to natural language.
This can be achieved by adding a supervised image caption-
ing objective to encourage agents to use natural language
in order to solve their communicative tasks (Havrylov &
Titov,2017;Lazaridou et al.,2017). Other methods use
constraints such as memory restrictions (Kottur et al.,2017)
to act as an information bottleneck to increase interpretabil-
ity and compositionality. While we purposefully chose a
graphical sensory-motor system to ease the visualization of
the emerging language, we do not inject prior knowledge
or pressures to facilitate the emergence of an iconic lan-
guage. Our produced utterances are completely arbitrary.
This fundamentally differentiates our work from Mihai &
Hare (2021b) that trains agents to communicate via sketches
replicating the visual referents they name. Note also that
their drawing setup does not include dynamical motor prim-
itives and utterances are directly optimized in image space.
They, moreover, allow gradients to back-propagate from
listener to speaker while we use a decentralized approach.
Finally, they do not consider contrastive learning. To our
knowledge, CURVES is the first contrastive deep-learning
algorithm successfully applied to a referential game.
There is a large body of work exploring the factors that
promote compositionally in emerging languages (Kottur
et al.,2017;Li & Bowling,2019;Rodr
´
ıguez Luna et al.,
2020;Ren et al.,2020;Chaabouni et al.,2020;Gupta et al.,
2020). In this context, a crucial question is how to actually
measure it in the first place (Mu & Goodman,2021). To
this end, (Choi et al.,2018) proposes to measure commu-
nicative performances on unseen compositions of known
objects as a way to evaluate compositionality. However, it
has been shown that a good performance in this test may
be achieved without leveraging any actual compositionality
in language (Andreas,2019;Chaabouni et al.,2020). Thus,
others instead compute topographic similarities (Brighton &
Kirby,2006), measuring the correlation between distances
in the utterance space (distance between signs) and distances
in the referents space (such as the cosine similarity between
the embeddings of objects) (Lazaridou et al.,2018). In this
paper we propose to do both and study 1) the generalization
to unseen combinations of abstract features and 2) topo-
graphic measures based on the Hausdorff distances between
utterances denoting composition and utterances denoting
isolated features.
Contributions. This paper introduces:
•
The Graphical Referential Game (GREG): a variation
of the referential language game to study the formation
of signs from a graphical sensory-motor system.
•
CURVES: an algorithmic solution to GREG, consisting
of a contrastive multimodal encoder coupled with a
generative model enabling the emergence of a graphi-
cal language.
•
A study of CURVES’s generalization performances on
compositions of features never seen during training in
a simplified control setting and a more perceptually
challenging one.
•
A complementary analysis of the structure of the
emerging graphical language measuring lexicon co-
herence and compositionality scores derived from the
Haussdorf distance.
2. Problem Definition
Graphical referential game.
We consider a group of two
agents playing a fixed number of referential games, each
time alternating their roles (speaker or listener). During
a game, we first present a context
R
of
n
objects, called
referents to a speaker
S
and a listener
L
. At the beginning
of each game, the target
r?∈R
is assigned to the speaker.
Given this target referent
r?
,
S
produces an utterance (
u
) to
designate it. Based on the produced utterance
u
,
L
selects
a referent (
ˆr
) in
R
. The game outcome
o
is a success if the
selected referent (ˆr) matches the target r?.
Referents.
Referents are compositions of orthogonal vec-
tor features (one-hot vectors). Given a set of
m
or-
thogonal features
Fm
, we define the set of all possi-
ble referents as
Rm={Pf∈Sf|S⊆Fm}
. The sub-
set of referents made of exactly
k
features are thus:
Rk
m={Pf∈Sf|S⊆Fm,|S|=k}
. In our experiments,
we fix m= 5.
From these orthogonal referents, we propose to generate
objects made of digit images sampled from the MNIST
dataset (LeCun et al.,1998). More precisely, we define the
stochastic mapping Φ : Rm→˜
Rmthat maps each feature
f∈Fm
to a digit class in the MNIST dataset. For each
feature in a referent, we sample a random instance from
the corresponding class and randomly place it on a
4×4
grid such that no number overlap. Note that the listener and
speaker can perceive different realizations of
Φ
, in this case,
we say that they see different perspectives of the referents.
More precisely, the speaker perceives the context
R
as
˜
RS
and its target
r?
as
r?
S
. Similarly, the listener perceives the
context Ras ˜
RLand selects a referent ˆramong it.