Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication Tristan Karch 1 2Yoann Lemesle 3Romain Laroche4Clement Moulin-Frier1 2Pierre-Yves Oudeyer1 2

2025-04-26 0 0 6.29MB 47 页 10玖币
侵权投诉
Contrastive Multimodal Learning for
Emergence of Graphical Sensory-Motor Communication
Tristan Karch *12 Yoann Lemesle * 3 Romain Laroche 4Cl´
ement Moulin-Frier 1 2 Pierre-Yves Oudeyer 1 2
Abstract
In this paper, we investigate whether artificial
agents can develop a shared language in an eco-
logical setting where communication relies on a
sensory-motor channel. To this end, we introduce
the Graphical Referential Game (GREG) where
a speaker must produce a graphical utterance to
name a visual referent object while a listener has
to select the corresponding object among distrac-
tor referents, given the delivered message. The
utterances are drawing images produced using dy-
namical motor primitives combined with a sketch-
ing library. To tackle GREG we present CURVES: a
multimodal contrastive deep learning mechanism
that represents the energy (alignment) between
named referents and utterances generated through
gradient ascent on the learned energy landscape.
We demonstrate that CURVES not only succeeds
at solving the GREG but also enables agents to
self-organize a language that generalizes to fea-
ture compositions never seen during training. In
addition to evaluating the communication perfor-
mance of our approach, we also explore the struc-
ture of the emerging language. Specifically, we
show that the resulting language forms a coher-
ent lexicon shared between agents and that basic
compositional rules on the graphical productions
could not explain the compositional generaliza-
tion.
1. Introduction
Understanding the emergence and evolution of human lan-
guages is a significant challenge that has involved many
fields, from linguistics to developmental cognitive sci-
ences (Christiansen & Kirby,2003). Computational ex-
perimental semiotics (Galantucci & Garrod,2011) has seen
*
Equal contribution
1
Inria, Flowers Team
2
Universit
´
e
de Bordeaux
3
Universit
´
e Paris-Dauphine-PSL
4
Microsoft Re-
search, Montreal. Correspondence to: Tristsan Karch
<
tris-
tan.karch@inria.fr>.
Pre-print, Under review
some success in modeling the formation of communica-
tion systems in populations of artificial agents (Cangelosi
& Parisi,2002;Kirby et al.,2014). More specifically, Lan-
guage Game models (Steels & Loetzsch,2012), have been
used to show how a population of agents can self-organize
a culturally shared lexicon without centralized coordina-
tion. Given the recent successes of artificial neural net-
works in solving complex tasks such as image classifica-
tion (Krizhevsky et al.,2012;He et al.,2015;2016;Dosovit-
skiy et al.,2021) and natural language understanding (De-
vlin et al.,2019;Radford et al.,2019;Brown et al.,2020),
many works have leveraged them to study the emergence of
communication in groups of agents (Lazaridou & Baroni,
2020), mainly using multi-agent deep reinforcement learn-
ing and language games (Nguyen et al.,2020;Mordatch
& Abbeel,2018;Lazaridou et al.,2018;Portelance et al.,
2021;Chaabouni et al.,2021). These advances have made it
possible to scale up language game models to environments
where linguistic conventions are jointly learned with visual
representations of raw image perception, as well as to envi-
ronments where emergent communication is used as a tool
to achieve joint cooperative tasks (Barde et al.,2022).
So far, most of these methods have considered only ideal-
ized symbolic communication channels based on discrete
tokens (Lazaridou et al.,2017;Mordatch & Abbeel,2018;
Chaabouni et al.,2021) or fixed-size sequences of word to-
kens (Havrylov & Titov,2017;Portelance et al.,2021). This
predefined means of communication is motivated by lan-
guage’s discrete and compositional nature. But how can this
specific structure emerge during vocalization or drawing,
for instance? Although fundamental in the investigation of
the origin of language (Dessalles,2000;Cheney & Seyfarth,
2005;Oller et al.,2019), this question seems to be neglected
by recent approaches to Language Games (Moulin-Frier &
Oudeyer,2020). We, therefore, propose to study how com-
munication could emerge between agents producing and
perceiving continuous signals with a constrained sensory-
motor system.
Such continuous constrained systems have been used in
the cognitive science literature as models of sign produc-
tion to study the self-organization of speech in artificial
systems (de Boer,2000;Oudeyer,2006;Moulin-Frier et al.,
arXiv:2210.06468v2 [cs.AI] 14 Feb 2023
Emergence of Graphical Sensory-Motor Communication
Listener Referent
perceived by
utterance
command
Listener Context:
Speaker Referent
Speaker Context:
sample perceived by
selects perceived by
Speaker
produces
Game Outcome
Listener
Figure 1. The Graphical Referential Game:
During the game, the speaker’s goal is to produce a motor command
c
that will yield an
utterance
u
in order to denote a referent
rS
sampled from a context
˜
RS
. Following this step, the listener needs to interpret the utterance in
order to guess the referent it denotes among a context
˜
RL
. The game is a success if the listener and the speaker agree on the referent
(rLrS).
2015). In this paper, we focus on a drawing sensory-
motor system producing graphical signs. The sensory-
motor system is made of Dynamical Motor Primitives
(DMPs) (Schaal,2006) combined with a sketching sys-
tem (Mihai & Hare,2021a) enabling the conversion of motor
commands into images. Drawing systems have the advan-
tage of producing 2D trajectories interpretable by humans
while preserving the non-linear properties of speech models,
which were shown to ease the discretization of the produced
signals (Stevens,1989;Moulin-Frier et al.,2015). We in-
troduce the Graphical Referential Game: a variation of the
original referential game, where a Speaker agent (top of
Figure 1) has to produce a graphical utterance given a single
target referent while a Listener agent (bottom of Figure 1)
has to select an element among a context made of several ref-
erents, given the produced utterance (agents alternate their
roles). In this setting, we first investigate whether a popula-
tion of agents can converge on an efficient communication
protocol to solve the graphical language game. Then, we
evaluate the coherence and compositional properties of the
emergent language, since it is one of the main characteristics
of human languages.
Early language game implementations (Steels,1995;2001)
achieve communication convergence by using contrastive
methods to update association tables between object ref-
erents and utterances. While recent works use deep learn-
ing methods to target high-dimensional signals they do not
explore contrastive approaches. Instead, they model inter-
actions as a multi-agent reinforcement learning problem
where utterances are actions, and agents are optimized with
policy gradients, using the outcomes of the games as the
reward signal (Lazaridou et al.,2017). In the meantime, re-
cent models leveraging contrastive multimodal mechanisms
such as CLIP (Radford et al.,2021) have achieved impres-
sive results in modeling associations between images and
texts. Combined with efficient generative methods (Ramesh
et al.,2021), they can compose textual elements that are
reflected in image form as the composition of their asso-
ciated visual concepts. Inspired by these techniques, we
propose CURVES: Contrastive Utterance-Referent associa-
tiVE Scoring, an algorithmic solution to the graphical ref-
erential game. CURVES relies on two mechanisms: 1) The
contrastive learning of an energy landscape representing the
alignment between utterances and referents and 2) the gen-
eration of utterances that maximize the energy for a given
target referent. We evaluate CURVES in two instantiations
of the graphical referential game: one with symbolic ref-
erents encoded by one-hot vectors and another with visual
referents derived from the multiple MNIST digits (LeCun
et al.,1998). We show that CURVES converges to a shared
graphical language that enables a population of agents not
only to name complex visual referents but also to name new
referent compositions that were never encountered during
training.
Scope.
The idea of using a sensory-motor system to study
the emergence of forms of combinatoriality in language
dates back to methods investigating the origins of digital vo-
calization systems (de Boer,2000;Oudeyer,2005;Zuidema
& De Boer,2009). Such studies were conducted in the con-
text of imitation games at the level of phonemes to observe
the formation of speech utterances (syllables, words) that
were systematically composed from lower-level meaning-
Emergence of Graphical Sensory-Motor Communication
less elements (phonemes). This corresponded to the first
level of compositionality within the notion of duality of pat-
terning (Hockett & Hockett,1960). Yet, these works did not
consider referential games and did not study agents’ ability
to compose meaningful words to denote referents, i.e. they
did not address the second level of the duality of patterning.
One of the goals of emergent communication research is
to develop machines that can interact with humans. As
a result, a variety of referential game approaches ensure
that the emergent language is as close to natural language.
This can be achieved by adding a supervised image caption-
ing objective to encourage agents to use natural language
in order to solve their communicative tasks (Havrylov &
Titov,2017;Lazaridou et al.,2017). Other methods use
constraints such as memory restrictions (Kottur et al.,2017)
to act as an information bottleneck to increase interpretabil-
ity and compositionality. While we purposefully chose a
graphical sensory-motor system to ease the visualization of
the emerging language, we do not inject prior knowledge
or pressures to facilitate the emergence of an iconic lan-
guage. Our produced utterances are completely arbitrary.
This fundamentally differentiates our work from Mihai &
Hare (2021b) that trains agents to communicate via sketches
replicating the visual referents they name. Note also that
their drawing setup does not include dynamical motor prim-
itives and utterances are directly optimized in image space.
They, moreover, allow gradients to back-propagate from
listener to speaker while we use a decentralized approach.
Finally, they do not consider contrastive learning. To our
knowledge, CURVES is the first contrastive deep-learning
algorithm successfully applied to a referential game.
There is a large body of work exploring the factors that
promote compositionally in emerging languages (Kottur
et al.,2017;Li & Bowling,2019;Rodr
´
ıguez Luna et al.,
2020;Ren et al.,2020;Chaabouni et al.,2020;Gupta et al.,
2020). In this context, a crucial question is how to actually
measure it in the first place (Mu & Goodman,2021). To
this end, (Choi et al.,2018) proposes to measure commu-
nicative performances on unseen compositions of known
objects as a way to evaluate compositionality. However, it
has been shown that a good performance in this test may
be achieved without leveraging any actual compositionality
in language (Andreas,2019;Chaabouni et al.,2020). Thus,
others instead compute topographic similarities (Brighton &
Kirby,2006), measuring the correlation between distances
in the utterance space (distance between signs) and distances
in the referents space (such as the cosine similarity between
the embeddings of objects) (Lazaridou et al.,2018). In this
paper we propose to do both and study 1) the generalization
to unseen combinations of abstract features and 2) topo-
graphic measures based on the Hausdorff distances between
utterances denoting composition and utterances denoting
isolated features.
Contributions. This paper introduces:
The Graphical Referential Game (GREG): a variation
of the referential language game to study the formation
of signs from a graphical sensory-motor system.
CURVES: an algorithmic solution to GREG, consisting
of a contrastive multimodal encoder coupled with a
generative model enabling the emergence of a graphi-
cal language.
A study of CURVESs generalization performances on
compositions of features never seen during training in
a simplified control setting and a more perceptually
challenging one.
A complementary analysis of the structure of the
emerging graphical language measuring lexicon co-
herence and compositionality scores derived from the
Haussdorf distance.
2. Problem Definition
Graphical referential game.
We consider a group of two
agents playing a fixed number of referential games, each
time alternating their roles (speaker or listener). During
a game, we first present a context
R
of
n
objects, called
referents to a speaker
S
and a listener
L
. At the beginning
of each game, the target
r?R
is assigned to the speaker.
Given this target referent
r?
,
S
produces an utterance (
u
) to
designate it. Based on the produced utterance
u
,
L
selects
a referent (
ˆr
) in
R
. The game outcome
o
is a success if the
selected referent (ˆr) matches the target r?.
Referents.
Referents are compositions of orthogonal vec-
tor features (one-hot vectors). Given a set of
m
or-
thogonal features
Fm
, we define the set of all possi-
ble referents as
Rm={PfSf|SFm}
. The sub-
set of referents made of exactly
k
features are thus:
Rk
m={PfSf|SFm,|S|=k}
. In our experiments,
we fix m= 5.
From these orthogonal referents, we propose to generate
objects made of digit images sampled from the MNIST
dataset (LeCun et al.,1998). More precisely, we define the
stochastic mapping Φ : Rm˜
Rmthat maps each feature
fFm
to a digit class in the MNIST dataset. For each
feature in a referent, we sample a random instance from
the corresponding class and randomly place it on a
4×4
grid such that no number overlap. Note that the listener and
speaker can perceive different realizations of
Φ
, in this case,
we say that they see different perspectives of the referents.
More precisely, the speaker perceives the context
R
as
˜
RS
and its target
r?
as
r?
S
. Similarly, the listener perceives the
context Ras ˜
RLand selects a referent ˆramong it.
Emergence of Graphical Sensory-Motor Communication
DMP
Sketch Lib.
(a) (b)
Figure 2.
(a)
Sketching sensory-motor system
: The sensory-motor system imitates a robotic arm drawing a sketch on a 2D plan. DMPs
first convert a continuous command
c
into a sequence of coordinates
T
. This trajectory is then rendered as a
52 ×52
graphical utterance
thanks to a differentiable sketching library. (b)
Referent transformation:
An example of a one-hot context
R
being transformed into two
contexts ˜
RSand ˜
RLby the stochastic transformation Φ. The two contexts are different perspectives of the same objects.
We use this formalism to instantiate three settings of the
Graphical Referential Game (GREG):
one-hot: where referents are one-hot vectors r∈ Rm.
visual-shared: where referents are MNIST digits
r
˜
Rm
and agents share the same perspective:
˜
RS=˜
RL
.
visual-unshared where referents are MNIST digits
r
˜
Rm
and agents have different perspectives of referents
in their contexts ˜
RS6=˜
RL.
Sensory-motor drawing system.
Utterances are pro-
duced by a sensory-motor system
M:Rm→ U RD×D
mimicking an arm drawing sketches displayed in Figure 2(a).
The arm motion is derived from Dynamical Motor Prim-
itives (DMPs) (Schaal,2006). The DMP is parametrized
by a command vector
cR20
. It converts
c
into a 2-
dimensional drawing trajectory
T
made of 10 coordinates
T={vi}i=0,...,9
. This trajectory is then fed to a Differen-
tiable Sketching model (Mihai & Hare,2021a) generating
an
D×D
image (in our implementation,
D= 52
). See
Suppl. Section A.1 for additional implementation details of
the Sensory-motor drawing system.
Objectives.
In this study, we aim to answer the three fol-
lowing questions:
1.
What are agents’ communicative performances in the
GREG? Are agents able to solve the game? Are they
able to generalize to compositional referents?
2.
Are the emergent signs coherent? Do agents produce
the same utterances to denote the same referents?
3.
Are the emergent signs compositional? Are there com-
positional rules in the production of signs naming com-
positional referents? 1
Are agents able to solve the GREG?To answer the first
1Note that the ability to perform compositional generalization
(question 1) and the presence of compositional structure in utter-
ances (question 3) are two separate investigations.
question, we will monitor the communicative performance
of agents on both training and testing referents. The training
referents consist of a single feature:
Rtrain =R1
5
while
the testing referents consists of two features:
Rtest =R2
5
.
For visual examples of compositional referents, see Suppl.
Section A.2.
Are the emergent signs coherent? To measure coher-
ence we propose to use a similarity measure based on the
Hausdorff distance. Haussdorf distance is known to cap-
ture geometric features of trajectories, in particular, their
shape (Besse et al.,2015). The Hausdorff distance
dH
is the maximum distance from any coordinate in a trajec-
tory to the closest coordinate in the other:
dH(T1, T2) =
max{supvT1d(v, T2),supv0T2d(T1, v0)}
. In particular,
we compute the following metrics.
Agent Coherence (A-coherence): For a given referent
r
with the same perspective for all agents, measure the
mean pairwise similarity between each agent’s utter-
ance.
Perspective Coherence (P-coherence): For a given
agent and a given referent
r
, measure the mean pair-
wise similarity between utterances produced from dif-
ferent perspectives
Referent Coherence (R-coherence): For a given agent,
measure the mean pairwise similarity between utter-
ances produced for different referents.
Are the emergent signs compositional? To measure the com-
positionally of the utterances, we introduce a topographic
score based on the Hausdorff distance
ρ
.
ρ
quantifies how an
utterance denoting a compositional referent made of feature
i
and
j
(
u(rij )
) is actually closer to the utterances denoting
isolated features
u(ri)
or
u(rj)
than the utterance naming
other compositional referents (
u(rxy)
,
x6=i, y 6=j
). For a
detailed derivation of metric ρ, see Suppl. Section A.3.
Emergence of Graphical Sensory-Motor Communication
3. CURVES - Contrastive Utterance-Referent
associatiVE Scoring
CURVES is an energy-based approach that relies on two
mechanisms:
1.
The contrastive learning of an energy landscape
E(r, u)
, defined as the cosine similarity between utter-
ance and referent embeddings.
2.
The generation of an utterance that maximizes the en-
ergy for a given target referent r?
S.
Agents modules and interactions.
Each agent
A
{A1, A2}
perceives utterances and referents using two dis-
tinct CNN encoders
fA
(for referents) and
gA
(for utter-
ances)
2
.
fA
and
gA
map referents and utterances in a shared
d
-dimensional latent space:
fA(·, θfA) : RmRd
and
gA(·, θgA) : U Rd
such that
zrA =fA(r)
and
zuA =
gA(u)
, as displayed in Figure 3(a). The agent then computes
the energy landscape as: EA(r, u) = cos(fA(r), gA(u)).
A given referential game unfolds as follows. Agents have
randomly attributed roles, for instance,
A1
is the speaker
A1S
and
A2
is the listener
A2L
. The speaker
is given a context
˜
RS
and a target referent perceived as
r?
S
to produce an utterance
ˆu
intending to approach the
utterance
u?
that maximizes
ES(r?
S, u)
. The listener ob-
serves
ˆu
and selects referent
ˆr
in context
˜
RL
that maximizes
EL= (r, ˆu):
ˆuu?=argmax
u∈U
ES(r?
S, u)
ˆr=argmax
r˜
RL
EL(r, ˆu)(1)
The outcome of the game is then
o=[ˆr=r?]b
where
b
is
a baseline parameter representing the mean success across
previous games.
Contrastive representation learning in referential
games.
For a given context
R
, agents are randomly as-
signed their roles and play
n=|R|
games. During these
n
games, roles are fixed and the speaker agent successively se-
lects each referent of the context
˜
RS
as the target
r?
S
. During
interactions, the speaker collects data
{(ri
S, ui, oi)}i=1,...,n
while the listeners observes
{(ui, ri
L)}i=1,...,n
. From the
collected data each agent can compute the squared co-
sine similarity matrices
ΣA
whose elements are
A)i,j =
EA(ri
A, uj)
as shown in Figure 3(b). Contrastive updates
are then performed using the objective
JA
that applies Cross
Entropy (CE) on the i-th row and i-th column of ΣA.
JAA, i) = CE((ΣA)i,1:n, ei) + CE((ΣA)1:n,i, ei)
2(2)
ei
being a one-hot vector of size
n
with value 1 at index
i
.
Depending on the role of the agent,
JA
is instantiated either
2
when referents are one-hot vectors
fA
is a fully-connected
network. Parameters for both encoders are given in Suppl. table 4.
as
JS
(speaker) or
JL
(listener). Thus, the speaker updates
its representation using the outcomes
oi
of the games (re-
inforcing the successful associations while decreasing the
unsuccessful ones):
minimize
θfSgS
n
X
i=1
oiJSS, i)(3)
On the other hand, the listener needs to make sure that the
selection matches the speaker’s referent (Steels,2015) and
hence always increases associations (no matter the games’
outcomes):
minimize
θfLgL
n
X
i=1
JLL, i)(4)
Note that in Eq. 4,
ri
L
is the target referent perceived by the
listener. This means that, at the end of the game, the speaker
indicates the referent (as perceived by the listener) that they
named. This retroactive pointing mechanism was employed
in both early language game implementations (Steels &
Kaplan,1999) and more recent ones (Chaabouni et al.,2020;
Portelance et al.,2021).
Speaker’s utterance optimization.
We distinguish two
utterance generation strategies:
The descriptive generation: in which the speaker agent
only considers the target referent
r?
S
to produce an
utterance that maximizes the cosine similarity between
the embeddings of
r?
S
and an utterance produced by
our sensory system
u=M(c)
from motor command
c
.
Since
M
is fully differentiable, we inject the sensory-
motor constraint in equation 1and seek for the optimal
motor command c?using gradient ascent:
c?=argmax
cRp
E(r?
S, M(c)) (5)
The discriminative generation: in which the speaker
also perceives the context
˜
RS
during production. This
is achieved by finding the motor command that mini-
mizes the cross entropy given a target referent
r?
S
and
its context ˜
RS:
c?=argmin
cRp
CE(σS, er?
S)(6)
where
σS
is the vector with coordinates
σSi =
[E(ri, M(c))]ri˜
RS
and
er?
S
is the one-hot vector of
size
|˜
RS|
with value 1 at the position of
r?
S
in
˜
RS
.
This discriminative generation process is only used at
test time when investigating CURVESs generalization
capabilities.
4. Experiments and Results
This section focuses first on CURVESs communicative per-
formances when agents interact in GREG (question 1 of
摘要:

ContrastiveMultimodalLearningforEmergenceofGraphicalSensory-MotorCommunicationTristanKarch*12YoannLemesle*3RomainLaroche4Cl´ementMoulin-Frier12Pierre-YvesOudeyer12AbstractInthispaper,weinvestigatewhetherarticialagentscandevelopasharedlanguageinaneco-logicalsettingwherecommunicationreliesonasensory-...

展开>> 收起<<
Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication Tristan Karch 1 2Yoann Lemesle 3Romain Laroche4Clement Moulin-Frier1 2Pierre-Yves Oudeyer1 2.pdf

共47页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:47 页 大小:6.29MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 47
客服
关注