Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication Tristan Karch 1 2Yoann Lemesle 3Romain Laroche4Clement Moulin-Frier1 2Pierre-Yves Oudeyer1 2

2025-04-26 0 0 6.29MB 47 页 10玖币

侵权投诉

Contrastive Multimodal Learning for

Emergence of Graphical Sensory-Motor Communication

Tristan Karch *12 Yoann Lemesle * 3 Romain Laroche 4Cl´

ement Moulin-Frier 1 2 Pierre-Yves Oudeyer 1 2

Abstract

In this paper, we investigate whether artiﬁcial

agents can develop a shared language in an eco-

logical setting where communication relies on a

sensory-motor channel. To this end, we introduce

the Graphical Referential Game (GREG) where

a speaker must produce a graphical utterance to

name a visual referent object while a listener has

to select the corresponding object among distrac-

tor referents, given the delivered message. The

utterances are drawing images produced using dy-

namical motor primitives combined with a sketch-

ing library. To tackle GREG we present CURVES: a

multimodal contrastive deep learning mechanism

that represents the energy (alignment) between

named referents and utterances generated through

gradient ascent on the learned energy landscape.

We demonstrate that CURVES not only succeeds

at solving the GREG but also enables agents to

self-organize a language that generalizes to fea-

ture compositions never seen during training. In

addition to evaluating the communication perfor-

mance of our approach, we also explore the struc-

ture of the emerging language. Speciﬁcally, we

show that the resulting language forms a coher-

ent lexicon shared between agents and that basic

compositional rules on the graphical productions

could not explain the compositional generaliza-

tion.

1. Introduction

Understanding the emergence and evolution of human lan-

guages is a signiﬁcant challenge that has involved many

ﬁelds, from linguistics to developmental cognitive sci-

ences (Christiansen & Kirby,2003). Computational ex-

perimental semiotics (Galantucci & Garrod,2011) has seen

Equal contribution

Inria, Flowers Team

Universit

de Bordeaux

Universit

e Paris-Dauphine-PSL

Microsoft Re-

search, Montreal. Correspondence to: Tristsan Karch

tris-

tan.karch@inria.fr>.

Pre-print, Under review

some success in modeling the formation of communica-

tion systems in populations of artiﬁcial agents (Cangelosi

& Parisi,2002;Kirby et al.,2014). More speciﬁcally, Lan-

guage Game models (Steels & Loetzsch,2012), have been

used to show how a population of agents can self-organize

a culturally shared lexicon without centralized coordina-

tion. Given the recent successes of artiﬁcial neural net-

works in solving complex tasks such as image classiﬁca-

tion (Krizhevsky et al.,2012;He et al.,2015;2016;Dosovit-

skiy et al.,2021) and natural language understanding (De-

vlin et al.,2019;Radford et al.,2019;Brown et al.,2020),

many works have leveraged them to study the emergence of

communication in groups of agents (Lazaridou & Baroni,

2020), mainly using multi-agent deep reinforcement learn-

ing and language games (Nguyen et al.,2020;Mordatch

& Abbeel,2018;Lazaridou et al.,2018;Portelance et al.,

2021;Chaabouni et al.,2021). These advances have made it

possible to scale up language game models to environments

where linguistic conventions are jointly learned with visual

representations of raw image perception, as well as to envi-

ronments where emergent communication is used as a tool

to achieve joint cooperative tasks (Barde et al.,2022).

So far, most of these methods have considered only ideal-

ized symbolic communication channels based on discrete

tokens (Lazaridou et al.,2017;Mordatch & Abbeel,2018;

Chaabouni et al.,2021) or ﬁxed-size sequences of word to-

kens (Havrylov & Titov,2017;Portelance et al.,2021). This

predeﬁned means of communication is motivated by lan-

guage’s discrete and compositional nature. But how can this

speciﬁc structure emerge during vocalization or drawing,

for instance? Although fundamental in the investigation of

the origin of language (Dessalles,2000;Cheney & Seyfarth,

2005;Oller et al.,2019), this question seems to be neglected

by recent approaches to Language Games (Moulin-Frier &

Oudeyer,2020). We, therefore, propose to study how com-

munication could emerge between agents producing and

perceiving continuous signals with a constrained sensory-

motor system.

Such continuous constrained systems have been used in

the cognitive science literature as models of sign produc-

tion to study the self-organization of speech in artiﬁcial

systems (de Boer,2000;Oudeyer,2006;Moulin-Frier et al.,

arXiv:2210.06468v2 [cs.AI] 14 Feb 2023

Emergence of Graphical Sensory-Motor Communication

Listener Referent

perceived by

utterance

command

Listener Context:

Speaker Referent

Speaker Context:

sample perceived by

selects perceived by

Speaker

produces

Game Outcome

Listener

Figure 1. The Graphical Referential Game:

During the game, the speaker’s goal is to produce a motor command

that will yield an

utterance

in order to denote a referent

sampled from a context

. Following this step, the listener needs to interpret the utterance in

order to guess the referent it denotes among a context

. The game is a success if the listener and the speaker agree on the referent

(rL≡rS).

2015). In this paper, we focus on a drawing sensory-

motor system producing graphical signs. The sensory-

motor system is made of Dynamical Motor Primitives

(DMPs) (Schaal,2006) combined with a sketching sys-

tem (Mihai & Hare,2021a) enabling the conversion of motor

commands into images. Drawing systems have the advan-

tage of producing 2D trajectories interpretable by humans

while preserving the non-linear properties of speech models,

which were shown to ease the discretization of the produced

signals (Stevens,1989;Moulin-Frier et al.,2015). We in-

troduce the Graphical Referential Game: a variation of the

original referential game, where a Speaker agent (top of

Figure 1) has to produce a graphical utterance given a single

target referent while a Listener agent (bottom of Figure 1)

has to select an element among a context made of several ref-

erents, given the produced utterance (agents alternate their

roles). In this setting, we ﬁrst investigate whether a popula-

tion of agents can converge on an efﬁcient communication

protocol to solve the graphical language game. Then, we

evaluate the coherence and compositional properties of the

emergent language, since it is one of the main characteristics

of human languages.

Early language game implementations (Steels,1995;2001)

achieve communication convergence by using contrastive

methods to update association tables between object ref-

erents and utterances. While recent works use deep learn-

ing methods to target high-dimensional signals they do not

explore contrastive approaches. Instead, they model inter-

actions as a multi-agent reinforcement learning problem

where utterances are actions, and agents are optimized with

policy gradients, using the outcomes of the games as the

reward signal (Lazaridou et al.,2017). In the meantime, re-

cent models leveraging contrastive multimodal mechanisms

such as CLIP (Radford et al.,2021) have achieved impres-

sive results in modeling associations between images and

texts. Combined with efﬁcient generative methods (Ramesh

et al.,2021), they can compose textual elements that are

reﬂected in image form as the composition of their asso-

ciated visual concepts. Inspired by these techniques, we

propose CURVES: Contrastive Utterance-Referent associa-

tiVE Scoring, an algorithmic solution to the graphical ref-

erential game. CURVES relies on two mechanisms: 1) The

contrastive learning of an energy landscape representing the

alignment between utterances and referents and 2) the gen-

eration of utterances that maximize the energy for a given

target referent. We evaluate CURVES in two instantiations

of the graphical referential game: one with symbolic ref-

erents encoded by one-hot vectors and another with visual

referents derived from the multiple MNIST digits (LeCun

et al.,1998). We show that CURVES converges to a shared

graphical language that enables a population of agents not

only to name complex visual referents but also to name new

referent compositions that were never encountered during

training.

Scope.

The idea of using a sensory-motor system to study

the emergence of forms of combinatoriality in language

dates back to methods investigating the origins of digital vo-

calization systems (de Boer,2000;Oudeyer,2005;Zuidema

& De Boer,2009). Such studies were conducted in the con-

text of imitation games at the level of phonemes to observe

the formation of speech utterances (syllables, words) that

were systematically composed from lower-level meaning-

Emergence of Graphical Sensory-Motor Communication

less elements (phonemes). This corresponded to the ﬁrst

level of compositionality within the notion of duality of pat-

terning (Hockett & Hockett,1960). Yet, these works did not

consider referential games and did not study agents’ ability

to compose meaningful words to denote referents, i.e. they

did not address the second level of the duality of patterning.

One of the goals of emergent communication research is

to develop machines that can interact with humans. As

a result, a variety of referential game approaches ensure

that the emergent language is as close to natural language.

This can be achieved by adding a supervised image caption-

ing objective to encourage agents to use natural language

in order to solve their communicative tasks (Havrylov &

Titov,2017;Lazaridou et al.,2017). Other methods use

constraints such as memory restrictions (Kottur et al.,2017)

to act as an information bottleneck to increase interpretabil-

ity and compositionality. While we purposefully chose a

graphical sensory-motor system to ease the visualization of

the emerging language, we do not inject prior knowledge

or pressures to facilitate the emergence of an iconic lan-

guage. Our produced utterances are completely arbitrary.

This fundamentally differentiates our work from Mihai &

Hare (2021b) that trains agents to communicate via sketches

replicating the visual referents they name. Note also that

their drawing setup does not include dynamical motor prim-

itives and utterances are directly optimized in image space.

They, moreover, allow gradients to back-propagate from

listener to speaker while we use a decentralized approach.

Finally, they do not consider contrastive learning. To our

knowledge, CURVES is the ﬁrst contrastive deep-learning

algorithm successfully applied to a referential game.

There is a large body of work exploring the factors that

promote compositionally in emerging languages (Kottur

et al.,2017;Li & Bowling,2019;Rodr

ıguez Luna et al.,

2020;Ren et al.,2020;Chaabouni et al.,2020;Gupta et al.,

2020). In this context, a crucial question is how to actually

measure it in the ﬁrst place (Mu & Goodman,2021). To

this end, (Choi et al.,2018) proposes to measure commu-

nicative performances on unseen compositions of known

objects as a way to evaluate compositionality. However, it

has been shown that a good performance in this test may

be achieved without leveraging any actual compositionality

in language (Andreas,2019;Chaabouni et al.,2020). Thus,

others instead compute topographic similarities (Brighton &

Kirby,2006), measuring the correlation between distances

in the utterance space (distance between signs) and distances

in the referents space (such as the cosine similarity between

the embeddings of objects) (Lazaridou et al.,2018). In this

paper we propose to do both and study 1) the generalization

to unseen combinations of abstract features and 2) topo-

graphic measures based on the Hausdorff distances between

utterances denoting composition and utterances denoting

isolated features.

Contributions. This paper introduces:

•

The Graphical Referential Game (GREG): a variation

of the referential language game to study the formation

of signs from a graphical sensory-motor system.

•

CURVES: an algorithmic solution to GREG, consisting

of a contrastive multimodal encoder coupled with a

generative model enabling the emergence of a graphi-

cal language.

•

A study of CURVES’s generalization performances on

compositions of features never seen during training in

a simpliﬁed control setting and a more perceptually

challenging one.

•

A complementary analysis of the structure of the

emerging graphical language measuring lexicon co-

herence and compositionality scores derived from the

Haussdorf distance.

2. Problem Deﬁnition

Graphical referential game.

We consider a group of two

agents playing a ﬁxed number of referential games, each

time alternating their roles (speaker or listener). During

a game, we ﬁrst present a context

objects, called

referents to a speaker

and a listener

. At the beginning

of each game, the target

r?∈R

is assigned to the speaker.

Given this target referent

produces an utterance (

) to

designate it. Based on the produced utterance

selects

a referent (

ˆr

) in

. The game outcome

is a success if the

selected referent (ˆr) matches the target r?.

Referents.

Referents are compositions of orthogonal vec-

tor features (one-hot vectors). Given a set of

or-

thogonal features

, we deﬁne the set of all possi-

ble referents as

Rm={Pf∈Sf|S⊆Fm}

. The sub-

set of referents made of exactly

features are thus:

m={Pf∈Sf|S⊆Fm,|S|=k}

. In our experiments,

we ﬁx m= 5.

From these orthogonal referents, we propose to generate

objects made of digit images sampled from the MNIST

dataset (LeCun et al.,1998). More precisely, we deﬁne the

stochastic mapping Φ : Rm→˜

Rmthat maps each feature

f∈Fm

to a digit class in the MNIST dataset. For each

feature in a referent, we sample a random instance from

the corresponding class and randomly place it on a

4×4

grid such that no number overlap. Note that the listener and

speaker can perceive different realizations of

, in this case,

we say that they see different perspectives of the referents.

More precisely, the speaker perceives the context

and its target

. Similarly, the listener perceives the

context Ras ˜

RLand selects a referent ˆramong it.

Emergence of Graphical Sensory-Motor Communication

DMP

Sketch Lib.

(a) (b)

Figure 2.

(a)

Sketching sensory-motor system

: The sensory-motor system imitates a robotic arm drawing a sketch on a 2D plan. DMPs

ﬁrst convert a continuous command

into a sequence of coordinates

. This trajectory is then rendered as a

52 ×52

graphical utterance

thanks to a differentiable sketching library. (b)

Referent transformation:

An example of a one-hot context

being transformed into two

contexts ˜

RSand ˜

RLby the stochastic transformation Φ. The two contexts are different perspectives of the same objects.

We use this formalism to instantiate three settings of the

Graphical Referential Game (GREG):

•one-hot: where referents are one-hot vectors r∈ Rm.

•

visual-shared: where referents are MNIST digits

r∈

and agents share the same perspective:

RS=˜

•

visual-unshared where referents are MNIST digits

r∈

and agents have different perspectives of referents

in their contexts ˜

RS6=˜

RL.

Sensory-motor drawing system.

Utterances are pro-

duced by a sensory-motor system

M:Rm→ U ⊂ RD×D

mimicking an arm drawing sketches displayed in Figure 2(a).

The arm motion is derived from Dynamical Motor Prim-

itives (DMPs) (Schaal,2006). The DMP is parametrized

by a command vector

c∈R20

. It converts

into a 2-

dimensional drawing trajectory

made of 10 coordinates

T={vi}i=0,...,9

. This trajectory is then fed to a Differen-

tiable Sketching model (Mihai & Hare,2021a) generating

D×D

image (in our implementation,

D= 52

). See

Suppl. Section A.1 for additional implementation details of

the Sensory-motor drawing system.

Objectives.

In this study, we aim to answer the three fol-

lowing questions:

What are agents’ communicative performances in the

GREG? Are agents able to solve the game? Are they

able to generalize to compositional referents?

Are the emergent signs coherent? Do agents produce

the same utterances to denote the same referents?

Are the emergent signs compositional? Are there com-

positional rules in the production of signs naming com-

positional referents? 1

Are agents able to solve the GREG?To answer the ﬁrst

1Note that the ability to perform compositional generalization

(question 1) and the presence of compositional structure in utter-

ances (question 3) are two separate investigations.

question, we will monitor the communicative performance

of agents on both training and testing referents. The training

referents consist of a single feature:

Rtrain =R1

while

the testing referents consists of two features:

Rtest =R2

For visual examples of compositional referents, see Suppl.

Section A.2.

Are the emergent signs coherent? To measure coher-

ence we propose to use a similarity measure based on the

Hausdorff distance. Haussdorf distance is known to cap-

ture geometric features of trajectories, in particular, their

shape (Besse et al.,2015). The Hausdorff distance

is the maximum distance from any coordinate in a trajec-

tory to the closest coordinate in the other:

dH(T1, T2) =

max{supv∈T1d(v, T2),supv0∈T2d(T1, v0)}

. In particular,

we compute the following metrics.

•

Agent Coherence (A-coherence): For a given referent

with the same perspective for all agents, measure the

mean pairwise similarity between each agent’s utter-

ance.

•

Perspective Coherence (P-coherence): For a given

agent and a given referent

, measure the mean pair-

wise similarity between utterances produced from dif-

ferent perspectives

•

Referent Coherence (R-coherence): For a given agent,

measure the mean pairwise similarity between utter-

ances produced for different referents.

Are the emergent signs compositional? To measure the com-

positionally of the utterances, we introduce a topographic

score based on the Hausdorff distance

quantiﬁes how an

utterance denoting a compositional referent made of feature

and

(

u(rij )

) is actually closer to the utterances denoting

isolated features

u(ri)

u(rj)

than the utterance naming

other compositional referents (

u(rxy)

x6=i, y 6=j

). For a

detailed derivation of metric ρ, see Suppl. Section A.3.

Emergence of Graphical Sensory-Motor Communication

3. CURVES - Contrastive Utterance-Referent

associatiVE Scoring

CURVES is an energy-based approach that relies on two

mechanisms:

The contrastive learning of an energy landscape

E(r, u)

, deﬁned as the cosine similarity between utter-

ance and referent embeddings.

The generation of an utterance that maximizes the en-

ergy for a given target referent r?

Agents modules and interactions.

Each agent

A∈

{A1, A2}

perceives utterances and referents using two dis-

tinct CNN encoders

(for referents) and

(for utter-

ances)

and

map referents and utterances in a shared

-dimensional latent space:

fA(·, θfA) : Rm→Rd

and

gA(·, θgA) : U → Rd

such that

zrA =fA(r)

and

zuA =

gA(u)

, as displayed in Figure 3(a). The agent then computes

the energy landscape as: EA(r, u) = cos(fA(r), gA(u)).

A given referential game unfolds as follows. Agents have

randomly attributed roles, for instance,

is the speaker

A1←S

and

is the listener

A2←L

. The speaker

is given a context

and a target referent perceived as

to produce an utterance

ˆu

intending to approach the

utterance

that maximizes

ES(r?

S, u)

. The listener ob-

serves

ˆu

and selects referent

ˆr

in context

that maximizes

EL= (r, ˆu):











ˆu≈u?=argmax

u∈U

ES(r?

S, u)

ˆr=argmax

r∈˜

EL(r, ˆu)(1)

The outcome of the game is then

o=[ˆr=r?]−b

where

a baseline parameter representing the mean success across

previous games.

Contrastive representation learning in referential

games.

For a given context

, agents are randomly as-

signed their roles and play

n=|R|

games. During these

games, roles are ﬁxed and the speaker agent successively se-

lects each referent of the context

as the target

. During

interactions, the speaker collects data

{(ri

S, ui, oi)}i=1,...,n

while the listeners observes

{(ui, ri

L)}i=1,...,n

. From the

collected data each agent can compute the squared co-

sine similarity matrices

ΣA

whose elements are

(ΣA)i,j =

EA(ri

A, uj)

as shown in Figure 3(b). Contrastive updates

are then performed using the objective

that applies Cross

Entropy (CE) on the i-th row and i-th column of ΣA.

JA(ΣA, i) = CE((ΣA)i,1:n, ei) + CE((ΣA)1:n,i, ei)

2(2)

being a one-hot vector of size

with value 1 at index

Depending on the role of the agent,

is instantiated either

when referents are one-hot vectors

is a fully-connected

network. Parameters for both encoders are given in Suppl. table 4.

(speaker) or

(listener). Thus, the speaker updates

its representation using the outcomes

of the games (re-

inforcing the successful associations while decreasing the

unsuccessful ones):

minimize

θfS,θgS

i=1

oiJS(ΣS, i)(3)

On the other hand, the listener needs to make sure that the

selection matches the speaker’s referent (Steels,2015) and

hence always increases associations (no matter the games’

outcomes):

minimize

θfL,θgL

i=1

JL(ΣL, i)(4)

Note that in Eq. 4,

is the target referent perceived by the

listener. This means that, at the end of the game, the speaker

indicates the referent (as perceived by the listener) that they

named. This retroactive pointing mechanism was employed

in both early language game implementations (Steels &

Kaplan,1999) and more recent ones (Chaabouni et al.,2020;

Portelance et al.,2021).

Speaker’s utterance optimization.

We distinguish two

utterance generation strategies:

•

The descriptive generation: in which the speaker agent

only considers the target referent

to produce an

utterance that maximizes the cosine similarity between

the embeddings of

and an utterance produced by

our sensory system

u=M(c)

from motor command

Since

is fully differentiable, we inject the sensory-

motor constraint in equation 1and seek for the optimal

motor command c?using gradient ascent:

c?=argmax

c∈Rp

E(r?

S, M(c)) (5)

•

The discriminative generation: in which the speaker

also perceives the context

during production. This

is achieved by ﬁnding the motor command that mini-

mizes the cross entropy given a target referent

and

its context ˜

RS:

c?=argmin

c∈Rp

CE(σS, er?

S)(6)

where

σS

is the vector with coordinates

σSi =

[E(ri, M(c))]ri∈˜

and

er?

is the one-hot vector of

size

|˜

RS|

with value 1 at the position of

This discriminative generation process is only used at

test time when investigating CURVES’s generalization

capabilities.

4. Experiments and Results

This section focuses ﬁrst on CURVES’s communicative per-

formances when agents interact in GREG (question 1 of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveMultimodalLearningforEmergenceofGraphicalSensory-MotorCommunicationTristanKarch*12YoannLemesle*3RomainLaroche4Cl´ementMoulin-Frier12Pierre-YvesOudeyer12AbstractInthispaper,weinvestigatewhetherarticialagentscandevelopasharedlanguageinaneco-logicalsettingwherecommunicationreliesonasensory-...

展开>> 收起<<

Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication Tristan Karch 1 2Yoann Lemesle 3Romain Laroche4Clement Moulin-Frier1 2Pierre-Yves Oudeyer1 2.pdf

共47页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication Tristan Karch 1 2Yoann Lemesle 3Romain Laroche4Clement Moulin-Frier1 2Pierre-Yves Oudeyer1 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: