Dont Copy the Teacher Data and Model Challenges in Embodied Dialogue_2

2025-05-01 0 0 1.8MB 8 页 10玖币

侵权投诉

Don’t Copy the Teacher:

Data and Model Challenges in Embodied Dialogue

So Yeon Min1Hao Zhu2Ruslan Salakhutdinov1Yonatan Bisk2

Machine Learning1and Language Technologies2at Carnegie Mellon University

{soyeonm,hzhu2,rsalakhu,ybisk}@andrew.cmu.edu

Abstract

Embodied dialogue instruction following re-

quires an agent to complete a complex se-

quence of tasks from a natural language ex-

change. The recent introduction of bench-

marks (Padmakumar et al.,2022) raises the

question of how best to train and evaluate

models for this multi-turn, multi-agent, long-

horizon task. This paper contributes to that

conversation, by arguing that imitation learn-

ing (IL) and related low-level metrics are actu-

ally misleading and do not align with the goals

of embodied dialogue research and may hinder

progress.

We provide empirical comparisons of metrics,

analysis of three models, and make sugges-

tions for how the ﬁeld might best progress.

First, we observe that models trained with IL

take spurious actions during evaluation. Sec-

ond, we ﬁnd that existing models fail to ground

query utterances, which are essential for task

completion. Third, we argue evaluation should

focus on higher-level semantic goals. 1

1 Introduction

Dialogue is key to how humans collaborate;

through dialogue, we query information, conﬁrm

our understanding, or banter in a friendly man-

ner. Since communication helps us work more

efﬁciently and successfully, it is only natural to

imbue for collaborative agents with this same abil-

ity. Most work has focused on grounded dialogues

for embodied navigation (Thomason et al.,2020;

Chi et al.,2019;Roman et al.,2020) or limited

interaction (Suhr et al.,2019), which are narrower

domains than the larger instruction following lit-

erature (Tellex et al.,2011,2020;Shridhar et al.,

2020;Blukis et al.,2018,2021;Min et al.,2021).

The ﬁrst step towards engaging in a dialogue,

is being able to understand and learn from it. Pic-

ture a child watching their parents with the goal

1Code to be released at

https://github.com/soyeonm/TEACh_FILM

to learn by imitation. They witness instructions,

clariﬁcations, mistakes, and banter. Begging the

question: What should one learn from noisy natural

dialogues?

Unlike in alinguistic tasks where modeling hu-

mans has recently proved helpful for search strate-

gies (Deitke et al.,2022), we focus on language

based tasks that require learning lexical-visual-

action correspondences. We discuss and compare

three paradigms: Instruction Following (IF), ac-

tions from Entire Dialogue History (EDH) and Tra-

jectory from Dialogue (TfD). The novel TEACh

dataset (Padmakumar et al.,2021) proposes EDH

as the primary metric and uses the Episodic Trans-

former (ET) (Pashevich et al.,2021) trained with

behavior cloning as their baseline. We also include

comparisons to the EDH competitive Symbiote

system and we adapt FILM (Min et al.,2021), a

recent method for general IF, to dialog instruction

following (DIF) on TEACH. FILM and Symbiote

belong to a different family of models, focusing

on abstract planning trained at a higher semantic

level than behavior cloning. This approach appears

crucial for generalization and TfD evaluations.

Most importantly, we analyze the human behav-

iors in TEACH and the corresponding effect on ET,

Symbiote, and FILM, as representatives of exist-

ing model classes. From our ﬁndings, we suggest

there are three major challenges the community

must tackle to move forward in the nascent ﬁeld of

Dialogue based Instruction Following:

Recognizing mistakes

Behavior cloning encour-

ages replication of low-level errors, but not high-

level intentions. Agents should learn to construe

high-level intentions of demonstrations and to de-

viate from demonstration errors.

Grounding queries

No approaches correctly

ground “queries" requesting information.

Model outputs provided by correspondence with the team.

arXiv:2210.04443v2 [cs.LG] 11 Oct 2022

Evaluation

Agent evaluation should focus on

achieving goals rather than immitating procedures.

2 Related Work

Instruction Following

A plethora of works have

been introduced for instruction following without

dialogue (Chen and Mooney,2011;Matuszek et al.,

2012); an agent is expected to perform a task given

a language instruction at the beginning and visual

inputs at every time step. Representative tasks are

Visual Language Navigation (Anderson et al.,2018;

Fried et al.,2018;Zhu et al.,2020) and instruction

following (IF) (Shridhar et al.,2020;Singh et al.,

2020), which demands both navigation and manip-

ulation. Popular methods rely on imitation learning

(Pashevich et al.,2021;Singh et al.,2020) and mod-

ularly trained components (Blukis et al.,2021;Min

et al.,2021) (e.g. for mapping and depth).

Dialogue Instruction Following

Instruction

Following with Dialogue (She et al.,2014) has

mostly addressed navigation. Thomason et al.

(2020); Suhr et al. (2019) built navigation agents

that ground human-human dialogues, while Chi

et al. (2019); Nguyen and Daumé III (2019)

showed that obtaining clariﬁcation via simulated

interactions can improve navigation. Manipulation

introduces grounding query utterances that involve

more complex reasoning than in navigation-only

scenarios (Tellex et al.,2013); for example, the

agent may hear that the object of interest (e.g.

“apple”) is inside “the third cabinet to the right of

the fridge.”

Imitation Learning vs Higher semantics

While behavior cloning (BC) is a popular method

used to train IF agents, it assumes that expert

demonstration is optimal (Zhang et al.,2021;Wu

et al.,2019). TEACh demonstrations are more

“ecologically valid" (de Vries et al.,2020) but

correspondingly suboptimal, frequently containing

mistakes and unnecessary actions. Popular

methods that deal with suboptimal demonstrations

involve annotated scoring labels or rankings for the

quality of demonstrations (Wu et al.,2019;Brown

et al.,2019). Such additional annotations are not

available in existing IF and DIF benchmarks. In

this work, we empirically demonstrate the effect of

noisy demonstrations on an episodic trained with

BC for DIF.

3 Tasks

TEACh focuses on two tasks: Entire Dialogue His-

tory and Trajectory from Dialogue. Despite what

the name implies, EDH is an evaluation over par-

tial dialogues (e.g. from state

begin execution to

). TfD starts an agent at

and asks for a com-

plete task completion provided the full dialogue.

In both settings, the agent (driver) completes

household tasks conditioned on text, egocentric

RGB observations, and the current view. An in-

stance of a dialogue will take the form of a com-

mand: Prepare coffee in a clean mug. Mugs are in

the microwave., the agent response How many do

I need?, and commander’s answer: One, together

with a sequence of RGB frames and actions that

the agent performed during the dialogue. As in this

example, the agent has to achieve multiple subtasks

(e.g. ﬁnd mug in the microwave, clean mug in the

sink, turn on the coffee machine, etc) to succeed.

In TfD, the full dialogue history is given, and the

agents succeeds if it completes the full task itself

(e.g. make coffee). In EDH, the dialog history is

partitioned into “sessions” (e.g. Fig. 1) with the

corresponding action/vision/dialogue history until

the ﬁrst utterance of the commander (Prepare

∼

microwave.) being the ﬁrst session and those after

it being the second. In EDH evaluation, the agent

takes one session as input and predicts actions until

the next session. An agent succeeds if it realizes all

state changes (e.g. Mug: picked up) that the human

annotator performed. Succinctly, TfD measures the

full dialogue while EDH evaluates subsequences.

4 Models

TEACh is an important new task for the community.

We analyze the provided baseline (ET), retroﬁt the

ALFRED FILM model, and requested outputs from

the authors of Symbioteon the EDH leaderboard.

ET is a transformer for direct sequence imitation

approach, that produces low-level actions condi-

tioned on the accumulated visual and linguistic

contexts. In contrast, FILM consists of four sub-

modules - semantic mapping, language processing,

semantic policy, and deterministic policy modules.

For the adaptation, we refactored the original code

of FILM to the TEACH API, retrained the learned

components of the semantic mapping module for

the change in height and camera horizon, and re-

trained/rewrote the language processing module to

take a dialogue history as input. The language pro-

cessing (LP) module of FILM maps an instruction

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Don'tCopytheTeacher:DataandModelChallengesinEmbodiedDialogueSoYeonMin1HaoZhu2RuslanSalakhutdinov1YonatanBisk2MachineLearning1andLanguageTechnologies2atCarnegieMellonUniversity{soyeonm,hzhu2,rsalakhu,ybisk}@andrew.cmu.eduAbstractEmbodieddialogueinstructionfollowingre-quiresanagenttocompleteacomplexse...

展开>> 收起<<

Dont Copy the Teacher Data and Model Challenges in Embodied Dialogue_2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dont Copy the Teacher Data and Model Challenges in Embodied Dialogue_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: