Dont Copy the Teacher Data and Model Challenges in Embodied Dialogue_2

2025-05-01 0 0 1.8MB 8 页 10玖币
侵权投诉
Don’t Copy the Teacher:
Data and Model Challenges in Embodied Dialogue
So Yeon Min1Hao Zhu2Ruslan Salakhutdinov1Yonatan Bisk2
Machine Learning1and Language Technologies2at Carnegie Mellon University
{soyeonm,hzhu2,rsalakhu,ybisk}@andrew.cmu.edu
Abstract
Embodied dialogue instruction following re-
quires an agent to complete a complex se-
quence of tasks from a natural language ex-
change. The recent introduction of bench-
marks (Padmakumar et al.,2022) raises the
question of how best to train and evaluate
models for this multi-turn, multi-agent, long-
horizon task. This paper contributes to that
conversation, by arguing that imitation learn-
ing (IL) and related low-level metrics are actu-
ally misleading and do not align with the goals
of embodied dialogue research and may hinder
progress.
We provide empirical comparisons of metrics,
analysis of three models, and make sugges-
tions for how the field might best progress.
First, we observe that models trained with IL
take spurious actions during evaluation. Sec-
ond, we find that existing models fail to ground
query utterances, which are essential for task
completion. Third, we argue evaluation should
focus on higher-level semantic goals. 1
1 Introduction
Dialogue is key to how humans collaborate;
through dialogue, we query information, confirm
our understanding, or banter in a friendly man-
ner. Since communication helps us work more
efficiently and successfully, it is only natural to
imbue for collaborative agents with this same abil-
ity. Most work has focused on grounded dialogues
for embodied navigation (Thomason et al.,2020;
Chi et al.,2019;Roman et al.,2020) or limited
interaction (Suhr et al.,2019), which are narrower
domains than the larger instruction following lit-
erature (Tellex et al.,2011,2020;Shridhar et al.,
2020;Blukis et al.,2018,2021;Min et al.,2021).
The first step towards engaging in a dialogue,
is being able to understand and learn from it. Pic-
ture a child watching their parents with the goal
1Code to be released at
https://github.com/soyeonm/TEACh_FILM
to learn by imitation. They witness instructions,
clarifications, mistakes, and banter. Begging the
question: What should one learn from noisy natural
dialogues?
Unlike in alinguistic tasks where modeling hu-
mans has recently proved helpful for search strate-
gies (Deitke et al.,2022), we focus on language
based tasks that require learning lexical-visual-
action correspondences. We discuss and compare
three paradigms: Instruction Following (IF), ac-
tions from Entire Dialogue History (EDH) and Tra-
jectory from Dialogue (TfD). The novel TEACh
dataset (Padmakumar et al.,2021) proposes EDH
as the primary metric and uses the Episodic Trans-
former (ET) (Pashevich et al.,2021) trained with
behavior cloning as their baseline. We also include
comparisons to the EDH competitive Symbiote
2
system and we adapt FILM (Min et al.,2021), a
recent method for general IF, to dialog instruction
following (DIF) on TEACH. FILM and Symbiote
belong to a different family of models, focusing
on abstract planning trained at a higher semantic
level than behavior cloning. This approach appears
crucial for generalization and TfD evaluations.
Most importantly, we analyze the human behav-
iors in TEACH and the corresponding effect on ET,
Symbiote, and FILM, as representatives of exist-
ing model classes. From our findings, we suggest
there are three major challenges the community
must tackle to move forward in the nascent field of
Dialogue based Instruction Following:
Recognizing mistakes
Behavior cloning encour-
ages replication of low-level errors, but not high-
level intentions. Agents should learn to construe
high-level intentions of demonstrations and to de-
viate from demonstration errors.
Grounding queries
No approaches correctly
ground “queries" requesting information.
2
Model outputs provided by correspondence with the team.
arXiv:2210.04443v2 [cs.LG] 11 Oct 2022
Evaluation
Agent evaluation should focus on
achieving goals rather than immitating procedures.
2 Related Work
Instruction Following
A plethora of works have
been introduced for instruction following without
dialogue (Chen and Mooney,2011;Matuszek et al.,
2012); an agent is expected to perform a task given
a language instruction at the beginning and visual
inputs at every time step. Representative tasks are
Visual Language Navigation (Anderson et al.,2018;
Fried et al.,2018;Zhu et al.,2020) and instruction
following (IF) (Shridhar et al.,2020;Singh et al.,
2020), which demands both navigation and manip-
ulation. Popular methods rely on imitation learning
(Pashevich et al.,2021;Singh et al.,2020) and mod-
ularly trained components (Blukis et al.,2021;Min
et al.,2021) (e.g. for mapping and depth).
Dialogue Instruction Following
Instruction
Following with Dialogue (She et al.,2014) has
mostly addressed navigation. Thomason et al.
(2020); Suhr et al. (2019) built navigation agents
that ground human-human dialogues, while Chi
et al. (2019); Nguyen and Daumé III (2019)
showed that obtaining clarification via simulated
interactions can improve navigation. Manipulation
introduces grounding query utterances that involve
more complex reasoning than in navigation-only
scenarios (Tellex et al.,2013); for example, the
agent may hear that the object of interest (e.g.
“apple”) is inside “the third cabinet to the right of
the fridge.
Imitation Learning vs Higher semantics
While behavior cloning (BC) is a popular method
used to train IF agents, it assumes that expert
demonstration is optimal (Zhang et al.,2021;Wu
et al.,2019). TEACh demonstrations are more
“ecologically valid" (de Vries et al.,2020) but
correspondingly suboptimal, frequently containing
mistakes and unnecessary actions. Popular
methods that deal with suboptimal demonstrations
involve annotated scoring labels or rankings for the
quality of demonstrations (Wu et al.,2019;Brown
et al.,2019). Such additional annotations are not
available in existing IF and DIF benchmarks. In
this work, we empirically demonstrate the effect of
noisy demonstrations on an episodic trained with
BC for DIF.
3 Tasks
TEACh focuses on two tasks: Entire Dialogue His-
tory and Trajectory from Dialogue. Despite what
the name implies, EDH is an evaluation over par-
tial dialogues (e.g. from state
St
begin execution to
ST
). TfD starts an agent at
S0
and asks for a com-
plete task completion provided the full dialogue.
In both settings, the agent (driver) completes
household tasks conditioned on text, egocentric
RGB observations, and the current view. An in-
stance of a dialogue will take the form of a com-
mand: Prepare coffee in a clean mug. Mugs are in
the microwave., the agent response How many do
I need?, and commander’s answer: One, together
with a sequence of RGB frames and actions that
the agent performed during the dialogue. As in this
example, the agent has to achieve multiple subtasks
(e.g. find mug in the microwave, clean mug in the
sink, turn on the coffee machine, etc) to succeed.
In TfD, the full dialogue history is given, and the
agents succeeds if it completes the full task itself
(e.g. make coffee). In EDH, the dialog history is
partitioned into “sessions” (e.g. Fig. 1) with the
corresponding action/vision/dialogue history until
the first utterance of the commander (Prepare
microwave.) being the first session and those after
it being the second. In EDH evaluation, the agent
takes one session as input and predicts actions until
the next session. An agent succeeds if it realizes all
state changes (e.g. Mug: picked up) that the human
annotator performed. Succinctly, TfD measures the
full dialogue while EDH evaluates subsequences.
4 Models
TEACh is an important new task for the community.
We analyze the provided baseline (ET), retrofit the
ALFRED FILM model, and requested outputs from
the authors of Symbioteon the EDH leaderboard.
ET is a transformer for direct sequence imitation
approach, that produces low-level actions condi-
tioned on the accumulated visual and linguistic
contexts. In contrast, FILM consists of four sub-
modules - semantic mapping, language processing,
semantic policy, and deterministic policy modules.
For the adaptation, we refactored the original code
of FILM to the TEACH API, retrained the learned
components of the semantic mapping module for
the change in height and camera horizon, and re-
trained/rewrote the language processing module to
take a dialogue history as input. The language pro-
cessing (LP) module of FILM maps an instruction
摘要:

Don'tCopytheTeacher:DataandModelChallengesinEmbodiedDialogueSoYeonMin1HaoZhu2RuslanSalakhutdinov1YonatanBisk2MachineLearning1andLanguageTechnologies2atCarnegieMellonUniversity{soyeonm,hzhu2,rsalakhu,ybisk}@andrew.cmu.eduAbstractEmbodieddialogueinstructionfollowingre-quiresanagenttocompleteacomplexse...

展开>> 收起<<
Dont Copy the Teacher Data and Model Challenges in Embodied Dialogue_2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.8MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注