Evaluation
Agent evaluation should focus on
achieving goals rather than immitating procedures.
2 Related Work
Instruction Following
A plethora of works have
been introduced for instruction following without
dialogue (Chen and Mooney,2011;Matuszek et al.,
2012); an agent is expected to perform a task given
a language instruction at the beginning and visual
inputs at every time step. Representative tasks are
Visual Language Navigation (Anderson et al.,2018;
Fried et al.,2018;Zhu et al.,2020) and instruction
following (IF) (Shridhar et al.,2020;Singh et al.,
2020), which demands both navigation and manip-
ulation. Popular methods rely on imitation learning
(Pashevich et al.,2021;Singh et al.,2020) and mod-
ularly trained components (Blukis et al.,2021;Min
et al.,2021) (e.g. for mapping and depth).
Dialogue Instruction Following
Instruction
Following with Dialogue (She et al.,2014) has
mostly addressed navigation. Thomason et al.
(2020); Suhr et al. (2019) built navigation agents
that ground human-human dialogues, while Chi
et al. (2019); Nguyen and Daumé III (2019)
showed that obtaining clarification via simulated
interactions can improve navigation. Manipulation
introduces grounding query utterances that involve
more complex reasoning than in navigation-only
scenarios (Tellex et al.,2013); for example, the
agent may hear that the object of interest (e.g.
“apple”) is inside “the third cabinet to the right of
the fridge.”
Imitation Learning vs Higher semantics
While behavior cloning (BC) is a popular method
used to train IF agents, it assumes that expert
demonstration is optimal (Zhang et al.,2021;Wu
et al.,2019). TEACh demonstrations are more
“ecologically valid" (de Vries et al.,2020) but
correspondingly suboptimal, frequently containing
mistakes and unnecessary actions. Popular
methods that deal with suboptimal demonstrations
involve annotated scoring labels or rankings for the
quality of demonstrations (Wu et al.,2019;Brown
et al.,2019). Such additional annotations are not
available in existing IF and DIF benchmarks. In
this work, we empirically demonstrate the effect of
noisy demonstrations on an episodic trained with
BC for DIF.
3 Tasks
TEACh focuses on two tasks: Entire Dialogue His-
tory and Trajectory from Dialogue. Despite what
the name implies, EDH is an evaluation over par-
tial dialogues (e.g. from state
St
begin execution to
ST
). TfD starts an agent at
S0
and asks for a com-
plete task completion provided the full dialogue.
In both settings, the agent (driver) completes
household tasks conditioned on text, egocentric
RGB observations, and the current view. An in-
stance of a dialogue will take the form of a com-
mand: Prepare coffee in a clean mug. Mugs are in
the microwave., the agent response How many do
I need?, and commander’s answer: One, together
with a sequence of RGB frames and actions that
the agent performed during the dialogue. As in this
example, the agent has to achieve multiple subtasks
(e.g. find mug in the microwave, clean mug in the
sink, turn on the coffee machine, etc) to succeed.
In TfD, the full dialogue history is given, and the
agents succeeds if it completes the full task itself
(e.g. make coffee). In EDH, the dialog history is
partitioned into “sessions” (e.g. Fig. 1) with the
corresponding action/vision/dialogue history until
the first utterance of the commander (Prepare
∼
microwave.) being the first session and those after
it being the second. In EDH evaluation, the agent
takes one session as input and predicts actions until
the next session. An agent succeeds if it realizes all
state changes (e.g. Mug: picked up) that the human
annotator performed. Succinctly, TfD measures the
full dialogue while EDH evaluates subsequences.
4 Models
TEACh is an important new task for the community.
We analyze the provided baseline (ET), retrofit the
ALFRED FILM model, and requested outputs from
the authors of Symbioteon the EDH leaderboard.
ET is a transformer for direct sequence imitation
approach, that produces low-level actions condi-
tioned on the accumulated visual and linguistic
contexts. In contrast, FILM consists of four sub-
modules - semantic mapping, language processing,
semantic policy, and deterministic policy modules.
For the adaptation, we refactored the original code
of FILM to the TEACH API, retrained the learned
components of the semantic mapping module for
the change in height and camera horizon, and re-
trained/rewrote the language processing module to
take a dialogue history as input. The language pro-
cessing (LP) module of FILM maps an instruction