DOROTHIE Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents Ziqiao Ma1 Ben VanDerPloeg1 Cristian-Paul Baray1 Yidong Huang1

2025-05-03 0 0 7.42MB 23 页 10玖币
侵权投诉
DOROTHIE: Spoken Dialogue for Handling Unexpected Situations
in Interactive Autonomous Driving Agents
Ziqiao Ma1, Ben VanDerPloeg1, Cristian-Paul Bara1, Yidong Huang1,
Eui-In Kim1, Felix Gervits2, Matthew Marge2, Joyce Chai1
1University of Michigan 2U.S. Army Research Laboratory
{marstin,bensvdp,cpbara,owenhji,euiink,chaijy}@umich.edu
{felix.gervits,matthew.r.marge}.civ@army.mil
Abstract
In the real world, autonomous driving agents
navigate in highly dynamic environments full
of unexpected situations where pre-trained
models are unreliable. In these situations,
what is immediately available to vehicles is
often only human operators. Empowering au-
tonomous driving agents with the ability to
navigate in a continuous and dynamic environ-
ment and to communicate with humans through
sensorimotor-grounded dialogue becomes criti-
cal. To this end, we introduce Dialogue On the
ROad To Handle Irregular Events (
DOROTHIE
),
a novel interactive simulation platform that en-
ables the creation of unexpected situations on
the fly to support empirical studies on situ-
ated communication with autonomous driving
agents. Based on this platform, we created
the Situated Dialogue Navigation (
SDN
), a nav-
igation benchmark of 183 trials with a total
of 8415 utterances, around 18.7 hours of con-
trol streams and 2.9 hours of trimmed audio.
SDN
is developed to evaluate the agent’s abil-
ity to predict dialogue moves from humans as
well as generate its own dialogue moves and
physical navigation actions. We further devel-
oped a transformer-based baseline model for
these
SDN
tasks. Our empirical results indicate
that language guided-navigation in a highly dy-
namic environment is an extremely difficult
task for end-to-end models. These results will
provide insight towards future work on robust
autonomous driving agents1.
1 Introduction
In embodied agents such as autonomous vehicles
(AVs), highly dynamic environments often lead to
unexpected situations, such as challenging environ-
ment conditions (e.g., caused by weather, light, ob-
stacles, etc.), influence of other agents, and change
Equal contribution.
Work done prior to joining Amazon Alexa AI
1
The
DOROTHIE
platform,
SDN
benchmark, and code for
the baseline model are available at
https://github.com/
sled-group/DOROTHIE
of the original goals. In these situations, the agent’s
pre-trained models or existing knowledge may not
be adequate or reliable to make a corresponding
decision. What is immediately available to help
the agent is often only human partners (Ramachan-
dran et al.,2013). As they are not programmers
who can readily change the code in the field, ap-
proaches that enable natural communication and
collaboration between humans and autonomy be-
come critical (Spiliotopoulos et al.,2001;Weng
et al.,2016). Although recent years have seen an
increasing amount of work in natural language com-
munication with robots, and especially the many
benchmarks that have been developed for naviga-
tion by instruction following (Roh et al.,2020;Va-
sudevan et al.,2021;Shridhar et al.,2020;Pad-
makumar et al.,2022), little work has been done to
study language communication under unexpected
situations, particularly in the context of AVs.
To address this limitation, we have developed
Dialogue On the ROad To Handle Irregular Events
(
DOROTHIE
), an interactive simulation platform
built upon the
CARLA
simulator (Dosovitskiy et al.,
2017) to specifically target unexpected situations.
The
DOROTHIE
simulator supports Wizard-of-Oz
(WoZ) studies through
a novel duo-wizard setup
:
a collaborative wizard (Co-Wizard) that collabo-
rates with the human to accomplish the tasks, and
an adversarial wizard (Ad-Wizard) that generates
unexpected situations (e.g., creating road obstacles,
changing weather conditions, adding/changing
goals, etc.) on the fly. Using
DOROTHIE
, we
collected the Situated Dialogue Navigation (
SDN
)
dataset of 183 trials between a Co-Wizard and hu-
man subjects to collaboratively resolve unexpected
situations and complete navigation tasks through
spoken dialogue.
The
SDN
dataset contains multi-faceted and time-
synchronized information (e.g., first-person view of
the environment, speech input from the human, dis-
crete actions, continuous trajectory and control sig-
arXiv:2210.12511v1 [cs.AI] 22 Oct 2022
Environment Communication Granularity Data Collection Instruction Type Action Space
Name Domain Fidelity Continuity Turn Form Language Control Lang. Demo. Modal. Replan. Adp. Nav. Man. Continuity
SDN (Ours) Sim C M Freeform H & L H & L H H LVMS4 4 4 - D & C
CDNLI (Roh et al.,2020) Sim C M Multi Inst L H & L H+T P LVM - 4 4 - D & C
LCSD (Sriram et al.,2019) Sim C S Multi Inst L H H P LVM - - 4- D
TtW (De Vries et al.,2018) Pano D M Freeform H & L H H H LVM - - 4- D
Talk2Nav (Vasudevan et al.,2021) Pano D S Multi Inst L H H P LVM - - 4- D
TouchDown (Chen et al.,2019) Pano D S Multi Inst L H H P LVM - - 4- D
Street Nav (Hermann et al.,2020) Pano D M Multi Inst L H T P LVM - - 4- D
Map2Seq (Schumann and Riezler,2021) Pano D S Multi Inst L H H P LM - - 4- D
RUN (Paz-Argaman and Tsarfaty,2019)
Outdoors
Pano D S Multi Inst L H H H LM - - 4- D
TEACh (Padmakumar et al.,2022) Sim C M Freeform H & L H H H LV - 4 4 4 D
DialFRED (Gao et al.,2022) Sim C M Restricted H & L H H+T P LV - 4 4 4 D
ALFRED (Shridhar et al.,2020) Sim C S Multi Inst H & L H H P LV - 4 4 4 D
HANNA (Nguyen and Daumé III,2019) Pano D M Multi Inst H & L H H P LV - 4 4 - D
RobotSlang (Banerjee et al.,2020) Phy C M Freeform H & L H H P LV - - 4- D
TtT and WtW (Ilyevsky et al.,2021) Phy C S Restricted H & L H H P LM - - 4- D
Robo-VLN (Irshad et al.,2021) Pano C S Multi Inst L H & L H P LV - - 4- C
VLN-CE (Krantz et al.,2020) Pano C S Multi Inst L H H P LV - - 4- D
CVDN (Thomason et al.,2020) Pano D M Restricted L H H H LV - - 4- D
R2R (Anderson et al.,2018)
Indoors
Pano D S Multi Inst L H H P LV - - 4- D
Table 1: Comparison of language-conditioned task completion settings in terms of
Environment Fidelity
(
Sim
ulated,
Pano
ramic,
Phy
sical),
Environment Continuity
(
D
iscrete,
C
ontinuous),
Turns of Communication
(
S
ingle,
M
ultiple),
Com-
munication Form
(
Freeform
Dialogue,
Restricted
Dialogue,
Multi
ple
Inst
ructions),
Language Granularity
(
H
igh: Goal,
L
ow: Step/Movement),
Control Granularity
(
H
igh: Action,
L
ow: Control),
Language Collection
(
H
uman,
T
emplated),
Demonstration Collection
(
H
uman,
P
lanner),
Modalities
(
L
anguage,
V
ision,
M
ap,
S
peech),
Instruction Type
(
Replan
ning,
Adaptation, Navigation, Manipulation), Action Space (Discrete, Continuous).
nals) as well as fine-grained annotation of dialogue
phenomena at multiple levels.
SDN
challenges au-
tonomous driving agents to navigate in continuous
and dynamic environments, engage in situated com-
munication with humans, and handle unexpected
events on the fly. As an initial step, we devel-
oped the Temporally-Ordered Task Oriented Trans-
former (
TOTO
), a transformer-based baseline model
for three tasks: (1) predicting dialogue moves from
human utterances; (2) generating dialogue moves
in response to humans; and (3) generating navi-
gation actions towards the goal. We present our
empirical results and discuss key challenges and
opportunities.
To the best of our knowledge, this is the first ef-
fort on language communication under unexpected
situations in autonomous vehicles. Our contribu-
tions are the following: (1) a novel, high-fidelity
simulation platform,
DOROTHIE
, that can be used
to create unexpected situations on the fly during
human-agent communication, (2) a fine-grained
benchmark,
SDN
, for continuous, dynamic, inter-
active navigation with sensorimotor-grounded dia-
logue, and (3) a transformer-based model for action
prediction and decision-making which serves as a
baseline for future development.
2 Related Work
Our work is mostly related to language-conditioned
navigation tasks (Anderson et al.,1991;MacMa-
hon et al.,2006;Paz-Argaman and Tsarfaty,2019)
and particularly recent work on embodied agents
that learn to navigate by following language in-
structions (Gu et al.,2022). Table 1summarizes
the comparison between our work and previous
work. Below we highlight some key differences.
Replanning in Unexpected Situations.
Most
simulated environments assume that only the
tasked agent can change the state of the world
through navigation and/or manipulation. In out-
door settings, the agent operates in a highly dy-
namic environment where unexpected changes to
the world can often occur due to, e.g., walking
pedestrians, moving vehicles, lighting, and weather
conditions. While previous studies have explored
misleading (Roh et al.,2020) or perturbed (Lin
et al.,2021) instructions, no prior work has looked
into how language instructions can help agents
adapt in these unexpected situations. To our knowl-
edge,
SDN
is the first dataset where language is
used to assist agents to replan their goals, paths,
and trajectories.
Free-Form Communication.
Most prior work
adopts either simple instruction-following (Chen
et al.,2019;Shridhar et al.,2020;Vasudevan et al.,
2021), or restricted QA dialogue (Chai et al.,2018;
Thomason et al.,2020;Gao et al.,2022) that only
allows the agent to ask for help. Except for some
recent work in human-robot dialogue (She and
Chai,2017;De Vries et al.,2018;Banerjee et al.,
2020;Padmakumar et al.,2022), few efforts have
supported fully free-form communication where
agents can ask, propose, explain, and negotiate un-
der ambiguity or confusion. To the best of our
knowledge,
SDN
is the first benchmark to enable
navigation in autonomous driving agents condi-
tioned on free-form spoken dialogue.
Physical
Moves
Co-Wizard😇
Speech
Recognizer
Speech
Recognizer
Human
Utterances
Participant👤
Human
Utterances
Dialogue
Interface
Ad-Wizard👿
Simulated Environment (CARLA)
Message
Interface
Environment
Controller
Message
Prompts
Environment
Changes
Storyboard
Task
Changes 👂
👄
📑👄
👂
First-Person
Camera View
Complete
Aerial View
Partial
Aerial View
?
Okay, i will turn right when the light turns green. 😇
Turn right and stop by seven eleven.
👤
<Message>: This is annabel, i’ve bought drinks for us
already. Just come directly to my place.
Sure, but i don’t know where annabel’s house is. 😇
Oh forget about it, turn left and go to annabel’s house.
👤
Okay, so you first make a left to Duffield Avenue.
👤
👿
?
👀 👀
👀
Local
Planner
Intended
Goal &
Trajectory
Figure 1: An overview of the
DOROTHIE
design. We extend the traditional Wizard-of-Oz framework by introducing
a pair of Wizards:
Co-Wizard
and
Ad-Wizard
. A Human
Participant
is given a storyboard and is instructed to
communicate with an autonomous vehicle to complete a set of tasks. The Co-Wizard controls the agent’s behaviors
and communicates with the human. The Ad-Wizard creates unexpected situations on the fly. The human and the
Co-Wizard need to collaborate with each other to resolve these unexpected situations.
Continuous Navigation.
In discrete navigation,
agents take discrete actions, e.g., tele-transport in
a pre-defined grid world (De Vries et al.,2018) or
a navigation graph with sparsely sampled panora-
mas at each node (Chen et al.,2019;Vasudevan
et al.,2021). More recently, researchers proposed a
continuous navigation setting (Krantz et al.,2020;
Hong et al.,2022) by converting discrete paths
on navigation graphs into trajectories. Unfortu-
nately, these agents are still limited with a discrete
action space such as
forward 0.25m
. This be-
comes unnatural in outdoor settings because the
default behaviour of outdoor driving agents (e.g.,
autonomous vehicles) is lane-following instead of
staying still. We instead follow the settings of mo-
bile robot navigation (Roh et al.,2020;Irshad et al.,
2021), where the agents are controlled by a con-
tinuous action space with physics like throttle and
steering, leading to continuous control signals with
long-range trajectories.
3 Dialogue On the ROad To Handle
Irregular Events (DOROTHIE) Simulator
Motivated by the wide availability of software sim-
ulations for autonomous vehicles (Rosique et al.,
2019), we set up our experiment in
CARLA
(Doso-
vitskiy et al.,2017), a driving simulator for au-
tonomous vehicles. We developed a novel frame-
work, Dialogue On the ROad To Handle Irreg-
ular Events (
DOROTHIE
) (as shown in Figure 1),
to study situated communication under unex-
pected situations based on the Wizard-of-Oz (WoZ)
paradigm (Riek,2012;Kawaguchi et al.,2004;
Hansen et al.,2005;Eric et al.,2017). In WoZ,
a human participant is typically instructed to in-
teract with an autonomous agent to complete a set
of tasks. The agent’s behaviors, however, are con-
trolled by a human “wizard” (i.e., a researcher).
One important novelty of our framework is that
it extends the traditional WoZ approach by intro-
ducing a pair of wizards. In our
duo-wizard
setup,
a
Co-Wizard
controls the agent’s behaviors and
carries out language communication with the hu-
man participant to jointly achieve a goal, and an
Ad-Wizard
creates unexpected situations on the
fly. The Co-Wizard and the participant need to
resolve these unexpected situations as they arise.
3.1 Interface for Co-Wizard Activities
We found in pilot studies that a low-level, free-form
controller is not desirable due to the poor quality of
demonstrated trajectories and high cognitive load
on the Co-Wizard. In line with prior work (Roh
et al.,2020;Codevilla et al.,2018;Mueller et al.,
2018), we developed a set of high-level physical
actions from pilot studies for the Co-Wizard to
control the vehicle. Each action is mapped to a
rule-based local trajectory planner to generate a list
of waypoints that the vehicle will drive through.
The continuous control (steering, throttle, brake)
of the vehicle is performed by a PID controller.
In a complex navigation task with multiple sub-
goals, belief tracking over plans, goals, task status,
and knowledge becomes crucial (Ma et al.,2012;
Misu et al.,2014). Besides controlling the vehi-
cle and communicating with the participant, the
Co-Wizard also annotates the intended actions (re-
ferred to as mental actions) during and after the
interaction, e.g., by noting down the navigation
plan by clicking junctions on the intended trajec-
tory from current position to the destination. The
set of the physical and mental actions is described
in Figure 2and more implementation details are
available in Appendix A.6.
Physical Actions Args Descriptions
LaneFollow - Default behaviour, follow the current lane.
LaneSwitch Angle (Rotation) Switch to a neighboring lane.
JTurn Angle (Rotation) Turn to a connecting road at a junction.
UTurn - Make a U-turn to the opposite direction.
Stop - Brake the vehicle manually.
Start - Start the vehicle manually.
SpeedChange Speed (±5) Change the desired cruise speed by 5 km/h.
LightChange Light State (On/Off) Change the front light state.
Mental Actions Args Descriptions
PlanUpdate List[Junction ID] Indicate intended trajectory towards a destination.
GoalUpdate List[Landmark] Indicate current goal as an intended landmark.
StatusUpdate Tuple[Landmark,Status] Indicate a change in task status.
KnowledgeUpdate x,y Guess the location of an unknown landmark.
Other - Other belief state updates.
Table 2: The space of primitive physical actions and
mental actions of the Co-Wizard.
3.2 Interface for Ad-Wizard Activities
The Ad-Wizard is able to introduce environmental
exceptions and task exceptions.
Environmental Exceptions
: Triggered by
changes to the environment. These include di-
rect environmental changes, which challenge the
vehicle’s perceptual processing and motivate par-
ticipants to request for adaptations without chang-
ing the plan or goal (e.g., drive slowly in foggy
weather and turn the headlights on at night). En-
vironmental exceptions can also be introduced by
creating roadblocks, which motivate new plans
by blocking the original ones.
Task Exceptions
: Brought by changing the tasks
specified in the storyboard by deleting, adding,
or changing a landmark to visit. The Ad-Wizard
will send a message to prompt the participant in
the message interface with appropriate context,
and modify the task interface that specifies the
landmarks to visit. Since the Co-Wizard does
not have a task interface, the participant needs to
communicate with the Co-Wizard in natural lan-
guage to inform the status of a subgoal, especially
when a change of current subgoal is indicated by
the Ad-Wizard.
The rich dynamics of the environment and tasks
in
DOROTHIE
create uncertainty and ambiguity,
which requires the Co-Wizard to actively initi-
ate conversation with the human partner and find
a way to handle these unexpected situations col-
laboratively. More illustrated details of the Ad-
Wizard interface is available in See Figure 10 in
Appendix A.7.
3.3 Data Collection
Using
DOROTHIE
, we recruited 40 naïve human sub-
jects as participants for data collection. Each sub-
ject went through an average of 4.5 sessions. In
each session, a storyboard was given to the subject
which required the agent to visit two to six land-
marks/destinations. Each storyboard was generated
from four different towns, with all task templates,
landmark locations, street names and departure lo-
cations randomly shuffled. While shown the map,
the Co-Wizard (an experimenter) did not have ac-
cess to some of the destinations, e.g., the location
of a friend’s house or a person to pick up. Such
knowledge disparities motivate rich situated com-
munication and challenge the agent to understand
language instructions of different granularity. As
the Co-Wizard and the human subject communi-
cated with each other to achieve the goal, the Ad-
Wizard (another experimenter) was tasked to create
different types of unexpected events that were rele-
vant to the current goal. The knowledge disparity
and unexpected events together drive the commu-
nication. Details of the task setups are available in
Appendix A.4.
4 Situated Dialogue Navigation (SDN)
Our data collection effort has led to the Situated
Dialogue Navigation (
SDN
), a fine-grained outdoor
navigation benchmark. Each session was replayed
at 10 FPS following prior work (Roh et al.,2020) to
obtain multi-faceted and time-synchronized infor-
mation, e.g., a first-person view of the environment,
speech input from the participant, discrete actions,
Dialogue Moves
Initiate Respond
Instruct ExplainAsk InformCommunicate
initiation? response?
command? statement?question? domain information?
communication status?
confirm communi-
cation status
Align
confirm an inferable
domain information
Check
simple yes/no query
QueryYN
complex wh query
QueryW
query for domain
information
Query
contain just the
information requested
Reply
contain amplified
information
Clarify
simple no reply
ReplyN
complex wh reply
ReplyW
Confuse
indicate
confusion
Acknowledge
indicate successful
communication
simple yes reply
ReplyY
simple unsure reply
ReplyU
contain repeated
information
Confirm
Figure 2: The coding scheme of dialogue moves as a decision tree. The leaf nodes of the decision tree specify the
set of dialogue moves we used for annotation.
a continuous trajectory, and control signals. The
benchmark also includes dialogue structure annota-
tion, which we analyzed for dialogue behaviors.
4.1 Dialogue Structure Annotation
Following prior work in human-robot dia-
logue (Marge et al.,2017;Traum et al.,2018;
Marge et al.,2020) and dialogue discourse process-
ing (Sinclair et al.,1975;Grosz and Sidner,1986;
Clark,1996), we annotate each dialogue session
using four levels of linguistic units:
Transaction Units (TUs)
: Sub-dialogues that
start when a task is initiated and end when it is
completed, interrupted, or abandoned.
Exchange Units (EUs)
Sequences of dialogue
moves towards common ground. These start with
an initiating utterance that has a purpose (e.g.,
a question) and end when the expectations are
fulfilled or abandoned (e.g., an answer).
Dialogue Moves
Sub-categories of dialogue
acts that drive conversation and update domain-
specific information state within an exchange.
Dialogue Slots
Parameters that further deter-
mine the semantics of dialogue moves, including
Action,Street,Landmark,Status,Object.
We follow the coding scheme of Carletta et al.
(1997) to represent dialogue moves as a decision
tree, with a slight modification to adjust to our
domain, as presented in Figure 2. The 14 dia-
logue moves, together with
Irrelevant
, specify
the space of conversational action in the human-
vehicle dialogue. We present an example dialogue
with annotations in Figure 4, with more samples
available in Appendix B.4.
4.2 Data Statistics
The dataset is split into training, validation, and
test sets and defines seen (Town 1, 3, 5) and unseen
(Town 2) sub-folds for validation and test. The
SDN
dataset captures rich dialogue behaviors be-
tween the human and the agent to collaboratively re-
solve unexpected situations and achieve joint goals.
Table 3a shows some basic statistics.
Metric Value
Control Stream 18.7 h
Trimmed Audio 2.9 h
# Utterances 8415
# Words 50398
Vocabulary 1373
# Transactions 578
# Exchanges 4089
# Dialogue Moves 11623
# Slot Values 8618
# Physical Actions 9448
Fold (Split) # Sessions
Train 123
Val (Seen) 14
Val (Unseen) 6
Test (Seen) 25
Test (Unseen) 15
(a) Dataset Statistics and
split information.
Moves
Slots
(b) The distribution of dialogue
moves and slots per TU.
Figure 3: Dataset description.
Figure 3b shows the frequencies of dialogue
moves and slots taken by the human and the agent
respectively. Not surprisingly, due to the nature of
the joint tasks, the human mostly instructs and the
agent constantly provides acknowledgement and
confirmation. Both the human and the agent ask
questions and give answers. The agent appears to
provide more explanation about its own behaviors
and decisions.
4.3 Dialogue Behaviors
The
SDN
also demonstrates some interesting and
unique behaviors between partners to handle unex-
pected situations. In particular, we investigate the
摘要:

DOROTHIE:SpokenDialogueforHandlingUnexpectedSituationsinInteractiveAutonomousDrivingAgentsZiqiaoMa1,BenVanDerPloeg1,Cristian-PaulBaray1,YidongHuang1,Eui-InKim1,FelixGervits2,MatthewMarge2,JoyceChai11UniversityofMichigan2U.S.ArmyResearchLaboratory{marstin,bensvdp,cpbara,owenhji,euiink,chaijy}@umic...

展开>> 收起<<
DOROTHIE Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents Ziqiao Ma1 Ben VanDerPloeg1 Cristian-Paul Baray1 Yidong Huang1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:7.42MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注