DOROTHIE Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents Ziqiao Ma1 Ben VanDerPloeg1 Cristian-Paul Baray1 Yidong Huang1

2025-05-03 0 0 7.42MB 23 页 10玖币

侵权投诉

DOROTHIE: Spoken Dialogue for Handling Unexpected Situations

in Interactive Autonomous Driving Agents

Ziqiao Ma1, Ben VanDerPloeg∗1, Cristian-Paul Bara∗†1, Yidong Huang∗1,

Eui-In Kim1, Felix Gervits2, Matthew Marge2, Joyce Chai1

1University of Michigan 2U.S. Army Research Laboratory

{marstin,bensvdp,cpbara,owenhji,euiink,chaijy}@umich.edu

{felix.gervits,matthew.r.marge}.civ@army.mil

Abstract

In the real world, autonomous driving agents

navigate in highly dynamic environments full

of unexpected situations where pre-trained

models are unreliable. In these situations,

what is immediately available to vehicles is

often only human operators. Empowering au-

tonomous driving agents with the ability to

navigate in a continuous and dynamic environ-

ment and to communicate with humans through

sensorimotor-grounded dialogue becomes criti-

cal. To this end, we introduce Dialogue On the

ROad To Handle Irregular Events (

DOROTHIE

a novel interactive simulation platform that en-

ables the creation of unexpected situations on

the ﬂy to support empirical studies on situ-

ated communication with autonomous driving

agents. Based on this platform, we created

the Situated Dialogue Navigation (

SDN

), a nav-

igation benchmark of 183 trials with a total

of 8415 utterances, around 18.7 hours of con-

trol streams and 2.9 hours of trimmed audio.

SDN

is developed to evaluate the agent’s abil-

ity to predict dialogue moves from humans as

well as generate its own dialogue moves and

physical navigation actions. We further devel-

oped a transformer-based baseline model for

these

SDN

tasks. Our empirical results indicate

that language guided-navigation in a highly dy-

namic environment is an extremely difﬁcult

task for end-to-end models. These results will

provide insight towards future work on robust

autonomous driving agents1.

1 Introduction

In embodied agents such as autonomous vehicles

(AVs), highly dynamic environments often lead to

unexpected situations, such as challenging environ-

ment conditions (e.g., caused by weather, light, ob-

stacles, etc.), inﬂuence of other agents, and change

∗Equal contribution.

†Work done prior to joining Amazon Alexa AI

The

DOROTHIE

platform,

SDN

benchmark, and code for

the baseline model are available at

https://github.com/

sled-group/DOROTHIE

of the original goals. In these situations, the agent’s

pre-trained models or existing knowledge may not

be adequate or reliable to make a corresponding

decision. What is immediately available to help

the agent is often only human partners (Ramachan-

dran et al.,2013). As they are not programmers

who can readily change the code in the ﬁeld, ap-

proaches that enable natural communication and

collaboration between humans and autonomy be-

come critical (Spiliotopoulos et al.,2001;Weng

et al.,2016). Although recent years have seen an

increasing amount of work in natural language com-

munication with robots, and especially the many

benchmarks that have been developed for naviga-

tion by instruction following (Roh et al.,2020;Va-

sudevan et al.,2021;Shridhar et al.,2020;Pad-

makumar et al.,2022), little work has been done to

study language communication under unexpected

situations, particularly in the context of AVs.

To address this limitation, we have developed

Dialogue On the ROad To Handle Irregular Events

(

DOROTHIE

), an interactive simulation platform

built upon the

CARLA

simulator (Dosovitskiy et al.,

2017) to speciﬁcally target unexpected situations.

The

DOROTHIE

simulator supports Wizard-of-Oz

(WoZ) studies through

a novel duo-wizard setup

a collaborative wizard (Co-Wizard) that collabo-

rates with the human to accomplish the tasks, and

an adversarial wizard (Ad-Wizard) that generates

unexpected situations (e.g., creating road obstacles,

changing weather conditions, adding/changing

goals, etc.) on the ﬂy. Using

DOROTHIE

, we

collected the Situated Dialogue Navigation (

SDN

)

dataset of 183 trials between a Co-Wizard and hu-

man subjects to collaboratively resolve unexpected

situations and complete navigation tasks through

spoken dialogue.

The

SDN

dataset contains multi-faceted and time-

synchronized information (e.g., ﬁrst-person view of

the environment, speech input from the human, dis-

crete actions, continuous trajectory and control sig-

arXiv:2210.12511v1 [cs.AI] 22 Oct 2022

Environment Communication Granularity Data Collection Instruction Type Action Space

Name Domain Fidelity Continuity Turn Form Language Control Lang. Demo. Modal. Replan. Adp. Nav. Man. Continuity

SDN (Ours) Sim C M Freeform H & L H & L H H LVMS4 4 4 - D & C

CDNLI (Roh et al.,2020) Sim C M Multi Inst L H & L H+T P LVM - 4 4 - D & C

LCSD (Sriram et al.,2019) Sim C S Multi Inst L H H P LVM - - 4- D

TtW (De Vries et al.,2018) Pano D M Freeform H & L H H H LVM - - 4- D

Talk2Nav (Vasudevan et al.,2021) Pano D S Multi Inst L H H P LVM - - 4- D

TouchDown (Chen et al.,2019) Pano D S Multi Inst L H H P LVM - - 4- D

Street Nav (Hermann et al.,2020) Pano D M Multi Inst L H T P LVM - - 4- D

Map2Seq (Schumann and Riezler,2021) Pano D S Multi Inst L H H P LM - - 4- D

RUN (Paz-Argaman and Tsarfaty,2019)

Outdoors

Pano D S Multi Inst L H H H LM - - 4- D

TEACh (Padmakumar et al.,2022) Sim C M Freeform H & L H H H LV - 4 4 4 D

DialFRED (Gao et al.,2022) Sim C M Restricted H & L H H+T P LV - 4 4 4 D

ALFRED (Shridhar et al.,2020) Sim C S Multi Inst H & L H H P LV - 4 4 4 D

HANNA (Nguyen and Daumé III,2019) Pano D M Multi Inst H & L H H P LV - 4 4 - D

RobotSlang (Banerjee et al.,2020) Phy C M Freeform H & L H H P LV - - 4- D

TtT and WtW (Ilyevsky et al.,2021) Phy C S Restricted H & L H H P LM - - 4- D

Robo-VLN (Irshad et al.,2021) Pano C S Multi Inst L H & L H P LV - - 4- C

VLN-CE (Krantz et al.,2020) Pano C S Multi Inst L H H P LV - - 4- D

CVDN (Thomason et al.,2020) Pano D M Restricted L H H H LV - - 4- D

R2R (Anderson et al.,2018)

Indoors

Pano D S Multi Inst L H H P LV - - 4- D

Table 1: Comparison of language-conditioned task completion settings in terms of

Environment Fidelity

(

Sim

ulated,

Pano

ramic,

Phy

sical),

Environment Continuity

(

iscrete,

ontinuous),

Turns of Communication

(

ingle,

ultiple),

Com-

munication Form

(

Freeform

Dialogue,

Restricted

Dialogue,

Multi

ple

Inst

ructions),

Language Granularity

(

igh: Goal,

ow: Step/Movement),

Control Granularity

(

igh: Action,

ow: Control),

Language Collection

(

uman,

emplated),

Demonstration Collection

(

uman,

lanner),

Modalities

(

anguage,

ision,

ap,

peech),

Instruction Type

(

Replan

ning,

Adaptation, Navigation, Manipulation), Action Space (Discrete, Continuous).

nals) as well as ﬁne-grained annotation of dialogue

phenomena at multiple levels.

SDN

challenges au-

tonomous driving agents to navigate in continuous

and dynamic environments, engage in situated com-

munication with humans, and handle unexpected

events on the ﬂy. As an initial step, we devel-

oped the Temporally-Ordered Task Oriented Trans-

former (

TOTO

), a transformer-based baseline model

for three tasks: (1) predicting dialogue moves from

human utterances; (2) generating dialogue moves

in response to humans; and (3) generating navi-

gation actions towards the goal. We present our

empirical results and discuss key challenges and

opportunities.

To the best of our knowledge, this is the ﬁrst ef-

fort on language communication under unexpected

situations in autonomous vehicles. Our contribu-

tions are the following: (1) a novel, high-ﬁdelity

simulation platform,

DOROTHIE

, that can be used

to create unexpected situations on the ﬂy during

human-agent communication, (2) a ﬁne-grained

benchmark,

SDN

, for continuous, dynamic, inter-

active navigation with sensorimotor-grounded dia-

logue, and (3) a transformer-based model for action

prediction and decision-making which serves as a

baseline for future development.

2 Related Work

Our work is mostly related to language-conditioned

navigation tasks (Anderson et al.,1991;MacMa-

hon et al.,2006;Paz-Argaman and Tsarfaty,2019)

and particularly recent work on embodied agents

that learn to navigate by following language in-

structions (Gu et al.,2022). Table 1summarizes

the comparison between our work and previous

work. Below we highlight some key differences.

Replanning in Unexpected Situations.

Most

simulated environments assume that only the

tasked agent can change the state of the world

through navigation and/or manipulation. In out-

door settings, the agent operates in a highly dy-

namic environment where unexpected changes to

the world can often occur due to, e.g., walking

pedestrians, moving vehicles, lighting, and weather

conditions. While previous studies have explored

misleading (Roh et al.,2020) or perturbed (Lin

et al.,2021) instructions, no prior work has looked

into how language instructions can help agents

adapt in these unexpected situations. To our knowl-

edge,

SDN

is the ﬁrst dataset where language is

used to assist agents to replan their goals, paths,

and trajectories.

Free-Form Communication.

Most prior work

adopts either simple instruction-following (Chen

et al.,2019;Shridhar et al.,2020;Vasudevan et al.,

2021), or restricted QA dialogue (Chai et al.,2018;

Thomason et al.,2020;Gao et al.,2022) that only

allows the agent to ask for help. Except for some

recent work in human-robot dialogue (She and

Chai,2017;De Vries et al.,2018;Banerjee et al.,

2020;Padmakumar et al.,2022), few efforts have

supported fully free-form communication where

agents can ask, propose, explain, and negotiate un-

der ambiguity or confusion. To the best of our

knowledge,

SDN

is the ﬁrst benchmark to enable

navigation in autonomous driving agents condi-

tioned on free-form spoken dialogue.

Physical

Moves

Co-Wizard😇

Speech

Recognizer

Speech

Recognizer

Human

Utterances

Participant👤

Human

Utterances

Dialogue

Interface

Ad-Wizard👿

Simulated Environment (CARLA)

Message

Interface

Environment

Controller

Message

Prompts

Environment

Changes

Storyboard

Task

Changes 👂

👄

📑👄

👂

First-Person

Camera View

Complete

Aerial View

Partial

Aerial View

Okay, i will turn right when the light turns green. 😇

Turn right and stop by seven eleven.

👤

<Message>: This is annabel, i’ve bought drinks for us

already. Just come directly to my place.

Sure, but i don’t know where annabel’s house is. 😇

Oh forget about it, turn left and go to annabel’s house.

👤

Okay, so you first make a left to Duffield Avenue.

👤

👿

👀 👀

👀

Local

Planner

Intended

Goal &

Trajectory

Figure 1: An overview of the

DOROTHIE

design. We extend the traditional Wizard-of-Oz framework by introducing

a pair of Wizards:

Co-Wizard

and

Ad-Wizard

. A Human

Participant

is given a storyboard and is instructed to

communicate with an autonomous vehicle to complete a set of tasks. The Co-Wizard controls the agent’s behaviors

and communicates with the human. The Ad-Wizard creates unexpected situations on the ﬂy. The human and the

Co-Wizard need to collaborate with each other to resolve these unexpected situations.

Continuous Navigation.

In discrete navigation,

agents take discrete actions, e.g., tele-transport in

a pre-deﬁned grid world (De Vries et al.,2018) or

a navigation graph with sparsely sampled panora-

mas at each node (Chen et al.,2019;Vasudevan

et al.,2021). More recently, researchers proposed a

continuous navigation setting (Krantz et al.,2020;

Hong et al.,2022) by converting discrete paths

on navigation graphs into trajectories. Unfortu-

nately, these agents are still limited with a discrete

action space such as

forward 0.25m

. This be-

comes unnatural in outdoor settings because the

default behaviour of outdoor driving agents (e.g.,

autonomous vehicles) is lane-following instead of

staying still. We instead follow the settings of mo-

bile robot navigation (Roh et al.,2020;Irshad et al.,

2021), where the agents are controlled by a con-

tinuous action space with physics like throttle and

steering, leading to continuous control signals with

long-range trajectories.

3 Dialogue On the ROad To Handle

Irregular Events (DOROTHIE) Simulator

Motivated by the wide availability of software sim-

ulations for autonomous vehicles (Rosique et al.,

2019), we set up our experiment in

CARLA

(Doso-

vitskiy et al.,2017), a driving simulator for au-

tonomous vehicles. We developed a novel frame-

work, Dialogue On the ROad To Handle Irreg-

ular Events (

DOROTHIE

) (as shown in Figure 1),

to study situated communication under unex-

pected situations based on the Wizard-of-Oz (WoZ)

paradigm (Riek,2012;Kawaguchi et al.,2004;

Hansen et al.,2005;Eric et al.,2017). In WoZ,

a human participant is typically instructed to in-

teract with an autonomous agent to complete a set

of tasks. The agent’s behaviors, however, are con-

trolled by a human “wizard” (i.e., a researcher).

One important novelty of our framework is that

it extends the traditional WoZ approach by intro-

ducing a pair of wizards. In our

duo-wizard

setup,

Co-Wizard

controls the agent’s behaviors and

carries out language communication with the hu-

man participant to jointly achieve a goal, and an

Ad-Wizard

creates unexpected situations on the

ﬂy. The Co-Wizard and the participant need to

resolve these unexpected situations as they arise.

3.1 Interface for Co-Wizard Activities

We found in pilot studies that a low-level, free-form

controller is not desirable due to the poor quality of

demonstrated trajectories and high cognitive load

on the Co-Wizard. In line with prior work (Roh

et al.,2020;Codevilla et al.,2018;Mueller et al.,

2018), we developed a set of high-level physical

actions from pilot studies for the Co-Wizard to

control the vehicle. Each action is mapped to a

rule-based local trajectory planner to generate a list

of waypoints that the vehicle will drive through.

The continuous control (steering, throttle, brake)

of the vehicle is performed by a PID controller.

In a complex navigation task with multiple sub-

goals, belief tracking over plans, goals, task status,

and knowledge becomes crucial (Ma et al.,2012;

Misu et al.,2014). Besides controlling the vehi-

cle and communicating with the participant, the

Co-Wizard also annotates the intended actions (re-

ferred to as mental actions) during and after the

interaction, e.g., by noting down the navigation

plan by clicking junctions on the intended trajec-

tory from current position to the destination. The

set of the physical and mental actions is described

in Figure 2and more implementation details are

available in Appendix A.6.

Physical Actions Args Descriptions

LaneFollow - Default behaviour, follow the current lane.

LaneSwitch Angle (Rotation) Switch to a neighboring lane.

JTurn Angle (Rotation) Turn to a connecting road at a junction.

UTurn - Make a U-turn to the opposite direction.

Stop - Brake the vehicle manually.

Start - Start the vehicle manually.

SpeedChange Speed (±5) Change the desired cruise speed by 5 km/h.

LightChange Light State (On/Off) Change the front light state.

Mental Actions Args Descriptions

PlanUpdate List[Junction ID] Indicate intended trajectory towards a destination.

GoalUpdate List[Landmark] Indicate current goal as an intended landmark.

StatusUpdate Tuple[Landmark,Status] Indicate a change in task status.

KnowledgeUpdate x,y Guess the location of an unknown landmark.

Other - Other belief state updates.

Table 2: The space of primitive physical actions and

mental actions of the Co-Wizard.

3.2 Interface for Ad-Wizard Activities

The Ad-Wizard is able to introduce environmental

exceptions and task exceptions.

•Environmental Exceptions

: Triggered by

changes to the environment. These include di-

rect environmental changes, which challenge the

vehicle’s perceptual processing and motivate par-

ticipants to request for adaptations without chang-

ing the plan or goal (e.g., drive slowly in foggy

weather and turn the headlights on at night). En-

vironmental exceptions can also be introduced by

creating roadblocks, which motivate new plans

by blocking the original ones.

•Task Exceptions

: Brought by changing the tasks

speciﬁed in the storyboard by deleting, adding,

or changing a landmark to visit. The Ad-Wizard

will send a message to prompt the participant in

the message interface with appropriate context,

and modify the task interface that speciﬁes the

landmarks to visit. Since the Co-Wizard does

not have a task interface, the participant needs to

communicate with the Co-Wizard in natural lan-

guage to inform the status of a subgoal, especially

when a change of current subgoal is indicated by

the Ad-Wizard.

The rich dynamics of the environment and tasks

DOROTHIE

create uncertainty and ambiguity,

which requires the Co-Wizard to actively initi-

ate conversation with the human partner and ﬁnd

a way to handle these unexpected situations col-

laboratively. More illustrated details of the Ad-

Wizard interface is available in See Figure 10 in

Appendix A.7.

3.3 Data Collection

Using

DOROTHIE

, we recruited 40 naïve human sub-

jects as participants for data collection. Each sub-

ject went through an average of 4.5 sessions. In

each session, a storyboard was given to the subject

which required the agent to visit two to six land-

marks/destinations. Each storyboard was generated

from four different towns, with all task templates,

landmark locations, street names and departure lo-

cations randomly shufﬂed. While shown the map,

the Co-Wizard (an experimenter) did not have ac-

cess to some of the destinations, e.g., the location

of a friend’s house or a person to pick up. Such

knowledge disparities motivate rich situated com-

munication and challenge the agent to understand

language instructions of different granularity. As

the Co-Wizard and the human subject communi-

cated with each other to achieve the goal, the Ad-

Wizard (another experimenter) was tasked to create

different types of unexpected events that were rele-

vant to the current goal. The knowledge disparity

and unexpected events together drive the commu-

nication. Details of the task setups are available in

Appendix A.4.

4 Situated Dialogue Navigation (SDN)

Our data collection effort has led to the Situated

Dialogue Navigation (

SDN

), a ﬁne-grained outdoor

navigation benchmark. Each session was replayed

at 10 FPS following prior work (Roh et al.,2020) to

obtain multi-faceted and time-synchronized infor-

mation, e.g., a ﬁrst-person view of the environment,

speech input from the participant, discrete actions,

Dialogue Moves

Initiate Respond

Instruct ExplainAsk InformCommunicate

initiation? response?

command? statement?question? domain information?

communication status?

conﬁrm communi-

cation status

Align

conﬁrm an inferable

domain information

Check

simple yes/no query

QueryYN

complex wh query

QueryW

query for domain

information

Query

contain just the

information requested

contain ampliﬁed

information

Clarify

simple no reply

ReplyN

complex wh reply

ReplyW

Confuse

indicate

confusion

Acknowledge

indicate successful

communication

simple yes reply

ReplyY

simple unsure reply

ReplyU

contain repeated

information

Confirm

Figure 2: The coding scheme of dialogue moves as a decision tree. The leaf nodes of the decision tree specify the

set of dialogue moves we used for annotation.

a continuous trajectory, and control signals. The

benchmark also includes dialogue structure annota-

tion, which we analyzed for dialogue behaviors.

4.1 Dialogue Structure Annotation

Following prior work in human-robot dia-

logue (Marge et al.,2017;Traum et al.,2018;

Marge et al.,2020) and dialogue discourse process-

ing (Sinclair et al.,1975;Grosz and Sidner,1986;

Clark,1996), we annotate each dialogue session

using four levels of linguistic units:

•Transaction Units (TUs)

: Sub-dialogues that

start when a task is initiated and end when it is

completed, interrupted, or abandoned.

•Exchange Units (EUs)

Sequences of dialogue

moves towards common ground. These start with

an initiating utterance that has a purpose (e.g.,

a question) and end when the expectations are

fulﬁlled or abandoned (e.g., an answer).

•Dialogue Moves

Sub-categories of dialogue

acts that drive conversation and update domain-

speciﬁc information state within an exchange.

•Dialogue Slots

Parameters that further deter-

mine the semantics of dialogue moves, including

Action,Street,Landmark,Status,Object.

We follow the coding scheme of Carletta et al.

(1997) to represent dialogue moves as a decision

tree, with a slight modiﬁcation to adjust to our

domain, as presented in Figure 2. The 14 dia-

logue moves, together with

Irrelevant

, specify

the space of conversational action in the human-

vehicle dialogue. We present an example dialogue

with annotations in Figure 4, with more samples

available in Appendix B.4.

4.2 Data Statistics

The dataset is split into training, validation, and

test sets and deﬁnes seen (Town 1, 3, 5) and unseen

(Town 2) sub-folds for validation and test. The

SDN

dataset captures rich dialogue behaviors be-

tween the human and the agent to collaboratively re-

solve unexpected situations and achieve joint goals.

Table 3a shows some basic statistics.

Metric Value

Control Stream 18.7 h

Trimmed Audio 2.9 h

# Utterances 8415

# Words 50398

Vocabulary 1373

# Transactions 578

# Exchanges 4089

# Dialogue Moves 11623

# Slot Values 8618

# Physical Actions 9448

Fold (Split) # Sessions

Train 123

Val (Seen) 14

Val (Unseen) 6

Test (Seen) 25

Test (Unseen) 15

(a) Dataset Statistics and

split information.

Moves

Slots

(b) The distribution of dialogue

moves and slots per TU.

Figure 3: Dataset description.

Figure 3b shows the frequencies of dialogue

moves and slots taken by the human and the agent

respectively. Not surprisingly, due to the nature of

the joint tasks, the human mostly instructs and the

agent constantly provides acknowledgement and

conﬁrmation. Both the human and the agent ask

questions and give answers. The agent appears to

provide more explanation about its own behaviors

and decisions.

4.3 Dialogue Behaviors

The

SDN

also demonstrates some interesting and

unique behaviors between partners to handle unex-

pected situations. In particular, we investigate the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DOROTHIE:SpokenDialogueforHandlingUnexpectedSituationsinInteractiveAutonomousDrivingAgentsZiqiaoMa1,BenVanDerPloeg1,Cristian-PaulBaray1,YidongHuang1,Eui-InKim1,FelixGervits2,MatthewMarge2,JoyceChai11UniversityofMichigan2U.S.ArmyResearchLaboratory{marstin,bensvdp,cpbara,owenhji,euiink,chaijy}@umic...

展开>> 收起<<

DOROTHIE Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents Ziqiao Ma1 Ben VanDerPloeg1 Cristian-Paul Baray1 Yidong Huang1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DOROTHIE Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents Ziqiao Ma1 Ben VanDerPloeg1 Cristian-Paul Baray1 Yidong Huang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: