Is MultiWOZ a Solved Task An Interactive TOD Evaluation Framework with User Simulator Qinyuan Cheng1 Linyang Li1 Guofeng Quan1 Feng Gao2 Xiaofeng Mou2and Xipeng Qiu1y

2025-05-03 0 0 469.42KB 12 页 10玖币

侵权投诉

Is MultiWOZ a Solved Task?

An Interactive TOD Evaluation Framework with User Simulator

Qinyuan Cheng1∗

, Linyang Li1∗

, Guofeng Quan1, Feng Gao2, Xiaofeng Mou2and Xipeng Qiu1†

1School of Computer Science, Fudan University

2AI Innovation Center, Midea Group Co Ltd

{chengqy21, gfquan21}@m.fudan.edu.cn {linyangli19, xpqiu}@fudan.edu.cn

{gaofeng14, mouxf}@midea.com

Abstract

Task-Oriented Dialogue (TOD) systems are

drawing more and more attention in recent

studies. Current methods focus on construct-

ing pre-trained models or ﬁne-tuning strategies

while the evaluation of TOD is limited by a pol-

icy mismatch problem. That is, during evalua-

tion, the user utterances are from the annotated

dataset while these utterances should interact

with previous responses which can have many

alternatives besides annotated texts. Therefore,

in this work, we propose an interactive evalua-

tion framework for TOD. We ﬁrst build a goal-

oriented user simulator based on pre-trained

models and then use the user simulator to in-

teract with the dialogue system to generate di-

alogues. Besides, we introduce a sentence-

level and a session-level score to measure the

sentence ﬂuency and session coherence in the

interactive evaluation. Experimental results

show that RL-based TOD systems trained by

our proposed user simulator can achieve nearly

98% inform and success rates in the interac-

tive evaluation of MultiWOZ dataset and the

proposed scores measure the response quality

besides the inform and success rates. We are

hoping that our work will encourage simulator-

based interactive evaluations in the TOD task

1 Introduction

Building intelligent dialogue systems has become

a trend in natural language process applications

especially with the help of powerful pre-trained

models. Speciﬁcally, task-oriented dialogue (TOD)

systems (Zhang et al.,2020b) are to help users with

scenarios such as booking hotels or ﬂights. These

TOD systems (Wen et al.,2017;Zhong et al.,2018;

Chen et al.,2019) usually ﬁrst recognize user’s

1https://github.com/xiami2019/

User-Simulator

*These authors contributed equally to this work.

†Corresponding author.

intents and then generate corresponding responses

based on an external database containing booking

information. Therefore, the key factor in TOD is

the interaction between users and dialogue systems.

However, in current TOD system evaluation pro-

cess, traditional evaluation process uses annotated

user utterances in multi-turn dialogue sessions no

matter what responses the dialogue system gen-

erated, as illustrated in Figure 1. While in real-

world dialogues, the user utterances are coherent

with responses from the other speaker (which is

the service provider). Therefore, in TOD evalua-

tion, using annotated utterances without interaction

with the dialogue system will cause a policy mis-

match, which would weaken the soundness of the

evaluation results. The mismatch might hurt the

evaluation process since some responses may be

correct and coherent but use a different policy with

the annotated responses. Also, incoherent dialogue

histories will affect the response generation. With

current state-of-the-art models achieving similar

performance, it is natural to consider that the bot-

tleneck in the performance of current TOD systems

is not the model capability but the evaluation strat-

egy. Since incorporating human interactions during

evaluation is costly, a feasible method is to build

an automatic interactive evaluation framework that

can solve the policy mismatch problem.

In this paper, we propose a complete interac-

tive evaluation framework to evaluate the TOD sys-

tem. We ﬁrst build a strong dialogue user simulator

based on pre-trained models, and we use the pro-

posed simulator to deploy interactive evaluations.

In simulator learning, we introduce a goal-

guided user utterance generation model based on

sequence-to-sequence pre-trained models. Then

we use reinforcement learning to train both the user

simulator and the dialogue system to boost interac-

tion performance.

In interactive evaluations, we use the simula-

tor to generate user utterances based on interac-

arXiv:2210.14529v1 [cs.CL] 26 Oct 2022

Annotated Dialogue

System: There are 5 Indian restaurants. What price range do you prefer?

User: Expensive would be great.

User: I am looking for a dining place that serves Indian food.

Traditional Evaluation

System: There are 5 Indian restaurants. What area of town would you like to

dine in?

User: Expensive would be great.

User: I am looking for a dining place that serves Indian food.

Interactive Evaluation

System: There are 5 Indian restaurants. Is there a particular area

you would like to dine in?

User: No, I don't care. I would like to book a table for 6 people at 13:15 on Friday.

User: I am looking for an Indian restaurant in the expensive price range.

Figure 1: Illustration of interactions between users and

systems. Traditional evaluation might face a policy mis-

match between utterances annotated with red color.

tions with the generated responses instead of using

static user utterances that have been annotated in

advance. Therefore, during evaluation, user ut-

terances will respond to the generated responses,

which can avoid the mismatch between stale user

utterances and generated responses. Further, in in-

teractive evaluations, the quality of the generated

texts cannot be measured since traditional BLEU

cannot be calculated without oracle texts. To bet-

ter evaluate the performance of dialogue systems,

we introduce two automatic scores to evaluate the

response quality at both sentence-level and session-

level. The sentence-level score is to evaluate the

sentence ﬂuency and the session-level score is to

evaluate the coherence between turns in a dialogue

session. Also, these proposed scores can be used

in traditional evaluation methods as well as the an-

notated dataset as a meta-evaluation to explore the

importance of using user simulators to construct

interactive evaluations.

We construct experiments on MultiWOZ dataset

(Budzianowski et al.,2018) based on pre-trained

models and use our proposed simulator and scores

to run interactive evaluations. Experimental re-

sults show that interactive evaluations can achieve

over 98% inform and success rates, indicating that

the bottleneck of TOD performance is the lack of

proper evaluation methods. The proposed scores

show that our proposed simulator can help achieve

promising evaluation results in the interactive eval-

uation framework. Also, we explore the perfor-

mance of RL-based models and we also use pro-

posed scores to ﬁnd that RL methods might hurt

response quality to achieve high success rates.

Therefore, we can summarize our contributions:

(A) We construct an evaluation framework that

avoids policy mismatch problems in TOD.

(B) We build a strong user simulator for TOD

systems that can be used in TOD training and eval-

uation.

of using our proposed simulator and evaluation

framework and provide hints for future TOD sys-

tem developments with public available codes.

2 Related Work

2.1 Task-Oriented Dialogue Systems

Task-oriented dialogue systems aim to achieve

users’ goals such as booking hotels or ﬂights (Wen

et al.,2017;Eric et al.,2017). With the widespread

use of pre-trained models (Qiu et al.,2020), end-

to-end TOD systems based on pre-trained models

become more and more popular: Hosseini-Asl et al.

(2020) ﬁne-tunes all subtasks of TOD using multi-

task learning based on a single pre-trained model.

Yang et al. (2021) encodes results of intermediate

subtasks, such as belief states and system actions,

into dialogue history to boost responses genera-

tion. Su et al. (2021) and He et al. (2021) use

additional dialogue corpus to further pre-train the

language model and then ﬁne-tune the model on

MultiWOZ dataset. Lee (2021) introduces an aux-

iliary task based on T5 models (Raffel et al.,2020)

and achieves state-of-the-art performance without

using further pre-training methods.

2.2 Automatic Evaluations

Recent trends leverage neural models to automat-

ically evaluate generated texts from different per-

spectives. Automatic evaluation methods can help

evaluate certain aspects in certain tasks such as

factuality checking in text summarization (Kryscin-

ski et al.,2020), stronger BLEU score in machine

translation (Sellam et al.,2020) and coherence in

dialogue systems (Tao et al.,2018;Pang et al.,

2020). With pre-trained models, the quality of text

generation can be measured by evaluation meth-

ods such as BERTScore (Zhang et al.,2020a) and

BARTScore (Yuan et al.,2021). With properly

designed neural model scores, the performance of

dialogue systems can be more accurately evaluated.

2.3 User Simulators

User simulators are designed to simulate users’ be-

haviors in dialogue interactions, including rule-

based simulators (Lee et al.,2019) and model-

based simulators (Takanobu et al.,2020;Tseng

et al.,2021). Usually, user simulators are intro-

duced along with reinforcement learning strategies

to enhance the dialogue policy modeling (Li et al.,

2016;Shi et al.,2019), which can help the model

learn better policies not included in the annotated

data. Takanobu et al. (2020) treats the model-based

simulator as a dialogue agent like the dialogue sys-

tem and formulate TOD as a multi-agent policy

learning problem. Tseng et al. (2021) focuses on

using reinforcement learning to jointly train the

simulator and the dialogue system to boost the do-

main adaption capability of the model.

3 Interactive Evaluation Framework

In our proposed interactive evaluation framework,

we ﬁrst build a goal-state guided user simulator

to model user policies and generate high-quality

user utterances. Then we construct the interactive

evaluation framework and introduce two scores to

evaluate the interactive inference results.

3.1 User Simulator Construction

A user simulator is to generate user utterances for

interactions with the dialogue system. Similar to

the dialogue system construction, the user simula-

tor also considers dialogue histories and generates

utterances via a sequence-to-sequence text gener-

ation framework. We propose a goal-state guided

simulator that controls the user utterance gener-

ation based on the goal-state tracking. Further,

we adopt reinforcement learning methods to boost

the interaction performance between our proposed

goal-state guided simulator and dialogue systems.

3.1.1 Goal-State Guided Simulator

We introduce a goal-state guided simulator that

generates user utterances based on sequence-to-

sequence pre-trained models. The basic idea is to

use pre-deﬁned user’s goals as initial goal states

and track goal states based on user and system ac-

tions, which is similar with belief state tracking. As

seen in Figure 2, we illustrate the interaction pro-

cess of the user simulator and the dialogue system.

We ﬁrst add current goal states at the front of the

user simulator inputs. Plus, the user simulator will

encode previous dialogue histories including user

utterances and dialogue system responses. The user

simulator will predict the user actions and then ob-

tain ﬁnished goals by combining both user actions

and dialogue system actions. By cutting off ﬁn-

ished goals from the current goal states, we obtain

GOAL-STATE:

[Restaurant]

-[inform]

-[food]-[Indian]

-[pricerange]-[expensive]

-[book]

-[people]-[6]

-[day]-[Friday]

-[time]-[13:15]

User Simulator: I am looking for an Indian

restaurant in the expensive price range.

System: There are 5 Indian restaurants in the city.

Is there a particular area you would like to dine in?

Finished GOAL:

[restaurant] [inform] [food] [pricerange]

Unfinished GOAL:

[restaurant] [book] [time] [day] [people]

User Simulator: No, I don't care. I would like to

book a table for 6 people at 13:15 on friday.

System: I have booked you a table at Nusha. your

reference number is 021... Is there anything else i

can help you with?

Finished GOAL:

[restaurant] [book] [time] [day] [people]

Unfinished GOAL:

None

FINISHED

Figure 2: Illustration of goal-state guided simulator in-

teraction process, including goal states tracking and ut-

terance generation.

unﬁnished goals and the user simulator will gener-

ate user utterances based on these unﬁnished goals

at next turn. When the user simulator has ﬁnished

all the required goals, the unﬁnished goal slot is

empty, the user simulator will cease to generate ut-

terances. Besides, we add two additional terminate

signals for the user simulator. When the dialogue

session exceeds a certain number of turns and the

goal states still cannot be fully ﬁnished or when

the user simulator or the dialogue system generates

deﬁnite actions to stop the session like action ’bye’

or action ’thank’, the user simulator will terminate

the dialogue session.

3.1.2 Simulator Training

The training process of the user simulator includes

sequence-to-sequence supervised learning and re-

inforcement learning.

In supervised learning, the user simulator en-

codes the goal states at the front of the input texts

and considers all dialogue histories including texts

of both user utterances and system responses. The

generation texts include current user actions and

user utterances. Therefore, the entire training pro-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IsMultiWOZaSolvedTask?AnInteractiveTODEvaluationFrameworkwithUserSimulatorQinyuanCheng1,LinyangLi1,GuofengQuan1,FengGao2,XiaofengMou2andXipengQiu1y1SchoolofComputerScience,FudanUniversity2AIInnovationCenter,MideaGroupCoLtd{chengqy21,gfquan21}@m.fudan.edu.cn{linyangli19,xpqiu}@fudan.edu.cn{gaofeng1...

展开>> 收起<<

Is MultiWOZ a Solved Task An Interactive TOD Evaluation Framework with User Simulator Qinyuan Cheng1 Linyang Li1 Guofeng Quan1 Feng Gao2 Xiaofeng Mou2and Xipeng Qiu1y.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Is MultiWOZ a Solved Task An Interactive TOD Evaluation Framework with User Simulator Qinyuan Cheng1 Linyang Li1 Guofeng Quan1 Feng Gao2 Xiaofeng Mou2and Xipeng Qiu1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: