Is MultiWOZ a Solved Task An Interactive TOD Evaluation Framework with User Simulator Qinyuan Cheng1 Linyang Li1 Guofeng Quan1 Feng Gao2 Xiaofeng Mou2and Xipeng Qiu1y

2025-05-03 0 0 469.42KB 12 页 10玖币
侵权投诉
Is MultiWOZ a Solved Task?
An Interactive TOD Evaluation Framework with User Simulator
Qinyuan Cheng1
, Linyang Li1
, Guofeng Quan1, Feng Gao2, Xiaofeng Mou2and Xipeng Qiu1
1School of Computer Science, Fudan University
2AI Innovation Center, Midea Group Co Ltd
{chengqy21, gfquan21}@m.fudan.edu.cn {linyangli19, xpqiu}@fudan.edu.cn
{gaofeng14, mouxf}@midea.com
Abstract
Task-Oriented Dialogue (TOD) systems are
drawing more and more attention in recent
studies. Current methods focus on construct-
ing pre-trained models or fine-tuning strategies
while the evaluation of TOD is limited by a pol-
icy mismatch problem. That is, during evalua-
tion, the user utterances are from the annotated
dataset while these utterances should interact
with previous responses which can have many
alternatives besides annotated texts. Therefore,
in this work, we propose an interactive evalua-
tion framework for TOD. We first build a goal-
oriented user simulator based on pre-trained
models and then use the user simulator to in-
teract with the dialogue system to generate di-
alogues. Besides, we introduce a sentence-
level and a session-level score to measure the
sentence fluency and session coherence in the
interactive evaluation. Experimental results
show that RL-based TOD systems trained by
our proposed user simulator can achieve nearly
98% inform and success rates in the interac-
tive evaluation of MultiWOZ dataset and the
proposed scores measure the response quality
besides the inform and success rates. We are
hoping that our work will encourage simulator-
based interactive evaluations in the TOD task
1.
1 Introduction
Building intelligent dialogue systems has become
a trend in natural language process applications
especially with the help of powerful pre-trained
models. Specifically, task-oriented dialogue (TOD)
systems (Zhang et al.,2020b) are to help users with
scenarios such as booking hotels or flights. These
TOD systems (Wen et al.,2017;Zhong et al.,2018;
Chen et al.,2019) usually first recognize user’s
1https://github.com/xiami2019/
User-Simulator
*These authors contributed equally to this work.
Corresponding author.
intents and then generate corresponding responses
based on an external database containing booking
information. Therefore, the key factor in TOD is
the interaction between users and dialogue systems.
However, in current TOD system evaluation pro-
cess, traditional evaluation process uses annotated
user utterances in multi-turn dialogue sessions no
matter what responses the dialogue system gen-
erated, as illustrated in Figure 1. While in real-
world dialogues, the user utterances are coherent
with responses from the other speaker (which is
the service provider). Therefore, in TOD evalua-
tion, using annotated utterances without interaction
with the dialogue system will cause a policy mis-
match, which would weaken the soundness of the
evaluation results. The mismatch might hurt the
evaluation process since some responses may be
correct and coherent but use a different policy with
the annotated responses. Also, incoherent dialogue
histories will affect the response generation. With
current state-of-the-art models achieving similar
performance, it is natural to consider that the bot-
tleneck in the performance of current TOD systems
is not the model capability but the evaluation strat-
egy. Since incorporating human interactions during
evaluation is costly, a feasible method is to build
an automatic interactive evaluation framework that
can solve the policy mismatch problem.
In this paper, we propose a complete interac-
tive evaluation framework to evaluate the TOD sys-
tem. We first build a strong dialogue user simulator
based on pre-trained models, and we use the pro-
posed simulator to deploy interactive evaluations.
In simulator learning, we introduce a goal-
guided user utterance generation model based on
sequence-to-sequence pre-trained models. Then
we use reinforcement learning to train both the user
simulator and the dialogue system to boost interac-
tion performance.
In interactive evaluations, we use the simula-
tor to generate user utterances based on interac-
arXiv:2210.14529v1 [cs.CL] 26 Oct 2022
Annotated Dialogue
System: There are 5 Indian restaurants. What price range do you prefer?
User: Expensive would be great.
User: I am looking for a dining place that serves Indian food.
Traditional Evaluation
System: There are 5 Indian restaurants. What area of town would you like to
dine in?
User: Expensive would be great.
User: I am looking for a dining place that serves Indian food.
Interactive Evaluation
System: There are 5 Indian restaurants. Is there a particular area
you would like to dine in?
User: No, I don't care. I would like to book a table for 6 people at 13:15 on Friday.
User: I am looking for an Indian restaurant in the expensive price range.
Figure 1: Illustration of interactions between users and
systems. Traditional evaluation might face a policy mis-
match between utterances annotated with red color.
tions with the generated responses instead of using
static user utterances that have been annotated in
advance. Therefore, during evaluation, user ut-
terances will respond to the generated responses,
which can avoid the mismatch between stale user
utterances and generated responses. Further, in in-
teractive evaluations, the quality of the generated
texts cannot be measured since traditional BLEU
cannot be calculated without oracle texts. To bet-
ter evaluate the performance of dialogue systems,
we introduce two automatic scores to evaluate the
response quality at both sentence-level and session-
level. The sentence-level score is to evaluate the
sentence fluency and the session-level score is to
evaluate the coherence between turns in a dialogue
session. Also, these proposed scores can be used
in traditional evaluation methods as well as the an-
notated dataset as a meta-evaluation to explore the
importance of using user simulators to construct
interactive evaluations.
We construct experiments on MultiWOZ dataset
(Budzianowski et al.,2018) based on pre-trained
models and use our proposed simulator and scores
to run interactive evaluations. Experimental re-
sults show that interactive evaluations can achieve
over 98% inform and success rates, indicating that
the bottleneck of TOD performance is the lack of
proper evaluation methods. The proposed scores
show that our proposed simulator can help achieve
promising evaluation results in the interactive eval-
uation framework. Also, we explore the perfor-
mance of RL-based models and we also use pro-
posed scores to find that RL methods might hurt
response quality to achieve high success rates.
Therefore, we can summarize our contributions:
(A) We construct an evaluation framework that
avoids policy mismatch problems in TOD.
(B) We build a strong user simulator for TOD
systems that can be used in TOD training and eval-
uation.
(C) Experimental results show the importance
of using our proposed simulator and evaluation
framework and provide hints for future TOD sys-
tem developments with public available codes.
2 Related Work
2.1 Task-Oriented Dialogue Systems
Task-oriented dialogue systems aim to achieve
users’ goals such as booking hotels or flights (Wen
et al.,2017;Eric et al.,2017). With the widespread
use of pre-trained models (Qiu et al.,2020), end-
to-end TOD systems based on pre-trained models
become more and more popular: Hosseini-Asl et al.
(2020) fine-tunes all subtasks of TOD using multi-
task learning based on a single pre-trained model.
Yang et al. (2021) encodes results of intermediate
subtasks, such as belief states and system actions,
into dialogue history to boost responses genera-
tion. Su et al. (2021) and He et al. (2021) use
additional dialogue corpus to further pre-train the
language model and then fine-tune the model on
MultiWOZ dataset. Lee (2021) introduces an aux-
iliary task based on T5 models (Raffel et al.,2020)
and achieves state-of-the-art performance without
using further pre-training methods.
2.2 Automatic Evaluations
Recent trends leverage neural models to automat-
ically evaluate generated texts from different per-
spectives. Automatic evaluation methods can help
evaluate certain aspects in certain tasks such as
factuality checking in text summarization (Kryscin-
ski et al.,2020), stronger BLEU score in machine
translation (Sellam et al.,2020) and coherence in
dialogue systems (Tao et al.,2018;Pang et al.,
2020). With pre-trained models, the quality of text
generation can be measured by evaluation meth-
ods such as BERTScore (Zhang et al.,2020a) and
BARTScore (Yuan et al.,2021). With properly
designed neural model scores, the performance of
dialogue systems can be more accurately evaluated.
2.3 User Simulators
User simulators are designed to simulate users’ be-
haviors in dialogue interactions, including rule-
based simulators (Lee et al.,2019) and model-
based simulators (Takanobu et al.,2020;Tseng
et al.,2021). Usually, user simulators are intro-
duced along with reinforcement learning strategies
to enhance the dialogue policy modeling (Li et al.,
2016;Shi et al.,2019), which can help the model
learn better policies not included in the annotated
data. Takanobu et al. (2020) treats the model-based
simulator as a dialogue agent like the dialogue sys-
tem and formulate TOD as a multi-agent policy
learning problem. Tseng et al. (2021) focuses on
using reinforcement learning to jointly train the
simulator and the dialogue system to boost the do-
main adaption capability of the model.
3 Interactive Evaluation Framework
In our proposed interactive evaluation framework,
we first build a goal-state guided user simulator
to model user policies and generate high-quality
user utterances. Then we construct the interactive
evaluation framework and introduce two scores to
evaluate the interactive inference results.
3.1 User Simulator Construction
A user simulator is to generate user utterances for
interactions with the dialogue system. Similar to
the dialogue system construction, the user simula-
tor also considers dialogue histories and generates
utterances via a sequence-to-sequence text gener-
ation framework. We propose a goal-state guided
simulator that controls the user utterance gener-
ation based on the goal-state tracking. Further,
we adopt reinforcement learning methods to boost
the interaction performance between our proposed
goal-state guided simulator and dialogue systems.
3.1.1 Goal-State Guided Simulator
We introduce a goal-state guided simulator that
generates user utterances based on sequence-to-
sequence pre-trained models. The basic idea is to
use pre-defined user’s goals as initial goal states
and track goal states based on user and system ac-
tions, which is similar with belief state tracking. As
seen in Figure 2, we illustrate the interaction pro-
cess of the user simulator and the dialogue system.
We first add current goal states at the front of the
user simulator inputs. Plus, the user simulator will
encode previous dialogue histories including user
utterances and dialogue system responses. The user
simulator will predict the user actions and then ob-
tain finished goals by combining both user actions
and dialogue system actions. By cutting off fin-
ished goals from the current goal states, we obtain
GOAL-STATE:
[Restaurant]
-[inform]
-[food]-[Indian]
-[pricerange]-[expensive]
-[book]
-[people]-[6]
-[day]-[Friday]
-[time]-[13:15]
User Simulator: I am looking for an Indian
restaurant in the expensive price range.
System: There are 5 Indian restaurants in the city.
Is there a particular area you would like to dine in?
Finished GOAL:
[restaurant] [inform] [food] [pricerange]
Unfinished GOAL:
[restaurant] [book] [time] [day] [people]
User Simulator: No, I don't care. I would like to
book a table for 6 people at 13:15 on friday.
System: I have booked you a table at Nusha. your
reference number is 021... Is there anything else i
can help you with?
Finished GOAL:
[restaurant] [book] [time] [day] [people]
Unfinished GOAL:
None
FINISHED
Figure 2: Illustration of goal-state guided simulator in-
teraction process, including goal states tracking and ut-
terance generation.
unfinished goals and the user simulator will gener-
ate user utterances based on these unfinished goals
at next turn. When the user simulator has finished
all the required goals, the unfinished goal slot is
empty, the user simulator will cease to generate ut-
terances. Besides, we add two additional terminate
signals for the user simulator. When the dialogue
session exceeds a certain number of turns and the
goal states still cannot be fully finished or when
the user simulator or the dialogue system generates
definite actions to stop the session like action ’bye’
or action ’thank’, the user simulator will terminate
the dialogue session.
3.1.2 Simulator Training
The training process of the user simulator includes
sequence-to-sequence supervised learning and re-
inforcement learning.
In supervised learning, the user simulator en-
codes the goal states at the front of the input texts
and considers all dialogue histories including texts
of both user utterances and system responses. The
generation texts include current user actions and
user utterances. Therefore, the entire training pro-
摘要:

IsMultiWOZaSolvedTask?AnInteractiveTODEvaluationFrameworkwithUserSimulatorQinyuanCheng1,LinyangLi1,GuofengQuan1,FengGao2,XiaofengMou2andXipengQiu1y1SchoolofComputerScience,FudanUniversity2AIInnovationCenter,MideaGroupCoLtd{chengqy21,gfquan21}@m.fudan.edu.cn{linyangli19,xpqiu}@fudan.edu.cn{gaofeng1...

展开>> 收起<<
Is MultiWOZ a Solved Task An Interactive TOD Evaluation Framework with User Simulator Qinyuan Cheng1 Linyang Li1 Guofeng Quan1 Feng Gao2 Xiaofeng Mou2and Xipeng Qiu1y.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:469.42KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注