
Annotated Dialogue
System: There are 5 Indian restaurants. What price range do you prefer?
User: Expensive would be great.
User: I am looking for a dining place that serves Indian food.
Traditional Evaluation
System: There are 5 Indian restaurants. What area of town would you like to
dine in?
User: Expensive would be great.
User: I am looking for a dining place that serves Indian food.
Interactive Evaluation
System: There are 5 Indian restaurants. Is there a particular area
you would like to dine in?
User: No, I don't care. I would like to book a table for 6 people at 13:15 on Friday.
User: I am looking for an Indian restaurant in the expensive price range.
Figure 1: Illustration of interactions between users and
systems. Traditional evaluation might face a policy mis-
match between utterances annotated with red color.
tions with the generated responses instead of using
static user utterances that have been annotated in
advance. Therefore, during evaluation, user ut-
terances will respond to the generated responses,
which can avoid the mismatch between stale user
utterances and generated responses. Further, in in-
teractive evaluations, the quality of the generated
texts cannot be measured since traditional BLEU
cannot be calculated without oracle texts. To bet-
ter evaluate the performance of dialogue systems,
we introduce two automatic scores to evaluate the
response quality at both sentence-level and session-
level. The sentence-level score is to evaluate the
sentence fluency and the session-level score is to
evaluate the coherence between turns in a dialogue
session. Also, these proposed scores can be used
in traditional evaluation methods as well as the an-
notated dataset as a meta-evaluation to explore the
importance of using user simulators to construct
interactive evaluations.
We construct experiments on MultiWOZ dataset
(Budzianowski et al.,2018) based on pre-trained
models and use our proposed simulator and scores
to run interactive evaluations. Experimental re-
sults show that interactive evaluations can achieve
over 98% inform and success rates, indicating that
the bottleneck of TOD performance is the lack of
proper evaluation methods. The proposed scores
show that our proposed simulator can help achieve
promising evaluation results in the interactive eval-
uation framework. Also, we explore the perfor-
mance of RL-based models and we also use pro-
posed scores to find that RL methods might hurt
response quality to achieve high success rates.
Therefore, we can summarize our contributions:
(A) We construct an evaluation framework that
avoids policy mismatch problems in TOD.
(B) We build a strong user simulator for TOD
systems that can be used in TOD training and eval-
uation.
(C) Experimental results show the importance
of using our proposed simulator and evaluation
framework and provide hints for future TOD sys-
tem developments with public available codes.
2 Related Work
2.1 Task-Oriented Dialogue Systems
Task-oriented dialogue systems aim to achieve
users’ goals such as booking hotels or flights (Wen
et al.,2017;Eric et al.,2017). With the widespread
use of pre-trained models (Qiu et al.,2020), end-
to-end TOD systems based on pre-trained models
become more and more popular: Hosseini-Asl et al.
(2020) fine-tunes all subtasks of TOD using multi-
task learning based on a single pre-trained model.
Yang et al. (2021) encodes results of intermediate
subtasks, such as belief states and system actions,
into dialogue history to boost responses genera-
tion. Su et al. (2021) and He et al. (2021) use
additional dialogue corpus to further pre-train the
language model and then fine-tune the model on
MultiWOZ dataset. Lee (2021) introduces an aux-
iliary task based on T5 models (Raffel et al.,2020)
and achieves state-of-the-art performance without
using further pre-training methods.
2.2 Automatic Evaluations
Recent trends leverage neural models to automat-
ically evaluate generated texts from different per-
spectives. Automatic evaluation methods can help
evaluate certain aspects in certain tasks such as
factuality checking in text summarization (Kryscin-
ski et al.,2020), stronger BLEU score in machine
translation (Sellam et al.,2020) and coherence in
dialogue systems (Tao et al.,2018;Pang et al.,
2020). With pre-trained models, the quality of text
generation can be measured by evaluation meth-
ods such as BERTScore (Zhang et al.,2020a) and
BARTScore (Yuan et al.,2021). With properly
designed neural model scores, the performance of
dialogue systems can be more accurately evaluated.
2.3 User Simulators
User simulators are designed to simulate users’ be-
haviors in dialogue interactions, including rule-
based simulators (Lee et al.,2019) and model-
based simulators (Takanobu et al.,2020;Tseng