FineD-Eval Fine-grained Automatic Dialogue-Level Evaluation Chen ZhangLuis Fernando DHaroQiquan Zhang Thomas FriedrichsHaizhou Liµ

2025-05-06 0 0 579.43KB 20 页 10玖币
侵权投诉
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang,Luis Fernando D’HaroQiquan Zhang
Thomas FriedrichsHaizhou Li,,
National University of Singapore Robert Bosch (SEA), Singapore
Universidad Politécnica de Madrid, Spain Kriston AI Lab, China
The Chinese University of Hong Kong, Shenzhen, China
chen_zhang@u.nus.edu
Abstract
Recent model-based reference-free metrics
for open-domain dialogue evaluation exhibit
promising correlations with human judgment1.
However, they either perform turn-level eval-
uation or look at a single dialogue quality di-
mension. One would expect a good evalua-
tion metric to assess multiple quality dimen-
sions at the dialogue level. To this end, we
are motivated to propose a multi-dimensional
dialogue-level metric, which consists of three
sub-metrics with each targeting a specific di-
mension. The sub-metrics are trained with
novel self-supervised objectives and exhibit
strong correlations with human judgment for
their respective dimensions. Moreover, we
explore two approaches to combine the sub-
metrics: metric ensemble and multitask learn-
ing. Both approaches yield a holistic metric
that significantly outperforms individual sub-
metrics. Compared to the existing state-of-
the-art metric, the combined metrics achieve
around 16% relative improvement on average
across three high-quality dialogue-level evalu-
ation benchmarks.
1 Introduction
In the study of generative dialogue systems, we
heavily rely on some reference-based static metrics,
such as BLEU (Papineni et al.,2002), to measure
improvements during system development and to
compare various model variants. These metrics still
need improvement due to their poor correlations
with human judgment (Liu et al.,2016) and poor
interpretability (Mehri and Eskenazi,2020b).
Recently, model-based reference-free met-
rics (Yeh et al.,2021) represent one of the ways
to address the limitations of static reference-based
1
As shown in (Yeh et al.,2021), most reference-free met-
rics can achieve around 0.3 to 0.6 Spearman correlations on
various turn-level benchmarks. However, on the dialogue-level
benchmarks, most metrics perform poorly (
<0.2
Spearman
correlations).
metrics. Although such metrics exhibit promis-
ing correlations with human evaluation, most of
them (Tao et al.,2018;Ghazarian et al.,2019;
Huang et al.,2020;Sinha et al.,2020;Mehri and Es-
kenazi,2020b;Phy et al.,2020;Pang et al.,2020;
Zhang et al.,2021c) target turn-level evaluation,
i.e., they focus on single-response quality, such
as contextual relevance and naturalness. When
evaluating a multi-turn human-chatbot dialogue,
turn-level metrics do not model the dialogue in to-
tality, but frame it as a set of context-response pairs.
They assign scores to every chatbot responses in
the dialogue. Hence, an aggregation strategy is
required to derive the single dialogue-level metric
score, such as taking the average of all the response-
level scores. Both prior works (Zhang et al.,2021a;
Yeh et al.,2021) and our experimental results in §5
suggest that such an approach yields sub-optimal
dialogue-level evaluation. The reason may be that
turn-level metrics do not model the dependency
among utterances within multi-turn interactions, it
is difficult for them to spot errors that are only obvi-
ous after observing the entire conversation (Ghan-
deharioun et al.,2019;Ghazarian et al.,2022).
There are some metrics that perform multi-turn
evaluation. However, they focus only on a single
dimension, such as coherence or overall impres-
sion (Mesgar et al.,2020;Zhang et al.,2021a;Li
et al.,2021;Ghazarian et al.,2022). When eval-
uating a dialogue, they assign a single score to
quantify one aspect of dialogue quality. As pointed
out in Mehri et al. (2022), dialogue quality is inher-
ently multi-faceted. By breaking down the quality
of the dialogue into multiple fine-grained dimen-
sions, we may provide a more interpretable and
descriptive dialogue evaluation. With such an inter-
pretable metric, dialogue researchers know exactly
which aspect of the dialogue system to improve.
To this end, we propose a multi-dimensional
metric, dubbed FineD-Eval
2
, which consists of spe-
2https://github.com/e0397123/FineD-Eval
arXiv:2210.13832v2 [cs.CL] 29 Oct 2022
cialized sub-metrics. Each sub-metric targets a
specific fine-grained dimension and all sub-metrics
are trained in a self-supervised manner without re-
liance on any human annotations.
To develop FineD-Eval, our first step is to iden-
tify the dimensions for metric design. It is a well-
known phenomenon that human judges do not pro-
vide completely independent assessments for vari-
ous fine-grained dimensions. For instance, Sai et al.
(2021) analyzes the human ratings with respect to
(w.r.t.) different fine-grained dimensions on four
text generation tasks and has observed moderate
correlations for most dimension pairs. Intuitively,
we want to select dimensions that are less corre-
lated such that our metric can holistically capture
the dialogue quality from different perspectives.
The selection process is guided by an analysis on
fine-grained human ratings of dialogue-level evalu-
ation data (§2). Through the analysis, we want to
cluster the dimensions into relatively independent
dimension groups and then, select representative
dimensions from different dimension groups.
Next, we propose dimension-specific strategies
for training the sub-metrics. 3.3). The sub-
metrics, which target the representative dimensions,
can also be applied to evaluate other dimensions
in their respective dimension groups. Furthermore,
both Yeh et al. (2021) and Zhang et al. (2021d)
highlight that the combination of different metrics
leads to better correlations with human evaluation
than individual specialized metrics. We are moti-
vated to explore how to combine the sub-metrics
into a unified one. Specifically, both the metric
ensemble and multitask learning (Caruana,1997)
are examined (§3.4).
Finally, in the experiments (§5), we demonstrate
that (1) the sub-metrics highly correlate with hu-
man judgment for their target dimensions. (2)
The scores assigned by FineD-Eval are more inter-
pretable than the existing metrics. (3) With either
metric ensemble or multitask learning, FineD-Eval
significantly outperforms existing state-of-the-art
metrics as well as individual sub-metrics on three
high-quality dialogue-level evaluation benchmarks.
2 Analysis of Human Evaluation Data
2.1 Grouping of the Dimensions
In this section, we analyze the human ratings of
FED (Mehri and Eskenazi,2020a), a high-quality
dialogue-level evaluation benchmark. Each dia-
logue in FED is annotated by five human judges
Figure 1: Spearman correlations of dimension pairs on
FED.
Group Quality Dimensions
Coh Coherence, Understanding
Lik Likability, Flexibility, Informativeness
Top Topic Depth, Diversity, Informativeness
Con Consistency
Inq Inquisitiveness
Err Error Recovery
Table 1: Grouping of the dimensions. We adopt the
first three letters of the representative dimension within
each group as the corresponding group name.
for 11 different quality dimensions
3
, as shown in
the axis labels of Figure 1. We choose FED for
our analysis because the dataset covers the most
comprehensive list of dialogue quality dimensions.
In addition, the human annotation quality of FED
is high as evidenced by the strong inter-annotator
agreements w.r.t. different dimensions4.
Figure 1presents the Spearman correlations of
different dimension pairs on FED. We can observe
that all dimensions are interdependent, with corre-
lations ranging from 0.38 to 0.88. Based on their
extent of interdependence, we cluster the 10 dimen-
sions (excluding the "Overall" category) into six
groups, as shown in Table 1. We adopt the first
three letters of the representative dimension within
each group as the corresponding group name. The
representative dimension in each group is chosen
based on criteria discussed in §2.2.
A dimension is treated as an independent group
if it does not correlate strongly with any of the
other dimensions (
0.75
). Hence, consistency,
inquisitiveness, and error recovery can be per-
3
The detailed definitions of all dimensions are presented
in Table 11 of the Appendix
4
Above 0.75 in terms of Spearman correlations for all the
dimensions except that of consistency, which is 0.562.
ceived as three independent dimension groups:
Con,Inq, and Err respectively. The remaining
dimensions are more or less correlated with each
other. Based on the following four observations: (1)
coherence strongly correlates with understanding
(0.83); (2) The likability-flexibility and likability-
informativeness correlations are both 0.82; (3) The
correlation between topic depth and informative-
ness is as high as 0.84; and (4) Diversity only
strongly correlates with topic depth (0.8), the re-
maining seven dimensions can be clustered into
three groups: Coh,Lik, and Top.
The categorization may not be perfect as Coh,
Lik, and Top are not completely independent from
each other. For example, informativeness can be
found in both group Lik and group Top. A possible
explanation is that humans generally like knowl-
edgeable chatbots, which can discuss different top-
ics in depth rather than those that generate dull
responses (See et al.,2019;Roller et al.,2021). To
improve the categorization, future work may con-
duct similar analysis on large-scale dialogue-level
human annotations.
2.2 Dimension Selection
As metioned in §1, we want to identify fine-grained
dimensions that are less similar. Hence, we select
only one dimension from each group and avoid
those that are shared between two different groups.
In addition, to further reduce the complexity of
FineD-Eval, we implement the following rules
to narrow down the selection to only three fine-
grained dimensions.
First, dimensions that highly correlate with the
"Overall" category (> 0.75) are considered. The
intuition is that a high correlation with "Overall"
indicates more influence from the fine-grained di-
mension on human annotators’ overall impression
about a dialogue. Second, we filter out dimensions
with low inter-annotator agreement (< 0.6)
5
, be-
cause low inter-annotator agreements may suggest
the dimension is complex to evaluate and human
annotators have different understandings of the di-
mension (Mehri et al.,2022). Lastly, we choose
dimensions based on how often they are marked
as "N/A" (not applicable) by the human judges. A
high frequency indicates that the dimension is not
generally applicable in different contexts. Most
dimensions do not contain a "N/A" rating except
5
Only consistency has an inter-anntator agreement below
0.6.
"Error recovery", which has been marked as "N/A"
25% of the time.
Based on the rules, we choose the following
three dimensions:
coherence
,
likability
, and
topic
depth
. In addition to the rules, we choose these
dimensions because they are also widely studied in
open-domain dialogue systems. Researchers spend
significant amount of efforts on developing coher-
ent, engaging, and knowledgeable chatbots (Adi-
wardana et al.,2020;Hedayatnia et al.,2020;Shus-
ter et al.,2021). Designing meaningful metrics
along these three dimensions can benefit the cur-
rent open-domain dialogue research. Though other
dimensions, such as consistency (Nie et al.,2021),
inquisitiveness (See et al.,2019), and long-term
memory (Xu et al.,2022) are equally important,
their evaluation deserves a thorough study on its
own. Hence, we leave them for future work.
3 Methodology
3.1 Problem Formulation
We formally define the dialogue-level evaluation
task. Suppose that we have a dialogue evaluation
dataset,
D
, which contains
n
human-chatbot dia-
logues,
D={d1, d2, . . . , dj, . . . , dn}
.
dj
is anno-
tated by several human judges for a set of quality
dimensions,
Q
. Each human judge provides a rat-
ing to
dj
for individual dimension,
qQ
. We use
rq
dj
to denote the average Likert rating provided by
all human annotators to djfor q.
Our goal is to learn dimension-specific met-
rics,
Mq(dj)sq
dj
, where
sq
dj
is the metric
score reflecting how good
dj
is for dimension
q
as perceived by
Mq
. To assess the performance
of
Mq
on
D
, the correlation, denoted as
ρq
, be-
tween
Sq={sq
d1, . . . , sq
dj, . . . , sq
dn}
and
Rq=
{rq
d1, . . . , rq
dj, . . . , rq
dn}
are calculated. Higher
ρq
indicates better performance of Mqon D.
3.2 General Framework
We propose a multi-dimensional dialogue-level
metric, FineD-Eval, which is a combination of
three specialized sub-metrics,
Mq
, where
q
{coherence,likability,topic depth}
. We explore
two approaches for combining the sub-metrics,
metric ensemble and multitask learning. Metric
ensemble is a late fusion approach whereby the
predictions made by the sub-metrics are combined.
Multitask learning, on the other hand, is an early fu-
sion approach whereby the sub-metrics will share a
common text encoder while having different output
layers. Details of both approaches are discussed in
§3.4. Here, we focus on the details of Mq.
To train
Mq
, we formulate a preference learn-
ing approach (Fürnkranz and Hüllermeier,2011).
Given a pair of dimensions-specific positive and
negative training dialogue samples,
d+
tr
than
d
tr
,
Mq
learns to predict a higher score for
d+
tr
than
d
tr
. The strategies for constructing (
d+
tr
,
d
tr
) are
outlined in §3.3. During training, a mini-batch
is formed with two types of data instances
6
: (1)
(
d+
tr
,
d
tr
) with label
y=1
; (2) (
d
tr
,
d+
tr
) with la-
bel
y=1
.
Mq
outputs two scalar values
sq
d+
tr
and
sq
d
tr
that correspond to
d+
tr
and
d
tr
respectively.
The following margin ranking loss is adopted to
train the model:
Lq=max(0, y (xq
1xq
2)+0.1)(1)
where
(xq
1, xq
2, y)
can be either (
sq
d+
tr
,
sq
d
tr
, 1) or
(sq
d
tr
,sq
d+
tr
, -1).
The pairwise ranking formulation is motivated
by previous works on dialogue evaluation (Mes-
gar et al.,2020;Huang et al.,2020;Gao et al.,
2020;Zhang et al.,2021a). Compared to direct
assessment approaches (Zhang et al.,2021c;Ghaz-
arian et al.,2022), the main advantage of pairwise
ranking is that the model can implicitly learn the
features that distinguish the good dialogues from
the bad ones based on a large quantity of dialogue
pairs for a specific quality dimension.
The network architecture of
Mq
is straightfor-
ward. RoBERTa-base (Liu et al.,2019) is adopted
as the text encoder,
T
, which maps (
d+
tr
,
d
tr
) to
dense representations (
H+
tr
,
H
tr
). Both
d+
tr
and
d
tr
are formulated as a token sequence with special to-
ken "</UTT>" to delimit different utterances. Next,
(
H+
tr
,
H
tr
) are converted into vector representa-
tions (
h+
tr
,
h
tr
) with average pooling. Through a
linear layer with output size 1 and a Sigmoid acti-
vation function,
h+
tr
and
h
tr
are transformed into
scalar values
sq
d+
tr
and
sq
d
tr
respectively. During in-
ference, given
djD
, the scalar value
sq
dj
output
by Mqis the corresponding metric score.
3.3 Dimension-Specific Sampling Strategies
In this section, we discuss different strategies to
obtain dimension-specific training dialogue pairs.
All (
d+
tr
,
d
tr
) samples are automatically constructed
6
This formulation is to avoid model relying on positions
of the dialogues to make predictions.
from human-human dialogue datasets without re-
liance on human annotations.
Coherence (Coh)
We consider two strategies for
coherence. The first is utterance order shuffling
whereby dialogues from existing human-human di-
alogue corpora (Li et al.,2017;Dinan et al.,2020)
are treated as
d+
tr
. To obtain
d
tr
, we randomly per-
mute the order of utterances in
d+
tr
. This strategy
has been widely adopted in previous dialogue co-
herence studies (Cervone et al.,2018;Mesgar et al.,
2020;Zhang et al.,2021a).
The second strategy, question-answer (QA) rel-
evance scoring, is motivated by the Gricean max-
ims (Grice,1975) whereby effective communica-
tion involves being relevant, i.e., one should pro-
vide information that is relevant to the current ex-
change. A natural and logical flow of conversa-
tion often involves asking and answering questions,
which is a form of information exchange. Humans
usually prefer answers that are straight to the point
rather than those that are vague and off-topic. Con-
cretely, we select dialogues in existing dialogue
corpora
7
that are more than 4 utterances and con-
tain at least one question-answer pair. Next, we
use a pretrained BERT-based QA evaluator from
HuggingFace
8
to score each QA pair within a di-
alogue. The evaluator provides a relevance score
between 0 and 1 (the higher the better). Then, we
average the relevance scores of all QA pairs within
the dialogue to derive the dialogue-level QA rele-
vance score. Finally, two thresholds, (
τrel
low
,
τrel
high
),
are chosen. Dialogues with scores lower than
τrel
low
are considered
d
tr
. Those with scores higher than
τrel
high
are considered
d+
tr
. (
τrel
low
,
τrel
high
) are heuristi-
cally determined to ensure sufficient data in both
the positive and negative classes.
Likability (Lik)
Two strategies are applied to con-
struct
d+
tr
and
d
tr
for likability. The first strategy,
contradiction scoring, is motivated by the similar-
tity attaction effect (Byrne et al.,1968;Nass and
Lee,2001). During human-human interaction, peo-
ple tend to favour others who share similar opinions
or preferences with them. On the contrary, convey-
ing contradictory opinions or information may lead
to disagreement and user dissatisfaction.
7
We hypothesize that even in human-human dialogue cor-
pora, there are answers that are vague and off-topic due to the
presence of low-quality crowd-source workers.
8https://huggingface.co/iarfmoose/
bert-base-cased-qa-evaluator
摘要:

FineD-Eval:Fine-grainedAutomaticDialogue-LevelEvaluationChenZhang†;“LuisFernandoD'Haro‡QiquanZhang†ThomasFriedrichs“HaizhouLi¸;†;µ†NationalUniversityofSingapore“RobertBosch(SEA),Singapore‡UniversidadPolitécnicadeMadrid,SpainµKristonAILab,China¸TheChineseUniversityofHongKong,Shenzhen,Chinachen_zhang@...

展开>> 收起<<
FineD-Eval Fine-grained Automatic Dialogue-Level Evaluation Chen ZhangLuis Fernando DHaroQiquan Zhang Thomas FriedrichsHaizhou Liµ.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:579.43KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注