FineD-Eval Fine-grained Automatic Dialogue-Level Evaluation Chen ZhangLuis Fernando DHaroQiquan Zhang Thomas FriedrichsHaizhou Liµ

2025-05-06 0 0 579.43KB 20 页 10玖币

侵权投诉

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Chen Zhang†,⋆Luis Fernando D’Haro‡Qiquan Zhang†

Thomas Friedrichs⋆Haizhou Li♥,†,♢

†National University of Singapore ⋆Robert Bosch (SEA), Singapore

‡Universidad Politécnica de Madrid, Spain ♢Kriston AI Lab, China

♥The Chinese University of Hong Kong, Shenzhen, China

chen_zhang@u.nus.edu

Abstract

Recent model-based reference-free metrics

for open-domain dialogue evaluation exhibit

promising correlations with human judgment1.

However, they either perform turn-level eval-

uation or look at a single dialogue quality di-

mension. One would expect a good evalua-

tion metric to assess multiple quality dimen-

sions at the dialogue level. To this end, we

are motivated to propose a multi-dimensional

dialogue-level metric, which consists of three

sub-metrics with each targeting a speciﬁc di-

mension. The sub-metrics are trained with

novel self-supervised objectives and exhibit

strong correlations with human judgment for

their respective dimensions. Moreover, we

explore two approaches to combine the sub-

metrics: metric ensemble and multitask learn-

ing. Both approaches yield a holistic metric

that signiﬁcantly outperforms individual sub-

metrics. Compared to the existing state-of-

the-art metric, the combined metrics achieve

around 16% relative improvement on average

across three high-quality dialogue-level evalu-

ation benchmarks.

1 Introduction

In the study of generative dialogue systems, we

heavily rely on some reference-based static metrics,

such as BLEU (Papineni et al.,2002), to measure

improvements during system development and to

compare various model variants. These metrics still

need improvement due to their poor correlations

with human judgment (Liu et al.,2016) and poor

interpretability (Mehri and Eskenazi,2020b).

Recently, model-based reference-free met-

rics (Yeh et al.,2021) represent one of the ways

to address the limitations of static reference-based

As shown in (Yeh et al.,2021), most reference-free met-

rics can achieve around 0.3 to 0.6 Spearman correlations on

various turn-level benchmarks. However, on the dialogue-level

benchmarks, most metrics perform poorly (

<0.2

Spearman

correlations).

metrics. Although such metrics exhibit promis-

ing correlations with human evaluation, most of

them (Tao et al.,2018;Ghazarian et al.,2019;

Huang et al.,2020;Sinha et al.,2020;Mehri and Es-

kenazi,2020b;Phy et al.,2020;Pang et al.,2020;

Zhang et al.,2021c) target turn-level evaluation,

i.e., they focus on single-response quality, such

as contextual relevance and naturalness. When

evaluating a multi-turn human-chatbot dialogue,

turn-level metrics do not model the dialogue in to-

tality, but frame it as a set of context-response pairs.

They assign scores to every chatbot responses in

the dialogue. Hence, an aggregation strategy is

required to derive the single dialogue-level metric

score, such as taking the average of all the response-

level scores. Both prior works (Zhang et al.,2021a;

Yeh et al.,2021) and our experimental results in §5

suggest that such an approach yields sub-optimal

dialogue-level evaluation. The reason may be that

turn-level metrics do not model the dependency

among utterances within multi-turn interactions, it

is difﬁcult for them to spot errors that are only obvi-

ous after observing the entire conversation (Ghan-

deharioun et al.,2019;Ghazarian et al.,2022).

There are some metrics that perform multi-turn

evaluation. However, they focus only on a single

dimension, such as coherence or overall impres-

sion (Mesgar et al.,2020;Zhang et al.,2021a;Li

et al.,2021;Ghazarian et al.,2022). When eval-

uating a dialogue, they assign a single score to

quantify one aspect of dialogue quality. As pointed

out in Mehri et al. (2022), dialogue quality is inher-

ently multi-faceted. By breaking down the quality

of the dialogue into multiple ﬁne-grained dimen-

sions, we may provide a more interpretable and

descriptive dialogue evaluation. With such an inter-

pretable metric, dialogue researchers know exactly

which aspect of the dialogue system to improve.

To this end, we propose a multi-dimensional

metric, dubbed FineD-Eval

, which consists of spe-

2https://github.com/e0397123/FineD-Eval

arXiv:2210.13832v2 [cs.CL] 29 Oct 2022

cialized sub-metrics. Each sub-metric targets a

speciﬁc ﬁne-grained dimension and all sub-metrics

are trained in a self-supervised manner without re-

liance on any human annotations.

To develop FineD-Eval, our ﬁrst step is to iden-

tify the dimensions for metric design. It is a well-

known phenomenon that human judges do not pro-

vide completely independent assessments for vari-

ous ﬁne-grained dimensions. For instance, Sai et al.

(2021) analyzes the human ratings with respect to

(w.r.t.) different ﬁne-grained dimensions on four

text generation tasks and has observed moderate

correlations for most dimension pairs. Intuitively,

we want to select dimensions that are less corre-

lated such that our metric can holistically capture

the dialogue quality from different perspectives.

The selection process is guided by an analysis on

ﬁne-grained human ratings of dialogue-level evalu-

ation data (§2). Through the analysis, we want to

cluster the dimensions into relatively independent

dimension groups and then, select representative

dimensions from different dimension groups.

Next, we propose dimension-speciﬁc strategies

for training the sub-metrics. (§3.3). The sub-

metrics, which target the representative dimensions,

can also be applied to evaluate other dimensions

in their respective dimension groups. Furthermore,

both Yeh et al. (2021) and Zhang et al. (2021d)

highlight that the combination of different metrics

leads to better correlations with human evaluation

than individual specialized metrics. We are moti-

vated to explore how to combine the sub-metrics

into a uniﬁed one. Speciﬁcally, both the metric

ensemble and multitask learning (Caruana,1997)

are examined (§3.4).

Finally, in the experiments (§5), we demonstrate

that (1) the sub-metrics highly correlate with hu-

man judgment for their target dimensions. (2)

The scores assigned by FineD-Eval are more inter-

pretable than the existing metrics. (3) With either

metric ensemble or multitask learning, FineD-Eval

signiﬁcantly outperforms existing state-of-the-art

metrics as well as individual sub-metrics on three

high-quality dialogue-level evaluation benchmarks.

2 Analysis of Human Evaluation Data

2.1 Grouping of the Dimensions

In this section, we analyze the human ratings of

FED (Mehri and Eskenazi,2020a), a high-quality

dialogue-level evaluation benchmark. Each dia-

logue in FED is annotated by ﬁve human judges

Figure 1: Spearman correlations of dimension pairs on

FED.

Group Quality Dimensions

Coh Coherence, Understanding

Lik Likability, Flexibility, Informativeness

Top Topic Depth, Diversity, Informativeness

Con Consistency

Inq Inquisitiveness

Err Error Recovery

Table 1: Grouping of the dimensions. We adopt the

ﬁrst three letters of the representative dimension within

each group as the corresponding group name.

for 11 different quality dimensions

, as shown in

the axis labels of Figure 1. We choose FED for

our analysis because the dataset covers the most

comprehensive list of dialogue quality dimensions.

In addition, the human annotation quality of FED

is high as evidenced by the strong inter-annotator

agreements w.r.t. different dimensions4.

Figure 1presents the Spearman correlations of

different dimension pairs on FED. We can observe

that all dimensions are interdependent, with corre-

lations ranging from 0.38 to 0.88. Based on their

extent of interdependence, we cluster the 10 dimen-

sions (excluding the "Overall" category) into six

groups, as shown in Table 1. We adopt the ﬁrst

three letters of the representative dimension within

each group as the corresponding group name. The

representative dimension in each group is chosen

based on criteria discussed in §2.2.

A dimension is treated as an independent group

if it does not correlate strongly with any of the

other dimensions (

≥0.75

). Hence, consistency,

inquisitiveness, and error recovery can be per-

The detailed deﬁnitions of all dimensions are presented

in Table 11 of the Appendix

Above 0.75 in terms of Spearman correlations for all the

dimensions except that of consistency, which is 0.562.

ceived as three independent dimension groups:

Con,Inq, and Err respectively. The remaining

dimensions are more or less correlated with each

other. Based on the following four observations: (1)

coherence strongly correlates with understanding

(0.83); (2) The likability-ﬂexibility and likability-

informativeness correlations are both 0.82; (3) The

correlation between topic depth and informative-

ness is as high as 0.84; and (4) Diversity only

strongly correlates with topic depth (0.8), the re-

maining seven dimensions can be clustered into

three groups: Coh,Lik, and Top.

The categorization may not be perfect as Coh,

Lik, and Top are not completely independent from

each other. For example, informativeness can be

found in both group Lik and group Top. A possible

explanation is that humans generally like knowl-

edgeable chatbots, which can discuss different top-

ics in depth rather than those that generate dull

responses (See et al.,2019;Roller et al.,2021). To

improve the categorization, future work may con-

duct similar analysis on large-scale dialogue-level

human annotations.

2.2 Dimension Selection

As metioned in §1, we want to identify ﬁne-grained

dimensions that are less similar. Hence, we select

only one dimension from each group and avoid

those that are shared between two different groups.

In addition, to further reduce the complexity of

FineD-Eval, we implement the following rules

to narrow down the selection to only three ﬁne-

grained dimensions.

First, dimensions that highly correlate with the

"Overall" category (> 0.75) are considered. The

intuition is that a high correlation with "Overall"

indicates more inﬂuence from the ﬁne-grained di-

mension on human annotators’ overall impression

about a dialogue. Second, we ﬁlter out dimensions

with low inter-annotator agreement (< 0.6)

, be-

cause low inter-annotator agreements may suggest

the dimension is complex to evaluate and human

annotators have different understandings of the di-

mension (Mehri et al.,2022). Lastly, we choose

dimensions based on how often they are marked

as "N/A" (not applicable) by the human judges. A

high frequency indicates that the dimension is not

generally applicable in different contexts. Most

dimensions do not contain a "N/A" rating except

Only consistency has an inter-anntator agreement below

0.6.

"Error recovery", which has been marked as "N/A"

25% of the time.

Based on the rules, we choose the following

three dimensions:

coherence

likability

, and

topic

depth

. In addition to the rules, we choose these

dimensions because they are also widely studied in

open-domain dialogue systems. Researchers spend

signiﬁcant amount of efforts on developing coher-

ent, engaging, and knowledgeable chatbots (Adi-

wardana et al.,2020;Hedayatnia et al.,2020;Shus-

ter et al.,2021). Designing meaningful metrics

along these three dimensions can beneﬁt the cur-

rent open-domain dialogue research. Though other

dimensions, such as consistency (Nie et al.,2021),

inquisitiveness (See et al.,2019), and long-term

memory (Xu et al.,2022) are equally important,

their evaluation deserves a thorough study on its

own. Hence, we leave them for future work.

3 Methodology

3.1 Problem Formulation

We formally deﬁne the dialogue-level evaluation

task. Suppose that we have a dialogue evaluation

dataset,

, which contains

human-chatbot dia-

logues,

D={d1, d2, . . . , dj, . . . , dn}

is anno-

tated by several human judges for a set of quality

dimensions,

. Each human judge provides a rat-

ing to

for individual dimension,

q∈Q

. We use

to denote the average Likert rating provided by

all human annotators to djfor q.

Our goal is to learn dimension-speciﬁc met-

rics,

Mq(dj)→sq

, where

is the metric

score reﬂecting how good

is for dimension

as perceived by

. To assess the performance

, the correlation, denoted as

ρq

, be-

tween

Sq={sq

d1, . . . , sq

dj, . . . , sq

dn}

and

Rq=

{rq

d1, . . . , rq

dj, . . . , rq

dn}

are calculated. Higher

ρq

indicates better performance of Mqon D.

3.2 General Framework

We propose a multi-dimensional dialogue-level

metric, FineD-Eval, which is a combination of

three specialized sub-metrics,

, where

q∈

{coherence,likability,topic depth}

. We explore

two approaches for combining the sub-metrics,

metric ensemble and multitask learning. Metric

ensemble is a late fusion approach whereby the

predictions made by the sub-metrics are combined.

Multitask learning, on the other hand, is an early fu-

sion approach whereby the sub-metrics will share a

common text encoder while having different output

layers. Details of both approaches are discussed in

§3.4. Here, we focus on the details of Mq.

To train

, we formulate a preference learn-

ing approach (Fürnkranz and Hüllermeier,2011).

Given a pair of dimensions-speciﬁc positive and

negative training dialogue samples,

than

d−

learns to predict a higher score for

than

d−

. The strategies for constructing (

d−

) are

outlined in §3.3. During training, a mini-batch

is formed with two types of data instances

: (1)

(

d−

) with label

y=1

; (2) (

d−

) with la-

bel

y=−1

outputs two scalar values

and

d−

that correspond to

and

d−

respectively.

The following margin ranking loss is adopted to

train the model:

Lq=max(0, y ∗(xq

1−xq

2)+0.1)(1)

where

(xq

1, xq

2, y)

can be either (

d−

, 1) or

(sq

d−

,sq

, -1).

The pairwise ranking formulation is motivated

by previous works on dialogue evaluation (Mes-

gar et al.,2020;Huang et al.,2020;Gao et al.,

2020;Zhang et al.,2021a). Compared to direct

assessment approaches (Zhang et al.,2021c;Ghaz-

arian et al.,2022), the main advantage of pairwise

ranking is that the model can implicitly learn the

features that distinguish the good dialogues from

the bad ones based on a large quantity of dialogue

pairs for a speciﬁc quality dimension.

The network architecture of

is straightfor-

ward. RoBERTa-base (Liu et al.,2019) is adopted

as the text encoder,

, which maps (

d−

) to

dense representations (

H+

H−

). Both

and

d−

are formulated as a token sequence with special to-

ken "</UTT>" to delimit different utterances. Next,

(

H+

H−

) are converted into vector representa-

tions (

h−

) with average pooling. Through a

linear layer with output size 1 and a Sigmoid acti-

vation function,

and

h−

are transformed into

scalar values

and

d−

respectively. During in-

ference, given

dj∈D

, the scalar value

output

by Mqis the corresponding metric score.

3.3 Dimension-Speciﬁc Sampling Strategies

In this section, we discuss different strategies to

obtain dimension-speciﬁc training dialogue pairs.

All (

d−

) samples are automatically constructed

This formulation is to avoid model relying on positions

of the dialogues to make predictions.

from human-human dialogue datasets without re-

liance on human annotations.

Coherence (Coh)

We consider two strategies for

coherence. The ﬁrst is utterance order shufﬂing

whereby dialogues from existing human-human di-

alogue corpora (Li et al.,2017;Dinan et al.,2020)

are treated as

. To obtain

d−

, we randomly per-

mute the order of utterances in

. This strategy

has been widely adopted in previous dialogue co-

herence studies (Cervone et al.,2018;Mesgar et al.,

2020;Zhang et al.,2021a).

The second strategy, question-answer (QA) rel-

evance scoring, is motivated by the Gricean max-

ims (Grice,1975) whereby effective communica-

tion involves being relevant, i.e., one should pro-

vide information that is relevant to the current ex-

change. A natural and logical ﬂow of conversa-

tion often involves asking and answering questions,

which is a form of information exchange. Humans

usually prefer answers that are straight to the point

rather than those that are vague and off-topic. Con-

cretely, we select dialogues in existing dialogue

corpora

that are more than 4 utterances and con-

tain at least one question-answer pair. Next, we

use a pretrained BERT-based QA evaluator from

HuggingFace

to score each QA pair within a di-

alogue. The evaluator provides a relevance score

between 0 and 1 (the higher the better). Then, we

average the relevance scores of all QA pairs within

the dialogue to derive the dialogue-level QA rele-

vance score. Finally, two thresholds, (

τrel

low

τrel

high

are chosen. Dialogues with scores lower than

τrel

low

are considered

d−

. Those with scores higher than

τrel

high

are considered

. (

τrel

low

τrel

high

) are heuristi-

cally determined to ensure sufﬁcient data in both

the positive and negative classes.

Likability (Lik)

Two strategies are applied to con-

struct

and

d−

for likability. The ﬁrst strategy,

contradiction scoring, is motivated by the similar-

tity attaction effect (Byrne et al.,1968;Nass and

Lee,2001). During human-human interaction, peo-

ple tend to favour others who share similar opinions

or preferences with them. On the contrary, convey-

ing contradictory opinions or information may lead

to disagreement and user dissatisfaction.

We hypothesize that even in human-human dialogue cor-

pora, there are answers that are vague and off-topic due to the

presence of low-quality crowd-source workers.

8https://huggingface.co/iarfmoose/

bert-base-cased-qa-evaluator

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FineD-Eval:Fine-grainedAutomaticDialogue-LevelEvaluationChenZhang;“LuisFernandoD'HaroQiquanZhangThomasFriedrichs“HaizhouLi¸;;µNationalUniversityofSingapore“RobertBosch(SEA),SingaporeUniversidadPolitécnicadeMadrid,SpainµKristonAILab,China¸TheChineseUniversityofHongKong,Shenzhen,Chinachen_zhang@...

展开>> 收起<<

FineD-Eval Fine-grained Automatic Dialogue-Level Evaluation Chen ZhangLuis Fernando DHaroQiquan Zhang Thomas FriedrichsHaizhou Liµ.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FineD-Eval Fine-grained Automatic Dialogue-Level Evaluation Chen ZhangLuis Fernando DHaroQiquan Zhang Thomas FriedrichsHaizhou Liµ

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: