ENDEX Evaluation of Dialogue Engagingness at Scale Guangxuan Xu1Ruibo Liu2 Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1

2025-05-06 0 0 732.36KB 10 页 10玖币

侵权投诉

ENDEX: Evaluation of Dialogue Engagingness at Scale

Guangxuan Xu1Ruibo Liu2

Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1

1University of California, Los Angeles 2Dartmouth College

{gxu21, violetpeng} @cs.ucla.edu

Abstract

We propose ENDEX, the ﬁrst human-reaction

based model to evaluate dialogue engaging-

ness. ENDEX is trained on 80k Reddit-

based Engagement Dataset (RED) curated us-

ing a novel distant-supervision framework. En-

gagingness is a key measure that captures

high-level quality of AI dialogue systems and

closely reﬂects actual user experience. How-

ever, data shortage, plus the abstract and exten-

sive deﬁnition of engagingness makes it chal-

lenging to develop an automatic metric. Our

work departs from mainstream approaches that

use synthetic negative examples to train binary

classiﬁers, and instead, proposes a solution us-

ing distant-supervision from human-reaction

feedback. To support the soundness of our EN-

DEX metric, we offer a theoretical foundation

for engagement, an extensive ablation study,

and empirical evidence of high correlation on

ﬁve engagingness related datasets.1

1 Introduction

Many modern generative language models are

trained to maximize a likelihood objective, but this

paradigm tends to assign high probability to generic

responses (Li et al.,2016), such as “I don’t know.”.

Prior research has established that people prefer to

converse with interesting, creative, and informative

agents (See et al.,2019), all concepts broadly re-

lated to the notion of engagingness. Furthermore,

engagingness is recognized as a key evaluation

metric for the quality of dialogue systems (Zhang

et al.,2018;Ghazarian et al.,2020). For example,

FAIR’s ParlAI (Miller et al.,2017) incorporated

Engagingness as the default testing metric in the

Blenderbot system (Roller et al.,2021); dialogue

data challenges, like ConvAI2 (Dinan et al.,2019),

Amazon Alexa Prize

, and ensemble metrics like

FED (Mehri and Eskenazi,2020), all measure en-

gagingness to benchmark dialogue quality.

However, the current evaluation of engagingness

still primarily relies on expensive human annota-

Off-the-shelf ENDEX model and the RED dataset is avail-

able at https://github.com/gxxu-ml/EnDex.

2https://www.amazon.science/alexa-prize

Figure 1: Example of an online post with scores for

emotional engagement (EE), attentional engagement

(AE), and behavioral engagement (BE) in blue to rep-

resent the 3 dimensions of human engagement; reply

engagement (RE) in red; and the aggregated ENDEX

score in green. We apply z-score to EnDex Score and

pick a hyper-parameter threshold to cluster posts into

positive and negative samples.

tion rather than off-the-shelf automatic tools, due to

several theoretical and technical challenges: ﬁrstly,

unlike more well-characterized properties such as

ﬂuency, the deﬁnition of engagingness is signiﬁ-

cantly more abstract and multi-dimensional (See

et al.,2019), requiring well-tuned quality met-

rics for each sub-dimension to aggregate a ﬁnal

score. Secondly, what qualiﬁes as engaging is open-

ended and many different answers may embody

the concept (Ghazarian et al.,2020). Therefore,

reference-based metrics requiring unique ground

truth, such as BLEURT (Sellam et al.,2020) and

BERTScore (Zhang et al.,2020), cannot apply.

Thirdly, there’s an acute shortage of large-scale,

high-quality data annotated for engagingness.

Ghazarian et al. (2020) jump-started efforts to au-

tomatically measure dialogue engagement, where

they ﬁne-tuned a BERT-based model (Devlin et al.,

2019) on the ConvAI2 and DialyDialog datasets (Li

et al.,2017) to predict an engagingness score. How-

ever, ﬁnetuning on small size supervised data could

arXiv:2210.12362v1 [cs.CL] 22 Oct 2022

easily lead to overﬁtting and generalization prob-

lems. Another high performing metric on engag-

ingness USL-H (Phy et al.,2020) assumes a posi-

tive set and generates synthetic negative samples

to train model. However, credible positive samples

are not always available, and synthetic negative

samples may not be challenging enough to further

advance classiﬁer performance.

In light of the above challenges, we propose

ENDEX, a novel metric trained with distantly su-

pervised data to predict turn-level dialogue engag-

ingness (Figure 1). ENDEX requires neither hu-

man annotations nor direct disentanglement of en-

gagingness. Instead, we leverage observed user

reactions to posts as distant signals to model en-

gagingness, which marks a departure from main-

stream approach to train on synthetic negative sam-

ples (Lan et al.,2020;Ghazarian et al.,2022;Tao

et al.,2018;Sato et al.,2020). ENDEX trains on

real conversations sourced from Reddit, that are

automatically annotated as positive and negative

examples with our framework. The novoel dataset

is named RED (Reddit Engagement Dataset) with

over 80k labelled samples. ENDEX framework

derives its theoretical underpinning from relevant

HCI works, and has shown superior performance

on ﬁve benchmark datasets.

2 EnDex Metric

Engagingness is not only a linguistic concept use-

ful for dialogue systems, but also manifests itself

in multi-modalities and is extensively leveraged

to benchmark gaming and online learning expe-

riences (Silpasuwanchai et al.,2016;Chen et al.,

2005;Mcmahan,2003;Schoenau-Fog,2011). Our

work is inspired by HCI study of Human Engage-

ment (Ma,2018), which decomposes engagingness

into three major dimensions including attentional

engagement (e.g., clicks and scrolls), behavior en-

gagement (e.g., facial expressions), and emotional

engagement (e.g., heart rate).

ENDEX metric follows the same intuition: we

can infer engagingness of a text by analyzing hu-

man reactions to it, for which there is abundant

data in social media. ENDEX metric learns from

our distant-supervised RED dataset, which mea-

sures dialogue engagement along four dimensions

as shown in Figure 1; three-dimensions correspond

to the original Human Engagement deﬁnition, and

one distinct Reply Engagement dimension for the

dialogue speciﬁc task.

Engaging Non-engaging

# of samples 40,162 40,162

Emotional .605 ±.273 .152 ±.120

Attentional .759 ±.127 .203 ±.100

Behavioral .659 ±.274 .318 ±.285

Reply .718 ±.154 .354 ±.980

ENDEX .709 ±.048 .259 ±.033

Table 1: RED dataset has two classes, engaging and

non-engaging, clustered by applying z-score on EN-

DEX score. This table shows the mean and standard

deviation of sub-dimension scores for both classes; the

last row displays the distribution of the overall ENDEX

score.

2.1 Reddit Engagement Dataset (RED)

We curate the Reddit Engagement Dataset (RED),

a distant-supervision set, with 80k single-turn con-

versations. We source RED from Reddit, sampling

from 43 popular subreddits, and processed a total

of 5 million posts, ﬁltering out data that was either

non-conversational, toxic, or posts not possible to

ascertain popularity; the resulting data distribution

of RED is shown in Table 1. The following sec-

tions will explain the procedure to automatically

annotate ENDEX scores and cluster samples into

positive and negative sets.

We also curated a RED testset with 150 human

annotated samples obtained from a different split

from RED. The inter-annotator agreement is 0.34

Fleiss-Kappa, indicating fair agreement, which re-

ﬂects the challenge of determining engagingness.

2.2 Distantly-Supervised Engagingness

Scores

We use distant-supervision to provide samples in

RED an ENDEX Score, which is the aggregate of

4 engaging dimensions. Section 2.2 discusses the

intuition for each engagingness dimension; section

2.3 explains how to adjust raw score by thread pop-

ularity; section 2.4 lays out the formula to normal-

ize and aggregate sub-dimensions into the overall

engagingness score; section 2.5 explains sampling

with z-score to convert the task into binary classiﬁ-

cation.

•Emotional Engagement (EE):

Emotional con-

nection is a key sign of human engage-

ment (Savin-Baden et al.,2014); and we model

EE using a multi-class emotional classiﬁer (Dem-

szky et al.,2020) on post replies. If post receives

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ENDEX:EvaluationofDialogueEngagingnessatScaleGuangxuanXu1RuiboLiu2FabriceHarel-Canada1NischalReddyChandra1NanyunPeng11UniversityofCalifornia,LosAngeles2DartmouthCollege{gxu21,violetpeng}@cs.ucla.eduAbstractWeproposeENDEX,thersthuman-reactionbasedmodeltoevaluatedialogueengaging-ness.ENDEXistrainedon...

展开>> 收起<<

ENDEX Evaluation of Dialogue Engagingness at Scale Guangxuan Xu1Ruibo Liu2 Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ENDEX Evaluation of Dialogue Engagingness at Scale Guangxuan Xu1Ruibo Liu2 Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: