ENDEX Evaluation of Dialogue Engagingness at Scale Guangxuan Xu1Ruibo Liu2 Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1

2025-05-06 0 0 732.36KB 10 页 10玖币
侵权投诉
ENDEX: Evaluation of Dialogue Engagingness at Scale
Guangxuan Xu1Ruibo Liu2
Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1
1University of California, Los Angeles 2Dartmouth College
{gxu21, violetpeng} @cs.ucla.edu
Abstract
We propose ENDEX, the first human-reaction
based model to evaluate dialogue engaging-
ness. ENDEX is trained on 80k Reddit-
based Engagement Dataset (RED) curated us-
ing a novel distant-supervision framework. En-
gagingness is a key measure that captures
high-level quality of AI dialogue systems and
closely reflects actual user experience. How-
ever, data shortage, plus the abstract and exten-
sive definition of engagingness makes it chal-
lenging to develop an automatic metric. Our
work departs from mainstream approaches that
use synthetic negative examples to train binary
classifiers, and instead, proposes a solution us-
ing distant-supervision from human-reaction
feedback. To support the soundness of our EN-
DEX metric, we offer a theoretical foundation
for engagement, an extensive ablation study,
and empirical evidence of high correlation on
five engagingness related datasets.1
1 Introduction
Many modern generative language models are
trained to maximize a likelihood objective, but this
paradigm tends to assign high probability to generic
responses (Li et al.,2016), such as “I don’t know.”.
Prior research has established that people prefer to
converse with interesting, creative, and informative
agents (See et al.,2019), all concepts broadly re-
lated to the notion of engagingness. Furthermore,
engagingness is recognized as a key evaluation
metric for the quality of dialogue systems (Zhang
et al.,2018;Ghazarian et al.,2020). For example,
FAIR’s ParlAI (Miller et al.,2017) incorporated
Engagingness as the default testing metric in the
Blenderbot system (Roller et al.,2021); dialogue
data challenges, like ConvAI2 (Dinan et al.,2019),
Amazon Alexa Prize
2
, and ensemble metrics like
FED (Mehri and Eskenazi,2020), all measure en-
gagingness to benchmark dialogue quality.
However, the current evaluation of engagingness
still primarily relies on expensive human annota-
1
Off-the-shelf ENDEX model and the RED dataset is avail-
able at https://github.com/gxxu-ml/EnDex.
2https://www.amazon.science/alexa-prize
Figure 1: Example of an online post with scores for
emotional engagement (EE), attentional engagement
(AE), and behavioral engagement (BE) in blue to rep-
resent the 3 dimensions of human engagement; reply
engagement (RE) in red; and the aggregated ENDEX
score in green. We apply z-score to EnDex Score and
pick a hyper-parameter threshold to cluster posts into
positive and negative samples.
tion rather than off-the-shelf automatic tools, due to
several theoretical and technical challenges: firstly,
unlike more well-characterized properties such as
fluency, the definition of engagingness is signifi-
cantly more abstract and multi-dimensional (See
et al.,2019), requiring well-tuned quality met-
rics for each sub-dimension to aggregate a final
score. Secondly, what qualifies as engaging is open-
ended and many different answers may embody
the concept (Ghazarian et al.,2020). Therefore,
reference-based metrics requiring unique ground
truth, such as BLEURT (Sellam et al.,2020) and
BERTScore (Zhang et al.,2020), cannot apply.
Thirdly, there’s an acute shortage of large-scale,
high-quality data annotated for engagingness.
Ghazarian et al. (2020) jump-started efforts to au-
tomatically measure dialogue engagement, where
they fine-tuned a BERT-based model (Devlin et al.,
2019) on the ConvAI2 and DialyDialog datasets (Li
et al.,2017) to predict an engagingness score. How-
ever, finetuning on small size supervised data could
arXiv:2210.12362v1 [cs.CL] 22 Oct 2022
easily lead to overfitting and generalization prob-
lems. Another high performing metric on engag-
ingness USL-H (Phy et al.,2020) assumes a posi-
tive set and generates synthetic negative samples
to train model. However, credible positive samples
are not always available, and synthetic negative
samples may not be challenging enough to further
advance classifier performance.
In light of the above challenges, we propose
ENDEX, a novel metric trained with distantly su-
pervised data to predict turn-level dialogue engag-
ingness (Figure 1). ENDEX requires neither hu-
man annotations nor direct disentanglement of en-
gagingness. Instead, we leverage observed user
reactions to posts as distant signals to model en-
gagingness, which marks a departure from main-
stream approach to train on synthetic negative sam-
ples (Lan et al.,2020;Ghazarian et al.,2022;Tao
et al.,2018;Sato et al.,2020). ENDEX trains on
real conversations sourced from Reddit, that are
automatically annotated as positive and negative
examples with our framework. The novoel dataset
is named RED (Reddit Engagement Dataset) with
over 80k labelled samples. ENDEX framework
derives its theoretical underpinning from relevant
HCI works, and has shown superior performance
on five benchmark datasets.
2 EnDex Metric
Engagingness is not only a linguistic concept use-
ful for dialogue systems, but also manifests itself
in multi-modalities and is extensively leveraged
to benchmark gaming and online learning expe-
riences (Silpasuwanchai et al.,2016;Chen et al.,
2005;Mcmahan,2003;Schoenau-Fog,2011). Our
work is inspired by HCI study of Human Engage-
ment (Ma,2018), which decomposes engagingness
into three major dimensions including attentional
engagement (e.g., clicks and scrolls), behavior en-
gagement (e.g., facial expressions), and emotional
engagement (e.g., heart rate).
ENDEX metric follows the same intuition: we
can infer engagingness of a text by analyzing hu-
man reactions to it, for which there is abundant
data in social media. ENDEX metric learns from
our distant-supervised RED dataset, which mea-
sures dialogue engagement along four dimensions
as shown in Figure 1; three-dimensions correspond
to the original Human Engagement definition, and
one distinct Reply Engagement dimension for the
dialogue specific task.
Engaging Non-engaging
# of samples 40,162 40,162
Emotional .605 ±.273 .152 ±.120
Attentional .759 ±.127 .203 ±.100
Behavioral .659 ±.274 .318 ±.285
Reply .718 ±.154 .354 ±.980
ENDEX .709 ±.048 .259 ±.033
Table 1: RED dataset has two classes, engaging and
non-engaging, clustered by applying z-score on EN-
DEX score. This table shows the mean and standard
deviation of sub-dimension scores for both classes; the
last row displays the distribution of the overall ENDEX
score.
2.1 Reddit Engagement Dataset (RED)
We curate the Reddit Engagement Dataset (RED),
a distant-supervision set, with 80k single-turn con-
versations. We source RED from Reddit, sampling
from 43 popular subreddits, and processed a total
of 5 million posts, filtering out data that was either
non-conversational, toxic, or posts not possible to
ascertain popularity; the resulting data distribution
of RED is shown in Table 1. The following sec-
tions will explain the procedure to automatically
annotate ENDEX scores and cluster samples into
positive and negative sets.
We also curated a RED testset with 150 human
annotated samples obtained from a different split
from RED. The inter-annotator agreement is 0.34
Fleiss-Kappa, indicating fair agreement, which re-
flects the challenge of determining engagingness.
2.2 Distantly-Supervised Engagingness
Scores
We use distant-supervision to provide samples in
RED an ENDEX Score, which is the aggregate of
4 engaging dimensions. Section 2.2 discusses the
intuition for each engagingness dimension; section
2.3 explains how to adjust raw score by thread pop-
ularity; section 2.4 lays out the formula to normal-
ize and aggregate sub-dimensions into the overall
engagingness score; section 2.5 explains sampling
with z-score to convert the task into binary classifi-
cation.
Emotional Engagement (EE):
Emotional con-
nection is a key sign of human engage-
ment (Savin-Baden et al.,2014); and we model
EE using a multi-class emotional classifier (Dem-
szky et al.,2020) on post replies. If post receives
摘要:

ENDEX:EvaluationofDialogueEngagingnessatScaleGuangxuanXu1RuiboLiu2FabriceHarel-Canada1NischalReddyChandra1NanyunPeng11UniversityofCalifornia,LosAngeles2DartmouthCollege{gxu21,violetpeng}@cs.ucla.eduAbstractWeproposeENDEX,thersthuman-reactionbasedmodeltoevaluatedialogueengaging-ness.ENDEXistrainedon...

展开>> 收起<<
ENDEX Evaluation of Dialogue Engagingness at Scale Guangxuan Xu1Ruibo Liu2 Fabrice Harel-Canada1Nischal Reddy Chandra1Nanyun Peng1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:732.36KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注