Measuring and Improving Semantic Diversity of Dialogue Generation Seungju Han Beomsu Kim Buru Chang Hyperconnect

2025-04-27 0 0 1.75MB 17 页 10玖币
侵权投诉
Measuring and Improving Semantic Diversity of Dialogue Generation
Seungju Han Beomsu Kim Buru Chang
Hyperconnect
wade3han@snu.ac.kr, {beomsu.kim,buru.chang}@hpcnt.com
Abstract
Response diversity has become an important
criterion for evaluating the quality of open-
domain dialogue generation models. How-
ever, current evaluation metrics for response
diversity often fail to capture the seman-
tic diversity of generated responses, as they
mainly consider lexical aspects of the gener-
ated responses. In this paper, we introduce a
new automatic evaluation metric to measure
the semantic diversity of generated responses.
Through human evaluation, we demonstrate
that our proposed metric captures human judg-
ments on response diversity better than exist-
ing lexical-level diversity metrics. Further-
more, motivated by analyzing an existing di-
alogue dataset, we propose a simple yet effec-
tive learning method that improves the seman-
tic diversity of generated responses. Our learn-
ing method weights training samples based on
the semantic distribution of the training set.
We show that our learning method improves
response diversity and coherency better than
other baseline methods through automatic and
human evaluation.
1 Introduction
Open-domain dialogue generation (Sordoni et al.,
2015;Bordes et al.,2017) has greatly progressed
with the development of large-scale pretrained lan-
guage models (Radford et al.,2019;Roller et al.,
2021) in the last decade. However, although di-
alogue generation models can produce fluent re-
sponses, they are also known for frequently gen-
erating dull and uninformative generic responses
(e.g., "I don’t know"), degrading their engaging-
ness (Serban et al.,2016;Li et al.,2016a). To
alleviate this problem, many studies (Zhao et al.,
2017;Li et al.,2017a;Zhang et al.,2018) have
been conducted to enhance the diversity of gener-
ated responses, and response diversity has become
Corresponding author
Contexts in Test Set
Generated Response of Model A Generated Response of Model B
Semantic Latent Space
Sem-Ent
1.3
👍
vs 0.6
👎
Semantic Distribution
Model A Model B
Figure 1: An illustration of our proposed Sem-Ent that
measures semantic diversity based on the semantic dis-
tribution of generated responses.
an important criterion for evaluating the quality of
generated responses.
The current evaluation protocol has relied on
lexical-level evaluation metrics such as Distinct-
n
(Dist-
n
) (Li et al.,2016a) and Entropy-
n
(Ent-
n
) (Serban et al.,2017) to measure the diversity
of generated responses. However, according to re-
cent studies (Tevet and Berant,2021;Stasaski and
Hearst,2022), these lexical-level evaluation met-
rics often fail to capture semantic diversity since re-
sponses including similar words can have different
semantics and responses with different words can
have similar semantics (Yarats and Lewis,2018).
In this paper, we propose
Sem-Ent
(
Sem
antic-
Ent
ropy), which is a new automatic evaluation
metric for measuring the semantic diversity of gen-
erated responses. Sem-Ent first maps generated
responses into a semantic latent space using a pre-
trained language model (e.g., DialoGPT (Zhang
et al.,2020) and BERT (Devlin et al.,2019)). Then,
the evaluation metric measures the semantic diver-
arXiv:2210.05725v2 [cs.CL] 22 Oct 2022
sity of generated responses by calculating how the
responses are evenly distributed in the semantic
latent space based on entropy, as shown in Fig-
ure 1. Through human evaluation, we demonstrate
that Sem-Ent is more highly correlated with hu-
man judgments on response diversity than existing
lexical-level evaluation metrics.
Furthermore, we propose a simple yet effec-
tive learning method of dialogue generation mod-
els to improve semantic diversity of generated re-
sponses. We observe that the semantic distribu-
tion of responses in a dialogue dataset is highly
imbalanced, leading dialogue generation models
to produce semantically less diverse responses.
To address this problem, our proposed method,
DRESS
(
D
iversifying
RES
ponses
S
emantically),
learns more about responses with rare semantics
and learn less about responses with frequent seman-
tics. From this, dialogue generation models could
produce more semantically diverse responses. Ex-
periments on two benchmark datasets demonstrate
that DRESS shows better semantic diversity com-
pared to state-of-the-art baseline methods, along
with the gain in response coherency. Interestingly,
DRESS also achieves better performance in lexical-
level diversity metrics than baselines, even though
it focuses only on improving the semantic diversity.
Moreover, human evaluation results show the ef-
fectiveness of DRESS, where DRESS outperforms
all baseline methods in appropriateness and infor-
mativeness of generated responses.
Our Contributions:
(1) A new automatic evalua-
tion metric for measuring semantic diversity (Sem-
Ent), which is highly correlated with human judg-
ments on response diversity. (2) A simple yet effec-
tive learning method of dialogue generation mod-
els (DRESS) for improving the semantic diversity
of generated responses. (3) Experiments on two
benchmark datasets, showing that DRESS outper-
forms the baseline methods in both semantic diver-
sity and lexical-level diversity. (4) An implementa-
tion of Sem-Ent will be released, contributing to the
community of open-domain dialogue generation.
2 Related Work
2.1 Enhancing Response Diversity
Since generating dull and uninformative responses
is a well-known and important problem in open-
domain dialogue (Vinyals and Le,2015;Li et al.,
2016a), numerous methods have been proposed to
address this issue. The maximum mutual infor-
mation objective function is utilized to penalize
generic responses and improve the diversity of gen-
erated responses (Li et al.,2016a,c;Zhang et al.,
2018,2020). Another line of work improves di-
versity by modeling the one-to-many relationship
of open-domain dialogue using latent variables to
generate multiple and diverse responses (Serban
et al.,2017;Zhao et al.,2017;Bao et al.,2020a,b;
Chen et al.,2019;Zhang et al.,2019;Gao et al.,
2019). Some methods selectively penalize frequent
responses by removing them from the training set
(Csáky et al.,2019) or applying negative training
to frequent responses (He and Glass,2020). Using
different decoding algorithms can improve the re-
sponse diversity; Li et al. (2016b) and Vijayakumar
et al. (2018) directly modify the beam search algo-
rithm to promote the response diversity. Sampling-
based decoding algorithms such as top-
k
sampling
(Fan et al.,2018) and nucleus sampling (Holtzman
et al.,2019) are also known to improve the diver-
sity of generated responses. Wang et al. (2021)
diversify responses by adaptively modifying the
target token distribution with a lightweight decoder
to prevent the model from being over-confident.
2.2 Metrics for Capturing Response Diversity
Response diversity metrics for open-domain dia-
logue generation models can mainly be categorized
into two groups. Referenced metrics (Zhao et al.,
2017;Gao et al.,2019) use the reference responses
provided by human annotators to capture the re-
sponse diversity by computing a recall value based
on various similarity metrics such as BLEU and
embedding similarity. On the other hand, unrefer-
enced metrics measure the response diversity with-
out using reference responses generated by human
annotators. Unreferenced metrics are more widely
adopted than referenced metrics because they can
measure response diversity even in the absence of
reference responses. Dist-
n
(Li et al.,2016a) mea-
sures the response diversity with the fraction of
distinct
n
-grams over possible
n
-grams in all gen-
erated responses. Ent-
n
metric (Serban et al.,2017;
Zhang et al.,2018) is suggested to improve the Dist-
n
metric by taking the frequency difference of
n
-
grams into account. Low-Frequency (LF) (Li et al.,
2019) calculates the frequency of low-frequency
words in generated responses as the response diver-
sity.
Semantic diversity.
Recently, several studies have
focused on the semantic diversity of generated re-
sponses. Tevet and Berant (2021) release the Mc-
Div benchmark to evaluate the semantic diversity
metrics and Stasaski and Hearst (2022) propose a
new semantic diversity metric, natural language in-
ference (NLI) diversity, leveraging pretrained NLI
models (Bowman et al.,2015).
The major difference between our Sem-Ent and
NLI diversity is that NLI diversity can only cap-
ture the semantic diversity of generated responses
for a single context, while Sem-Ent measures the
overall semantic diversity of generated responses
for multiple contexts of the test set. This is an im-
portant distinction since the latter provides insight
into how well generated responses vary depend-
ing on which context is provided as an input while
the former cannot. To see the shortcoming of NLI
diversity more clearly, take the following an exam-
ple: suppose that given a context
ca
as an input,
a dialogue generation model generated responses
{ra,1, ra,2,· · · }
that are "semantically diverse" ac-
cording to NLI diversity. Now, further suppose
that given another context
cb
, the model generates
responses
{rb,1, rb,2,· · · }
that are also "semanti-
cally diverse" among themselves but appear simi-
lar to the responses
{ra,1, ra,2,· · · }
produced for
the context
ca
. In such a case, NLI diversity can-
not capture the fact that the generated responses
{ra,1,· · · }
and
{rb,1,· · · }
for the contexts
ca
and
cb
are semantically similar despite the two contexts
being different contexts; Sem-Ent can because it
measures the semantic diversity of generated re-
sponses for a set of different contexts of the test
set. In the next Section 3, we will describe our
proposed semantic diversity metric, Sem-Ent, in
detail.
3 Measuring Semantic Diversity
3.1 Sem-Ent
Let
D
=
{(ci, ri)}m
i=1
denote a training set con-
sisting of
m
dialogues where
ci
and
ri
denote the
context and its response of the
i
-th dialogue, re-
spectively. Dialogue generation is to generate a
response rfor a given context c.
We are motivated by recent empirical observa-
tions that responses can be clustered by the se-
mantic similarity between the responses (Ko et al.,
2020;Gao et al.,2020). By following Csáky et al.
(2019); Pillutla et al. (2021), we cluster responses
in
D
by utilizing a pretrained language model.
Here, we select DialoGPT (Zhang et al.,2020)
as the language model. Each response
ri∈ D
is
turned into a semantic representation
e(ri)
by the
language model, and then
k
semantic clusters are
formed from the semantic representations by the
k
-means algorithm (Lloyd,1982). Let
C
denote a
set of the obtained ksemantic clusters.
Consider a test set
˜
D
=
{ci,˜ri)}n
i=1
consist-
ing of
n
dialogues. During evaluation, a dialogue
generation model
M
generates responses
RM
=
{rM
i}n
i=1
for the contexts
{˜ci}n
i=1 ˜
D
, respec-
tively. To compute semantic diversity, Sem-Ent
requires a semantic distribution
P(RM)
, but there
is no direct way to obtain the exact distribution.
Thus, we approximate the semantic distribution
P(RM)
as
˜
P(RM)
=
˜p(1); · · · ; ˜p(k)
using the
the semantic clusters Cas follows:
˜p(j) = 1
n
n
X
i=1
IφC(e(rM
i)) = j,(1)
where
φC(r)∈ {1,· · · , k}
is a cluster mapping
function that returns the cluster index of a response
r
from
C
.
˜p(j)
is the probability of the
j
-th clus-
ter, indicating how many generated responses are
assigned to the j-th semantic cluster.
Sem-Ent is an entropy of
˜
P(RM)
that approxi-
mates the semantic distribution
P(RM)
as follows:
Sem-Ent(RM) =
k
X
j=1
˜p(j)·log ˜p(j).(2)
Interpretation of Sem-Ent is quite straightforward:
Sem-Ent gets lower when the semantic distribution
gets more imbalanced, i.e., when models gener-
ate responses belonging to only several specific
semantic clusters. Conversely, Sem-Ent gets the
highest value of
log k
when generated responses
are uniformly distributed to each semantic cluster.
3.2 Correlation with Human judgment
We conduct a human evaluation to demonstrate that
Sem-Ent successfully captures human judgments
on response diversity.
Experimental Setup.
We borrow a pairwise exper-
imental setup of Pillutla et al. (2021) for analyzing
the correlation between diversity metrics and hu-
man judgments. Our evaluation is based on the
observation that the degree of response diversity
varies depending on the types of generative models
and decoding algorithms (Holtzman et al.,2019;
Tevet and Berant,2021). From this, we first pre-
pare eight different response generation settings
from two generation models (Blender-90M (Roller
Metric Correlation Dist-3 Ent-3 LF MAUVE Sem-Ent
Diversity/BT Pearson 0.348 (0.399) 0.702 (0.052) -0.232 (0.580) 0.134 (0.750) 0.810 (0.015)
Spearman 0.381 (0.352) 0.667 (0.071) 0.000 (1.000) 0.547 (0.160) 0.762 (0.028)
Interesting/BT Pearson 0.261 (0.533) 0.671 (0.068) -0.260 (0.533) 0.098 (0.817) 0.789 (0.020)
Spearman 0.381 (0.352) 0.714 (0.047) 0.048 (0.911) 0.523 (0.182) 0.667 (0.020)
Table 1: Correlation of diversity matrices with human judgments on response diversity. BT denotes the Bradley-
Terry score for a pairwise human evaluation and the value inside the parenthesis indicates p-value. We set the
number of semantic clusters as k=20. Evaluation results with different nfor Dist-nand Ent-nare reported in
Appendix A.3.
et al.,2021) and BART-large (Lewis et al.,2020))
and four decoding algorithms (greedy, beam, top-k
sampling, and nucleus sampling). We then obtain
28 pairs of generation settings from the eight re-
sponse generation settings.
For each pair, we randomly choose ten contexts
from the test set of a DailyDialog dataset (Li et al.,
2017b) and generate two response sets using the
two generation settings, respectively, for the ten
contexts. Human annotators are asked to select
which response set is better in two criteria, di-
versity and interestingness, using a 5-point Likert
scale. We obtain 25 pairwise annotations for each
pair of response generation settings. These annota-
tion results are converted into each response gen-
eration setting’s score by using the Bradley-Terry
model (Marden,1996). By Bradley-Terry model,
the probability of the outcome
ij
is calculated
as
p(ij) = eθi/(eθi+eθj)
when parameters
θ1,·, θn
, for two items
i
and
j
are given. For more
details about the Bradley-Terry model, please refer
to choix manual1.
We measure the correlation between the Bradley-
Terry score and diversity metrics to check how each
metric correlates with the human judgments on
each criterion. More details about human evalua-
tion are included in Appendix A.
Baseline Metrics.
We compare Sem-Ent with ex-
isting lexical-level response diversity metrics: Dist-
n
(Li et al.,2016a), Ent-
n
(Serban et al.,2017;
Zhang et al.,2018) and LF (Li et al.,2019). We
also include the recently proposed MAUVE (Pil-
lutla et al.,2021) as a baseline metric. MAUVE
shares some properties with Sem-Ent such that
it evaluates the distributional property of gener-
ated responses with semantic latent representations.
However, it is designed to measure the divergence
of generated responses from human responses, not
directly measuring response diversity. We compare
Sem-Ent to MAUVE to verify that our Sem-Ent is
1https://github.com/lucasmaystre/choix
more suitable for measuring the response diversity
in open-domain dialogue generation. Note that we
do not set NLI diversity as a baseline because it
is incompatible with our human evaluation which
measures the overall semantic diversity.
Results.
Table 1shows the correlation between the
human judgments and diversity metrics in terms of
Pearson and Spearman rank correlation. Our Sem-
Ent shows the highest Pearson and Spearman rank
correlation with human judgments on response
diversity compared to other evaluation metrics
with a significant margin. Especially, Dist-
n
, the
most commonly used metric for response diversity,
shows a much lower correlation (0.348) compared
to Sem-Ent (0.810). These results support that Sem-
Ent is a good surrogate for estimating human judg-
ments on response diversity and strongly suggest
that analyzing the semantic diversity of generated
responses is crucial for capturing human perception
of response diversity. Moreover, MAUVE shows
a lower correlation with human judgments on re-
sponse diversity. This result implies that generated
responses that have similar representations to hu-
man responses (i.e., high MAUVE scores) are not
always semantically diverse since human responses
are also often generic (Csáky et al.,2019) (further
analyzed in Section 4.1).
We also observe that Sem-Ent shows a high
correlation with human judgment on interesting-
ness (See et al.,2019); Ask annotators about how
interesting or boring did they find about the gener-
ated response. Sem-Ent has a similar correlation
to Ent-
n
and shows a substantially higher corre-
lation than Dist-
n
, LF, and MAUVE. We believe
that semantically diverse responses could improve
the interestingness of dialogue generation models
and Sem-Ent could somewhat capture human judg-
ments on this response interestingness.
Robustness of Sem-Ent to the Choice of Config-
uration.
Sem-Ent could be affected by the changes
in the configurations used for calculating the score,
摘要:

MeasuringandImprovingSemanticDiversityofDialogueGenerationSeungjuHanBeomsuKimBuruChangHyperconnectwade3han@snu.ac.kr,{beomsu.kim,buru.chang}@hpcnt.comAbstractResponsediversityhasbecomeanimportantcriterionforevaluatingthequalityofopen-domaindialoguegenerationmodels.How-ever,currentevaluationmetricsf...

展开>> 收起<<
Measuring and Improving Semantic Diversity of Dialogue Generation Seungju Han Beomsu Kim Buru Chang Hyperconnect.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.75MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注