Measuring and Improving Semantic Diversity of Dialogue Generation Seungju Han Beomsu Kim Buru Chang Hyperconnect

2025-04-27 1 0 1.75MB 17 页 10玖币

侵权投诉

Measuring and Improving Semantic Diversity of Dialogue Generation

Seungju Han Beomsu Kim Buru Chang∗

Hyperconnect

wade3han@snu.ac.kr, {beomsu.kim,buru.chang}@hpcnt.com

Abstract

Response diversity has become an important

criterion for evaluating the quality of open-

domain dialogue generation models. How-

ever, current evaluation metrics for response

diversity often fail to capture the seman-

tic diversity of generated responses, as they

mainly consider lexical aspects of the gener-

ated responses. In this paper, we introduce a

new automatic evaluation metric to measure

the semantic diversity of generated responses.

Through human evaluation, we demonstrate

that our proposed metric captures human judg-

ments on response diversity better than exist-

ing lexical-level diversity metrics. Further-

more, motivated by analyzing an existing di-

alogue dataset, we propose a simple yet effec-

tive learning method that improves the seman-

tic diversity of generated responses. Our learn-

ing method weights training samples based on

the semantic distribution of the training set.

We show that our learning method improves

response diversity and coherency better than

other baseline methods through automatic and

human evaluation.

1 Introduction

Open-domain dialogue generation (Sordoni et al.,

2015;Bordes et al.,2017) has greatly progressed

with the development of large-scale pretrained lan-

guage models (Radford et al.,2019;Roller et al.,

2021) in the last decade. However, although di-

alogue generation models can produce ﬂuent re-

sponses, they are also known for frequently gen-

erating dull and uninformative generic responses

(e.g., "I don’t know"), degrading their engaging-

ness (Serban et al.,2016;Li et al.,2016a). To

alleviate this problem, many studies (Zhao et al.,

2017;Li et al.,2017a;Zhang et al.,2018) have

been conducted to enhance the diversity of gener-

ated responses, and response diversity has become

∗Corresponding author

Contexts in Test Set

Generated Response of Model A Generated Response of Model B

Semantic Latent Space

Sem-Ent

1.3

👍

vs 0.6

👎

Semantic Distribution

Model A Model B

Figure 1: An illustration of our proposed Sem-Ent that

measures semantic diversity based on the semantic dis-

tribution of generated responses.

an important criterion for evaluating the quality of

generated responses.

The current evaluation protocol has relied on

lexical-level evaluation metrics such as Distinct-

(Dist-

) (Li et al.,2016a) and Entropy-

(Ent-

) (Serban et al.,2017) to measure the diversity

of generated responses. However, according to re-

cent studies (Tevet and Berant,2021;Stasaski and

Hearst,2022), these lexical-level evaluation met-

rics often fail to capture semantic diversity since re-

sponses including similar words can have different

semantics and responses with different words can

have similar semantics (Yarats and Lewis,2018).

In this paper, we propose

Sem-Ent

(

Sem

antic-

Ent

ropy), which is a new automatic evaluation

metric for measuring the semantic diversity of gen-

erated responses. Sem-Ent ﬁrst maps generated

responses into a semantic latent space using a pre-

trained language model (e.g., DialoGPT (Zhang

et al.,2020) and BERT (Devlin et al.,2019)). Then,

the evaluation metric measures the semantic diver-

arXiv:2210.05725v2 [cs.CL] 22 Oct 2022

sity of generated responses by calculating how the

responses are evenly distributed in the semantic

latent space based on entropy, as shown in Fig-

ure 1. Through human evaluation, we demonstrate

that Sem-Ent is more highly correlated with hu-

man judgments on response diversity than existing

lexical-level evaluation metrics.

Furthermore, we propose a simple yet effec-

tive learning method of dialogue generation mod-

els to improve semantic diversity of generated re-

sponses. We observe that the semantic distribu-

tion of responses in a dialogue dataset is highly

imbalanced, leading dialogue generation models

to produce semantically less diverse responses.

To address this problem, our proposed method,

DRESS

(

iversifying

RES

ponses

emantically),

learns more about responses with rare semantics

and learn less about responses with frequent seman-

tics. From this, dialogue generation models could

produce more semantically diverse responses. Ex-

periments on two benchmark datasets demonstrate

that DRESS shows better semantic diversity com-

pared to state-of-the-art baseline methods, along

with the gain in response coherency. Interestingly,

DRESS also achieves better performance in lexical-

level diversity metrics than baselines, even though

it focuses only on improving the semantic diversity.

Moreover, human evaluation results show the ef-

fectiveness of DRESS, where DRESS outperforms

all baseline methods in appropriateness and infor-

mativeness of generated responses.

Our Contributions:

(1) A new automatic evalua-

tion metric for measuring semantic diversity (Sem-

Ent), which is highly correlated with human judg-

ments on response diversity. (2) A simple yet effec-

tive learning method of dialogue generation mod-

els (DRESS) for improving the semantic diversity

of generated responses. (3) Experiments on two

benchmark datasets, showing that DRESS outper-

forms the baseline methods in both semantic diver-

sity and lexical-level diversity. (4) An implementa-

tion of Sem-Ent will be released, contributing to the

community of open-domain dialogue generation.

2 Related Work

2.1 Enhancing Response Diversity

Since generating dull and uninformative responses

is a well-known and important problem in open-

domain dialogue (Vinyals and Le,2015;Li et al.,

2016a), numerous methods have been proposed to

address this issue. The maximum mutual infor-

mation objective function is utilized to penalize

generic responses and improve the diversity of gen-

erated responses (Li et al.,2016a,c;Zhang et al.,

2018,2020). Another line of work improves di-

versity by modeling the one-to-many relationship

of open-domain dialogue using latent variables to

generate multiple and diverse responses (Serban

et al.,2017;Zhao et al.,2017;Bao et al.,2020a,b;

Chen et al.,2019;Zhang et al.,2019;Gao et al.,

2019). Some methods selectively penalize frequent

responses by removing them from the training set

(Csáky et al.,2019) or applying negative training

to frequent responses (He and Glass,2020). Using

different decoding algorithms can improve the re-

sponse diversity; Li et al. (2016b) and Vijayakumar

et al. (2018) directly modify the beam search algo-

rithm to promote the response diversity. Sampling-

based decoding algorithms such as top-

sampling

(Fan et al.,2018) and nucleus sampling (Holtzman

et al.,2019) are also known to improve the diver-

sity of generated responses. Wang et al. (2021)

diversify responses by adaptively modifying the

target token distribution with a lightweight decoder

to prevent the model from being over-conﬁdent.

2.2 Metrics for Capturing Response Diversity

Response diversity metrics for open-domain dia-

logue generation models can mainly be categorized

into two groups. Referenced metrics (Zhao et al.,

2017;Gao et al.,2019) use the reference responses

provided by human annotators to capture the re-

sponse diversity by computing a recall value based

on various similarity metrics such as BLEU and

embedding similarity. On the other hand, unrefer-

enced metrics measure the response diversity with-

out using reference responses generated by human

annotators. Unreferenced metrics are more widely

adopted than referenced metrics because they can

measure response diversity even in the absence of

reference responses. Dist-

(Li et al.,2016a) mea-

sures the response diversity with the fraction of

distinct

-grams over possible

-grams in all gen-

erated responses. Ent-

metric (Serban et al.,2017;

Zhang et al.,2018) is suggested to improve the Dist-

metric by taking the frequency difference of

grams into account. Low-Frequency (LF) (Li et al.,

2019) calculates the frequency of low-frequency

words in generated responses as the response diver-

sity.

Semantic diversity.

Recently, several studies have

focused on the semantic diversity of generated re-

sponses. Tevet and Berant (2021) release the Mc-

Div benchmark to evaluate the semantic diversity

metrics and Stasaski and Hearst (2022) propose a

new semantic diversity metric, natural language in-

ference (NLI) diversity, leveraging pretrained NLI

models (Bowman et al.,2015).

The major difference between our Sem-Ent and

NLI diversity is that NLI diversity can only cap-

ture the semantic diversity of generated responses

for a single context, while Sem-Ent measures the

overall semantic diversity of generated responses

for multiple contexts of the test set. This is an im-

portant distinction since the latter provides insight

into how well generated responses vary depend-

ing on which context is provided as an input while

the former cannot. To see the shortcoming of NLI

diversity more clearly, take the following an exam-

ple: suppose that given a context

as an input,

a dialogue generation model generated responses

{ra,1, ra,2,· · · }

that are "semantically diverse" ac-

cording to NLI diversity. Now, further suppose

that given another context

, the model generates

responses

{rb,1, rb,2,· · · }

that are also "semanti-

cally diverse" among themselves but appear simi-

lar to the responses

{ra,1, ra,2,· · · }

produced for

the context

. In such a case, NLI diversity can-

not capture the fact that the generated responses

{ra,1,· · · }

and

{rb,1,· · · }

for the contexts

and

are semantically similar despite the two contexts

being different contexts; Sem-Ent can because it

measures the semantic diversity of generated re-

sponses for a set of different contexts of the test

set. In the next Section 3, we will describe our

proposed semantic diversity metric, Sem-Ent, in

detail.

3 Measuring Semantic Diversity

3.1 Sem-Ent

Let

{(ci, ri)}m

i=1

denote a training set con-

sisting of

dialogues where

and

denote the

context and its response of the

-th dialogue, re-

spectively. Dialogue generation is to generate a

response rfor a given context c.

We are motivated by recent empirical observa-

tions that responses can be clustered by the se-

mantic similarity between the responses (Ko et al.,

2020;Gao et al.,2020). By following Csáky et al.

(2019); Pillutla et al. (2021), we cluster responses

by utilizing a pretrained language model.

Here, we select DialoGPT (Zhang et al.,2020)

as the language model. Each response

ri∈ D

turned into a semantic representation

e(ri)

by the

language model, and then

semantic clusters are

formed from the semantic representations by the

-means algorithm (Lloyd,1982). Let

denote a

set of the obtained ksemantic clusters.

Consider a test set

{(˜ci,˜ri)}n

i=1

consist-

ing of

dialogues. During evaluation, a dialogue

generation model

generates responses

{rM

i}n

i=1

for the contexts

{˜ci}n

i=1 ∈˜

, respec-

tively. To compute semantic diversity, Sem-Ent

requires a semantic distribution

P(RM)

, but there

is no direct way to obtain the exact distribution.

Thus, we approximate the semantic distribution

P(RM)

˜p(1); · · · ; ˜p(k)

using the

the semantic clusters Cas follows:

˜p(j) = 1

i=1

IφC(e(rM

i)) = j,(1)

where

φC(r)∈ {1,· · · , k}

is a cluster mapping

function that returns the cluster index of a response

from

˜p(j)

is the probability of the

-th clus-

ter, indicating how many generated responses are

assigned to the j-th semantic cluster.

Sem-Ent is an entropy of

P(RM)

that approxi-

mates the semantic distribution

P(RM)

as follows:

Sem-Ent(RM) = −

j=1

˜p(j)·log ˜p(j).(2)

Interpretation of Sem-Ent is quite straightforward:

Sem-Ent gets lower when the semantic distribution

gets more imbalanced, i.e., when models gener-

ate responses belonging to only several speciﬁc

semantic clusters. Conversely, Sem-Ent gets the

highest value of

log k

when generated responses

are uniformly distributed to each semantic cluster.

3.2 Correlation with Human judgment

We conduct a human evaluation to demonstrate that

Sem-Ent successfully captures human judgments

on response diversity.

Experimental Setup.

We borrow a pairwise exper-

imental setup of Pillutla et al. (2021) for analyzing

the correlation between diversity metrics and hu-

man judgments. Our evaluation is based on the

observation that the degree of response diversity

varies depending on the types of generative models

and decoding algorithms (Holtzman et al.,2019;

Tevet and Berant,2021). From this, we ﬁrst pre-

pare eight different response generation settings

from two generation models (Blender-90M (Roller

Metric Correlation Dist-3 Ent-3 LF MAUVE Sem-Ent

Diversity/BT Pearson 0.348 (0.399) 0.702 (0.052) -0.232 (0.580) 0.134 (0.750) 0.810 (0.015)

Spearman 0.381 (0.352) 0.667 (0.071) 0.000 (1.000) 0.547 (0.160) 0.762 (0.028)

Interesting/BT Pearson 0.261 (0.533) 0.671 (0.068) -0.260 (0.533) 0.098 (0.817) 0.789 (0.020)

Spearman 0.381 (0.352) 0.714 (0.047) 0.048 (0.911) 0.523 (0.182) 0.667 (0.020)

Table 1: Correlation of diversity matrices with human judgments on response diversity. BT denotes the Bradley-

Terry score for a pairwise human evaluation and the value inside the parenthesis indicates p-value. We set the

number of semantic clusters as k=20. Evaluation results with different nfor Dist-nand Ent-nare reported in

Appendix A.3.

et al.,2021) and BART-large (Lewis et al.,2020))

and four decoding algorithms (greedy, beam, top-k

sampling, and nucleus sampling). We then obtain

28 pairs of generation settings from the eight re-

sponse generation settings.

For each pair, we randomly choose ten contexts

from the test set of a DailyDialog dataset (Li et al.,

2017b) and generate two response sets using the

two generation settings, respectively, for the ten

contexts. Human annotators are asked to select

which response set is better in two criteria, di-

versity and interestingness, using a 5-point Likert

scale. We obtain 25 pairwise annotations for each

pair of response generation settings. These annota-

tion results are converted into each response gen-

eration setting’s score by using the Bradley-Terry

model (Marden,1996). By Bradley-Terry model,

the probability of the outcome

ij

is calculated

p(ij) = eθi/(eθi+eθj)

when parameters

θ1,·, θn

, for two items

and

are given. For more

details about the Bradley-Terry model, please refer

to choix manual1.

We measure the correlation between the Bradley-

Terry score and diversity metrics to check how each

metric correlates with the human judgments on

each criterion. More details about human evalua-

tion are included in Appendix A.

Baseline Metrics.

We compare Sem-Ent with ex-

isting lexical-level response diversity metrics: Dist-

(Li et al.,2016a), Ent-

(Serban et al.,2017;

Zhang et al.,2018) and LF (Li et al.,2019). We

also include the recently proposed MAUVE (Pil-

lutla et al.,2021) as a baseline metric. MAUVE

shares some properties with Sem-Ent such that

it evaluates the distributional property of gener-

ated responses with semantic latent representations.

However, it is designed to measure the divergence

of generated responses from human responses, not

directly measuring response diversity. We compare

Sem-Ent to MAUVE to verify that our Sem-Ent is

1https://github.com/lucasmaystre/choix

more suitable for measuring the response diversity

in open-domain dialogue generation. Note that we

do not set NLI diversity as a baseline because it

is incompatible with our human evaluation which

measures the overall semantic diversity.

Results.

Table 1shows the correlation between the

human judgments and diversity metrics in terms of

Pearson and Spearman rank correlation. Our Sem-

Ent shows the highest Pearson and Spearman rank

correlation with human judgments on response

diversity compared to other evaluation metrics

with a signiﬁcant margin. Especially, Dist-

, the

most commonly used metric for response diversity,

shows a much lower correlation (0.348) compared

to Sem-Ent (0.810). These results support that Sem-

Ent is a good surrogate for estimating human judg-

ments on response diversity and strongly suggest

that analyzing the semantic diversity of generated

responses is crucial for capturing human perception

of response diversity. Moreover, MAUVE shows

a lower correlation with human judgments on re-

sponse diversity. This result implies that generated

responses that have similar representations to hu-

man responses (i.e., high MAUVE scores) are not

always semantically diverse since human responses

are also often generic (Csáky et al.,2019) (further

analyzed in Section 4.1).

We also observe that Sem-Ent shows a high

correlation with human judgment on interesting-

ness (See et al.,2019); Ask annotators about how

interesting or boring did they ﬁnd about the gener-

ated response. Sem-Ent has a similar correlation

to Ent-

and shows a substantially higher corre-

lation than Dist-

, LF, and MAUVE. We believe

that semantically diverse responses could improve

the interestingness of dialogue generation models

and Sem-Ent could somewhat capture human judg-

ments on this response interestingness.

Robustness of Sem-Ent to the Choice of Conﬁg-

uration.

Sem-Ent could be affected by the changes

in the conﬁgurations used for calculating the score,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MeasuringandImprovingSemanticDiversityofDialogueGenerationSeungjuHanBeomsuKimBuruChangHyperconnectwade3han@snu.ac.kr,{beomsu.kim,buru.chang}@hpcnt.comAbstractResponsediversityhasbecomeanimportantcriterionforevaluatingthequalityofopen-domaindialoguegenerationmodels.How-ever,currentevaluationmetricsf...

展开>> 收起<<

Measuring and Improving Semantic Diversity of Dialogue Generation Seungju Han Beomsu Kim Buru Chang Hyperconnect.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Measuring and Improving Semantic Diversity of Dialogue Generation Seungju Han Beomsu Kim Buru Chang Hyperconnect

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: