
sity of generated responses by calculating how the
responses are evenly distributed in the semantic
latent space based on entropy, as shown in Fig-
ure 1. Through human evaluation, we demonstrate
that Sem-Ent is more highly correlated with hu-
man judgments on response diversity than existing
lexical-level evaluation metrics.
Furthermore, we propose a simple yet effec-
tive learning method of dialogue generation mod-
els to improve semantic diversity of generated re-
sponses. We observe that the semantic distribu-
tion of responses in a dialogue dataset is highly
imbalanced, leading dialogue generation models
to produce semantically less diverse responses.
To address this problem, our proposed method,
DRESS
(
D
iversifying
RES
ponses
S
emantically),
learns more about responses with rare semantics
and learn less about responses with frequent seman-
tics. From this, dialogue generation models could
produce more semantically diverse responses. Ex-
periments on two benchmark datasets demonstrate
that DRESS shows better semantic diversity com-
pared to state-of-the-art baseline methods, along
with the gain in response coherency. Interestingly,
DRESS also achieves better performance in lexical-
level diversity metrics than baselines, even though
it focuses only on improving the semantic diversity.
Moreover, human evaluation results show the ef-
fectiveness of DRESS, where DRESS outperforms
all baseline methods in appropriateness and infor-
mativeness of generated responses.
Our Contributions:
(1) A new automatic evalua-
tion metric for measuring semantic diversity (Sem-
Ent), which is highly correlated with human judg-
ments on response diversity. (2) A simple yet effec-
tive learning method of dialogue generation mod-
els (DRESS) for improving the semantic diversity
of generated responses. (3) Experiments on two
benchmark datasets, showing that DRESS outper-
forms the baseline methods in both semantic diver-
sity and lexical-level diversity. (4) An implementa-
tion of Sem-Ent will be released, contributing to the
community of open-domain dialogue generation.
2 Related Work
2.1 Enhancing Response Diversity
Since generating dull and uninformative responses
is a well-known and important problem in open-
domain dialogue (Vinyals and Le,2015;Li et al.,
2016a), numerous methods have been proposed to
address this issue. The maximum mutual infor-
mation objective function is utilized to penalize
generic responses and improve the diversity of gen-
erated responses (Li et al.,2016a,c;Zhang et al.,
2018,2020). Another line of work improves di-
versity by modeling the one-to-many relationship
of open-domain dialogue using latent variables to
generate multiple and diverse responses (Serban
et al.,2017;Zhao et al.,2017;Bao et al.,2020a,b;
Chen et al.,2019;Zhang et al.,2019;Gao et al.,
2019). Some methods selectively penalize frequent
responses by removing them from the training set
(Csáky et al.,2019) or applying negative training
to frequent responses (He and Glass,2020). Using
different decoding algorithms can improve the re-
sponse diversity; Li et al. (2016b) and Vijayakumar
et al. (2018) directly modify the beam search algo-
rithm to promote the response diversity. Sampling-
based decoding algorithms such as top-
k
sampling
(Fan et al.,2018) and nucleus sampling (Holtzman
et al.,2019) are also known to improve the diver-
sity of generated responses. Wang et al. (2021)
diversify responses by adaptively modifying the
target token distribution with a lightweight decoder
to prevent the model from being over-confident.
2.2 Metrics for Capturing Response Diversity
Response diversity metrics for open-domain dia-
logue generation models can mainly be categorized
into two groups. Referenced metrics (Zhao et al.,
2017;Gao et al.,2019) use the reference responses
provided by human annotators to capture the re-
sponse diversity by computing a recall value based
on various similarity metrics such as BLEU and
embedding similarity. On the other hand, unrefer-
enced metrics measure the response diversity with-
out using reference responses generated by human
annotators. Unreferenced metrics are more widely
adopted than referenced metrics because they can
measure response diversity even in the absence of
reference responses. Dist-
n
(Li et al.,2016a) mea-
sures the response diversity with the fraction of
distinct
n
-grams over possible
n
-grams in all gen-
erated responses. Ent-
n
metric (Serban et al.,2017;
Zhang et al.,2018) is suggested to improve the Dist-
n
metric by taking the frequency difference of
n
-
grams into account. Low-Frequency (LF) (Li et al.,
2019) calculates the frequency of low-frequency
words in generated responses as the response diver-
sity.
Semantic diversity.
Recently, several studies have
focused on the semantic diversity of generated re-