DialoGen Generalized Long-Range Context Representation for Dialogue Systems Suvodip Dey1 Maunendra Sankar Desarkar1 Asif Ekbal2 P.K. Srijith1

2025-04-26 0 0 1.07MB 15 页 10玖币
侵权投诉
DialoGen: Generalized Long-Range Context Representation
for Dialogue Systems
Suvodip Dey1, Maunendra Sankar Desarkar1, Asif Ekbal2, P.K. Srijith1
1Indian Institute of Technology Hyderabad, India
2Indian Institute of Technology Patna, India
suvodip15@gmail.com, maunendra@cse.iith.ac.in, asif@iitp.ac.in, srijith@iith.ac.in
Abstract
Long-range context modeling is crucial to both
dialogue understanding and generation. The
most popular method for dialogue context rep-
resentation is to concatenate the last-
k
utter-
ances in chronological order. However, this
method may not be ideal for conversations con-
taining long-range dependencies, i.e., when
there is a need to look beyond last-
k
utter-
ances to generate a meaningful response. In this
work, we propose DialoGen, a novel encoder-
decoder based framework for dialogue genera-
tion with a generalized context representation
that can look beyond the last-
k
utterances. The
main idea of our approach is to identify and
utilize the most relevant historical utterances
instead of last-
k
, which also enables the com-
pact representation of dialogue history with
fewer tokens. We study the effectiveness of our
proposed method on both dialogue generation
(open-domain) and understanding (DST). Even
with a compact context representation, Dialo-
Gen performs comparably to the state-of-the-
art models on the open-domain DailyDialog
dataset. We observe a similar behavior on the
DST task of the MultiWOZ dataset when the
proposed context representation is applied to
existing DST models. We also discuss the gen-
eralizability and interpretability of DialoGen
and show that the relevance score of previous
utterances agrees well with human cognition.
1 Introduction
One of the key challenges in dialogue systems is
modeling long-range context (Yan et al.,2022). Hu-
man conversations can be lengthy and may contain
long-range dependencies among turns. While hav-
ing a conversation, we often refer back to names,
topics, or other information that was mentioned
long before the current dialogue turn. For exam-
ple, Table 1shows an open-domain conversation
from the DailyDialog (Li et al.,2017) dataset. We
can observe that in Turn 11, “it” refers to the word
hats”, which is mentioned only once in the first
Turn Utterance
1 Oh , so many kinds of winter hats .
2 What is your favorite color , miss ?
3 Red .
4 Here you are. It ’ s very attractive .
5 May I try it on ?
6 Go ahead .
7 Is there a mirror around here ?
8 Right over there .
9 Does it suit me ?
10 Yes , you look very nice .
11 How much is it ?
Table 1: A sample conversation from DailyDialog
turn. Understanding such long-range dependencies
is critical for long-range context modeling, which
can be beneficial for both dialogue generation and
understanding.
The main challenge of dialogue context mod-
eling comes from the fact that conversations can
be arbitrarily long and complex in nature. To
encode arbitrary long conversations, researchers
started adapting a hierarchical recurrent encoder
framework that contains an utterance-level and
a dialogue-level encoder (Sordoni et al.,2015a).
However, this approach cannot fully leverage the
benefits of the utterance level features (discussed in
Section 2.2). After the evolution of Transformers
(Vaswani et al.,2017), the most popular approach
to context modeling is to concatenate the histori-
cal utterances and use a transformer decoder (or
encoder-decoder) model to generate the response.
As the sequence length of a transformer is lim-
ited, people generally use only the last-
k
utterances
according to memory limit. Despite its simplicity,
this method has produced state-of-the-art results for
almost all kinds of dialogue-related tasks (Zhang
et al.,2020;Heck et al.,2020;Kim et al.,2020a).
Since the existing dialogue datasets have a scarcity
of long-range dependencies among turns, looking
only at last-
k
turns is enough to generate a good
aggregate-level performance. Although this phe-
nomenon of relying only on recent turns can be
arXiv:2210.06282v4 [cs.CL] 3 Oct 2023
observed in short and simple real-world conversa-
tions, the same cannot be said for more complex
scenarios.
In this work we propose DialoGen
1
, an open
domain Dialogue system with Generalized context
representation strategy. The primary objective of
DialoGen is to enrich dialogue context modeling
by addressing long-range dependencies such that
arbitrarily long conversations can be handled in
an easy and interpretable way. The central idea
of our approach is to find the relevant historical
utterances along with a vector representation of
the entire context that can guide the generation of
a meaningful response. The main contributions of
our work are as follows:
We propose DialoGen, a novel dialogue gen-
eration framework with a generalized repre-
sentation for long-range dialogue context.
The proposed context representation method
can handle arbitrary long conversations and
works even when the context for the current
turn might have been presented much earlier
in the conversation. The relevance scores
over all the previous turns help to understand
the long-range dependencies among dialogue
turns, which enhances the generalization and
interpretability of the context representation.
DialoGen achieves comparable performance
to state-of-the-art models on dialogue genera-
tion and understanding, even with its short and
compact representation of dialogue history.
Detailed discussion on the generalizability
and interpretability of the proposed approach,
along with a psycholinguistic perspective.
2 Background and Related Works
The existing neural network approaches for con-
text modeling can be broadly categorized into two
classes: Concatenation-based and Hierarchical.
2.1 Concatenation-based Encoding
In this approach, historical utterances are concate-
nated to represent the context. In pre-Transformer
era, the concatenation-based encoding strategy was
a go-to method to train an RNN based encoder-
decoder (Bahdanau et al.,2015) for dialogue gen-
eration (Sordoni et al.,2015b). A major issue
1Code is available at github.com/SuvodipDey/DialoGen
with this approach is that the concatenated utter-
ances can be very long, depending on the conversa-
tion. Moreover, modeling long-range dependencies
with an RNN/LSTM is difficult. This is why re-
searchers started switching to hierarchical encoders
(Section 2.2) to handle long conversations. How-
ever, concatenation-based encoding again came to
the forefront after the emergence of Transformer
architecture (Vaswani et al.,2017). Most of the
Transformer based dialogue models concatenate
previous utterances and finetune the decoder on
language modeling task (Wolf et al.,2019;Zhang
et al.,2020;Bao et al.,2020;Li et al.,2021;Chen
et al.,2022) to achieve state-of-the-art results on
various dialogue datasets. Note that Transformers
have a limit on maximum sequence length. This is
why all these dialogue models can only take last-
k
previous utterances as input based on a pre-defined
maximum sequence length. Hence, they are not
able to look beyond last-
k
turns and thereby can-
not capture very long-range dependencies among
dialog turns. There are variations of Transformer
(like Big-Bird (Zaheer et al.,2020), Poolingformer
(Zhang et al.,2021) etc.) that reduce computation
complexity of self-attention operation from
O(n2)
to
O(n)
, enabling longer sequence length. How-
ever, looking at more context does not necessarily
solve the problem of long-range dependencies, as
there might still exist dependencies beyond the con-
catenated context that could be passed according
to the maximum allowed sequence length.
2.2 Hierarchical Encoding
In this strategy, the encoding of arbitrary long con-
versations is achieved through a hierarchical en-
coder. Each utterance is first encoded using an
utterance-level encoder. The encoded utterances
are then fed to a dialogue-level encoder to get the
final context representation. As discussed in Sec-
tion 2.1, vanilla RNN-based encoder-decoder ar-
chitecture cannot handle long conversations. To
address this issue, researchers started adopting hi-
erarchical recurrent encoders where two separate
RNN/LSTM are employed as the utterance-level
and dialogue-level encoders. Models like HRED
(Sordoni et al.,2015a) and VHRED (Serban et al.,
2017) fall under this category. There are few works
that use BERT as an utterance-level encoder (Kim
et al.,2020a). Li et al. (2019) proposed an In-
cremental Transformer for hierarchical recurrent
encoding. DialogBERT (Gu et al.,2021) uses two
Figure 1: Architecture of DialoGen
separate BERT (Devlin et al.,2019) encoders to re-
alize hierarchical encoding. Although DialogBERT
can handle lengthy conversations, theoretically, the
number of turns is limited by the maximum se-
quence length of BERT. The main advantage of
hierarchical encoding is its ease of encoding long
conversations. However, these models depend only
on the final context vector for response generation.
Not considering word/token level features can fail
to capture the complex correlation between all the
words in the input and output sequences, which
may be required for dialogue generation. More-
over, in real-world conversation, we often reuse
words/phrases from past utterances in our replies
for which word/token level features are important.
The decoders of concatenation-based methods put
attention on all the context tokens during response
generation. This is why most of the state-of-the-art
results are reported using concatenation-based en-
coding. Hierarchical encoding-based models like
HRAN (Xing et al.,2018) and ReCoSa (Zhang
et al.,2019) try to address this issue by addi-
tionally considering attention on utterance-level
words/tokens. But doing so makes the dialogue gen-
eration dependent on context length, which again
brings back some of the limitations discussed in
Section 2.1.
3 Methodology
In this section, we describe our proposed dia-
logue generation framework, DialoGen. Let
D=
{u1, u2, u3, ...}
be a multi-turn conversation where
ui
represents the utterance at turn
i
. The objec-
tive of dialogue generation is to generate
ut+1
given
Dt
i.e.
{u1, u2, u3, ..., ut}
. The main idea
of our approach is to combine the advantages of
both concatenation-based and hierarchical encod-
ings and provide a generalized context representa-
tion for dialogue systems that is adaptive to long-
range dependencies. The framework is based on an
encoder-decoder architecture, as shown in Fig. 1.
3.1 Encoder
DialoGen encoder is basically a hierarchical recur-
rent encoder with few added elements. At a given
turn
t
, the encoder first predicts the encoding of
the next response. This predicted encoding (
b
t+1
)
is then used to find a relevance score (
α(t)
) for all
the previous utterances. Finally,
α(t)
is used to
compute a vector representation (
Xt
) of the entire
context such that the prediction of ground-truth
words/tokens is maximized.
Hierarchical Encoding: We use BERT (Devlin
et al.,2019) and GRU (Gated Recurrent Unit) (Cho
et al.,2014) as our utterance-level and dialogue-
level encoders respectively. At each turn
t
, the
utterance-level encoder (
fϕ
) takes
ut
as input and
outputs
bt
. Here
fϕ
is defined as the mean of all
the tokens of the second-to-last layer of the BERT
model. The utterance-level encoding is then passed
to the stacked GRU (
gψ
) with
l
layers to generate
the contextual representation
et
. The procedure
of obtaining the contextual representation can be
summarized as,
bt=fϕ(ut)Rd(1)
et, ht=gψ(bt, ht1)(2)
where
d
is the dimension of BERT embedding,
etRd
is the output of the GRU and
htRl×d
is
the GRU hidden state. The initial hidden state
h0
is set to a zero matrix.
Next Utterance Prediction: After hierarchical
encoding, we predict the encoding of the next ut-
terance as
b
t+1 = FNN1(et)
where
FNN1
is a
摘要:

DialoGen:GeneralizedLong-RangeContextRepresentationforDialogueSystemsSuvodipDey1,MaunendraSankarDesarkar1,AsifEkbal2,P.K.Srijith11IndianInstituteofTechnologyHyderabad,India2IndianInstituteofTechnologyPatna,Indiasuvodip15@gmail.com,maunendra@cse.iith.ac.in,asif@iitp.ac.in,srijith@iith.ac.inAbstractLo...

展开>> 收起<<
DialoGen Generalized Long-Range Context Representation for Dialogue Systems Suvodip Dey1 Maunendra Sankar Desarkar1 Asif Ekbal2 P.K. Srijith1.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.07MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注