DialoGen Generalized Long-Range Context Representation for Dialogue Systems Suvodip Dey1 Maunendra Sankar Desarkar1 Asif Ekbal2 P.K. Srijith1

2025-04-26 1 0 1.07MB 15 页 10玖币

侵权投诉

DialoGen: Generalized Long-Range Context Representation

for Dialogue Systems

Suvodip Dey1, Maunendra Sankar Desarkar1, Asif Ekbal2, P.K. Srijith1

1Indian Institute of Technology Hyderabad, India

2Indian Institute of Technology Patna, India

suvodip15@gmail.com, maunendra@cse.iith.ac.in, asif@iitp.ac.in, srijith@iith.ac.in

Abstract

Long-range context modeling is crucial to both

dialogue understanding and generation. The

most popular method for dialogue context rep-

resentation is to concatenate the last-

utter-

ances in chronological order. However, this

method may not be ideal for conversations con-

taining long-range dependencies, i.e., when

there is a need to look beyond last-

utter-

ances to generate a meaningful response. In this

work, we propose DialoGen, a novel encoder-

decoder based framework for dialogue genera-

tion with a generalized context representation

that can look beyond the last-

utterances. The

main idea of our approach is to identify and

utilize the most relevant historical utterances

instead of last-

, which also enables the com-

pact representation of dialogue history with

fewer tokens. We study the effectiveness of our

proposed method on both dialogue generation

(open-domain) and understanding (DST). Even

with a compact context representation, Dialo-

Gen performs comparably to the state-of-the-

art models on the open-domain DailyDialog

dataset. We observe a similar behavior on the

DST task of the MultiWOZ dataset when the

proposed context representation is applied to

existing DST models. We also discuss the gen-

eralizability and interpretability of DialoGen

and show that the relevance score of previous

utterances agrees well with human cognition.

1 Introduction

One of the key challenges in dialogue systems is

modeling long-range context (Yan et al.,2022). Hu-

man conversations can be lengthy and may contain

long-range dependencies among turns. While hav-

ing a conversation, we often refer back to names,

topics, or other information that was mentioned

long before the current dialogue turn. For exam-

ple, Table 1shows an open-domain conversation

from the DailyDialog (Li et al.,2017) dataset. We

can observe that in Turn 11, “it” refers to the word

“hats”, which is mentioned only once in the ﬁrst

Turn Utterance

1 Oh , so many kinds of winter hats .

2 What is your favorite color , miss ?

3 Red .

4 Here you are. It ’ s very attractive .

5 May I try it on ?

6 Go ahead .

7 Is there a mirror around here ?

8 Right over there .

9 Does it suit me ?

10 Yes , you look very nice .

11 How much is it ?

Table 1: A sample conversation from DailyDialog

turn. Understanding such long-range dependencies

is critical for long-range context modeling, which

can be beneﬁcial for both dialogue generation and

understanding.

The main challenge of dialogue context mod-

eling comes from the fact that conversations can

be arbitrarily long and complex in nature. To

encode arbitrary long conversations, researchers

started adapting a hierarchical recurrent encoder

framework that contains an utterance-level and

a dialogue-level encoder (Sordoni et al.,2015a).

However, this approach cannot fully leverage the

beneﬁts of the utterance level features (discussed in

Section 2.2). After the evolution of Transformers

(Vaswani et al.,2017), the most popular approach

to context modeling is to concatenate the histori-

cal utterances and use a transformer decoder (or

encoder-decoder) model to generate the response.

As the sequence length of a transformer is lim-

ited, people generally use only the last-

utterances

according to memory limit. Despite its simplicity,

this method has produced state-of-the-art results for

almost all kinds of dialogue-related tasks (Zhang

et al.,2020;Heck et al.,2020;Kim et al.,2020a).

Since the existing dialogue datasets have a scarcity

of long-range dependencies among turns, looking

only at last-

turns is enough to generate a good

aggregate-level performance. Although this phe-

nomenon of relying only on recent turns can be

arXiv:2210.06282v4 [cs.CL] 3 Oct 2023

observed in short and simple real-world conversa-

tions, the same cannot be said for more complex

scenarios.

In this work we propose DialoGen

, an open

domain Dialogue system with Generalized context

representation strategy. The primary objective of

DialoGen is to enrich dialogue context modeling

by addressing long-range dependencies such that

arbitrarily long conversations can be handled in

an easy and interpretable way. The central idea

of our approach is to ﬁnd the relevant historical

utterances along with a vector representation of

the entire context that can guide the generation of

a meaningful response. The main contributions of

our work are as follows:

•

We propose DialoGen, a novel dialogue gen-

eration framework with a generalized repre-

sentation for long-range dialogue context.

•

The proposed context representation method

can handle arbitrary long conversations and

works even when the context for the current

turn might have been presented much earlier

in the conversation. The relevance scores

over all the previous turns help to understand

the long-range dependencies among dialogue

turns, which enhances the generalization and

interpretability of the context representation.

•

DialoGen achieves comparable performance

to state-of-the-art models on dialogue genera-

tion and understanding, even with its short and

compact representation of dialogue history.

•

Detailed discussion on the generalizability

and interpretability of the proposed approach,

along with a psycholinguistic perspective.

2 Background and Related Works

The existing neural network approaches for con-

text modeling can be broadly categorized into two

classes: Concatenation-based and Hierarchical.

2.1 Concatenation-based Encoding

In this approach, historical utterances are concate-

nated to represent the context. In pre-Transformer

era, the concatenation-based encoding strategy was

a go-to method to train an RNN based encoder-

decoder (Bahdanau et al.,2015) for dialogue gen-

eration (Sordoni et al.,2015b). A major issue

1Code is available at github.com/SuvodipDey/DialoGen

with this approach is that the concatenated utter-

ances can be very long, depending on the conversa-

tion. Moreover, modeling long-range dependencies

with an RNN/LSTM is difﬁcult. This is why re-

searchers started switching to hierarchical encoders

(Section 2.2) to handle long conversations. How-

ever, concatenation-based encoding again came to

the forefront after the emergence of Transformer

architecture (Vaswani et al.,2017). Most of the

Transformer based dialogue models concatenate

previous utterances and ﬁnetune the decoder on

language modeling task (Wolf et al.,2019;Zhang

et al.,2020;Bao et al.,2020;Li et al.,2021;Chen

et al.,2022) to achieve state-of-the-art results on

various dialogue datasets. Note that Transformers

have a limit on maximum sequence length. This is

why all these dialogue models can only take last-

previous utterances as input based on a pre-deﬁned

maximum sequence length. Hence, they are not

able to look beyond last-

turns and thereby can-

not capture very long-range dependencies among

dialog turns. There are variations of Transformer

(like Big-Bird (Zaheer et al.,2020), Poolingformer

(Zhang et al.,2021) etc.) that reduce computation

complexity of self-attention operation from

O(n2)

O(n)

, enabling longer sequence length. How-

ever, looking at more context does not necessarily

solve the problem of long-range dependencies, as

there might still exist dependencies beyond the con-

catenated context that could be passed according

to the maximum allowed sequence length.

2.2 Hierarchical Encoding

In this strategy, the encoding of arbitrary long con-

versations is achieved through a hierarchical en-

coder. Each utterance is ﬁrst encoded using an

utterance-level encoder. The encoded utterances

are then fed to a dialogue-level encoder to get the

ﬁnal context representation. As discussed in Sec-

tion 2.1, vanilla RNN-based encoder-decoder ar-

chitecture cannot handle long conversations. To

address this issue, researchers started adopting hi-

erarchical recurrent encoders where two separate

RNN/LSTM are employed as the utterance-level

and dialogue-level encoders. Models like HRED

(Sordoni et al.,2015a) and VHRED (Serban et al.,

2017) fall under this category. There are few works

that use BERT as an utterance-level encoder (Kim

et al.,2020a). Li et al. (2019) proposed an In-

cremental Transformer for hierarchical recurrent

encoding. DialogBERT (Gu et al.,2021) uses two

Figure 1: Architecture of DialoGen

separate BERT (Devlin et al.,2019) encoders to re-

alize hierarchical encoding. Although DialogBERT

can handle lengthy conversations, theoretically, the

number of turns is limited by the maximum se-

quence length of BERT. The main advantage of

hierarchical encoding is its ease of encoding long

conversations. However, these models depend only

on the ﬁnal context vector for response generation.

Not considering word/token level features can fail

to capture the complex correlation between all the

words in the input and output sequences, which

may be required for dialogue generation. More-

over, in real-world conversation, we often reuse

words/phrases from past utterances in our replies

for which word/token level features are important.

The decoders of concatenation-based methods put

attention on all the context tokens during response

generation. This is why most of the state-of-the-art

results are reported using concatenation-based en-

coding. Hierarchical encoding-based models like

HRAN (Xing et al.,2018) and ReCoSa (Zhang

et al.,2019) try to address this issue by addi-

tionally considering attention on utterance-level

words/tokens. But doing so makes the dialogue gen-

eration dependent on context length, which again

brings back some of the limitations discussed in

Section 2.1.

3 Methodology

In this section, we describe our proposed dia-

logue generation framework, DialoGen. Let

{u1, u2, u3, ...}

be a multi-turn conversation where

represents the utterance at turn

. The objec-

tive of dialogue generation is to generate

ut+1

given

D≤t

i.e.

{u1, u2, u3, ..., ut}

. The main idea

of our approach is to combine the advantages of

both concatenation-based and hierarchical encod-

ings and provide a generalized context representa-

tion for dialogue systems that is adaptive to long-

range dependencies. The framework is based on an

encoder-decoder architecture, as shown in Fig. 1.

3.1 Encoder

DialoGen encoder is basically a hierarchical recur-

rent encoder with few added elements. At a given

turn

, the encoder ﬁrst predicts the encoding of

the next response. This predicted encoding (

b′

t+1

)

is then used to ﬁnd a relevance score (

α(t)

) for all

the previous utterances. Finally,

α(t)

is used to

compute a vector representation (

) of the entire

context such that the prediction of ground-truth

words/tokens is maximized.

Hierarchical Encoding: We use BERT (Devlin

et al.,2019) and GRU (Gated Recurrent Unit) (Cho

et al.,2014) as our utterance-level and dialogue-

level encoders respectively. At each turn

, the

utterance-level encoder (

fϕ

) takes

as input and

outputs

. Here

fϕ

is deﬁned as the mean of all

the tokens of the second-to-last layer of the BERT

model. The utterance-level encoding is then passed

to the stacked GRU (

gψ

) with

layers to generate

the contextual representation

. The procedure

of obtaining the contextual representation can be

summarized as,

bt=fϕ(ut)∈Rd(1)

et, ht=gψ(bt, ht−1)(2)

where

is the dimension of BERT embedding,

et∈Rd

is the output of the GRU and

ht∈Rl×d

the GRU hidden state. The initial hidden state

is set to a zero matrix.

Next Utterance Prediction: After hierarchical

encoding, we predict the encoding of the next ut-

terance as

b′

t+1 = FNN1(et)

where

FNN1

is a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DialoGen:GeneralizedLong-RangeContextRepresentationforDialogueSystemsSuvodipDey1,MaunendraSankarDesarkar1,AsifEkbal2,P.K.Srijith11IndianInstituteofTechnologyHyderabad,India2IndianInstituteofTechnologyPatna,Indiasuvodip15@gmail.com,maunendra@cse.iith.ac.in,asif@iitp.ac.in,srijith@iith.ac.inAbstractLo...

展开>> 收起<<

DialoGen Generalized Long-Range Context Representation for Dialogue Systems Suvodip Dey1 Maunendra Sankar Desarkar1 Asif Ekbal2 P.K. Srijith1.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DialoGen Generalized Long-Range Context Representation for Dialogue Systems Suvodip Dey1 Maunendra Sankar Desarkar1 Asif Ekbal2 P.K. Srijith1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: