
observed in short and simple real-world conversa-
tions, the same cannot be said for more complex
scenarios.
In this work we propose DialoGen
1
, an open
domain Dialogue system with Generalized context
representation strategy. The primary objective of
DialoGen is to enrich dialogue context modeling
by addressing long-range dependencies such that
arbitrarily long conversations can be handled in
an easy and interpretable way. The central idea
of our approach is to find the relevant historical
utterances along with a vector representation of
the entire context that can guide the generation of
a meaningful response. The main contributions of
our work are as follows:
•
We propose DialoGen, a novel dialogue gen-
eration framework with a generalized repre-
sentation for long-range dialogue context.
•
The proposed context representation method
can handle arbitrary long conversations and
works even when the context for the current
turn might have been presented much earlier
in the conversation. The relevance scores
over all the previous turns help to understand
the long-range dependencies among dialogue
turns, which enhances the generalization and
interpretability of the context representation.
•
DialoGen achieves comparable performance
to state-of-the-art models on dialogue genera-
tion and understanding, even with its short and
compact representation of dialogue history.
•
Detailed discussion on the generalizability
and interpretability of the proposed approach,
along with a psycholinguistic perspective.
2 Background and Related Works
The existing neural network approaches for con-
text modeling can be broadly categorized into two
classes: Concatenation-based and Hierarchical.
2.1 Concatenation-based Encoding
In this approach, historical utterances are concate-
nated to represent the context. In pre-Transformer
era, the concatenation-based encoding strategy was
a go-to method to train an RNN based encoder-
decoder (Bahdanau et al.,2015) for dialogue gen-
eration (Sordoni et al.,2015b). A major issue
1Code is available at github.com/SuvodipDey/DialoGen
with this approach is that the concatenated utter-
ances can be very long, depending on the conversa-
tion. Moreover, modeling long-range dependencies
with an RNN/LSTM is difficult. This is why re-
searchers started switching to hierarchical encoders
(Section 2.2) to handle long conversations. How-
ever, concatenation-based encoding again came to
the forefront after the emergence of Transformer
architecture (Vaswani et al.,2017). Most of the
Transformer based dialogue models concatenate
previous utterances and finetune the decoder on
language modeling task (Wolf et al.,2019;Zhang
et al.,2020;Bao et al.,2020;Li et al.,2021;Chen
et al.,2022) to achieve state-of-the-art results on
various dialogue datasets. Note that Transformers
have a limit on maximum sequence length. This is
why all these dialogue models can only take last-
k
previous utterances as input based on a pre-defined
maximum sequence length. Hence, they are not
able to look beyond last-
k
turns and thereby can-
not capture very long-range dependencies among
dialog turns. There are variations of Transformer
(like Big-Bird (Zaheer et al.,2020), Poolingformer
(Zhang et al.,2021) etc.) that reduce computation
complexity of self-attention operation from
O(n2)
to
O(n)
, enabling longer sequence length. How-
ever, looking at more context does not necessarily
solve the problem of long-range dependencies, as
there might still exist dependencies beyond the con-
catenated context that could be passed according
to the maximum allowed sequence length.
2.2 Hierarchical Encoding
In this strategy, the encoding of arbitrary long con-
versations is achieved through a hierarchical en-
coder. Each utterance is first encoded using an
utterance-level encoder. The encoded utterances
are then fed to a dialogue-level encoder to get the
final context representation. As discussed in Sec-
tion 2.1, vanilla RNN-based encoder-decoder ar-
chitecture cannot handle long conversations. To
address this issue, researchers started adopting hi-
erarchical recurrent encoders where two separate
RNN/LSTM are employed as the utterance-level
and dialogue-level encoders. Models like HRED
(Sordoni et al.,2015a) and VHRED (Serban et al.,
2017) fall under this category. There are few works
that use BERT as an utterance-level encoder (Kim
et al.,2020a). Li et al. (2019) proposed an In-
cremental Transformer for hierarchical recurrent
encoding. DialogBERT (Gu et al.,2021) uses two