is that the pairwise relation prediction might not
capture enough contextual information as the con-
nection between two utterances depends on the
contexts in many cases (Liu et al.,2020). Also, fo-
cusing on pairwise relations leads to a short-sighted
local view. To mitigate this, there are methods try-
ing to introduce additional conversation loss (Li
et al.,2020b,2022) or session classifier (Liu et al.,
2021) to group utterances in the same session to-
gether. We also see methods leveraging relational
graph convolution network (Ma et al.,2022) or
masking mechanism in Transformers (Zhu et al.,
2020). More directly, end-to-end methods (Tan
et al.,2019;Liu et al.,2020) capture the context
information contained in detached sessions and cal-
culate the matching degree between a session and
an utterance. However, many of such methods are
conducted in an online manner which only consid-
ers the preceding context. It may lead to biased
session representations, introduce noisy utterances
to sessions and consequently accumulate errors.
Meanwhile, most of these methods rely heavily
upon human-annotated session labels or reply-to
relations, which are expensive to obtain in practice.
Although there have been a few attempts to tackle
this issue, a more general framework that can han-
dle both supervised and unsupervised learning is
yet to come. For example, Liu et al. (2021) de-
sign a deep co-training scheme with message-pair
classifier and session classifier. However, various
data augmentation procedures based on heuristics
are required for good performance. Chi and Rud-
nicky (2021) propose a zero-shot disentanglement
solution based on a related response selection task.
Still, it relies on a closely related dataset that comes
from the same Ubuntu IRC source inside DSTC8.
Recently, contrastive learning (Hadsell et al.,
2006) has brought prosperity to numbers of ma-
chine learning tasks by introducing unsupervised
representation learning. Substantial performance
gains have been reported in computer vision (He
et al.,2020;Chen et al.,2020) and NLP works
(Yan et al.,2021;Gao et al.,2021). They believe
that good representation should be able to identify
semantically close neighbors while distinguishing
from non-neighbors. Intuitively, in multi-party con-
versation, utterances in the same session should
semantically resemble each other while be far apart
from utterances in other sessions. Instead of hand-
crafted features such as speaker, mention and time
difference etc, it provides another option for auto-
matically learn discriminative representations.
In this work, we design a Bi-level Contrastive
Learning scheme (Bi-CL) to learn discriminative
representations of tangled multi-party dialogue ut-
terances. It not only learns utterance level differ-
ences across sessions, but more importantly, it en-
codes session level structures discovered by clus-
tering into the learned embedding space. Specifi-
cally, we introduce session prototypes to represent
each session for capturing global dialogue struc-
ture and encourage each utterance to be closer to
their assigned prototypes. Since the prototypes
can be estimated via performing clustering on the
utterance representations, it also supports unsu-
pervised conversation disentanglement under an
Expectation-Maximization framework. We evalu-
ate the proposed model under both supervised and
unsupervised settings across several public datasets.
It achieves new state-of-the-art on both.
The contribution is summarized as follows:
•
We design a bi-level contrastive learning
scheme to learn better utterance level and ses-
sion level representations for disentanglement.
•
We delve into the conversation nature to har-
vest evidence which supports our model to dis-
entangle dialogues without any supervision.
•
Experiments show that the proposed Bi-CL
model significantly outperforms several state-
of-the-art models both on the supervised and
unsupervised settings across datasets.
2 Related Work
2.1 Conversation Disentanglement
Previous methods on conversation disentanglement
are mostly performed in a supervised fashion,
which can be coarsely organized into two lines: (1)
two-step methods which first obtain the pairwise
relations among utterances and then disentangle
them with a clustering algorithm; and (2) end-to-
end approaches which directly assign utterances
into different sessions.
The majority of efforts follow the two-step
pipeline. Great attention has been devoted to the
first step. Early works rely heavily on handcrafted
features to represent the utterances for pairwise
relation prediction. For example, Elsner and Char-
niak (2008,2010) used the speaker, time, mentions,
shared word count etc. to train a linear classifier
for utterance pair coherence. More recent works
utilized neural networks to train classifiers. For
instance, Mehri and Carenini (2017) and Guo et al.