Conversation Disentanglement with Bi-Level Contrastive Learning Chengyu Huang National University of Singapore

2025-05-06 0 0 455.81KB 12 页 10玖币
侵权投诉
Conversation Disentanglement with Bi-Level Contrastive Learning
Chengyu Huang
National University of Singapore
e0376956@u.nus.edu
Zheng Zhang
Tsinghua University
zhangz.goal@gmail.com
Hao Fei
National University of Singapore
haofei37@nus.edu.sg
Lizi Liao
Singapore Management University
lzliao@smu.edu.sg
Abstract
Conversation disentanglement aims to group
utterances into detached sessions, which is a
fundamental task in processing multi-party con-
versations. Existing methods have two main
drawbacks. First, they overemphasize pairwise
utterance relations but pay inadequate atten-
tion to the utterance-to-context relation mod-
eling. Second, a huge amount of human an-
notated data is required for training, which is
expensive to obtain in practice. To address
these issues, we propose a general disentangle
model based on bi-level contrastive learning.
It brings closer utterances in the same session
while encourages each utterance to be near its
clustered session prototypes in the representa-
tion space. Unlike existing approaches, our
disentangle model works in both supervised
settings with labeled data and unsupervised set-
tings when no such data is available. The pro-
posed method achieves new state-of-the-art per-
formance results on both settings across several
public datasets.
1 Introduction
Multi-party conversations generally involve three
or more speakers in a single dialogue, in which
the speaker utterances are interleaved, and multi-
ple topics may be discussed concurrently (Aoki
et al.,2006). This causes inconvenience for dia-
logue participant to digest the utterances and re-
spond to a particular topic thread. Conversation
disentanglement is the task of separating these en-
tangled utterances into detached sessions, which is
a prerequisite of many important downstream tasks
such as dialogue information extraction (Fei et al.,
2022a,b), state tracking (Zhang et al.,2019;Wu
et al.,2022), response generation (Liao et al.,2018,
2021b;Ye et al.,2022a,b), and response ranking
(Elsner and Charniak,2008;Lowe et al.,2017).
There has been substantial work on the conver-
sation disentanglement task. Most of them empha-
size on the pairwise relation between utterances in
Session 1
Session 2
lolcat
Can
Iwget mms://mms-icanal-odc.online.no/norsk
-
ripub
/autodistribusjon/NRK3_201111092212_KOID
usr13
lolcat
:Sure, why not?
dr_willis
lolcat
:try it and see? if not theres programs
like
streamtuner
,or vlc,or others that capture streams
Gremuc
hnik
Hi!
Quick question: I don't like Unity, but Ido
like
GNOME
3. Will Ubuntu 12.04 offer GNOME3
without
Unity,
the regular "original" GNOME3
OttScorp
try
LXDE Gremuchnik :-)
ArNezT
Gremuchnik
:may be you can use ubuntu no-
effect
from
login menu :)
OttScorp
Ubuntu
plans on sticking with Unity
L1nuxR
ules
gremuchink
although Icant answer your
question,
you
can install and use any desktop in Linux
usr13
lolcat
:Oh, yea, myabe you need streamripper
or
something
.But if you just want to watch it, try gxine
OttScorp
Or
perhaps http://forum.videohelp.com/threads/
25704
5
-How-to-record-streaming...
Figure 1: An example piece of conversation from the
Ubuntu IRC corpus. There are distribution patterns in
both utterance level and session level.
a two-step manner. They predict the relationship
between utterance pairs as the first step, followed
by clustering utterances into sessions as the sec-
ond. In the first step, early works (Elsner and Char-
niak,2008,2010) utilize handmade features and
discourse cues to predict whether two utterances
belong to the same session or whether there is a
reply-to relation. The recent development in deep
learning inspires the use of neural network such as
LSTM or CNN to learn abstract features of utter-
ances in training (Mehri and Carenini,2017;Jiang
et al.,2018). More recently, a number of methods
show that BERT in combination with handcrafted
features or heuristics remains a strong baseline (Li
et al.,2020b;Zhu et al.,2021;Ma et al.,2022). In
the second step, the most popular clustering meth-
ods use a greedy approach to group utterances by
adding pairs (Wang and Oard,2009;Zhu et al.,
2020). There are also some variations incorporat-
ing voting mechanism (Kummerfeld et al.,2019),
bipartite graph matching (Zhu et al.,2021) or addi-
tional tracking models (Wang et al.,2020).
An obvious drawback of such two-step approach
arXiv:2210.15265v2 [cs.CL] 30 Aug 2024
is that the pairwise relation prediction might not
capture enough contextual information as the con-
nection between two utterances depends on the
contexts in many cases (Liu et al.,2020). Also, fo-
cusing on pairwise relations leads to a short-sighted
local view. To mitigate this, there are methods try-
ing to introduce additional conversation loss (Li
et al.,2020b,2022) or session classifier (Liu et al.,
2021) to group utterances in the same session to-
gether. We also see methods leveraging relational
graph convolution network (Ma et al.,2022) or
masking mechanism in Transformers (Zhu et al.,
2020). More directly, end-to-end methods (Tan
et al.,2019;Liu et al.,2020) capture the context
information contained in detached sessions and cal-
culate the matching degree between a session and
an utterance. However, many of such methods are
conducted in an online manner which only consid-
ers the preceding context. It may lead to biased
session representations, introduce noisy utterances
to sessions and consequently accumulate errors.
Meanwhile, most of these methods rely heavily
upon human-annotated session labels or reply-to
relations, which are expensive to obtain in practice.
Although there have been a few attempts to tackle
this issue, a more general framework that can han-
dle both supervised and unsupervised learning is
yet to come. For example, Liu et al. (2021) de-
sign a deep co-training scheme with message-pair
classifier and session classifier. However, various
data augmentation procedures based on heuristics
are required for good performance. Chi and Rud-
nicky (2021) propose a zero-shot disentanglement
solution based on a related response selection task.
Still, it relies on a closely related dataset that comes
from the same Ubuntu IRC source inside DSTC8.
Recently, contrastive learning (Hadsell et al.,
2006) has brought prosperity to numbers of ma-
chine learning tasks by introducing unsupervised
representation learning. Substantial performance
gains have been reported in computer vision (He
et al.,2020;Chen et al.,2020) and NLP works
(Yan et al.,2021;Gao et al.,2021). They believe
that good representation should be able to identify
semantically close neighbors while distinguishing
from non-neighbors. Intuitively, in multi-party con-
versation, utterances in the same session should
semantically resemble each other while be far apart
from utterances in other sessions. Instead of hand-
crafted features such as speaker, mention and time
difference etc, it provides another option for auto-
matically learn discriminative representations.
In this work, we design a Bi-level Contrastive
Learning scheme (Bi-CL) to learn discriminative
representations of tangled multi-party dialogue ut-
terances. It not only learns utterance level differ-
ences across sessions, but more importantly, it en-
codes session level structures discovered by clus-
tering into the learned embedding space. Specifi-
cally, we introduce session prototypes to represent
each session for capturing global dialogue struc-
ture and encourage each utterance to be closer to
their assigned prototypes. Since the prototypes
can be estimated via performing clustering on the
utterance representations, it also supports unsu-
pervised conversation disentanglement under an
Expectation-Maximization framework. We evalu-
ate the proposed model under both supervised and
unsupervised settings across several public datasets.
It achieves new state-of-the-art on both.
The contribution is summarized as follows:
We design a bi-level contrastive learning
scheme to learn better utterance level and ses-
sion level representations for disentanglement.
We delve into the conversation nature to har-
vest evidence which supports our model to dis-
entangle dialogues without any supervision.
Experiments show that the proposed Bi-CL
model significantly outperforms several state-
of-the-art models both on the supervised and
unsupervised settings across datasets.
2 Related Work
2.1 Conversation Disentanglement
Previous methods on conversation disentanglement
are mostly performed in a supervised fashion,
which can be coarsely organized into two lines: (1)
two-step methods which first obtain the pairwise
relations among utterances and then disentangle
them with a clustering algorithm; and (2) end-to-
end approaches which directly assign utterances
into different sessions.
The majority of efforts follow the two-step
pipeline. Great attention has been devoted to the
first step. Early works rely heavily on handcrafted
features to represent the utterances for pairwise
relation prediction. For example, Elsner and Char-
niak (2008,2010) used the speaker, time, mentions,
shared word count etc. to train a linear classifier
for utterance pair coherence. More recent works
utilized neural networks to train classifiers. For
instance, Mehri and Carenini (2017) and Guo et al.
(2018) leveraged LSTM to predict either the same-
session or reply-to probabilities between utterances,
while Jiang et al. (2018) combined the output of a
hierarchical CNN on utterances with other features
to capture the interactions. More recently, Gu et al.
(2020) and Li et al. (2020b) used BERT to learn the
similarity score in a fixed length context window.
For the second step, there has also been progress in
exploring an optimal clustering algorithm. Greedy
decoding has been a popular choice (Elsner and
Charniak,2010;Jiang et al.,2018). There are also
works that train a separate classifier to assign ut-
terance to a thread (Mehri and Carenini,2017) or
design advanced algorithms like bipartite graph
matching (Zhu et al.,2021).
On the downside, the pairwise relations, which
are predicted typically without considering enough
session context, are local and may not reflect how
utterances interact in reality. Hence, the clustering
step may be undermined subsequently. This mo-
tivates end-to-end solutions that aim at assigning
the target utterance in each time step with respect
to the existing threads or preceding utterances (Liu
et al.,2020). Similarly, Yu and Joty (2020) used
attention to capture utterance interactions and grad-
ually assign each utterance to its replied-to parent
with a pointer module. However, such online man-
ner not only limits the scope of session context but
also leads to error accumulation.
There are also studies that work in an unsu-
pervised fashion to avoid the reliance on human-
annotation. For example, Liu et al. (2021) designed
both message-pair classifier and session classifier
to form a co-training algorithm. Chi and Rudnicky
(2021) proposed to train a closely-related response
selection model for zero-shot disentanglement. The
former needs pseudo labeled data to warm-up the
training, while the latter gains from training data
of the same source. More importantly, a general
framework that can handle both supervised and su-
pervised learning is yet to come. In our work, we
target at building such a flexible model.
2.2 Contrastive Learning
Contrastive learning learns effective representation
by pulling semantically close neighbors together
and pushing apart non-neighbors (Hadsell et al.,
2006). Recent advances are largely driven by in-
stance discrimination tasks. For example, in the
field of computer vision, such methods consist of
two key components: image transformation and
contrastive loss. The former aims to generate mul-
tiple representations about the same image, by data
augmentation (Ye et al.,2019;Chen et al.,2020),
patch perturbation (Misra and Maaten,2020), or
using momentum features (He et al.,2020). While
the latter aims to bring closer samples from the
same instance and separate samples from different
instances. In the field of natural language process-
ing, contrastive learning has also been widely ap-
plied, such as for language model pre-trainining
(Yan et al.,2021;Gao et al.,2021).
Despite their improved performance, these in-
stance discrimination methods share a common
weakness: the representation is not encouraged to
encode the global semantic structure of data (Caron
et al.,2020). This is because it treats two samples
as a negative pair as long as they are from different
instances, regardless of their semantic similarity
(Li et al.,2020a). Hence, there are methods which
simultaneously conduct contrastive learning at both
the instance- and cluster-level (Li et al.,2021;Shen
et al.,2021). Likewise, we emphasize leveraging
bi-level contrastive objects to learn better utterance
level and session level representations.
3 Method
The definition of the conversation disentanglement
task and details of our model are sequentially pre-
sented in this section. Starting from the supervised
setting for a clear view, we gradually extend to the
unsupervised setting.
3.1 Task Formulation
Given a multi-party conversation history with
n
utterances
U={u1, u2, ..., un}
in chronological
order, our goal is to disentangle them into detached
sessions
S={s1, s2, ..., sk}
, where each
si
is a
non empty subset of
U
, and
S
is a partition of
U
.
Each utterance includes an identity of speaker and
a message sent by this user.
The task has been popularly formulated as a
reply-to relation identification problem to find the
parent utterance for every
uiU
. It has also
been modeled as sequentially assigning each
ui
to
already detached sessions in
S
or create a new ses-
sion for
S
. Here, instead of separating local pair
and global cluster modeling, we opt for learning
more discriminative representations for utterances
to push them into different sessions.
摘要:

ConversationDisentanglementwithBi-LevelContrastiveLearningChengyuHuangNationalUniversityofSingaporee0376956@u.nus.eduZhengZhangTsinghuaUniversityzhangz.goal@gmail.comHaoFeiNationalUniversityofSingaporehaofei37@nus.edu.sgLiziLiaoSingaporeManagementUniversitylzliao@smu.edu.sgAbstractConversationdisent...

展开>> 收起<<
Conversation Disentanglement with Bi-Level Contrastive Learning Chengyu Huang National University of Singapore.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:455.81KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注