Conversation Disentanglement with Bi-Level Contrastive Learning Chengyu Huang National University of Singapore

2025-05-06 1 0 455.81KB 12 页 10玖币

侵权投诉

Conversation Disentanglement with Bi-Level Contrastive Learning

Chengyu Huang

National University of Singapore

e0376956@u.nus.edu

Zheng Zhang

Tsinghua University

zhangz.goal@gmail.com

Hao Fei

National University of Singapore

haofei37@nus.edu.sg

Lizi Liao

Singapore Management University

lzliao@smu.edu.sg

Abstract

Conversation disentanglement aims to group

utterances into detached sessions, which is a

fundamental task in processing multi-party con-

versations. Existing methods have two main

drawbacks. First, they overemphasize pairwise

utterance relations but pay inadequate atten-

tion to the utterance-to-context relation mod-

eling. Second, a huge amount of human an-

notated data is required for training, which is

expensive to obtain in practice. To address

these issues, we propose a general disentangle

model based on bi-level contrastive learning.

It brings closer utterances in the same session

while encourages each utterance to be near its

clustered session prototypes in the representa-

tion space. Unlike existing approaches, our

disentangle model works in both supervised

settings with labeled data and unsupervised set-

tings when no such data is available. The pro-

posed method achieves new state-of-the-art per-

formance results on both settings across several

public datasets.

1 Introduction

Multi-party conversations generally involve three

or more speakers in a single dialogue, in which

the speaker utterances are interleaved, and multi-

ple topics may be discussed concurrently (Aoki

et al.,2006). This causes inconvenience for dia-

logue participant to digest the utterances and re-

spond to a particular topic thread. Conversation

disentanglement is the task of separating these en-

tangled utterances into detached sessions, which is

a prerequisite of many important downstream tasks

such as dialogue information extraction (Fei et al.,

2022a,b), state tracking (Zhang et al.,2019;Wu

et al.,2022), response generation (Liao et al.,2018,

2021b;Ye et al.,2022a,b), and response ranking

(Elsner and Charniak,2008;Lowe et al.,2017).

There has been substantial work on the conver-

sation disentanglement task. Most of them empha-

size on the pairwise relation between utterances in

Session 1

Session 2

lolcat

Can

Iwget mms://mms-icanal-odc.online.no/norsk

ripub

/autodistribusjon/NRK3_201111092212_KOID…

usr13

lolcat

:Sure, why not?

dr_willis

lolcat

:try it and see? if not theres programs

streamtuner

,or vlc,or others that capture streams…

…

Gremuc

hnik

Hi!

Quick question: I don't like Unity, but Ido

GNOME

3. Will Ubuntu 12.04 offer GNOME3

without

Unity,

the regular "original" GNOME3

OttScorp

try

LXDE Gremuchnik :-)

ArNezT

Gremuchnik

:may be you can use ubuntu no-

effect

from

OttScorp

Ubuntu

plans on sticking with Unity

L1nuxR

ules

gremuchink

although Icant answer your

question,

you

can install and use any desktop in Linux

usr13

lolcat

:Oh, yea, myabe you need streamripper

something

.But if you just want to watch it, try gxine

OttScorp

perhaps http://forum.videohelp.com/threads/

25704

-How-to-record-streaming...

Figure 1: An example piece of conversation from the

Ubuntu IRC corpus. There are distribution patterns in

both utterance level and session level.

a two-step manner. They predict the relationship

between utterance pairs as the ﬁrst step, followed

by clustering utterances into sessions as the sec-

ond. In the ﬁrst step, early works (Elsner and Char-

niak,2008,2010) utilize handmade features and

discourse cues to predict whether two utterances

belong to the same session or whether there is a

reply-to relation. The recent development in deep

learning inspires the use of neural network such as

LSTM or CNN to learn abstract features of utter-

ances in training (Mehri and Carenini,2017;Jiang

et al.,2018). More recently, a number of methods

show that BERT in combination with handcrafted

features or heuristics remains a strong baseline (Li

et al.,2020b;Zhu et al.,2021;Ma et al.,2022). In

the second step, the most popular clustering meth-

ods use a greedy approach to group utterances by

adding pairs (Wang and Oard,2009;Zhu et al.,

2020). There are also some variations incorporat-

ing voting mechanism (Kummerfeld et al.,2019),

bipartite graph matching (Zhu et al.,2021) or addi-

tional tracking models (Wang et al.,2020).

An obvious drawback of such two-step approach

arXiv:2210.15265v2 [cs.CL] 30 Aug 2024

is that the pairwise relation prediction might not

capture enough contextual information as the con-

nection between two utterances depends on the

contexts in many cases (Liu et al.,2020). Also, fo-

cusing on pairwise relations leads to a short-sighted

local view. To mitigate this, there are methods try-

ing to introduce additional conversation loss (Li

et al.,2020b,2022) or session classiﬁer (Liu et al.,

2021) to group utterances in the same session to-

gether. We also see methods leveraging relational

graph convolution network (Ma et al.,2022) or

masking mechanism in Transformers (Zhu et al.,

2020). More directly, end-to-end methods (Tan

et al.,2019;Liu et al.,2020) capture the context

information contained in detached sessions and cal-

culate the matching degree between a session and

an utterance. However, many of such methods are

conducted in an online manner which only consid-

ers the preceding context. It may lead to biased

session representations, introduce noisy utterances

to sessions and consequently accumulate errors.

Meanwhile, most of these methods rely heavily

upon human-annotated session labels or reply-to

relations, which are expensive to obtain in practice.

Although there have been a few attempts to tackle

this issue, a more general framework that can han-

dle both supervised and unsupervised learning is

yet to come. For example, Liu et al. (2021) de-

sign a deep co-training scheme with message-pair

classiﬁer and session classiﬁer. However, various

data augmentation procedures based on heuristics

are required for good performance. Chi and Rud-

nicky (2021) propose a zero-shot disentanglement

solution based on a related response selection task.

Still, it relies on a closely related dataset that comes

from the same Ubuntu IRC source inside DSTC8.

Recently, contrastive learning (Hadsell et al.,

2006) has brought prosperity to numbers of ma-

chine learning tasks by introducing unsupervised

representation learning. Substantial performance

gains have been reported in computer vision (He

et al.,2020;Chen et al.,2020) and NLP works

(Yan et al.,2021;Gao et al.,2021). They believe

that good representation should be able to identify

semantically close neighbors while distinguishing

from non-neighbors. Intuitively, in multi-party con-

versation, utterances in the same session should

semantically resemble each other while be far apart

from utterances in other sessions. Instead of hand-

crafted features such as speaker, mention and time

difference etc, it provides another option for auto-

matically learn discriminative representations.

In this work, we design a Bi-level Contrastive

Learning scheme (Bi-CL) to learn discriminative

representations of tangled multi-party dialogue ut-

terances. It not only learns utterance level differ-

ences across sessions, but more importantly, it en-

codes session level structures discovered by clus-

tering into the learned embedding space. Speciﬁ-

cally, we introduce session prototypes to represent

each session for capturing global dialogue struc-

ture and encourage each utterance to be closer to

their assigned prototypes. Since the prototypes

can be estimated via performing clustering on the

utterance representations, it also supports unsu-

pervised conversation disentanglement under an

Expectation-Maximization framework. We evalu-

ate the proposed model under both supervised and

unsupervised settings across several public datasets.

It achieves new state-of-the-art on both.

The contribution is summarized as follows:

•

We design a bi-level contrastive learning

scheme to learn better utterance level and ses-

sion level representations for disentanglement.

•

We delve into the conversation nature to har-

vest evidence which supports our model to dis-

entangle dialogues without any supervision.

•

Experiments show that the proposed Bi-CL

model signiﬁcantly outperforms several state-

of-the-art models both on the supervised and

unsupervised settings across datasets.

2 Related Work

2.1 Conversation Disentanglement

Previous methods on conversation disentanglement

are mostly performed in a supervised fashion,

which can be coarsely organized into two lines: (1)

two-step methods which ﬁrst obtain the pairwise

relations among utterances and then disentangle

them with a clustering algorithm; and (2) end-to-

end approaches which directly assign utterances

into different sessions.

The majority of efforts follow the two-step

pipeline. Great attention has been devoted to the

ﬁrst step. Early works rely heavily on handcrafted

features to represent the utterances for pairwise

relation prediction. For example, Elsner and Char-

niak (2008,2010) used the speaker, time, mentions,

shared word count etc. to train a linear classiﬁer

for utterance pair coherence. More recent works

utilized neural networks to train classiﬁers. For

instance, Mehri and Carenini (2017) and Guo et al.

(2018) leveraged LSTM to predict either the same-

session or reply-to probabilities between utterances,

while Jiang et al. (2018) combined the output of a

hierarchical CNN on utterances with other features

to capture the interactions. More recently, Gu et al.

(2020) and Li et al. (2020b) used BERT to learn the

similarity score in a ﬁxed length context window.

For the second step, there has also been progress in

exploring an optimal clustering algorithm. Greedy

decoding has been a popular choice (Elsner and

Charniak,2010;Jiang et al.,2018). There are also

works that train a separate classiﬁer to assign ut-

terance to a thread (Mehri and Carenini,2017) or

design advanced algorithms like bipartite graph

matching (Zhu et al.,2021).

On the downside, the pairwise relations, which

are predicted typically without considering enough

session context, are local and may not reﬂect how

utterances interact in reality. Hence, the clustering

step may be undermined subsequently. This mo-

tivates end-to-end solutions that aim at assigning

the target utterance in each time step with respect

to the existing threads or preceding utterances (Liu

et al.,2020). Similarly, Yu and Joty (2020) used

attention to capture utterance interactions and grad-

ually assign each utterance to its replied-to parent

with a pointer module. However, such online man-

ner not only limits the scope of session context but

also leads to error accumulation.

There are also studies that work in an unsu-

pervised fashion to avoid the reliance on human-

annotation. For example, Liu et al. (2021) designed

both message-pair classiﬁer and session classiﬁer

to form a co-training algorithm. Chi and Rudnicky

(2021) proposed to train a closely-related response

selection model for zero-shot disentanglement. The

former needs pseudo labeled data to warm-up the

training, while the latter gains from training data

of the same source. More importantly, a general

framework that can handle both supervised and su-

pervised learning is yet to come. In our work, we

target at building such a ﬂexible model.

2.2 Contrastive Learning

Contrastive learning learns effective representation

by pulling semantically close neighbors together

and pushing apart non-neighbors (Hadsell et al.,

2006). Recent advances are largely driven by in-

stance discrimination tasks. For example, in the

ﬁeld of computer vision, such methods consist of

two key components: image transformation and

contrastive loss. The former aims to generate mul-

tiple representations about the same image, by data

augmentation (Ye et al.,2019;Chen et al.,2020),

patch perturbation (Misra and Maaten,2020), or

using momentum features (He et al.,2020). While

the latter aims to bring closer samples from the

same instance and separate samples from different

instances. In the ﬁeld of natural language process-

ing, contrastive learning has also been widely ap-

plied, such as for language model pre-trainining

(Yan et al.,2021;Gao et al.,2021).

Despite their improved performance, these in-

stance discrimination methods share a common

weakness: the representation is not encouraged to

encode the global semantic structure of data (Caron

et al.,2020). This is because it treats two samples

as a negative pair as long as they are from different

instances, regardless of their semantic similarity

(Li et al.,2020a). Hence, there are methods which

simultaneously conduct contrastive learning at both

the instance- and cluster-level (Li et al.,2021;Shen

et al.,2021). Likewise, we emphasize leveraging

bi-level contrastive objects to learn better utterance

level and session level representations.

3 Method

The deﬁnition of the conversation disentanglement

task and details of our model are sequentially pre-

sented in this section. Starting from the supervised

setting for a clear view, we gradually extend to the

unsupervised setting.

3.1 Task Formulation

Given a multi-party conversation history with

utterances

U={u1, u2, ..., un}

in chronological

order, our goal is to disentangle them into detached

sessions

S={s1, s2, ..., sk}

, where each

is a

non empty subset of

, and

is a partition of

Each utterance includes an identity of speaker and

a message sent by this user.

The task has been popularly formulated as a

reply-to relation identiﬁcation problem to ﬁnd the

parent utterance for every

ui∈U

. It has also

been modeled as sequentially assigning each

already detached sessions in

or create a new ses-

sion for

. Here, instead of separating local pair

and global cluster modeling, we opt for learning

more discriminative representations for utterances

to push them into different sessions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConversationDisentanglementwithBi-LevelContrastiveLearningChengyuHuangNationalUniversityofSingaporee0376956@u.nus.eduZhengZhangTsinghuaUniversityzhangz.goal@gmail.comHaoFeiNationalUniversityofSingaporehaofei37@nus.edu.sgLiziLiaoSingaporeManagementUniversitylzliao@smu.edu.sgAbstractConversationdisent...

展开>> 收起<<

Conversation Disentanglement with Bi-Level Contrastive Learning Chengyu Huang National University of Singapore.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Conversation Disentanglement with Bi-Level Contrastive Learning Chengyu Huang National University of Singapore

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: