Dial2vec Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings Che Liu Rui Wang Junfeng Jiang Yongbin Li Fei Huang

2025-05-06 0 0 560.01KB 11 页 10玖币

侵权投诉

Dial2vec: Self-Guided Contrastive Learning

of Unsupervised Dialogue Embeddings

Che Liu, Rui Wang, Junfeng Jiang, Yongbin Li∗

, Fei Huang

DAMO Academy, Alibaba Group

{liuche.lc,wr224079,jiangjunfeng.jjf,shuide.lyb,f.huang}@alibaba-inc.com

Abstract

In this paper, we introduce the task of learning

unsupervised dialogue embeddings. Trivial

approaches such as combining pre-trained

word or sentence embeddings and encoding

through pre-trained language models (PLMs)

have been shown to be feasible for this

task. However, these approaches typically

ignore the conversational interactions between

interlocutors, resulting in poor performance.

To address this issue, we proposed a self-

guided contrastive learning approach named

dial2vec. Dial2vec considers a dialogue as an

information exchange process. It captures the

conversational interaction patterns between

interlocutors and leverages them to guide the

learning of the embeddings corresponding

to each interlocutor. The dialogue embed-

ding is obtained by an aggregation of the

embeddings from all interlocutors. To verify

our approach, we establish a comprehensive

benchmark consisting of six widely-used

dialogue datasets. We consider three evalua-

tion tasks: domain categorization, semantic

relatedness, and dialogue retrieval. Dial2vec

achieves on average 8.7, 9.0, and 13.8 points

absolute improvements in terms of purity,

Spearman’s correlation, and mean average

precision (MAP) over the strongest baseline

on the three tasks respectively. Further

analysis shows that dial2vec obtains infor-

mative and discriminative embeddings for

both interlocutors under the guidance of the

conversational interactions and achieves the

best performance when aggregating them

through the interlocutor-level pooling strategy.

All codes and data are publicly available at

https://github.com/AlibabaResearch/DAMO-

ConvAI/tree/main/dial2vec.

1 Introduction

Dialogue embedding, as a critical prerequisite of

semantically understanding a dialogue, has been

∗Corresponding author.

a central issue in dialogue-related research such

as dialogue clustering (Shi et al.,2018;Lv et al.,

2021), conversational sentiment analysis (Wang

et al.,2020;Lv et al.,2021), context-dependent

text-to-SQL (Hui et al.,2021;Wang et al.,2022),

and dialogue summarization (Liu et al.,2019b;Liu

and Chen,2021). Trivial unsupervised approaches

generally encode dialogues by combining their pre-

trained word or sentence embeddings (Pennington

et al.,2014;Reimers and Gurevych,2019) or us-

ing PLMs (Wu et al.,2020a;Bao et al.,2020;He

et al.,2022a,b,c). However, such methods are not

speciﬁcally designed for dialogues and thus fail

to adequately capture the key conversational infor-

mation. In this paper, we formally introduce the

task of learning unsupervised dialogue embeddings,

which aims to learn dialogue embeddings that can

well reﬂect conversational semantics without any

additional manual annotations.

Previous studies have extensively demonstrated

the importance of encoding token-level interac-

tions for learning semantic textual embeddings.

However, for dialogue embedding, encoding

interlocutor-level interactions is also essential but

is overlooked in trivial approaches. Figure 1shows

I want to go out and do something.

Perhaps you want to go to see some live

music or to a sports event? Do you know

what city you want to go?

Maybe a concert? I love jazz. If

possible, I’d like a convert in Napa.

There is a concert called Acoustic

Alchemy at Blue Note Napa.

Sounds great. That’s just

what I was looking for.

Have a nice day.

Figure 1: A dialogue from the SGD dataset.

an example. We highlight the signiﬁcant interac-

tion patterns between the interlocutors with red

arXiv:2210.15332v1 [cs.CL] 27 Oct 2022

color. As we can see, although these patterns only

appear in three utterances, they highly represent the

key conversational semantics (e.g., topics) and are

more important than the other parts (e.g., greetings

and chit-chats). We hold that capturing and leverag-

ing them is one of the keys to learning high-quality

unsupervised dialogue embeddings.

In this work, we propose dial2vec, a self-guided

contrastive learning approach to solve the proposed

task. Dial2vec considers a dialogue as an informa-

tion exchange process between the two interlocu-

tors and learns embeddings for both interlocutors

with the help of each other. Speciﬁcally, dial2vec

ﬁrstly encodes a dialogue through a PLM and as-

signs each interlocutor a self-representation by

masking the non-corresponding positions in the

encoding outputs. Then it calculates a matching

matrix via the token-level dot-product operation

between the two self-representations, obtaining a

cross-representation for each interlocutor. Finally,

the two cross-representations are leveraged as guid-

ance to help the two self-representations gradually

learn the interlocutor-level interaction-aware infor-

mation and eliminate the interaction-free informa-

tion during the training procedure.

To verify our model, we build a comprehensive

benchmark comprising a total of 98,879 dialogues

by introducing six widely-used dialogue datasets,

including BiTOD (Lin et al.,2021), Doc2dial (Feng

et al.,2020), MetalWOZ (Lee et al.,2019), Multi-

WOZ (Eric et al.,2019), Self-dialogue (Fainberg

et al.,2018), and SGD (Rastogi et al.,2020). Each

dataset consists of thousands of dialogues, where

each dialogue is provided with a domain label (e.g.,

hotel booking and movie). We leverage these la-

bels and design three evaluation tasks: domain

categorization, semantic relatedness, and dialogue

retrieval. We categorize them into intrinsic and

extrinsic tasks according to their different focus.

Experimental results on this benchmark show

that dial2vec outperforms the baselines by a sub-

stantial margin. Compared with the strongest base-

line, dial2vec achieves on average 8.7, 9.0, and

13.8 points absolute improvements in terms of pu-

rity, Spearman’s correlation, and mean average

precision (MAP) on the three tasks respectively.

We also conduct experiments with the single inter-

locutor’s embeddings, their aggregation strategies,

and the overall dialogue embedding distributions

to study how dial2vec achieves such advanced per-

formance. The results demonstrate that dial2vec

learns both informative and discriminative embed-

dings for the two interlocutors and achieves the

best performance when combining them through

the proposed interlocutor-level pooling aggregation

strategy.

2 Related Work

2.1 Text Embedding

Text embedding aims to encode a piece of text into

a distributed vector that could represent its seman-

tics. Early works (Bengio et al.,2003;Mikolov

et al.,2013;Pennington et al.,2014) learn unsu-

pervised word embeddings by making use of word-

level co-occurrence information in the skip-gram

or CBOW tasks. Recently, Devlin et al. (2018);

Liu et al. (2019a); Yang et al. (2019); Raffel et al.

(2020) pre-train deep transformer (Vaswani et al.,

2017) with a series of pretext tasks, setting a new

state-of-the-art across the GLUE benchmark (Wang

et al.,2018) as well as exhibiting a strong poten-

tial in producing general text embeddings. Along

this line, Gao et al. (2021); Yan et al. (2021); Liu

et al. (2021); Chuang et al. (2022); Nishikawa et al.

(2022); Zhou et al. (2022); Klein and Nabi (2022)

ﬁne-tune the PLMs with contrastive learning ob-

jectives, achieving remarkable improvements in

learning unsupervised sentence embeddings. Luo

et al. (2021) introduce a data augmentation-based

contrastive learning approach in learning docu-

ment embeddings, achieving superior performance

over word2vec-based approaches (Le and Mikolov,

2014;Chen,2017).

For dialogue embedding, the above approaches

are generally unsatisfactory, as they typically ob-

tain dialogue embeddings by averaging the pre-

trained word or sentence embeddings, ignoring the

interlocutor-level conversational interactions. Al-

though conversational-PLMs pre-trained with dia-

logue data can solve this problem to some extent

(Wu et al.,2020a;Bao et al.,2020;Roller et al.,

2021), they mainly focus on learning end-to-end

models which are not sufﬁcient for our task. As a

comparison, we study how to produce high-quality

dialogue embeddings by fully exploiting the con-

versational information.

2.2 Contrastive Learning

Contrastive learning is an emerging self-supervised

learning method which can improve the represen-

tation capability of PLMs in both pre-training and

ﬁne-tuning stages. Wu et al. (2020b); Meng et al.

Contrastive LearningEncoding Module

Downstream

Pretraining

E0 E1 E2 E0 E1

E0 E2

E2 E1 E1 E0 E0

E3 E2

EB EA EA EB EBEA EB

Eare

Ehow Eam Egood Eme Etoo

Ehi Eyou

Transformer Block 1

Transformer Block 2

···

Transformer Block L

Eare

Ehow Eam Egood Eme Etoo

Ehi Eyou

Masking Layer

Pooling

···

positive

negative

+Domain

Categorization

Dialogue

Retrieval

Semantic

Relatedness

Figure 2: Architecture of dial2vec. Firstly, it encodes a dialogue through a PLM and assigns each interlocutor a

self-representation through a masking layer (highlighted with yellow). Hollow circles in each self-representation

represent zero embeddings. Then two matching matrices are calculated through the dot-product multiplication,

based on which two cross-representations are generated. Each cross-representation and its corresponding self-

representation are complementary in the token sequence dimension. Finally, the cosine distance between them

will be minimized or maximized according to whether the training sample is positive or negative.

(2021); Giorgi et al. (2020) introduce the token-

level and sentence-level contrastive learning tasks

by correcting corrupted texts to encourage PLMs

to learn noise-invariant representations. Zhang

et al. (2022) propose phrase-guided and tree-guided

contrastive learning objectives to inject syntactic

knowledge into PLMs. Kim et al. (2021) propose

a self-guided learning objective through which a

PLM ﬁne-tunes itself under the guidance of its dif-

ferent layers. Inspired by these works, we propose

to leverage the interlocutor-level conversational in-

teractions to guide the learning of dialogue embed-

dings in an unsupervised learning manner.

3 Proposed Approach

In this section, we take a two-party dialogue as

an example to describe how dial2vec works. It is

worth mentioning that dial2vec can be extended

to the multi-party version through the OVR (one

vs. the rest) scheme with no modiﬁcation of the

architecture.

3.1 Training Samples Generation

We ﬁrst describe how we construct the posi-

tive and the negative training samples, which

plays a key role in the self-guided contrastive

learning approach. Suppose that we have a di-

alogue dataset

D={Sk}K

k=1

, where

Sk=

{up1

1, up2

2, up1

3, up2

4, . . . , up1

t−1, up2

is the

-th dia-

logue session with

utterances.

and

represent

two interlocutors. We treat each utterance in a dia-

logue as a turn, regardless of which interlocutor it

corresponds to. For the convenience of narration,

kin Skis omitted in the following sections.

We treat

(i.e., the original dialogue) as a posi-

tive sample. To construct a negative sample

, we

ﬁrst randomly select an interlocutor in

, say

and keep all the turns of it. Then we ﬁll the other

turns of

with the utterances of

randomly sam-

pled from all dialogue sessions. For each positive

sample, we repeat this operation multiple times to

generate the desired number of negative samples.

3.2 Model Architecture

Figure 2shows the architecture of dial2vec, which

consists of two parts: encoding and contrastive

learning. After training, dial2vec aggregates the

embeddings from both interlocutors to obtain the

ﬁnal dialogue embedding, which is further used for

downstream tasks.

3.2.1 Encoding

Following Bao et al. (2020), we use four types of

embeddings as input to dial2vec: token embedding,

relative positional embedding, turn embedding, and

role embedding. To encode the dialogue, we ﬁrst

concatenate all the utterances and then tokenize

them through WordPiece (Wu et al.,2016) to obtain

a long token sequence. The tokens along with their

corresponding position, turn, and role indices are

respectively mapped into four embedding spaces

and summed to form the ﬁnal input embedding.

3.2.2 Contrastive Learning

Suppose that the output embeddings from the en-

coder are

{h1,h2,h3,...,hn}

, where

hi∈Rd

the output embedding corresponding to the

-th in-

put token and

is the length of the input sequence,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Dial2vec:Self-GuidedContrastiveLearningofUnsupervisedDialogueEmbeddingsCheLiu,RuiWang,JunfengJiang,YongbinLi,FeiHuangDAMOAcademy,AlibabaGroup{liuche.lc,wr224079,jiangjunfeng.jjf,shuide.lyb,f.huang}@alibaba-inc.comAbstractInthispaper,weintroducethetaskoflearningunsuperviseddialogueembeddings.Trivial...

展开>> 收起<<

Dial2vec Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings Che Liu Rui Wang Junfeng Jiang Yongbin Li Fei Huang.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dial2vec Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings Che Liu Rui Wang Junfeng Jiang Yongbin Li Fei Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: