Dial2vec Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings Che Liu Rui Wang Junfeng Jiang Yongbin Li Fei Huang

2025-05-06 0 0 560.01KB 11 页 10玖币
侵权投诉
Dial2vec: Self-Guided Contrastive Learning
of Unsupervised Dialogue Embeddings
Che Liu, Rui Wang, Junfeng Jiang, Yongbin Li
, Fei Huang
DAMO Academy, Alibaba Group
{liuche.lc,wr224079,jiangjunfeng.jjf,shuide.lyb,f.huang}@alibaba-inc.com
Abstract
In this paper, we introduce the task of learning
unsupervised dialogue embeddings. Trivial
approaches such as combining pre-trained
word or sentence embeddings and encoding
through pre-trained language models (PLMs)
have been shown to be feasible for this
task. However, these approaches typically
ignore the conversational interactions between
interlocutors, resulting in poor performance.
To address this issue, we proposed a self-
guided contrastive learning approach named
dial2vec. Dial2vec considers a dialogue as an
information exchange process. It captures the
conversational interaction patterns between
interlocutors and leverages them to guide the
learning of the embeddings corresponding
to each interlocutor. The dialogue embed-
ding is obtained by an aggregation of the
embeddings from all interlocutors. To verify
our approach, we establish a comprehensive
benchmark consisting of six widely-used
dialogue datasets. We consider three evalua-
tion tasks: domain categorization, semantic
relatedness, and dialogue retrieval. Dial2vec
achieves on average 8.7, 9.0, and 13.8 points
absolute improvements in terms of purity,
Spearman’s correlation, and mean average
precision (MAP) over the strongest baseline
on the three tasks respectively. Further
analysis shows that dial2vec obtains infor-
mative and discriminative embeddings for
both interlocutors under the guidance of the
conversational interactions and achieves the
best performance when aggregating them
through the interlocutor-level pooling strategy.
All codes and data are publicly available at
https://github.com/AlibabaResearch/DAMO-
ConvAI/tree/main/dial2vec.
1 Introduction
Dialogue embedding, as a critical prerequisite of
semantically understanding a dialogue, has been
Corresponding author.
a central issue in dialogue-related research such
as dialogue clustering (Shi et al.,2018;Lv et al.,
2021), conversational sentiment analysis (Wang
et al.,2020;Lv et al.,2021), context-dependent
text-to-SQL (Hui et al.,2021;Wang et al.,2022),
and dialogue summarization (Liu et al.,2019b;Liu
and Chen,2021). Trivial unsupervised approaches
generally encode dialogues by combining their pre-
trained word or sentence embeddings (Pennington
et al.,2014;Reimers and Gurevych,2019) or us-
ing PLMs (Wu et al.,2020a;Bao et al.,2020;He
et al.,2022a,b,c). However, such methods are not
specifically designed for dialogues and thus fail
to adequately capture the key conversational infor-
mation. In this paper, we formally introduce the
task of learning unsupervised dialogue embeddings,
which aims to learn dialogue embeddings that can
well reflect conversational semantics without any
additional manual annotations.
Previous studies have extensively demonstrated
the importance of encoding token-level interac-
tions for learning semantic textual embeddings.
However, for dialogue embedding, encoding
interlocutor-level interactions is also essential but
is overlooked in trivial approaches. Figure 1shows
I want to go out and do something.
Perhaps you want to go to see some live
music or to a sports event? Do you know
what city you want to go?
Maybe a concert? I love jazz. If
possible, Id like a convert in Napa.
There is a concert called Acoustic
Alchemy at Blue Note Napa.
Sounds great. Thats just
what I was looking for.
Have a nice day.
Figure 1: A dialogue from the SGD dataset.
an example. We highlight the significant interac-
tion patterns between the interlocutors with red
arXiv:2210.15332v1 [cs.CL] 27 Oct 2022
color. As we can see, although these patterns only
appear in three utterances, they highly represent the
key conversational semantics (e.g., topics) and are
more important than the other parts (e.g., greetings
and chit-chats). We hold that capturing and leverag-
ing them is one of the keys to learning high-quality
unsupervised dialogue embeddings.
In this work, we propose dial2vec, a self-guided
contrastive learning approach to solve the proposed
task. Dial2vec considers a dialogue as an informa-
tion exchange process between the two interlocu-
tors and learns embeddings for both interlocutors
with the help of each other. Specifically, dial2vec
firstly encodes a dialogue through a PLM and as-
signs each interlocutor a self-representation by
masking the non-corresponding positions in the
encoding outputs. Then it calculates a matching
matrix via the token-level dot-product operation
between the two self-representations, obtaining a
cross-representation for each interlocutor. Finally,
the two cross-representations are leveraged as guid-
ance to help the two self-representations gradually
learn the interlocutor-level interaction-aware infor-
mation and eliminate the interaction-free informa-
tion during the training procedure.
To verify our model, we build a comprehensive
benchmark comprising a total of 98,879 dialogues
by introducing six widely-used dialogue datasets,
including BiTOD (Lin et al.,2021), Doc2dial (Feng
et al.,2020), MetalWOZ (Lee et al.,2019), Multi-
WOZ (Eric et al.,2019), Self-dialogue (Fainberg
et al.,2018), and SGD (Rastogi et al.,2020). Each
dataset consists of thousands of dialogues, where
each dialogue is provided with a domain label (e.g.,
hotel booking and movie). We leverage these la-
bels and design three evaluation tasks: domain
categorization, semantic relatedness, and dialogue
retrieval. We categorize them into intrinsic and
extrinsic tasks according to their different focus.
Experimental results on this benchmark show
that dial2vec outperforms the baselines by a sub-
stantial margin. Compared with the strongest base-
line, dial2vec achieves on average 8.7, 9.0, and
13.8 points absolute improvements in terms of pu-
rity, Spearman’s correlation, and mean average
precision (MAP) on the three tasks respectively.
We also conduct experiments with the single inter-
locutor’s embeddings, their aggregation strategies,
and the overall dialogue embedding distributions
to study how dial2vec achieves such advanced per-
formance. The results demonstrate that dial2vec
learns both informative and discriminative embed-
dings for the two interlocutors and achieves the
best performance when combining them through
the proposed interlocutor-level pooling aggregation
strategy.
2 Related Work
2.1 Text Embedding
Text embedding aims to encode a piece of text into
a distributed vector that could represent its seman-
tics. Early works (Bengio et al.,2003;Mikolov
et al.,2013;Pennington et al.,2014) learn unsu-
pervised word embeddings by making use of word-
level co-occurrence information in the skip-gram
or CBOW tasks. Recently, Devlin et al. (2018);
Liu et al. (2019a); Yang et al. (2019); Raffel et al.
(2020) pre-train deep transformer (Vaswani et al.,
2017) with a series of pretext tasks, setting a new
state-of-the-art across the GLUE benchmark (Wang
et al.,2018) as well as exhibiting a strong poten-
tial in producing general text embeddings. Along
this line, Gao et al. (2021); Yan et al. (2021); Liu
et al. (2021); Chuang et al. (2022); Nishikawa et al.
(2022); Zhou et al. (2022); Klein and Nabi (2022)
fine-tune the PLMs with contrastive learning ob-
jectives, achieving remarkable improvements in
learning unsupervised sentence embeddings. Luo
et al. (2021) introduce a data augmentation-based
contrastive learning approach in learning docu-
ment embeddings, achieving superior performance
over word2vec-based approaches (Le and Mikolov,
2014;Chen,2017).
For dialogue embedding, the above approaches
are generally unsatisfactory, as they typically ob-
tain dialogue embeddings by averaging the pre-
trained word or sentence embeddings, ignoring the
interlocutor-level conversational interactions. Al-
though conversational-PLMs pre-trained with dia-
logue data can solve this problem to some extent
(Wu et al.,2020a;Bao et al.,2020;Roller et al.,
2021), they mainly focus on learning end-to-end
models which are not sufficient for our task. As a
comparison, we study how to produce high-quality
dialogue embeddings by fully exploiting the con-
versational information.
2.2 Contrastive Learning
Contrastive learning is an emerging self-supervised
learning method which can improve the represen-
tation capability of PLMs in both pre-training and
fine-tuning stages. Wu et al. (2020b); Meng et al.
Contrastive LearningEncoding Module
Downstream
Pretraining
+
+
+
+
+
+
+
+
+
E0
E1
E0 E1 E2 E0 E1
E0 E2
+
+
+
+
+
+
+
+
+
E1
E2
E2 E1 E1 E0 E0
E3 E2
+
+
+
+
+
+
+
+
+
EA
EB
EB EA EA EB EBEA EB
EI
Eare
Ehow Eam Egood Eme Etoo
Ehi Eyou
Transformer Block 1
Transformer Block 2
···
Transformer Block L
EI
Eare
Ehow Eam Egood Eme Etoo
Ehi Eyou
Masking Layer
Pooling
Pooling
Pooling
Pooling
···
positive
negative
-
+Domain
Categorization
Dialogue
Retrieval
Semantic
Relatedness
+
Figure 2: Architecture of dial2vec. Firstly, it encodes a dialogue through a PLM and assigns each interlocutor a
self-representation through a masking layer (highlighted with yellow). Hollow circles in each self-representation
represent zero embeddings. Then two matching matrices are calculated through the dot-product multiplication,
based on which two cross-representations are generated. Each cross-representation and its corresponding self-
representation are complementary in the token sequence dimension. Finally, the cosine distance between them
will be minimized or maximized according to whether the training sample is positive or negative.
(2021); Giorgi et al. (2020) introduce the token-
level and sentence-level contrastive learning tasks
by correcting corrupted texts to encourage PLMs
to learn noise-invariant representations. Zhang
et al. (2022) propose phrase-guided and tree-guided
contrastive learning objectives to inject syntactic
knowledge into PLMs. Kim et al. (2021) propose
a self-guided learning objective through which a
PLM fine-tunes itself under the guidance of its dif-
ferent layers. Inspired by these works, we propose
to leverage the interlocutor-level conversational in-
teractions to guide the learning of dialogue embed-
dings in an unsupervised learning manner.
3 Proposed Approach
In this section, we take a two-party dialogue as
an example to describe how dial2vec works. It is
worth mentioning that dial2vec can be extended
to the multi-party version through the OVR (one
vs. the rest) scheme with no modification of the
architecture.
3.1 Training Samples Generation
We first describe how we construct the posi-
tive and the negative training samples, which
plays a key role in the self-guided contrastive
learning approach. Suppose that we have a di-
alogue dataset
D={Sk}K
k=1
, where
Sk=
{up1
1, up2
2, up1
3, up2
4, . . . , up1
t1, up2
t}
is the
k
-th dia-
logue session with
t
utterances.
p1
and
p2
represent
two interlocutors. We treat each utterance in a dia-
logue as a turn, regardless of which interlocutor it
corresponds to. For the convenience of narration,
kin Skis omitted in the following sections.
We treat
S
(i.e., the original dialogue) as a posi-
tive sample. To construct a negative sample
S0
, we
first randomly select an interlocutor in
S
, say
p1
,
and keep all the turns of it. Then we fill the other
turns of
S
with the utterances of
p2
randomly sam-
pled from all dialogue sessions. For each positive
sample, we repeat this operation multiple times to
generate the desired number of negative samples.
3.2 Model Architecture
Figure 2shows the architecture of dial2vec, which
consists of two parts: encoding and contrastive
learning. After training, dial2vec aggregates the
embeddings from both interlocutors to obtain the
final dialogue embedding, which is further used for
downstream tasks.
3.2.1 Encoding
Following Bao et al. (2020), we use four types of
embeddings as input to dial2vec: token embedding,
relative positional embedding, turn embedding, and
role embedding. To encode the dialogue, we first
concatenate all the utterances and then tokenize
them through WordPiece (Wu et al.,2016) to obtain
a long token sequence. The tokens along with their
corresponding position, turn, and role indices are
respectively mapped into four embedding spaces
and summed to form the final input embedding.
3.2.2 Contrastive Learning
Suppose that the output embeddings from the en-
coder are
{h1,h2,h3,...,hn}
, where
hiRd
is
the output embedding corresponding to the
i
-th in-
put token and
n
is the length of the input sequence,
摘要:

Dial2vec:Self-GuidedContrastiveLearningofUnsupervisedDialogueEmbeddingsCheLiu,RuiWang,JunfengJiang,YongbinLi,FeiHuangDAMOAcademy,AlibabaGroup{liuche.lc,wr224079,jiangjunfeng.jjf,shuide.lyb,f.huang}@alibaba-inc.comAbstractInthispaper,weintroducethetaskoflearningunsuperviseddialogueembeddings.Trivial...

展开>> 收起<<
Dial2vec Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings Che Liu Rui Wang Junfeng Jiang Yongbin Li Fei Huang.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:560.01KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注