color. As we can see, although these patterns only
appear in three utterances, they highly represent the
key conversational semantics (e.g., topics) and are
more important than the other parts (e.g., greetings
and chit-chats). We hold that capturing and leverag-
ing them is one of the keys to learning high-quality
unsupervised dialogue embeddings.
In this work, we propose dial2vec, a self-guided
contrastive learning approach to solve the proposed
task. Dial2vec considers a dialogue as an informa-
tion exchange process between the two interlocu-
tors and learns embeddings for both interlocutors
with the help of each other. Specifically, dial2vec
firstly encodes a dialogue through a PLM and as-
signs each interlocutor a self-representation by
masking the non-corresponding positions in the
encoding outputs. Then it calculates a matching
matrix via the token-level dot-product operation
between the two self-representations, obtaining a
cross-representation for each interlocutor. Finally,
the two cross-representations are leveraged as guid-
ance to help the two self-representations gradually
learn the interlocutor-level interaction-aware infor-
mation and eliminate the interaction-free informa-
tion during the training procedure.
To verify our model, we build a comprehensive
benchmark comprising a total of 98,879 dialogues
by introducing six widely-used dialogue datasets,
including BiTOD (Lin et al.,2021), Doc2dial (Feng
et al.,2020), MetalWOZ (Lee et al.,2019), Multi-
WOZ (Eric et al.,2019), Self-dialogue (Fainberg
et al.,2018), and SGD (Rastogi et al.,2020). Each
dataset consists of thousands of dialogues, where
each dialogue is provided with a domain label (e.g.,
hotel booking and movie). We leverage these la-
bels and design three evaluation tasks: domain
categorization, semantic relatedness, and dialogue
retrieval. We categorize them into intrinsic and
extrinsic tasks according to their different focus.
Experimental results on this benchmark show
that dial2vec outperforms the baselines by a sub-
stantial margin. Compared with the strongest base-
line, dial2vec achieves on average 8.7, 9.0, and
13.8 points absolute improvements in terms of pu-
rity, Spearman’s correlation, and mean average
precision (MAP) on the three tasks respectively.
We also conduct experiments with the single inter-
locutor’s embeddings, their aggregation strategies,
and the overall dialogue embedding distributions
to study how dial2vec achieves such advanced per-
formance. The results demonstrate that dial2vec
learns both informative and discriminative embed-
dings for the two interlocutors and achieves the
best performance when combining them through
the proposed interlocutor-level pooling aggregation
strategy.
2 Related Work
2.1 Text Embedding
Text embedding aims to encode a piece of text into
a distributed vector that could represent its seman-
tics. Early works (Bengio et al.,2003;Mikolov
et al.,2013;Pennington et al.,2014) learn unsu-
pervised word embeddings by making use of word-
level co-occurrence information in the skip-gram
or CBOW tasks. Recently, Devlin et al. (2018);
Liu et al. (2019a); Yang et al. (2019); Raffel et al.
(2020) pre-train deep transformer (Vaswani et al.,
2017) with a series of pretext tasks, setting a new
state-of-the-art across the GLUE benchmark (Wang
et al.,2018) as well as exhibiting a strong poten-
tial in producing general text embeddings. Along
this line, Gao et al. (2021); Yan et al. (2021); Liu
et al. (2021); Chuang et al. (2022); Nishikawa et al.
(2022); Zhou et al. (2022); Klein and Nabi (2022)
fine-tune the PLMs with contrastive learning ob-
jectives, achieving remarkable improvements in
learning unsupervised sentence embeddings. Luo
et al. (2021) introduce a data augmentation-based
contrastive learning approach in learning docu-
ment embeddings, achieving superior performance
over word2vec-based approaches (Le and Mikolov,
2014;Chen,2017).
For dialogue embedding, the above approaches
are generally unsatisfactory, as they typically ob-
tain dialogue embeddings by averaging the pre-
trained word or sentence embeddings, ignoring the
interlocutor-level conversational interactions. Al-
though conversational-PLMs pre-trained with dia-
logue data can solve this problem to some extent
(Wu et al.,2020a;Bao et al.,2020;Roller et al.,
2021), they mainly focus on learning end-to-end
models which are not sufficient for our task. As a
comparison, we study how to produce high-quality
dialogue embeddings by fully exploiting the con-
versational information.
2.2 Contrastive Learning
Contrastive learning is an emerging self-supervised
learning method which can improve the represen-
tation capability of PLMs in both pre-training and
fine-tuning stages. Wu et al. (2020b); Meng et al.