a single surface name is insufficient (Yin et al.,
2019). Although some automatic search strategies
(Gao et al.,2020;Schick and Schütze,2020;
Schick et al.,2020) have been suggested, relevant
research and applications are still under-explored.
3 Proposed Model: BERT-Flow-VAE
3.1 Problem Formulation and Motivation
Problem Formulation
Multi-label text classifi-
cation task is a broad concept, which includes many
sub-fields such as eXtreme Multi-label Text Clas-
sification (XMTC), Hierarchical Multi-label Text
Classification (HMTC) and multi-label topic mod-
eling. In our model, instead of following these
approaches, we follow a simpler assumption that
the labels do not have a hierarchical structure and
distribution of examples per label is not extremely
skewed.
More precisely, given an input corpus consist-
ing of
N
documents
D={D1, ...DN}
, the model
assigns zero, single, or multiple labels to each doc-
ument
Di∈ D
based on weak supervision signal
from a dictionary of {topic surface name:keywords}
Wprovided by user.
This is a more challenging task than multi-class
text classification as samples are assumed to have
non-mutually exclusive labels. This is a more prac-
tical assumption for text classification task because
documents usually belong to more than one con-
ceptual class (Tsoumakas and Katakis,2007).
Motivation
Inspired by relevant work of VAE
and β-VAE (see Appendix A), we assume that the
semantic information within sentence embeddings
are composed of multiple disentangled factors in
the latent space. Each latent factor can be seen as a
label (topic) that may appear independently. Hence,
we adopted VAE as our framework to approach this
task.
3.2 Preparing the Inputs
Language Model and Sentence Embedding
Strategy
As we will model the latent factors
from the semantic information of sentences en-
coded in the word embeddings, we need to firstly
convert sentences into embeddings. Specifically,
given the input corpus
D
, we firstly process them
into a collection of sentence embeddings
Es∈
RN×V
, where
V
is the embedding dimension of
the language model. Taking BERT as an example,
there are two main ways to produce such sentence
embeddings: (1) using the special token (
[CLS]
in
BERT) and (2) using a mean-pooling strategy to ag-
gregate all word embeddings into a single sentence
embedding. We tested and showed the performance
of the two versions in section 5. Lastly, for com-
putational efficiency, we used distil-BERT (Sanh
et al.,2019) as our language model, which is a
lighter version of BERT with comparable perfor-
mance.
Moreover, instead of simply averaging the em-
beddings of words in a sentence with equal weights,
we also tested a TF-IDF averaging strategy. Specif-
ically, we firstly calculated the weights of words
in a sentence using the TF-IDF algorithm with
L2
normalization, and then averaged the words accord-
ing to the TF-IDF weights. To avoid weights of
some common words to be nearly zero, we com-
bined 10% mean pooling weights and 90% TF-IDF
pooling weights as the final embeddings.
Flow-calibration
Sentence embeddings ob-
tained from BERT without extra fine-tuning
have been found to poorly capture the semantic
meaning of sentences. This is reflected by the
performance of BERT on sentence-level tasks
such as predicting Semantic Textual Similarity
(STS) (Reimers and Gurevych,2019). This may
be caused by anisotropy (embeddings occupy
a narrow cone in the vector space), a common
problem of embeddings produced by language
models (Ethayarajh,2019;Li et al.,2020). To
address this problem, following the work of (Li
et al.,2020), we adopted BERT-Flow to calibrate
the sentence embeddings. More exactly, we
used a shallow Glow (Kingma and Dhariwal,
2018) with K = 16 and L = 1, a normalizing-flow
based model, with random permutation and affine
coupling to post-process the sentence embeddings
from all 7 layers of distil-BERT (including the
word embedding layer). We tested different
combinations of the 7 post-processed embeddings
and took the average of embeddings from the
first, second and sixth layer based on the metrics
evaluated on the STS benchmark dataset.
Since normalizing-flow based models can create
an invertible mapping from the BERT embedding
space to a standard Gaussian latent space (Li et al.,
2020), the advantages of using flow calibration
are: (1) it improves the anisotropy to make the sen-
tence embeddings more semantically distinguish-
able, and (2) it converts the distribution of BERT
embeddings to be standard Gaussian, which fits the