BERT-Flow-V AE A Weakly-supervised Model for Multi-Label Text Classification Ziwen Liu

2025-05-06 0 0 2.25MB 18 页 10玖币
侵权投诉
BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text
Classification
Ziwen Liu
University College London
z.liu.19@ucl.ac.uk
Scott Allan Orr
University College London
scott.orr@ucl.ac.uk
Josep Grau-Bo
University College London
josep.grau.bove@ucl.ac.uk
Abstract
Multi-label Text Classification (MLTC) is the
task of categorizing documents into one or
more topics. Considering the large vol-
umes of data and varying domains of such
tasks, fully-supervised learning requires man-
ually fully annotated datasets which is costly
and time-consuming. In this paper, we
propose BERT-Flow-VAE (BFV), a Weakly-
Supervised Multi-Label Text Classification
(WSMLTC) model that reduces the need for
full supervision. This new model: (1) pro-
duces BERT sentence embeddings and cali-
brates them using a flow model, (2) generates
an initial topic-document matrix by averaging
results of a seeded sparse topic model and
a textual entailment model that only require
surface name of topics and 4-6 seed words
per topic, and (3) adopts a VAE framework
to reconstruct the embeddings under the guid-
ance of the topic-document matrix. Finally,
(4) it uses the means produced by the encoder
model in the VAE architecture as predictions
for MLTC. Experimental results on 6 multi-
label datasets show that BFV can substantially
outperform other baseline WSMLTC models
in key metrics and achieve approximately 84%
performance of a fully-supervised model.
1 Introduction
As vast numbers of written comments are posted
daily on social media and e-commerce platforms,
there is an increasing demand for methods that
efficiently and effectively extract useful informa-
tion from this unstructured text data. One of the
methods to analyze this unstructured text data is
to classify them into organized categories. This
can be considered as a Multi-label Text Classifica-
tion (MLTC) task since a single data may contain
multiple non-mutually-exclusive topics (aspects).
There are a range of relevant applications of this
task such as categorizing movies by genres (Hoang,
2018), multi-label sentiment analysis (Almeida
et al.,2018) and multi-label toxicity identification
(Gunasekara and Nejadgholi,2018).
Fully-supervised learning methods are undesir-
able for this task, because of the diversity of do-
mains of application and cost of manual labelling
(Brody and Elhadad,2010). Seeded topic mod-
els, such as SeededLDA and CorEx (Jagarlamudi
et al.,2012;Gallagher et al.,2017), where users
can designate seed words as a prior to guide the
models to find topics of interest, can be seen as
a Weakly-Supervised Multi-Label Text Classifica-
tion (WSMLTC) method. Nevertheless, as these
models are mainly statistical models based on bag-
of-words representation, they fail to fully exploit
key sentence elements such as context and word
positions. In contrast, large pre-trained language
models such as BERT and GPT-3 (Devlin et al.,
2018;Brown et al.,2020) produce contextualized
embeddings for each word in a sentence, which
has afforded them great success in the NLP field
(Minaee et al.,2021;Ethayarajh,2019).
Recently, prompt-based Few-Shot Learning
(FSL) and Zero-Shot Learning (ZSL) methods (Yin
et al.,2019,2020;Gao et al.,2020) that take advan-
tage of the general knowledge of large pre-trained
language models can also approach MLTC tasks us-
ing only a few examples or topic surface names as
a means of supervision. Specifically, these models
convert text classification to a textual entailment
task by preparing a template such as ’This example
is about
_
’ as input, and then estimating the prob-
ability of the model filling the blank with certain
topic names. However, this method does not work
well for abstract topics and there is no agreed way
to use multiple words for the entailment task.
In this paper, we propose BERT-Flow-VAE
(BFV), a WSMLTC model. It is based on the Vari-
ational AutoEncoder (VAE) (Kingma and Welling,
2013) framework to reconstruct the sentence em-
beddings obtained from distil-BERT (Sanh et al.,
2019). Inspired by the work of (Li et al.,2020), we
arXiv:2210.15225v1 [cs.CL] 27 Oct 2022
use a shallow Glow (Kingma and Dhariwal,2018)
model to map the sentence embeddings to a stan-
dard Gaussian space before feeding them into the
VAE model. Finally, we use the averaged results
of a seeded sparse topic model and a ZSL model to
guide our model to build latent variables towards
pre-specified topics as predictions for MLTC.
Our contributions can be listed as follows: (1)
We propose BFV, a WSMLTC model based on
VAE framework, that can achieve comparable per-
formance to a fully-supervised method on 6 multi-
label datasets with only limited inputs (4 to 6 seed
words per topic and surface name of topics). (2)
We show that using a normalizing-flow model to
calibrate sentence embeddings before feeding them
into a VAE model can improve the model’s MLTC
performance, suggesting that pre-processing inputs
is needed as it can better fit the overall objective
of the VAE framework. (3) We present that the
topic classification performance of ZSL method
can be further improved by properly integrating pre-
dictions from a sparse seeded topic model, which
complements the results from ZSL method by natu-
rally incorporating multiple words to define a topic
and could play a role of regularization.
2 Related Work
Seeded Topic Model
Guided (seeded) topic
models are built to find more desirable topics
by incorporating users’ prior domain knowledge.
These seeded topic models can be seen as Weakly-
Supervised (WS) methods to find specific topics in
a corpus. Andrzejewski and Zhu (2009) proposed
a model by using ’z-labels’ to control which words
appear or not appear in certain topics. Andrzejew-
ski et al. (2009) presented DFLDA to construct
Must-Link and Cannot-Link conditions between
words to indirectly force the emergence of topics.
Jagarlamudi et al. (2012) proposed SeededLDA to
incorporate seed words for each topics to guide the
results found by LDA. This is achieved by biasing
(1) topics to produce seed words and (2) documents
containing seed words to select corresponding top-
ics. Gallagher et al. (2017) presented Correlation
Explanation (CorEx), a model searching for top-
ics that are ’maximally informative’ about a set
of documents. Seed words can be flexibly incor-
porated into the model during fitting. Meng et al.
(2020a) proposed CatE that jointly embeds words,
documents and seeded categories (topics) into a
shared space. The category distinctive information
is encoded during the process.
Weakly-supervised Text Classification
Re-
cently, Weakly-Supervised Text Classification
(WSTC) has been rapidly developed (Meng et al.,
2020b;Wang et al.,2020). Most of the works
used pseudo labels/documents generation and
self-training. Particularly, Meng et al. (2018)
proposed WeSTClass which uses seed information
such as label surface name and keywords to
generate pseudo documents and refines itself via
self-training. Mekala and Shang (2020) proposed
ConWea that uses contextualized embeddings to
disambiguate user input seed words and generates
pseudo labels for unlabeled documents based on
these words to train a text classifier. COSINE from
(Yu et al.,2020) receives weak supervision and
generates pseudo labels to perform contrastive
learning (with confidence reweighting) to train a
classifier. Some studies integrated simple rules
as weak supervision signals: Ren et al. (2020)
used rule-annotated weak labels to denoise labels,
which then supervise a classifier to predict unseen
samples; Karamanolakis et al. (2021) developed
ASTRA that utilizes task-specific unlabeled
data, few labeled data, and domain-specific rules
through a student model, a teacher model and
self-training. However, these WSTC methods
were specifically designed for multi-class tasks
and are not optimized for WSMLTC tasks in
which documents could belong to multiple classes
simultaneously.
Prompt-based Zero-Shot Learning
MLTC
tasks can also be approached with very limited
supervision by Prompt-based Few-Shot Learning
(FSL) or Zero-Shot Learning (ZSL). For example,
Yin et al. (2019) proposed a ZSL method for text
classification tasks by treating text classification
as a textual entailment problem. This model
treats an input text as a premise and prepares a
corresponding hypothesis (template) such as ’This
example is about
_
’ for the entailment model.
Finally, it uses the probability of the model filling
the blank with topic names as the topic predictions.
However, the choice of template and word for
the entailment task requires domain knowledge
and is often sub-optimal (Gao et al.,2020).
Also, it is not straightforward to find multiple
words as entailment for a topic. This may limit
model’s ability to understand abstract topics (e.g.,
’evacuation’ and ’infrastructure’) where providing
a single surface name is insufficient (Yin et al.,
2019). Although some automatic search strategies
(Gao et al.,2020;Schick and Schütze,2020;
Schick et al.,2020) have been suggested, relevant
research and applications are still under-explored.
3 Proposed Model: BERT-Flow-VAE
3.1 Problem Formulation and Motivation
Problem Formulation
Multi-label text classifi-
cation task is a broad concept, which includes many
sub-fields such as eXtreme Multi-label Text Clas-
sification (XMTC), Hierarchical Multi-label Text
Classification (HMTC) and multi-label topic mod-
eling. In our model, instead of following these
approaches, we follow a simpler assumption that
the labels do not have a hierarchical structure and
distribution of examples per label is not extremely
skewed.
More precisely, given an input corpus consist-
ing of
N
documents
D={D1, ...DN}
, the model
assigns zero, single, or multiple labels to each doc-
ument
Di∈ D
based on weak supervision signal
from a dictionary of {topic surface name:keywords}
Wprovided by user.
This is a more challenging task than multi-class
text classification as samples are assumed to have
non-mutually exclusive labels. This is a more prac-
tical assumption for text classification task because
documents usually belong to more than one con-
ceptual class (Tsoumakas and Katakis,2007).
Motivation
Inspired by relevant work of VAE
and β-VAE (see Appendix A), we assume that the
semantic information within sentence embeddings
are composed of multiple disentangled factors in
the latent space. Each latent factor can be seen as a
label (topic) that may appear independently. Hence,
we adopted VAE as our framework to approach this
task.
3.2 Preparing the Inputs
Language Model and Sentence Embedding
Strategy
As we will model the latent factors
from the semantic information of sentences en-
coded in the word embeddings, we need to firstly
convert sentences into embeddings. Specifically,
given the input corpus
D
, we firstly process them
into a collection of sentence embeddings
Es
RN×V
, where
V
is the embedding dimension of
the language model. Taking BERT as an example,
there are two main ways to produce such sentence
embeddings: (1) using the special token (
[CLS]
in
BERT) and (2) using a mean-pooling strategy to ag-
gregate all word embeddings into a single sentence
embedding. We tested and showed the performance
of the two versions in section 5. Lastly, for com-
putational efficiency, we used distil-BERT (Sanh
et al.,2019) as our language model, which is a
lighter version of BERT with comparable perfor-
mance.
Moreover, instead of simply averaging the em-
beddings of words in a sentence with equal weights,
we also tested a TF-IDF averaging strategy. Specif-
ically, we firstly calculated the weights of words
in a sentence using the TF-IDF algorithm with
L2
normalization, and then averaged the words accord-
ing to the TF-IDF weights. To avoid weights of
some common words to be nearly zero, we com-
bined 10% mean pooling weights and 90% TF-IDF
pooling weights as the final embeddings.
Flow-calibration
Sentence embeddings ob-
tained from BERT without extra fine-tuning
have been found to poorly capture the semantic
meaning of sentences. This is reflected by the
performance of BERT on sentence-level tasks
such as predicting Semantic Textual Similarity
(STS) (Reimers and Gurevych,2019). This may
be caused by anisotropy (embeddings occupy
a narrow cone in the vector space), a common
problem of embeddings produced by language
models (Ethayarajh,2019;Li et al.,2020). To
address this problem, following the work of (Li
et al.,2020), we adopted BERT-Flow to calibrate
the sentence embeddings. More exactly, we
used a shallow Glow (Kingma and Dhariwal,
2018) with K = 16 and L = 1, a normalizing-flow
based model, with random permutation and affine
coupling to post-process the sentence embeddings
from all 7 layers of distil-BERT (including the
word embedding layer). We tested different
combinations of the 7 post-processed embeddings
and took the average of embeddings from the
first, second and sixth layer based on the metrics
evaluated on the STS benchmark dataset.
Since normalizing-flow based models can create
an invertible mapping from the BERT embedding
space to a standard Gaussian latent space (Li et al.,
2020), the advantages of using flow calibration
are: (1) it improves the anisotropy to make the sen-
tence embeddings more semantically distinguish-
able, and (2) it converts the distribution of BERT
embeddings to be standard Gaussian, which fits the
objective of minimizing mean-squared reconstruc-
tion error and Kullback–Leibler Divergence (KLD)
with a standard Gaussian prior distribution in the
following VAE model.
Backend Model
To guide our model towards
some pre-specified topics, we used Zero-Shot Text
Classification method (0SHOT-TC) proposed by
(Yin et al.,2019) as the backend model. Specifi-
cally, we used RoBERTa-large (Liu et al.,2019)
as the language model for 0SHOT-TC. Following
the example mentioned previously, we prepared a
template (hypothesis) with the shape ’This example
is about
_
’ for each sentence (premise) and filled
the blank with the surface name of topics. Finally,
we took the probability of entailment as that of the
topic appearing in the sentence for each class and
collected this as
T0SHOT T C RN×M
, where
M
is the number of topics.
However, because current zero-shot learning
methods lack an agreed way to find multiple words
as entailment for a topic, we further used a seeded
topic model as a complement. More exactly, we se-
lected Anchored Correlation Explanation (CorEx)
(Gallagher et al.,2017) as another backend model.
By following the approach used by (Jagarlamudi
et al.,2012;Gallagher et al.,2017), we randomly
chose 4 to 6 seed words from the top 20 most dis-
criminating words of each topic as seed words to
better simulate real-world applications. Finally,
we estimated unnormalized document-topic matrix
TCorEx RN×Mand took the combination:
T=ω× T0SHOT T C + (1 ω)× TCorEx
where
ω
is the combination weight. We set
ω
= 0.5
herein (details will be discussed in section 5.2).
3.3 Model Description
Model Architecture and Objective Function
An overview of the model architecture can be seen
in Fig 1. Specifically, we used fully connected
layers combined with layer normalization (Ba
et al.,2016) and Parametric ReLU (PReLU) (He
et al.,2015). The encoder model
qφ
receives flow-
calibrated sentence embeddings
Es
and outputs
mean (
µRN×M
) and variance (
σRN×M
)
which will be the inputs to the decoder model
pθ
to produce reconstructed sentence embeddings
ˆ
EsRN×V.
As in the vanilla VAE model in Appendix A, the
objective function of our model contains a recon-
struction loss and KLD loss. We used the mean-
Figure 1: Architecture of the proposed model.
squared error between
Es
and
ˆ
Es
as the recon-
struction loss because input embeddings have been
calibrated to have a standard Gaussian distribution,
and used the KL divergence between the output
(
µ
and
σ
) of the encoder and the prior
N(0, I)
as
the KLD loss. In addition, in order to guide the
model’s direction towards the pre-specified topics,
we added another loss term dubbed topic loss:
LT=1
NM
N
X
i
M
X
jTij ·log(sigmoid(µij ))
where
sigmoid(·)
is the element-wise sigmoid
function.
LT
is the binary cross-entropy between
µand Tto encourage µto be closer to T.
Notice that the value of
sigmoid(µ)RN×M
produced by the encoder can be viewed as a
document-topic matrix. Thus, we used it as the
model’s prediction for MLTC.
sigmoid(µij )>
0.5
is be predicted as positive (i.e., topic
j
appears
in
i
th document).
σ
is be left as free values to
reconstruct ˆ
Es.
As shown by (Higgins et al.,2016;Burgess et al.,
2018), the weight of different components in the
objective function of the VAE model is important
to find disentangled representations. In particu-
lar, based on our observations, the ratio between
LKLD
and
LT
is be crucial. Hence, we set a hyper-
parameter
γ
in the objective function controlling
the ratio of
LKLD
and
LT
. Finally, the objective
function of our VAE model is:
L(θ, φ;Es,T, α, η) = (LR+αLKLD +ηLT)
where
α= 0.1×γ
and
η= 0.1×γM
. It can be
seen that a higher value of
γ
will lead to a heavier
penalty on
LT
, and therefore
µ
will become more
similar to
T
; while, conversely, a lower value of
γ
will make
µ
diverge from
T
. As
LKLD
pushes
摘要:

BERT-Flow-VAE:AWeakly-supervisedModelforMulti-LabelTextClassicationZiwenLiuUniversityCollegeLondonz.liu.19@ucl.ac.ukScottAllanOrrUniversityCollegeLondonscott.orr@ucl.ac.ukJosepGrau-BovéUniversityCollegeLondonjosep.grau.bove@ucl.ac.ukAbstractMulti-labelTextClassication(MLTC)isthetaskofcategorizingd...

展开>> 收起<<
BERT-Flow-V AE A Weakly-supervised Model for Multi-Label Text Classification Ziwen Liu.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.25MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注