BERT-Flow-V AE A Weakly-supervised Model for Multi-Label Text Classiﬁcation Ziwen Liu

2025-05-06 0 0 2.25MB 18 页 10玖币

侵权投诉

BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text

Classiﬁcation

Ziwen Liu

University College London

z.liu.19@ucl.ac.uk

Scott Allan Orr

University College London

scott.orr@ucl.ac.uk

Josep Grau-Bové

University College London

josep.grau.bove@ucl.ac.uk

Abstract

Multi-label Text Classiﬁcation (MLTC) is the

task of categorizing documents into one or

more topics. Considering the large vol-

umes of data and varying domains of such

tasks, fully-supervised learning requires man-

ually fully annotated datasets which is costly

and time-consuming. In this paper, we

propose BERT-Flow-VAE (BFV), a Weakly-

Supervised Multi-Label Text Classiﬁcation

(WSMLTC) model that reduces the need for

full supervision. This new model: (1) pro-

duces BERT sentence embeddings and cali-

brates them using a ﬂow model, (2) generates

an initial topic-document matrix by averaging

results of a seeded sparse topic model and

a textual entailment model that only require

surface name of topics and 4-6 seed words

per topic, and (3) adopts a VAE framework

to reconstruct the embeddings under the guid-

ance of the topic-document matrix. Finally,

(4) it uses the means produced by the encoder

model in the VAE architecture as predictions

for MLTC. Experimental results on 6 multi-

label datasets show that BFV can substantially

outperform other baseline WSMLTC models

in key metrics and achieve approximately 84%

performance of a fully-supervised model.

1 Introduction

As vast numbers of written comments are posted

daily on social media and e-commerce platforms,

there is an increasing demand for methods that

efﬁciently and effectively extract useful informa-

tion from this unstructured text data. One of the

methods to analyze this unstructured text data is

to classify them into organized categories. This

can be considered as a Multi-label Text Classiﬁca-

tion (MLTC) task since a single data may contain

multiple non-mutually-exclusive topics (aspects).

There are a range of relevant applications of this

task such as categorizing movies by genres (Hoang,

2018), multi-label sentiment analysis (Almeida

et al.,2018) and multi-label toxicity identiﬁcation

(Gunasekara and Nejadgholi,2018).

Fully-supervised learning methods are undesir-

able for this task, because of the diversity of do-

mains of application and cost of manual labelling

(Brody and Elhadad,2010). Seeded topic mod-

els, such as SeededLDA and CorEx (Jagarlamudi

et al.,2012;Gallagher et al.,2017), where users

can designate seed words as a prior to guide the

models to ﬁnd topics of interest, can be seen as

a Weakly-Supervised Multi-Label Text Classiﬁca-

tion (WSMLTC) method. Nevertheless, as these

models are mainly statistical models based on bag-

of-words representation, they fail to fully exploit

key sentence elements such as context and word

positions. In contrast, large pre-trained language

models such as BERT and GPT-3 (Devlin et al.,

2018;Brown et al.,2020) produce contextualized

embeddings for each word in a sentence, which

has afforded them great success in the NLP ﬁeld

(Minaee et al.,2021;Ethayarajh,2019).

Recently, prompt-based Few-Shot Learning

(FSL) and Zero-Shot Learning (ZSL) methods (Yin

et al.,2019,2020;Gao et al.,2020) that take advan-

tage of the general knowledge of large pre-trained

language models can also approach MLTC tasks us-

ing only a few examples or topic surface names as

a means of supervision. Speciﬁcally, these models

convert text classiﬁcation to a textual entailment

task by preparing a template such as ’This example

is about

’ as input, and then estimating the prob-

ability of the model ﬁlling the blank with certain

topic names. However, this method does not work

well for abstract topics and there is no agreed way

to use multiple words for the entailment task.

In this paper, we propose BERT-Flow-VAE

(BFV), a WSMLTC model. It is based on the Vari-

ational AutoEncoder (VAE) (Kingma and Welling,

2013) framework to reconstruct the sentence em-

beddings obtained from distil-BERT (Sanh et al.,

2019). Inspired by the work of (Li et al.,2020), we

arXiv:2210.15225v1 [cs.CL] 27 Oct 2022

use a shallow Glow (Kingma and Dhariwal,2018)

model to map the sentence embeddings to a stan-

dard Gaussian space before feeding them into the

VAE model. Finally, we use the averaged results

of a seeded sparse topic model and a ZSL model to

guide our model to build latent variables towards

pre-speciﬁed topics as predictions for MLTC.

Our contributions can be listed as follows: (1)

We propose BFV, a WSMLTC model based on

VAE framework, that can achieve comparable per-

formance to a fully-supervised method on 6 multi-

label datasets with only limited inputs (4 to 6 seed

words per topic and surface name of topics). (2)

We show that using a normalizing-ﬂow model to

calibrate sentence embeddings before feeding them

into a VAE model can improve the model’s MLTC

performance, suggesting that pre-processing inputs

is needed as it can better ﬁt the overall objective

of the VAE framework. (3) We present that the

topic classiﬁcation performance of ZSL method

can be further improved by properly integrating pre-

dictions from a sparse seeded topic model, which

complements the results from ZSL method by natu-

rally incorporating multiple words to deﬁne a topic

and could play a role of regularization.

2 Related Work

Seeded Topic Model

Guided (seeded) topic

models are built to ﬁnd more desirable topics

by incorporating users’ prior domain knowledge.

These seeded topic models can be seen as Weakly-

Supervised (WS) methods to ﬁnd speciﬁc topics in

a corpus. Andrzejewski and Zhu (2009) proposed

a model by using ’z-labels’ to control which words

appear or not appear in certain topics. Andrzejew-

ski et al. (2009) presented DFLDA to construct

Must-Link and Cannot-Link conditions between

words to indirectly force the emergence of topics.

Jagarlamudi et al. (2012) proposed SeededLDA to

incorporate seed words for each topics to guide the

results found by LDA. This is achieved by biasing

(1) topics to produce seed words and (2) documents

containing seed words to select corresponding top-

ics. Gallagher et al. (2017) presented Correlation

Explanation (CorEx), a model searching for top-

ics that are ’maximally informative’ about a set

of documents. Seed words can be ﬂexibly incor-

porated into the model during ﬁtting. Meng et al.

(2020a) proposed CatE that jointly embeds words,

documents and seeded categories (topics) into a

shared space. The category distinctive information

is encoded during the process.

Weakly-supervised Text Classiﬁcation

Re-

cently, Weakly-Supervised Text Classiﬁcation

(WSTC) has been rapidly developed (Meng et al.,

2020b;Wang et al.,2020). Most of the works

used pseudo labels/documents generation and

self-training. Particularly, Meng et al. (2018)

proposed WeSTClass which uses seed information

such as label surface name and keywords to

generate pseudo documents and reﬁnes itself via

self-training. Mekala and Shang (2020) proposed

ConWea that uses contextualized embeddings to

disambiguate user input seed words and generates

pseudo labels for unlabeled documents based on

these words to train a text classiﬁer. COSINE from

(Yu et al.,2020) receives weak supervision and

generates pseudo labels to perform contrastive

learning (with conﬁdence reweighting) to train a

classiﬁer. Some studies integrated simple rules

as weak supervision signals: Ren et al. (2020)

used rule-annotated weak labels to denoise labels,

which then supervise a classiﬁer to predict unseen

samples; Karamanolakis et al. (2021) developed

ASTRA that utilizes task-speciﬁc unlabeled

data, few labeled data, and domain-speciﬁc rules

through a student model, a teacher model and

self-training. However, these WSTC methods

were speciﬁcally designed for multi-class tasks

and are not optimized for WSMLTC tasks in

which documents could belong to multiple classes

simultaneously.

Prompt-based Zero-Shot Learning

MLTC

tasks can also be approached with very limited

supervision by Prompt-based Few-Shot Learning

(FSL) or Zero-Shot Learning (ZSL). For example,

Yin et al. (2019) proposed a ZSL method for text

classiﬁcation tasks by treating text classiﬁcation

as a textual entailment problem. This model

treats an input text as a premise and prepares a

corresponding hypothesis (template) such as ’This

example is about

’ for the entailment model.

Finally, it uses the probability of the model ﬁlling

the blank with topic names as the topic predictions.

However, the choice of template and word for

the entailment task requires domain knowledge

and is often sub-optimal (Gao et al.,2020).

Also, it is not straightforward to ﬁnd multiple

words as entailment for a topic. This may limit

model’s ability to understand abstract topics (e.g.,

’evacuation’ and ’infrastructure’) where providing

a single surface name is insufﬁcient (Yin et al.,

2019). Although some automatic search strategies

(Gao et al.,2020;Schick and Schütze,2020;

Schick et al.,2020) have been suggested, relevant

research and applications are still under-explored.

3 Proposed Model: BERT-Flow-VAE

3.1 Problem Formulation and Motivation

Problem Formulation

Multi-label text classiﬁ-

cation task is a broad concept, which includes many

sub-ﬁelds such as eXtreme Multi-label Text Clas-

siﬁcation (XMTC), Hierarchical Multi-label Text

Classiﬁcation (HMTC) and multi-label topic mod-

eling. In our model, instead of following these

approaches, we follow a simpler assumption that

the labels do not have a hierarchical structure and

distribution of examples per label is not extremely

skewed.

More precisely, given an input corpus consist-

ing of

documents

D={D1, ...DN}

, the model

assigns zero, single, or multiple labels to each doc-

ument

Di∈ D

based on weak supervision signal

from a dictionary of {topic surface name:keywords}

Wprovided by user.

This is a more challenging task than multi-class

text classiﬁcation as samples are assumed to have

non-mutually exclusive labels. This is a more prac-

tical assumption for text classiﬁcation task because

documents usually belong to more than one con-

ceptual class (Tsoumakas and Katakis,2007).

Motivation

Inspired by relevant work of VAE

and β-VAE (see Appendix A), we assume that the

semantic information within sentence embeddings

are composed of multiple disentangled factors in

the latent space. Each latent factor can be seen as a

label (topic) that may appear independently. Hence,

we adopted VAE as our framework to approach this

task.

3.2 Preparing the Inputs

Language Model and Sentence Embedding

Strategy

As we will model the latent factors

from the semantic information of sentences en-

coded in the word embeddings, we need to ﬁrstly

convert sentences into embeddings. Speciﬁcally,

given the input corpus

, we ﬁrstly process them

into a collection of sentence embeddings

Es∈

RN×V

, where

is the embedding dimension of

the language model. Taking BERT as an example,

there are two main ways to produce such sentence

embeddings: (1) using the special token (

[CLS]

BERT) and (2) using a mean-pooling strategy to ag-

gregate all word embeddings into a single sentence

embedding. We tested and showed the performance

of the two versions in section 5. Lastly, for com-

putational efﬁciency, we used distil-BERT (Sanh

et al.,2019) as our language model, which is a

lighter version of BERT with comparable perfor-

mance.

Moreover, instead of simply averaging the em-

beddings of words in a sentence with equal weights,

we also tested a TF-IDF averaging strategy. Specif-

ically, we ﬁrstly calculated the weights of words

in a sentence using the TF-IDF algorithm with

normalization, and then averaged the words accord-

ing to the TF-IDF weights. To avoid weights of

some common words to be nearly zero, we com-

bined 10% mean pooling weights and 90% TF-IDF

pooling weights as the ﬁnal embeddings.

Flow-calibration

Sentence embeddings ob-

tained from BERT without extra ﬁne-tuning

have been found to poorly capture the semantic

meaning of sentences. This is reﬂected by the

performance of BERT on sentence-level tasks

such as predicting Semantic Textual Similarity

(STS) (Reimers and Gurevych,2019). This may

be caused by anisotropy (embeddings occupy

a narrow cone in the vector space), a common

problem of embeddings produced by language

models (Ethayarajh,2019;Li et al.,2020). To

address this problem, following the work of (Li

et al.,2020), we adopted BERT-Flow to calibrate

the sentence embeddings. More exactly, we

used a shallow Glow (Kingma and Dhariwal,

2018) with K = 16 and L = 1, a normalizing-ﬂow

based model, with random permutation and afﬁne

coupling to post-process the sentence embeddings

from all 7 layers of distil-BERT (including the

word embedding layer). We tested different

combinations of the 7 post-processed embeddings

and took the average of embeddings from the

ﬁrst, second and sixth layer based on the metrics

evaluated on the STS benchmark dataset.

Since normalizing-ﬂow based models can create

an invertible mapping from the BERT embedding

space to a standard Gaussian latent space (Li et al.,

2020), the advantages of using ﬂow calibration

are: (1) it improves the anisotropy to make the sen-

tence embeddings more semantically distinguish-

able, and (2) it converts the distribution of BERT

embeddings to be standard Gaussian, which ﬁts the

objective of minimizing mean-squared reconstruc-

tion error and Kullback–Leibler Divergence (KLD)

with a standard Gaussian prior distribution in the

following VAE model.

Backend Model

To guide our model towards

some pre-speciﬁed topics, we used Zero-Shot Text

Classiﬁcation method (0SHOT-TC) proposed by

(Yin et al.,2019) as the backend model. Speciﬁ-

cally, we used RoBERTa-large (Liu et al.,2019)

as the language model for 0SHOT-TC. Following

the example mentioned previously, we prepared a

template (hypothesis) with the shape ’This example

is about

’ for each sentence (premise) and ﬁlled

the blank with the surface name of topics. Finally,

we took the probability of entailment as that of the

topic appearing in the sentence for each class and

collected this as

T0SHOT −T C ∈RN×M

, where

is the number of topics.

However, because current zero-shot learning

methods lack an agreed way to ﬁnd multiple words

as entailment for a topic, we further used a seeded

topic model as a complement. More exactly, we se-

lected Anchored Correlation Explanation (CorEx)

(Gallagher et al.,2017) as another backend model.

By following the approach used by (Jagarlamudi

et al.,2012;Gallagher et al.,2017), we randomly

chose 4 to 6 seed words from the top 20 most dis-

criminating words of each topic as seed words to

better simulate real-world applications. Finally,

we estimated unnormalized document-topic matrix

TCorEx ∈RN×Mand took the combination:

T=ω× T0SHOT −T C + (1 −ω)× TCorEx

where

is the combination weight. We set

= 0.5

herein (details will be discussed in section 5.2).

3.3 Model Description

Model Architecture and Objective Function

An overview of the model architecture can be seen

in Fig 1. Speciﬁcally, we used fully connected

layers combined with layer normalization (Ba

et al.,2016) and Parametric ReLU (PReLU) (He

et al.,2015). The encoder model

qφ

receives ﬂow-

calibrated sentence embeddings

and outputs

mean (

µ∈RN×M

) and variance (

σ∈RN×M

)

which will be the inputs to the decoder model

pθ

to produce reconstructed sentence embeddings

Es∈RN×V.

As in the vanilla VAE model in Appendix A, the

objective function of our model contains a recon-

struction loss and KLD loss. We used the mean-

Figure 1: Architecture of the proposed model.

squared error between

and

as the recon-

struction loss because input embeddings have been

calibrated to have a standard Gaussian distribution,

and used the KL divergence between the output

(

and

) of the encoder and the prior

N(0, I)

the KLD loss. In addition, in order to guide the

model’s direction towards the pre-speciﬁed topics,

we added another loss term dubbed topic loss:

LT=−1

jTij ·log(sigmoid(µij ))

where

sigmoid(·)

is the element-wise sigmoid

function.

is the binary cross-entropy between

µand Tto encourage µto be closer to T.

Notice that the value of

sigmoid(µ)∈RN×M

produced by the encoder can be viewed as a

document-topic matrix. Thus, we used it as the

model’s prediction for MLTC.

sigmoid(µij )>

0.5

is be predicted as positive (i.e., topic

appears

th document).

is be left as free values to

reconstruct ˆ

Es.

As shown by (Higgins et al.,2016;Burgess et al.,

2018), the weight of different components in the

objective function of the VAE model is important

to ﬁnd disentangled representations. In particu-

lar, based on our observations, the ratio between

LKLD

and

is be crucial. Hence, we set a hyper-

parameter

in the objective function controlling

the ratio of

LKLD

and

. Finally, the objective

function of our VAE model is:

L(θ, φ;Es,T, α, η) = −(LR+αLKLD +ηLT)

where

α= 0.1×√γ

and

η= 0.1×γM

. It can be

seen that a higher value of

will lead to a heavier

penalty on

, and therefore

will become more

similar to

; while, conversely, a lower value of

will make

diverge from

. As

LKLD

pushes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BERT-Flow-VAE:AWeakly-supervisedModelforMulti-LabelTextClassicationZiwenLiuUniversityCollegeLondonz.liu.19@ucl.ac.ukScottAllanOrrUniversityCollegeLondonscott.orr@ucl.ac.ukJosepGrau-BovéUniversityCollegeLondonjosep.grau.bove@ucl.ac.ukAbstractMulti-labelTextClassication(MLTC)isthetaskofcategorizingd...

展开>> 收起<<

BERT-Flow-V AE A Weakly-supervised Model for Multi-Label Text Classiﬁcation Ziwen Liu.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BERT-Flow-V AE A Weakly-supervised Model for Multi-Label Text Classiﬁcation Ziwen Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: