LIME Weakly-Supervised Text Classiﬁcation Without Seeds Seongmin Park and Jihwa Lee ActionPower Seoul Republic of Korea

2025-04-24 0 0 266.68KB 6 页 10玖币

侵权投诉

LIME: Weakly-Supervised Text Classiﬁcation Without Seeds

Seongmin Park and Jihwa Lee

ActionPower, Seoul, Republic of Korea

{seongmin.park, jihwa.lee}@actionpower.kr

Abstract

In weakly-supervised text classiﬁcation, only

label names act as sources of supervision.

Predominant approaches to weakly-supervised

text classiﬁcation utilize a two-phase frame-

work, where test samples are ﬁrst assigned

pseudo-labels and are then used to train a

neural text classiﬁer. In most previous work,

the pseudo-labeling step is dependent on ob-

taining seed words that best capture the rel-

evance of each class label. We present

LIME1, a framework for weakly-supervised

text classiﬁcation that entirely replaces the

brittle seed-word generation process with

entailment-based pseudo-classiﬁcation. We

ﬁnd that combining weakly-supervised classi-

ﬁcation and textual entailment mitigates short-

comings of both, resulting in a more stream-

lined and effective classiﬁcation pipeline.

With just an off-the-shelf textual entailment

model, LIME outperforms recent baselines

in weakly-supervised text classiﬁcation and

achieves state-of-the-art in 4 benchmarks. We

open source our code at https://github.

com/seongminp/LIME.

1 Introduction

Weakly-supervised text classiﬁcation (Meng et al.,

2018) is an important avenue of research in low-

resourced text classiﬁcation. Unlike in traditional

text classiﬁcation, all supervision derives from

textual information in category names. Weakly-

supervised classiﬁcation offers a practical approach

to classiﬁcation because it does not necessitate mas-

sive amounts of training data.

Another distinct aspect of weakly-supervised

text classiﬁcation is that the system has access to

the entire test set at evaluation time, instead of

encountering test samples sequentially. Exploit-

ing this characteristic, recent approaches employ

keyword-matching pseudo-labeling schemes to ten-

tatively assign class labels to each test sample, be-

1Labels Identiﬁed with Maximal Entailment

fore using the information to train a separate classi-

ﬁer (Meng et al.,2018;Mekala and Shang,2020;

Wang et al.,2021). Pseudo-labels are assigned by

counting how many “seed words” of each class are

found in the test sample. Keyword matching-based

labeling, however, is neither adaptable nor ﬂexible

because semantic information embedded in class

names cannot be extracted adaptively for distinct

classiﬁcation tasks.

Inspired by recent advances in prompt-based

text classiﬁcation (Yin et al.,2019,2020;Schick

and Schütze,2021), we replace the keyword-

based pseudo-labeling step with a more streamlined

entailment-based approach. Extensive experiments

show that entailment-based classiﬁers assign more

accurate pseudo-labels with greater task adaptabil-

ity and much fewer hyperparameters. We ﬁnd that

our method realizes the beneﬁts of both entailment-

based classiﬁcation and self-training.

Our contributions are as follows:

We present LIME, a novel framework for

weakly-supervised text classiﬁcation that uti-

lizes textual entailment. LIME surpasses cur-

rent state-of-the-art weakly-supervised meth-

ods in all tested benchmarks.

We show that self-training with pseudo-labels

can mitigate unsolved robustness issues in

entailment-based classiﬁcation (Ma et al.,

2021).

We experimentally conﬁrm that higher con-

ﬁdence in pseudo-labels translates to better

classiﬁcation accuracy in self-training. We

also ﬁnd that a balance between ﬁltering out

low-conﬁdence labels and preserving a sizable

pseudo-training corpus is important.

arXiv:2210.06720v1 [cs.CL] 13 Oct 2022

2 Background

2.1 Weakly-supervised text classiﬁcation

In weakly-supervised text classiﬁcation, the system

is allowed to view the entire test set at evaluation

time. Having access to all test data allows novel pre-

processing approaches unavailable in traditional

text classiﬁcation, such as preliminary clustering

of test samples (Mekala and Shang,2020;Wang

et al.,2021) before attempting ﬁnal classiﬁcation.

In the process, the system has an opportunity to

examine overall characteristics of the test set.

Existing methods for weakly-supervised text

classiﬁcation focus on effectively leveraging such

additional information. The dominant approach in-

volves generating pseudo-data to train a neural text

classiﬁer. Most methods assign labels to samples

in the test set by identifying operative keywords

within the text (Meng et al.,2018). They obtain

seed words that best represent each category name.

Then, each sample in the test set is assigned a label

with keywords most relevant to its content.

Later works improve this pipeline by automati-

cally generating seed words (Meng et al.,2020b) or

incorporating pre-trained language models to uti-

lize contextual information of representative key-

words (Mekala and Shang,2020).

Seed-word-based pseudo-labeling, however, is

heavily dependent on the existence of representa-

tive seed words in test samples. Seed-word-based

matching cannot fully utilize information in con-

textual language representations, because the clas-

siﬁcation of each document involves brittle global

hyperparameters such as the number of total seed

words (Meng et al.,2020b) or word embedding

distance (Wang et al.,2021).

In this work, we entirely forgo the seed word

generation process during pseudo-labeling. We

show that replacing seed-word generation with

entailment-based text classiﬁcation is more reliable

and performant for text classiﬁcation with weak

supervision.

2.2 Entailment based text classiﬁcation

Textual entailment (Fyodorov et al.,2000;MacCart-

ney and Manning,2009) measures the likeliness

of a sentence appearing after another. Since entail-

ment is evaluated to a probability value, the task

can be extended for use in text classiﬁcation. In

entailment-based text classiﬁcation, classiﬁcation

is posed as a textual entailment problem: given

a test document, the system ranks the probabili-

ties that sentences each containing a possible class

label (hypotheses) will immediately follow the doc-

ument text. The class label belonging to the most

probable hypothesis is selected as the classiﬁcation

prediction. A hypothesis for topic classiﬁcation,

for example, could be “This text is about <topic>”.

The ﬂexibility in prompt choices for constructing

the hypotheses makes entailment-based classiﬁca-

tion extremely adaptable to different task types.

Although entailment-based sentence scoring is

popular in zero- and few-shot text classiﬁcation

(Yin et al.,2019,2020), the robustness of such ap-

proaches has recently been called into question (Ma

et al.,2021). Since entailment-based classiﬁers rely

heavily on lexical patterns, a large variance is ob-

served in classiﬁcation performance across differ-

ent domains. We ﬁnd that self-training commonly

found in weakly-supervised classiﬁcation mitigates

such robustness issues in entailment-based classiﬁ-

cation to a large degree.

3 The LIME Framework

LIME enhances the two-phase weakly-supervised

classiﬁcation pipeline with an entailment-based

pseudo-labeling scheme.

Examples

Test sample (t) “I love the food."

Class label (c) “Positive"

Verbalizer "Positive" →“good"

Prompt "It was <verbalizer(hi)>."

Hypothesis (h) "It was good."

Table 1: Example test sample, class label, verbalizer,

prompt, and entailment hypothesis. Converting class

labels with a verbalizer is an optional procedure.

3.1 Phase 1: Pseudo-labeling

Textual entailment evaluates the likeliness of a hy-

pothesis hsucceeding some text t.

Given

C={c1, c2, . . . , cn}

, the set of all possi-

ble labels for

, we generate

H={h1, h2, . . . , hn}

the set of all entailment hypothesis. Every sentence

asserts that its corresponding

ci∈C

is the cor-

rect label for

is constructed from a designated

prompt and an optional verbalizer for each dataset

(Schick and Schütze,2021):

hi=prompt(verbalizer(ci))

Prompts dictate the wording of the hypotheses,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LIME:Weakly-SupervisedTextClassicationWithoutSeedsSeongminParkandJihwaLeeActionPower,Seoul,RepublicofKorea{seongmin.park,jihwa.lee}@actionpower.krAbstractInweakly-supervisedtextclassication,onlylabelnamesactassourcesofsupervision.Predominantapproachestoweakly-supervisedtextclassicationutilizeatwo...

展开>> 收起<<

LIME Weakly-Supervised Text Classiﬁcation Without Seeds Seongmin Park and Jihwa Lee ActionPower Seoul Republic of Korea.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LIME Weakly-Supervised Text Classiﬁcation Without Seeds Seongmin Park and Jihwa Lee ActionPower Seoul Republic of Korea

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: