LIME Weakly-Supervised Text Classification Without Seeds Seongmin Park and Jihwa Lee ActionPower Seoul Republic of Korea

2025-04-24 0 0 266.68KB 6 页 10玖币
侵权投诉
LIME: Weakly-Supervised Text Classification Without Seeds
Seongmin Park and Jihwa Lee
ActionPower, Seoul, Republic of Korea
{seongmin.park, jihwa.lee}@actionpower.kr
Abstract
In weakly-supervised text classification, only
label names act as sources of supervision.
Predominant approaches to weakly-supervised
text classification utilize a two-phase frame-
work, where test samples are first assigned
pseudo-labels and are then used to train a
neural text classifier. In most previous work,
the pseudo-labeling step is dependent on ob-
taining seed words that best capture the rel-
evance of each class label. We present
LIME1, a framework for weakly-supervised
text classification that entirely replaces the
brittle seed-word generation process with
entailment-based pseudo-classification. We
find that combining weakly-supervised classi-
fication and textual entailment mitigates short-
comings of both, resulting in a more stream-
lined and effective classification pipeline.
With just an off-the-shelf textual entailment
model, LIME outperforms recent baselines
in weakly-supervised text classification and
achieves state-of-the-art in 4 benchmarks. We
open source our code at https://github.
com/seongminp/LIME.
1 Introduction
Weakly-supervised text classification (Meng et al.,
2018) is an important avenue of research in low-
resourced text classification. Unlike in traditional
text classification, all supervision derives from
textual information in category names. Weakly-
supervised classification offers a practical approach
to classification because it does not necessitate mas-
sive amounts of training data.
Another distinct aspect of weakly-supervised
text classification is that the system has access to
the entire test set at evaluation time, instead of
encountering test samples sequentially. Exploit-
ing this characteristic, recent approaches employ
keyword-matching pseudo-labeling schemes to ten-
tatively assign class labels to each test sample, be-
1Labels Identified with Maximal Entailment
fore using the information to train a separate classi-
fier (Meng et al.,2018;Mekala and Shang,2020;
Wang et al.,2021). Pseudo-labels are assigned by
counting how many “seed words” of each class are
found in the test sample. Keyword matching-based
labeling, however, is neither adaptable nor flexible
because semantic information embedded in class
names cannot be extracted adaptively for distinct
classification tasks.
Inspired by recent advances in prompt-based
text classification (Yin et al.,2019,2020;Schick
and Schütze,2021), we replace the keyword-
based pseudo-labeling step with a more streamlined
entailment-based approach. Extensive experiments
show that entailment-based classifiers assign more
accurate pseudo-labels with greater task adaptabil-
ity and much fewer hyperparameters. We find that
our method realizes the benefits of both entailment-
based classification and self-training.
Our contributions are as follows:
1.
We present LIME, a novel framework for
weakly-supervised text classification that uti-
lizes textual entailment. LIME surpasses cur-
rent state-of-the-art weakly-supervised meth-
ods in all tested benchmarks.
2.
We show that self-training with pseudo-labels
can mitigate unsolved robustness issues in
entailment-based classification (Ma et al.,
2021).
3.
We experimentally confirm that higher con-
fidence in pseudo-labels translates to better
classification accuracy in self-training. We
also find that a balance between filtering out
low-confidence labels and preserving a sizable
pseudo-training corpus is important.
arXiv:2210.06720v1 [cs.CL] 13 Oct 2022
2 Background
2.1 Weakly-supervised text classification
In weakly-supervised text classification, the system
is allowed to view the entire test set at evaluation
time. Having access to all test data allows novel pre-
processing approaches unavailable in traditional
text classification, such as preliminary clustering
of test samples (Mekala and Shang,2020;Wang
et al.,2021) before attempting final classification.
In the process, the system has an opportunity to
examine overall characteristics of the test set.
Existing methods for weakly-supervised text
classification focus on effectively leveraging such
additional information. The dominant approach in-
volves generating pseudo-data to train a neural text
classifier. Most methods assign labels to samples
in the test set by identifying operative keywords
within the text (Meng et al.,2018). They obtain
seed words that best represent each category name.
Then, each sample in the test set is assigned a label
with keywords most relevant to its content.
Later works improve this pipeline by automati-
cally generating seed words (Meng et al.,2020b) or
incorporating pre-trained language models to uti-
lize contextual information of representative key-
words (Mekala and Shang,2020).
Seed-word-based pseudo-labeling, however, is
heavily dependent on the existence of representa-
tive seed words in test samples. Seed-word-based
matching cannot fully utilize information in con-
textual language representations, because the clas-
sification of each document involves brittle global
hyperparameters such as the number of total seed
words (Meng et al.,2020b) or word embedding
distance (Wang et al.,2021).
In this work, we entirely forgo the seed word
generation process during pseudo-labeling. We
show that replacing seed-word generation with
entailment-based text classification is more reliable
and performant for text classification with weak
supervision.
2.2 Entailment based text classification
Textual entailment (Fyodorov et al.,2000;MacCart-
ney and Manning,2009) measures the likeliness
of a sentence appearing after another. Since entail-
ment is evaluated to a probability value, the task
can be extended for use in text classification. In
entailment-based text classification, classification
is posed as a textual entailment problem: given
a test document, the system ranks the probabili-
ties that sentences each containing a possible class
label (hypotheses) will immediately follow the doc-
ument text. The class label belonging to the most
probable hypothesis is selected as the classification
prediction. A hypothesis for topic classification,
for example, could be “This text is about <topic>”.
The flexibility in prompt choices for constructing
the hypotheses makes entailment-based classifica-
tion extremely adaptable to different task types.
Although entailment-based sentence scoring is
popular in zero- and few-shot text classification
(Yin et al.,2019,2020), the robustness of such ap-
proaches has recently been called into question (Ma
et al.,2021). Since entailment-based classifiers rely
heavily on lexical patterns, a large variance is ob-
served in classification performance across differ-
ent domains. We find that self-training commonly
found in weakly-supervised classification mitigates
such robustness issues in entailment-based classifi-
cation to a large degree.
3 The LIME Framework
LIME enhances the two-phase weakly-supervised
classification pipeline with an entailment-based
pseudo-labeling scheme.
Examples
Test sample (t) “I love the food."
Class label (c) “Positive"
Verbalizer "Positive" “good"
Prompt "It was <verbalizer(hi)>."
Hypothesis (h) "It was good."
Table 1: Example test sample, class label, verbalizer,
prompt, and entailment hypothesis. Converting class
labels with a verbalizer is an optional procedure.
3.1 Phase 1: Pseudo-labeling
Textual entailment evaluates the likeliness of a hy-
pothesis hsucceeding some text t.
Given
C={c1, c2, . . . , cn}
, the set of all possi-
ble labels for
t
, we generate
H={h1, h2, . . . , hn}
,
the set of all entailment hypothesis. Every sentence
hi
asserts that its corresponding
ciC
is the cor-
rect label for
t
.
hi
is constructed from a designated
prompt and an optional verbalizer for each dataset
(Schick and Schütze,2021):
hi=prompt(verbalizer(ci))
Prompts dictate the wording of the hypotheses,
摘要:

LIME:Weakly-SupervisedTextClassicationWithoutSeedsSeongminParkandJihwaLeeActionPower,Seoul,RepublicofKorea{seongmin.park,jihwa.lee}@actionpower.krAbstractInweakly-supervisedtextclassication,onlylabelnamesactassourcesofsupervision.Predominantapproachestoweakly-supervisedtextclassicationutilizeatwo...

展开>> 收起<<
LIME Weakly-Supervised Text Classification Without Seeds Seongmin Park and Jihwa Lee ActionPower Seoul Republic of Korea.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:266.68KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注