A CURRICULUM LEARNING APPROACH FOR MULTI-DOMAIN TEXT CLASSIFICATION USING KEYWORD WEIGHT RANKING Zilin Yuan1 Yinghui Li1 Yangning Li1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13

2025-04-30 0 0 537.68KB 5 页 10玖币

侵权投诉

A CURRICULUM LEARNING APPROACH FOR MULTI-DOMAIN TEXT CLASSIFICATION

USING KEYWORD WEIGHT RANKING

Zilin Yuan1, Yinghui Li1, Yangning Li1, Rui Xie2, Wei Wu2, Hai-Tao Zheng1,3∗

1Shenzhen International Graduate School, Tsinghua University

2Meituan, 3Peng Cheng Laboratory

ABSTRACT

Text classiﬁcation is a very classic NLP task, but it has two promi-

nent shortcomings: On the one hand, text classiﬁcation is deeply

domain-dependent. That is, a classiﬁer trained on the corpus of one

domain may not perform so well in another domain. On the other

hand, text classiﬁcation models require a lot of annotated data for

training. However, for some domains, there may not exist enough

annotated data. Therefore, it is valuable to investigate how to efﬁ-

ciently utilize text data from different domains to improve the per-

formance of models in various domains. Some multi-domain text

classiﬁcation models are trained by adversarial training to extract

shared features among all domains and the speciﬁc features of each

domain. We noted that the distinctness of the domain-speciﬁc fea-

tures is different, so in this paper, we propose to use a curriculum

learning strategy based on keyword weight ranking to improve the

performance of multi-domain text classiﬁcation models. The exper-

imental results on the Amazon review and FDU-MTL datasets show

that our curriculum learning strategy effectively improves the perfor-

mance of multi-domain text classiﬁcation models based on adversar-

ial learning and outperforms state-of-the-art methods.

Index Terms—Multi-Domain Text Classiﬁcation, Curriculum

Learning, Keyword Weight Ranking

1. INTRODUCTION

Text classiﬁcation is one of the fundamental NLP tasks and it has

a wide range of applications, such as spam determination [1], news

classiﬁcation [2], and evaluation of e-commerce products [3]. The

research on text classiﬁcation methods can be traced back to the

methods based on expert rules in the 1950s. In the 1990s, machine

learning classiﬁcation methods combining feature engineering and

classiﬁers began to appear [4], and now the more popular method is

to use CNN [5], RNN[6, 7], attention mechanism [8] and other deep

learning methods for classiﬁcation.

But no matter which method, there are two main problems: the

highly domain-dependence and the need for amounts of the anno-

tated corpus. Domain-dependence means that the classiﬁer trained

on a certain domain may not have the same effect in other domains,

because the meaning of vocabulary of different domains may be dif-

ferent, and even the same word expresses different meanings in dif-

ferent domains. As shown in Figure 1, the “infantile” [9] often ex-

presses a negative meaning in the domain of Movie Review (e.g.,

“The idea of the movie is infantile”), but there is usually no obvi-

ous emotional color in the evaluation of Infant Products (e.g., “The

infantile toy was sold out yesterday”). Therefore, when we want

to train classiﬁers in different domain texts, we need enough labeled

* Corresponding author. (E-mail: zheng.haitao@sz.tsinghua.edu.cn)

: The idea of the movie is infantile.

: The infantile toy was sold out yesterday.

Neutral

Negative

😐

😡

Movie Review

Infant Products

Fig. 1. The different sentiments of “infantile” in different domains.

data in each domain, but not all domains have enough domain corpus

to train. So it is necessary to make full use of the corpus in different

domains to classify the texts in a speciﬁc domain, also known as the

Multi-Domain Text Classiﬁcation (MDTC) [10, 11]. However, the

traditional MDTC methods [11, 12] all ignore a piece of important

information. That is, the classiﬁcation difﬁculty of different domains

is different.

The difﬁculty of text classiﬁcation of different domains is in-

consistent, so this feature might be used to make the model learn the

data from easy to difﬁcult. This way of learning is like human learn-

ing, in which simple lessons are learned ﬁrst, followed by complex

lessons. This learning mode is called curriculum learning [13], and it

has shown outstanding promotion in NLP tasks such as dialog state

tracking [14], few-shot text classiﬁcation [15], Chinese Spell Check-

ing [16] and so on. The core of the course learning lies in the dif-

ﬁculty measurer of data samples and the data scheduler. Combined

with the extraction of private and shared features by multi-domain

text classiﬁcation, we propose that the sum of the weights of domain

keywords can be regarded as a measurer of the difﬁculty of domain-

speciﬁc feature extraction to adjust the order when the corpus of a

speciﬁc domain is fed into the model.

Based on the above motivations, we propose a framework called

Keyword-weight-aware Curriculum Learning (KCL) for MDTC,

which includes the following two features:

1) By calculating the word weights of texts, take the Top-N

words as the domain keywords, and calculate the sum of the weights

of these N keywords to measure the difﬁculty of extracting the

domain-speciﬁc feature of each domain. The higher the sum is, the

more obvious the domain-speciﬁc features are, and the easier it is to

extract, so it is necessary to enter the model for training earlier.

2) Using different methods of keyword extraction and testing

different numbers of keywords to ﬁnd the best order of domains.

The experimental results show that our proposed approach im-

proves MDTC performance and achieves new state-of-the-art results

on the Amazon review dataset and FDU-MTL dataset.

arXiv:2210.15147v1 [cs.CL] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACURRICULUMLEARNINGAPPROACHFORMULTI-DOMAINTEXTCLASSIFICATIONUSINGKEYWORDWEIGHTRANKINGZilinYuan1,YinghuiLi1,YangningLi1,RuiXie2,WeiWu2,Hai-TaoZheng1;31ShenzhenInternationalGraduateSchool,TsinghuaUniversity2Meituan,3PengChengLaboratoryABSTRACTTextclassicationisaveryclassicNLPtask,butithastwopromi-ne...

展开>> 收起<<

A CURRICULUM LEARNING APPROACH FOR MULTI-DOMAIN TEXT CLASSIFICATION USING KEYWORD WEIGHT RANKING Zilin Yuan1 Yinghui Li1 Yangning Li1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A CURRICULUM LEARNING APPROACH FOR MULTI-DOMAIN TEXT CLASSIFICATION USING KEYWORD WEIGHT RANKING Zilin Yuan1 Yinghui Li1 Yangning Li1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: