OTSeq2Set An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classiﬁcation Jie Cao

2025-04-29 0 0 1.57MB 10 页 10玖币

侵权投诉

OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for

Extreme Multi-label Text Classiﬁcation

Jie Cao

Polytechnic Institute

Zhejiang University

caojie@zju.edu.cn

Yin Zhang∗

College of Computer Science and Technology

Zhejiang University

yinzh@zju.edu.cn

Abstract

Extreme multi-label text classiﬁcation

(XMTC) is the task of ﬁnding the most

relevant subset labels from an extremely

large-scale label collection. Recently, some

deep learning models have achieved state-of-

the-art results in XMTC tasks. These models

commonly predict scores for all labels by a

fully connected layer as the last layer of the

model. However, such models can’t predict

a relatively complete and variable-length

label subset for each document, because

they select positive labels relevant to the

document by a ﬁxed threshold or take top k

labels in descending order of scores. A less

popular type of deep learning models called

sequence-to-sequence (Seq2Seq) focus on

predicting variable-length positive labels in

sequence style. However, the labels in XMTC

tasks are essentially an unordered set rather

than an ordered sequence, the default order of

labels restrains Seq2Seq models in training.

To address this limitation in Seq2Seq, we

propose an autoregressive sequence-to-set

model for XMTC tasks named OTSeq2Set.

Our model generates predictions in student-

forcing scheme and is trained by a loss

function based on bipartite matching which

enables permutation-invariance. Meanwhile,

we use the optimal transport distance as a

measurement to force the model to focus on

the closest labels in semantic label space. Ex-

periments show that OTSeq2Set outperforms

other competitive baselines on 4 benchmark

datasets. Especially, on the Wikipedia

dataset with 31k labels, it outperforms the

state-of-the-art Seq2Seq method by 16.34%

in micro-F1 score. The code is available at

https://github.com/caojie54/OTSeq2Set.

1 Introduction

Extreme multi-label text classiﬁcation (XMTC) is a

Natural Language Processing (NLP) task of ﬁnding

∗

Corresponding author

the most relevant subset labels from an extremely

large-scale label set. It has a lot of usage scenar-

ios, such as item categorization in e-commerce and

tagging Wikipedia articles. XMTC become more

important with the fast growth of big data.

As in many other NLP tasks, deep learning

based models also achieve the state-of-the-art per-

formance in XTMC. For example, AttentionXML

(You et al.,2019), X-Transformer (Chang et al.,

2020) and LightXML (Jiang et al.,2021) have

achieved remarkable improvements in evaluating

metrics relative to the current state-of-the-art meth-

ods. These models are both composed of three

parts (Jiang et al.,2021): text representing, label

recalling, and label ranking. The ﬁrst part converts

the raw text to text representation vectors, then the

label recalling part gives scores for all cluster or

tree nodes including portion labels, and ﬁnally, the

label ranking part predicts scores for all labels in

descending order. Notice that, the label recalling

and label ranking part both use fully connected

layers. Although the fully connected layer based

models have excellent performance, there exists a

drawback which is these models can’t generate a

variable-length and relatively complete label set for

each document. Because the fully connected layer

based models select positive labels relevant to the

document by a ﬁxed threshold or take top k labels in

descending order of label scores, which depends on

human’s decision. Another type of deep learning

based models is Seq2seq learning based methods

which focus on predicting variable-length positive

labels only, such as MLC2Seq (Nam et al.,2017),

SGM (Yang et al.,2018). MLC2Seq and SGM

enhance Seq2Seq model for Multi-label classiﬁca-

tion (MLC) tasks by changing label permutations

according to the frequency of labels. However,

a pre-deﬁned label order can’t solve the problem

of Seq2Seq based models which is the labels in

XMTC tasks are essentially an unordered set rather

than an ordered sequence. Yang et al. (2019) solves

arXiv:2210.14523v2 [cs.CL] 14 Nov 2022

this problem on MLC tasks via reinforcement learn-

ing by designing a reward function to reduce the

dependence of the model on the label order, but

it needs to pretrain the model via Maximum Like-

lihood Estimate (MLE) method. The two-stage

training is not efﬁcient for XMTC tasks that have

large-scale labels.

To address the above problems, we propose an

autoregressive sequence-to-set model, OTSeq2Set,

which generates a subset of labels for each doc-

ument and ignores the order of ground truth in

training. OTSeq2Set is based on the Seq2Seq (Bah-

danau et al.,2015), which consists of an encoder

and a decoder with the attention mechanism. The

bipartite matching method has been successfully

applied in Named entity recognition task (Tan et al.,

2021) and keyphrase generation task (Ye et al.,

2021) to allievate the impact of order in targets.

Chen et al. (2019) and Li et al. (2020) have suc-

cessfully applied the optimal transport algorithm to

enable sequence-level training for Seq2Seq learn-

ing. Both methods can achieve optimal matching

between two sequences, but the difference is the

former matches two sequences one to one, and the

latter gives a matrix containing regularized scores

of all connections. We combine the two methods

in our model.

Our contributions of this paper are summarized

as follows: (1) We propose two schemes to use

the bipartite matching in XMTC tasks, which are

suitable for datasets with different label distribu-

tions. (2) We combine the bipartite matching and

the optimal transport distance to compute the over-

all training loss, with the student-forcing scheme

when generating predictions in the training stage.

Our model can avoid the exposure bias; besides,

the optimal transport distance as a measurement

forces the model to focus on the closest labels in

semantic label space. (3) We add a lightweight con-

volution module into the Seq2Seq models, which

achieves a stable improvement and requires only

a few parameters. (4) Experimental results show

that our model achieves signiﬁcant improvements

on four benchmark datasets. For example, on the

Wikipedia dataset with 31k labels, it outperforms

the state-of-the-art method by 16.34% in micro-

F1 score, and on Amazon-670K, it outperforms

the state-of-the-art model by 14.86% in micro-F1

score.

2 Methodology

2.1 Overview

Here we deﬁne necessary notations and describe

the Sequence-to-Set XMTC task. Given a text

sequence

containing

words, the task aims to

assign a subset

containing

labels in the to-

tal label set

. Unlike fully connected layer

based methods which give scores to all labels, the

Seq2Set XMTC task is modeled as ﬁnding an opti-

mal positive label sequence

that maximizes the

joint probability P(ˆy|x), which is as follows:

P(ˆy|x) =

i=1

P(yˆρ(i)|yg

1, yg

2, . . . , yg

i−1,x),(1)

where

is the sequence generated by the greedy

search,

is the ground truth sequence with default

order,

ˆy

is the most matched reordered sequence

computed by bipartite matching. As described in

Eq.(1), we use the student-forcing scheme to avoid

exposure bias (Ranzato et al.,2016) between the

generation stage and the training stage. Further-

more, combining the scheme with bipartite match-

ing enables the model to eliminate the inﬂuence of

the default order of labels.

2.2 Sequence-to-Set Model

Our proposed Seq2Set model is based on the

Seq2Seq (Bahdanau et al.,2015) model, and the

model consists of an encoder and a set decoder with

the attention mechanism and an extra lightweight

convolution layer (Wu et al.,2019), which are in-

troduced in detail below.

Encoder

We implement the encoder by a bidirec-

tional GRU to read the text sequence

from both

directions and compute the hidden states for each

word as follows:

−→

hi=−−−→

GRU(−→

hi−1,e(xi)) (2)

←−

hi=←−−−

GRU(←−

hi+1,e(xi)) (3)

where

e(xi)

is the embedding of

. The ﬁnal

representation of the

-th word is

hi= [−→

hi;←−

hi]

which is the concatenation of hidden states from

both directions.

Attention with lightweight convolution

After

the encoder computes

for all elements in

, we

compute a context vector

to focus on different

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OTSeq2Set:AnOptimalTransportEnhancedSequence-to-SetModelforExtremeMulti-labelTextClassicationJieCaoPolytechnicInstituteZhejiangUniversitycaojie@zju.edu.cnYinZhangCollegeofComputerScienceandTechnologyZhejiangUniversityyinzh@zju.edu.cnAbstractExtrememulti-labeltextclassication(XMTC)isthetaskofndin...

展开>> 收起<<

OTSeq2Set An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classiﬁcation Jie Cao.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OTSeq2Set An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classiﬁcation Jie Cao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: