OTSeq2Set An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification Jie Cao

2025-04-29 0 0 1.57MB 10 页 10玖币
侵权投诉
OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for
Extreme Multi-label Text Classification
Jie Cao
Polytechnic Institute
Zhejiang University
caojie@zju.edu.cn
Yin Zhang
College of Computer Science and Technology
Zhejiang University
yinzh@zju.edu.cn
Abstract
Extreme multi-label text classification
(XMTC) is the task of finding the most
relevant subset labels from an extremely
large-scale label collection. Recently, some
deep learning models have achieved state-of-
the-art results in XMTC tasks. These models
commonly predict scores for all labels by a
fully connected layer as the last layer of the
model. However, such models can’t predict
a relatively complete and variable-length
label subset for each document, because
they select positive labels relevant to the
document by a fixed threshold or take top k
labels in descending order of scores. A less
popular type of deep learning models called
sequence-to-sequence (Seq2Seq) focus on
predicting variable-length positive labels in
sequence style. However, the labels in XMTC
tasks are essentially an unordered set rather
than an ordered sequence, the default order of
labels restrains Seq2Seq models in training.
To address this limitation in Seq2Seq, we
propose an autoregressive sequence-to-set
model for XMTC tasks named OTSeq2Set.
Our model generates predictions in student-
forcing scheme and is trained by a loss
function based on bipartite matching which
enables permutation-invariance. Meanwhile,
we use the optimal transport distance as a
measurement to force the model to focus on
the closest labels in semantic label space. Ex-
periments show that OTSeq2Set outperforms
other competitive baselines on 4 benchmark
datasets. Especially, on the Wikipedia
dataset with 31k labels, it outperforms the
state-of-the-art Seq2Seq method by 16.34%
in micro-F1 score. The code is available at
https://github.com/caojie54/OTSeq2Set.
1 Introduction
Extreme multi-label text classification (XMTC) is a
Natural Language Processing (NLP) task of finding
Corresponding author
the most relevant subset labels from an extremely
large-scale label set. It has a lot of usage scenar-
ios, such as item categorization in e-commerce and
tagging Wikipedia articles. XMTC become more
important with the fast growth of big data.
As in many other NLP tasks, deep learning
based models also achieve the state-of-the-art per-
formance in XTMC. For example, AttentionXML
(You et al.,2019), X-Transformer (Chang et al.,
2020) and LightXML (Jiang et al.,2021) have
achieved remarkable improvements in evaluating
metrics relative to the current state-of-the-art meth-
ods. These models are both composed of three
parts (Jiang et al.,2021): text representing, label
recalling, and label ranking. The first part converts
the raw text to text representation vectors, then the
label recalling part gives scores for all cluster or
tree nodes including portion labels, and finally, the
label ranking part predicts scores for all labels in
descending order. Notice that, the label recalling
and label ranking part both use fully connected
layers. Although the fully connected layer based
models have excellent performance, there exists a
drawback which is these models can’t generate a
variable-length and relatively complete label set for
each document. Because the fully connected layer
based models select positive labels relevant to the
document by a fixed threshold or take top k labels in
descending order of label scores, which depends on
human’s decision. Another type of deep learning
based models is Seq2seq learning based methods
which focus on predicting variable-length positive
labels only, such as MLC2Seq (Nam et al.,2017),
SGM (Yang et al.,2018). MLC2Seq and SGM
enhance Seq2Seq model for Multi-label classifica-
tion (MLC) tasks by changing label permutations
according to the frequency of labels. However,
a pre-defined label order can’t solve the problem
of Seq2Seq based models which is the labels in
XMTC tasks are essentially an unordered set rather
than an ordered sequence. Yang et al. (2019) solves
arXiv:2210.14523v2 [cs.CL] 14 Nov 2022
this problem on MLC tasks via reinforcement learn-
ing by designing a reward function to reduce the
dependence of the model on the label order, but
it needs to pretrain the model via Maximum Like-
lihood Estimate (MLE) method. The two-stage
training is not efficient for XMTC tasks that have
large-scale labels.
To address the above problems, we propose an
autoregressive sequence-to-set model, OTSeq2Set,
which generates a subset of labels for each doc-
ument and ignores the order of ground truth in
training. OTSeq2Set is based on the Seq2Seq (Bah-
danau et al.,2015), which consists of an encoder
and a decoder with the attention mechanism. The
bipartite matching method has been successfully
applied in Named entity recognition task (Tan et al.,
2021) and keyphrase generation task (Ye et al.,
2021) to allievate the impact of order in targets.
Chen et al. (2019) and Li et al. (2020) have suc-
cessfully applied the optimal transport algorithm to
enable sequence-level training for Seq2Seq learn-
ing. Both methods can achieve optimal matching
between two sequences, but the difference is the
former matches two sequences one to one, and the
latter gives a matrix containing regularized scores
of all connections. We combine the two methods
in our model.
Our contributions of this paper are summarized
as follows: (1) We propose two schemes to use
the bipartite matching in XMTC tasks, which are
suitable for datasets with different label distribu-
tions. (2) We combine the bipartite matching and
the optimal transport distance to compute the over-
all training loss, with the student-forcing scheme
when generating predictions in the training stage.
Our model can avoid the exposure bias; besides,
the optimal transport distance as a measurement
forces the model to focus on the closest labels in
semantic label space. (3) We add a lightweight con-
volution module into the Seq2Seq models, which
achieves a stable improvement and requires only
a few parameters. (4) Experimental results show
that our model achieves significant improvements
on four benchmark datasets. For example, on the
Wikipedia dataset with 31k labels, it outperforms
the state-of-the-art method by 16.34% in micro-
F1 score, and on Amazon-670K, it outperforms
the state-of-the-art model by 14.86% in micro-F1
score.
2 Methodology
2.1 Overview
Here we define necessary notations and describe
the Sequence-to-Set XMTC task. Given a text
sequence
x
containing
l
words, the task aims to
assign a subset
yg
containing
n
labels in the to-
tal label set
L
to
x
. Unlike fully connected layer
based methods which give scores to all labels, the
Seq2Set XMTC task is modeled as finding an opti-
mal positive label sequence
yg
that maximizes the
joint probability P(ˆy|x), which is as follows:
P(ˆy|x) =
n
Y
i=1
P(yˆρ(i)|yg
1, yg
2, . . . , yg
i1,x),(1)
where
yg
is the sequence generated by the greedy
search,
y
is the ground truth sequence with default
order,
ˆy
is the most matched reordered sequence
computed by bipartite matching. As described in
Eq.(1), we use the student-forcing scheme to avoid
exposure bias (Ranzato et al.,2016) between the
generation stage and the training stage. Further-
more, combining the scheme with bipartite match-
ing enables the model to eliminate the influence of
the default order of labels.
2.2 Sequence-to-Set Model
Our proposed Seq2Set model is based on the
Seq2Seq (Bahdanau et al.,2015) model, and the
model consists of an encoder and a set decoder with
the attention mechanism and an extra lightweight
convolution layer (Wu et al.,2019), which are in-
troduced in detail below.
Encoder
We implement the encoder by a bidirec-
tional GRU to read the text sequence
x
from both
directions and compute the hidden states for each
word as follows:
hi=
GRU(
hi1,e(xi)) (2)
hi=
GRU(
hi+1,e(xi)) (3)
where
e(xi)
is the embedding of
xi
. The final
representation of the
i
-th word is
hi= [
hi;
hi]
which is the concatenation of hidden states from
both directions.
Attention with lightweight convolution
After
the encoder computes
hi
for all elements in
x
, we
compute a context vector
ct
to focus on different
摘要:

OTSeq2Set:AnOptimalTransportEnhancedSequence-to-SetModelforExtremeMulti-labelTextClassicationJieCaoPolytechnicInstituteZhejiangUniversitycaojie@zju.edu.cnYinZhangCollegeofComputerScienceandTechnologyZhejiangUniversityyinzh@zju.edu.cnAbstractExtrememulti-labeltextclassication(XMTC)isthetaskofndin...

展开>> 收起<<
OTSeq2Set An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification Jie Cao.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.57MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注