
OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for
Extreme Multi-label Text Classification
Jie Cao
Polytechnic Institute
Zhejiang University
caojie@zju.edu.cn
Yin Zhang∗
College of Computer Science and Technology
Zhejiang University
yinzh@zju.edu.cn
Abstract
Extreme multi-label text classification
(XMTC) is the task of finding the most
relevant subset labels from an extremely
large-scale label collection. Recently, some
deep learning models have achieved state-of-
the-art results in XMTC tasks. These models
commonly predict scores for all labels by a
fully connected layer as the last layer of the
model. However, such models can’t predict
a relatively complete and variable-length
label subset for each document, because
they select positive labels relevant to the
document by a fixed threshold or take top k
labels in descending order of scores. A less
popular type of deep learning models called
sequence-to-sequence (Seq2Seq) focus on
predicting variable-length positive labels in
sequence style. However, the labels in XMTC
tasks are essentially an unordered set rather
than an ordered sequence, the default order of
labels restrains Seq2Seq models in training.
To address this limitation in Seq2Seq, we
propose an autoregressive sequence-to-set
model for XMTC tasks named OTSeq2Set.
Our model generates predictions in student-
forcing scheme and is trained by a loss
function based on bipartite matching which
enables permutation-invariance. Meanwhile,
we use the optimal transport distance as a
measurement to force the model to focus on
the closest labels in semantic label space. Ex-
periments show that OTSeq2Set outperforms
other competitive baselines on 4 benchmark
datasets. Especially, on the Wikipedia
dataset with 31k labels, it outperforms the
state-of-the-art Seq2Seq method by 16.34%
in micro-F1 score. The code is available at
https://github.com/caojie54/OTSeq2Set.
1 Introduction
Extreme multi-label text classification (XMTC) is a
Natural Language Processing (NLP) task of finding
∗
Corresponding author
the most relevant subset labels from an extremely
large-scale label set. It has a lot of usage scenar-
ios, such as item categorization in e-commerce and
tagging Wikipedia articles. XMTC become more
important with the fast growth of big data.
As in many other NLP tasks, deep learning
based models also achieve the state-of-the-art per-
formance in XTMC. For example, AttentionXML
(You et al.,2019), X-Transformer (Chang et al.,
2020) and LightXML (Jiang et al.,2021) have
achieved remarkable improvements in evaluating
metrics relative to the current state-of-the-art meth-
ods. These models are both composed of three
parts (Jiang et al.,2021): text representing, label
recalling, and label ranking. The first part converts
the raw text to text representation vectors, then the
label recalling part gives scores for all cluster or
tree nodes including portion labels, and finally, the
label ranking part predicts scores for all labels in
descending order. Notice that, the label recalling
and label ranking part both use fully connected
layers. Although the fully connected layer based
models have excellent performance, there exists a
drawback which is these models can’t generate a
variable-length and relatively complete label set for
each document. Because the fully connected layer
based models select positive labels relevant to the
document by a fixed threshold or take top k labels in
descending order of label scores, which depends on
human’s decision. Another type of deep learning
based models is Seq2seq learning based methods
which focus on predicting variable-length positive
labels only, such as MLC2Seq (Nam et al.,2017),
SGM (Yang et al.,2018). MLC2Seq and SGM
enhance Seq2Seq model for Multi-label classifica-
tion (MLC) tasks by changing label permutations
according to the frequency of labels. However,
a pre-defined label order can’t solve the problem
of Seq2Seq based models which is the labels in
XMTC tasks are essentially an unordered set rather
than an ordered sequence. Yang et al. (2019) solves
arXiv:2210.14523v2 [cs.CL] 14 Nov 2022