
TABMIXER: EXCAVATING LABEL DISTRIBUTION LEARNING WITH SMALL-SCALE
FEATURES
Weiyi Cong, Zhuoran Zheng and Xiuyi Jia∗
CSE, Nanjing University of Science and Technology
ABSTRACT
Label distribution learning (LDL) differs from multi-label
learning which aims at representing the polysemy of in-
stances by transforming single-label values into descriptive
degrees. Unfortunately, the feature space of the label distri-
bution dataset is affected by human factors and the inductive
bias of the feature extractor causing uncertainty in the feature
space. Especially, for datasets with small-scale feature spaces
(the feature space dimension ≈the label space), the existing
LDL algorithms do not perform well. To address this issue,
we seek to model the uncertainty augmentation of the feature
space to alleviate the problem in LDL tasks. Specifically, we
start with augmenting each feature value in the feature vector
of a sample into a vector (sampling on a Gaussian distribution
function). Which, the variance parameter of the Gaussian dis-
tribution function is learned by using a sub-network, and the
mean parameter is filled by this feature value. Then, each fea-
ture vector is augmented to a matrix which is fed into a mixer
with local attention (TabMixer) to extract the latent feature.
Finally, the latent feature is squeezed to yield an accurate
label distribution via a squeezed network. Extensive experi-
ments verify that our proposed algorithm can be competitive
compared to other LDL algorithms on several benchmarks.
Index Terms—Label distribution learning, uncertainty
augmenting, Gaussian distribution function, TabMixer.
1. INTRODUCTION
During the development of machine learning tasks, label dis-
tribution learning (LDL) [1] is an important machine learning
paradigm that leverages a function to map a single instance
to a set of labels (labels are represented in the form of de-
scriptive degrees and the sum of the descriptive degrees is
1). Unlike the multi-label learning paradigm, the LDL con-
veys a richer semantic content in terms of characterizing the
instance’s emotions [2,3] and estimating the learning task’s
uncertainty [4–6].
Although several classical LDL algorithms [1,7–16] are
proposed to tackle the task of modeling the feature space into
the label space, these algorithms usually favor an accurate and
ample feature space. Briefly, these algorithms expect to con-
duct a process of condensing the representation space rather
∗Corresponding authors.
than augmenting it. Here, we define a lemma that a feature
space of dimension ≈the label space is a small-scale fea-
ture space. A shred of evidence is that almost all the pro-
posed LDL algorithms report weak performance on bench-
mark datasets with a large number of labels in many stud-
ies. So far, we draw two questions about this: 1) For the la-
bel space, is it difficult for the comparatively small amount
of feature information to provide the algorithm with effec-
tive features to regress an accurate label distribution? 2)
For the feature space, are there artificial reasons and un-
certainty of the feature extractor that cause the low quality
of the feature space? Unfortunately, we cannot parse the
existing LDL dataset because the details of feature process-
ing are blind-boxed. Further, we want to boost the feature
dimension and infer the uncertainty of the feature space by
tapping into expert knowledge is costly. To solve the above
two problems, we propose a feature augmentation technique
with uncertainty awareness enforced on TabMixer (Tabular
MLP-Mixer) to learn an LDL dataset with small-scale fea-
tures. Note that our network treats tabular data equally and
does not distinguish between logical and continuous values.
Overall, our approach can be grouped into the following
learning cohorts. First, to augment the feature space, an MLP-
based sub-network (Learner) is created to learn the variance
of a Gaussian function. This Learner inputs the raw feature
vectors and then assigns a unique variance value to each of
the feature values in the raw feature vectors. Combining the
above, we can design a Gaussian function for each element
in the raw feature space by taking the feature value as the
mean value and using Gaussian sampling to obtain a vector
to replace that element (the time seed is fixed in the model
training phase). By now, our input pattern is evolved from
1D to 2D and can be pseudo-considered as a grayscale map.
Subsequently, the augmented feature information is fed into
TabMixer, where each linear layer shortcut in TabMixer is
a convolution operator to capture the local characteristics of
the features. Finally, the output feature map is squeezed by
the squeezed network to obtain an accurate label distribution,
where the floodgates of the network use softmax. The net-
work is trained using the loss function of only L1 and K-L di-
vergence. We use two standard and a synthetic benchmark to
evaluate our approach and other comparative algorithms, and
the experimental results verify that the proposed algorithm is
arXiv:2210.13852v1 [cs.LG] 25 Oct 2022