be extremely labor-consuming, or even impossible in cases where some objects are so small that they
can hardly be perceivable [25].
As for using different label encoding methods instead of the one-hot encoding, all of them assume
implicitly that the annotations are reasonably good [12, 9, 26]. Examples of other label encodings
include the label-smoothing regularization (LSR), generalized entropy regularization (GER), and so
on [12, 26]. In LSR, Szegedy et al. replaced the one-hot encoding of labels in the cross entropy
loss minimization with a mixture of the original one-hot distribution and a uniform distribution,
and this mixture encoding is obtained by taking a small probability mass from the annotated class
and then evenly spreading it over all the other classes [12]. In GER, Meister et al. proposed to
use a skew-Jensen divergence to encourage the learned conditional distribution to approximate the
mixture encoding of label, and they add this extra divergence term as a regularization to the original
cross entropy loss minimization [26]. However, these efforts can still have severe limitations. First
of all, the construction of mixture encoding of label can be quite biased due to the similar reasons
accounting for the conditional entropy loss in dataset construction process. Secondly, their focus is
still on improving the conditional entropy, and this can be very challenging since the gap between the
conditional entropy of the label and the differential entropy of the input is so big that after we reveal
the the input, the conditional entropy of the label distribution can be very small and hard to learn.
1.2 Mutual Information Learned Classifiers
In this paper, we propose a new learning framework, i.e., mutual information learning (MIL) where
we train classifiers via learning the mutual information of the dataset, i.e., the dependency between
the input and the label, and this is motivated by several observations.
First of all, we show that the existing cross entropy loss minimization for training DNN classifiers
actualy learns the conditional entropy of the label when the input is given. From an information
theoretic viewpoint, the mutual information between the input and the label quantifies the information
shared by them while the conditional entropy quantifies the information remained in the label after
revealing the input. Compared with the conditional entropy, the magnitude of the mutual information
can be much larger, and it may not be easily ignored by the model during training, thus possible
to alleviate the overfitting phenomenon. An illustration of the relation among different information
quantities involved in a dataset is shown in Figure 3.
In addition, in 2020, there are several works which apply information theoretic tools to investigate
DNN models [27, 28, 29]. In [28], Yi et al. investigated the adversarial attack problems from an
information theoretic viewpoint, and they proposed to acheive adversarial attacks by minimizing the
mutual information between the input and the label. Yi et al. also established theoretical results
for characterizing what the best an adversary can do for attacking machine learning models [28].
Concurrently, in [27], Wang et al. used information theoretic tools to derive interesting relations
between the existing adversarial training formulations and the conditional entropy optimizations.
These works imply that there are intrinsinc connections between the properties of the model learned
from the dataset and the information contained in the dataset. Last but not least, some recent works
on training DNN reported that increasing entropy about the labels can improve the generalization
performance, and even make the model more adversarially robust [12, 26, 30]. Though these results
imply that the overfitting can be due to the severe ignorance of the conditional entropy of the label
during training, we argue that a more appropriate quantity for guiding the learning and training of
classification systems can be the mutual information (MI) which better characterizes the dependency
between the input and the label. Besides, the MI usually has larger magnitude than the conditional
label entropy in classification tasks, making it less to be ignored by the model and affected by the
numerical precisions. To see this, we can consider an extreme case where the model gives uniform
distribution for the label when an arbitrary example is fed to the model. In this case, the conditional
entropy of the label acheives the maximum, but what the model learned can be meaningless since it
cannot accurately characterize the dependency between the input and the label.
Under our mutual information learning (MIL) framework, we design a new loss for training the
DNN classifiers, and the loss itself originates from a representation of mutual information of the
dataset generating distribution, and we propose new pipelines associated with the proposed frame-
work. We will refer to this loss as mutual information learning loss (milLoss), and the traditional
cross entropy loss as conditional entropy learning loss (celLoss) since it essentially learning the con-
ditional entropy of the dataset generating distribution. When reformulated as a regularized form of
the celLoss, the milLoss can be interpreted as a weighted sum of the conditional label entropy loss
4