Mutual Information Learned Classifiers an Information-theoretic Viewpoint of Training Deep Learning Classification Systems

2025-05-02 0 0 1.66MB 42 页 10玖币
侵权投诉
Mutual Information Learned Classifiers: an
Information-theoretic Viewpoint of Training Deep Learning
Classification Systems
Jirong Yi, Qiaosheng Zhang, Zhen Chen
Qiao Liu, Wei Shao
October 4, 2022
Abstract
Deep learning systems have been reported to acheive state-of-the-art performances in many
applications, and one of the keys for achieving this is the existence of well trained classifiers on
benchmark datasets which can be used as backbone feature extractors in downstream tasks. As
a main-stream loss function for training deep neural network (DNN) classifiers, the cross entropy
loss can easily lead us to find models which demonstrate severe overfitting behavior when no
other techniques are used for alleviating it such as data augmentation. In this paper, we prove
that the existing cross entropy loss minimization for training DNN classifiers essentially learns
the conditional entropy of the underlying data distribution of the dataset, i.e., the information or
uncertainty remained in the labels after revealing the input. In this paper, we propose a mutual
information learning framework where we train DNN classifiers via learning the mutual informa-
tion between the label and input. Theoretically, we give the population error probability lower
bound in terms of the mutual information. In addition, we derive the mutual information lower
and upper bounds for a concrete binary classification data model in R𝑛, and also the error prob-
ability lower bound in this scenario. Besides, we establish the sample complexity for accurately
learning the mutual information from empirical data samples drawn from the underlying data
distribution. Empirically, we conduct extensive experiments on several benchmark datasets to
support our theory. Without whistles and bells, the proposed mutual information learned classi-
fiers (MILCs) acheive far better generalization performances than the state-of-the-art classifiers
with an improvement which can exceed more than 10% in testing accuracy.
Keywords: classification; mutual information learning; error probability; sample complexity;
overfitting; deep neural network
1 Introduction
Ever since the breakthrough made by Krizhevsky et al. [1], deep learning has been finding trenmdous
applications in different areas such as computer vision, natural language process, and traditonal
signal processing [2, 3, 4], and it achieved the state-of-the-art performances in almost all of them. In
nearly all of these applications, a fundamental classification task is usually involved, i.e., determining
the class to which a given input belongs. Examples from computer vision include the image-level
classification in recognition tasks, the patch-level classification in object detection, and pixel-level
classification in image segmentation tasks [5, 6, 7]. For many other applications which do not directly
or explicitly involve classifications, they still use model pretrained via classification tasks as a backbone
for extracting useful and meaningful representation for the specific tasks [8, 9]. In practice, such
extracted representations have been reported to be beneficial for the downstream tasks [8, 7, 9].
Jirong Yi is with CAD Science Group, Hologic Inc, Santa Clara, CA 95054 and Department of Electrical and
Computer Engineering, University of Iowa, Iowa City, IA 52242. Qiaosheng Zhang is with Department of Electrical
and Computer Engineering at National University of Singapore, Singapore 119077. Zhen Chen is with Department of
Electrical Engineering and Computer Science at University of California at Irvine, Irvine, CA 92697. Qiao Liu is with
Department of Statistics at Stanford University, Stanford, CA 94305. Wei Shao is with Department of Radiology at
Stanford University, Stanford, CA 94305. Emails: jirong.yi@hologic.com, ericzhang8951@gmail.com, zhenc4@uci.edu,
liuqiao@stanford.edu, weishao@stanford.edu. Corresponding emails should be sent to: jirong.yi@hologic.com, jirong-
yi@uiowa.edu.
1
arXiv:2210.01000v1 [cs.LG] 3 Oct 2022
To train such classifiers, the deep learning community has been mainly using the cross entropy
loss or its variants as the objective function for guiding the search of a good set of model weights [2].
However, the models trained this way can easily overfit the data and result in pretty bad generalization
performance, and this motivates the proposal of many techniques for improved generalization. These
techniques can be broadly divided into several categories. From the propsect of data, increasing the
dataset size has been proving to be beneficial for better generalization, but collecting huge amount
of data can be labor-consuming and costly. For example, in medical image analysis and diagnostics,
collecting the dataset can cost millions of dollars, and the data size is usually very small for some rare
disease such as cancer [10]. The data augmentation is another commonly used technique to increase
the diversity of dataset such as random cropping, flipping, and color jittering [9, 11, 12]. However, in
situations where the dataset itself is scarce, the data augmentation may not be enough for training a
well-performing classifier such as few shot learning [13].
From the angle of models, traditional machine learning theory shows that decreasing the flexibility
or complexity of the models can help alleviate the overfitting phenomenon [14, 15, 16, 17]. However,
under the background of deep learning, this does not seem to be a feasible solution because the models
which achieve the state-of-the-art (SOTA) performance are becoming increasingly more complex with
parameters even over one trillion [18, 19, 8]. These huge models are motivated by the increasingly
more challenging learning problems which require the strong capability of huge models to extract
useful information and find meaningful patterns that can be used for solving them, and which the
smaller models are incapable of [19].
Another line of works for improving the generalization performance comes from the regularization
viewpoint, i.e., restricting the model space for searching during training to avoid overfitting [2].
Examples includes the weight decay (or 2regularization), 1regularization, and label smoothing
regularization [12]. The major limitations of these approaches are that they require prior knowledge
about the learning tasks. For example, the 1regularization usually requires the ground truth model
to have sparse weights while the 2regularization requires the ground truth model to have small
magnitude to achieve good generalization performance. Unfortunately, such prior knowledge is not
always available in practice. What makes things worse is that it is recently reported that such prior
knowledge may make the learned models adversarially vulnerable such that adversarial attacks can be
easily acheived, and this is because the model can be underfitted to those unseen adversarial examples
[20, 21, 22].
1.1 Ignored Conditional Label Entropy
In this paper, we show that the existing cross entropy loss minimization for training deep neural
network classifiers essentially learns the conditional entropy of the underlying data distribution of the
input and the label. We argue that this can be the fundamental reason which accounts for the severe
overfitting of models trained by cross entropy loss minimization, and the extremely small training loss
in practice implies that the learned model completely ignores the conditional entropy about the label
distribution. The reasons for such ignorance of the conditional entropy include that the marginal
input entropy is very big when compared with the conditional label entropy, and that the annotating
of input samples in the data collection process and the encoding of the labels during the training
both ignore the conditional entropy of label.
To see these, we consider the MNIST image recognition task from computer vision where we
want to train a model to predict which class from 10 classes a given digit image belongs to1. In
this example, the label has only 10 choices, and the maximum label entropy is log2(10) 3.32 bits.
However, since the image input is in R784, the maximum entropy of the image distribution can be very
big, i.e., log2256784 =6272 bits if we assume each pixel to take value in {0,1,··· ,255}uniformly.
The gap between the information contained in the input image distribution and that contained in
the label distribution is so big that when the model is trained to learn the conditional entropy of
the label distribution after revealing the image input, it has a strong tendency to simply ignore such
remaining information and directly treat it as zero. This is indeed the case in many machine learning
applications. In the MNIST classification task, when the image of the hand-written digital is given,
we are usually 100% sure which class the image belongs to. In Figure 1a, we show one of such
images, and there is no doubt that the digit is 1, thus the label entropy is 0 when this image is given.
However, this is not always case because we can have image samples whose classes cannot determined
with complete certainty. Some such examples are also presented in Figure 1, the digit in Figure 1b
1http://yann.lecun.com/exdb/mnist/
2
has a truth label 1, but it looks like 2. Similarly, the digit in Figure 1c has a truth label 4 but looks
like 9, and Figure 1d has a truth label 9 but looks like 4. Different people can have different labels for
these image samples, but their ground truth annotations or labels are at the discretion of the creator
of them. In more complex image classification tasks such as ImageNet classification, a single image
itself can contain multiple objects, and thus belong to multiple classes. However, it has only single
annotation or label which depends on the discretion of the human annotators [23]. In Figure 2, we
show image examples from the ImageNet-1K classification task where the goal is to classify a given
image into 1000 classes. Though Figure 2a contains also a pencil, the human annotator only labeled
it with cauliflower label. In deep learning practice, since we usually use the one-hot encoding of the
label for a given image, i.e., assigning all the probability mass to the annotated class while zero to all
the other classes, this further encourages the model to ignore the conditional information of the label
[9, 12, 6]. The ignorance of label entropy allows the classifiers to give over-confident label predictions,
resulting in unsatisfactory generalization performance.
(a) Truth label is 1 (b) Truth label is 1 (c) Truth label is 4 (d) Truth label is 9
Figure 1: Image examples from MNIST dataset.
(a) Truth label is cauliflower (b) Truth label is cauliflower
Figure 2: Image examples from ImageNet dataset. Though Figure 2a contains also a pencil, the
human annotator only labeled it with cauliflower. In deep learning practice, since we usually use
the one-hot encoding of the label for a given image, i.e., assigning all the probability mass to the
annotated class while zero to all the other classes, this further encourages the model to ignore the
conditional entropy of the label [9, 12, 6].
Based on the above observations, a naive way for improving the generalization performance can be
getting back the conditional entropy of label, e.g., giving multiple annotations for a single image if it
contains multiple objects during the data generation process, and using other types of label encoding
instead of one-hot encoding during the training process. The deep learning community seems to also
realize the limitations of dataset with single annotation for images containing multiple objects, thus
created the ImageNet ReaL benchmark in 2020 which is more than 10 years after the construction
of the original ImageNet dataset [23, 24]. However, these re-assessed labels only partially fixed the
conditional information loss of labels, and can still suffer the conditional information loss when an
object itself has uncertainty as we show in Figure 1. Besides, annotating each object in an image can
3
be extremely labor-consuming, or even impossible in cases where some objects are so small that they
can hardly be perceivable [25].
As for using different label encoding methods instead of the one-hot encoding, all of them assume
implicitly that the annotations are reasonably good [12, 9, 26]. Examples of other label encodings
include the label-smoothing regularization (LSR), generalized entropy regularization (GER), and so
on [12, 26]. In LSR, Szegedy et al. replaced the one-hot encoding of labels in the cross entropy
loss minimization with a mixture of the original one-hot distribution and a uniform distribution,
and this mixture encoding is obtained by taking a small probability mass from the annotated class
and then evenly spreading it over all the other classes [12]. In GER, Meister et al. proposed to
use a skew-Jensen divergence to encourage the learned conditional distribution to approximate the
mixture encoding of label, and they add this extra divergence term as a regularization to the original
cross entropy loss minimization [26]. However, these efforts can still have severe limitations. First
of all, the construction of mixture encoding of label can be quite biased due to the similar reasons
accounting for the conditional entropy loss in dataset construction process. Secondly, their focus is
still on improving the conditional entropy, and this can be very challenging since the gap between the
conditional entropy of the label and the differential entropy of the input is so big that after we reveal
the the input, the conditional entropy of the label distribution can be very small and hard to learn.
1.2 Mutual Information Learned Classifiers
In this paper, we propose a new learning framework, i.e., mutual information learning (MIL) where
we train classifiers via learning the mutual information of the dataset, i.e., the dependency between
the input and the label, and this is motivated by several observations.
First of all, we show that the existing cross entropy loss minimization for training DNN classifiers
actualy learns the conditional entropy of the label when the input is given. From an information
theoretic viewpoint, the mutual information between the input and the label quantifies the information
shared by them while the conditional entropy quantifies the information remained in the label after
revealing the input. Compared with the conditional entropy, the magnitude of the mutual information
can be much larger, and it may not be easily ignored by the model during training, thus possible
to alleviate the overfitting phenomenon. An illustration of the relation among different information
quantities involved in a dataset is shown in Figure 3.
In addition, in 2020, there are several works which apply information theoretic tools to investigate
DNN models [27, 28, 29]. In [28], Yi et al. investigated the adversarial attack problems from an
information theoretic viewpoint, and they proposed to acheive adversarial attacks by minimizing the
mutual information between the input and the label. Yi et al. also established theoretical results
for characterizing what the best an adversary can do for attacking machine learning models [28].
Concurrently, in [27], Wang et al. used information theoretic tools to derive interesting relations
between the existing adversarial training formulations and the conditional entropy optimizations.
These works imply that there are intrinsinc connections between the properties of the model learned
from the dataset and the information contained in the dataset. Last but not least, some recent works
on training DNN reported that increasing entropy about the labels can improve the generalization
performance, and even make the model more adversarially robust [12, 26, 30]. Though these results
imply that the overfitting can be due to the severe ignorance of the conditional entropy of the label
during training, we argue that a more appropriate quantity for guiding the learning and training of
classification systems can be the mutual information (MI) which better characterizes the dependency
between the input and the label. Besides, the MI usually has larger magnitude than the conditional
label entropy in classification tasks, making it less to be ignored by the model and affected by the
numerical precisions. To see this, we can consider an extreme case where the model gives uniform
distribution for the label when an arbitrary example is fed to the model. In this case, the conditional
entropy of the label acheives the maximum, but what the model learned can be meaningless since it
cannot accurately characterize the dependency between the input and the label.
Under our mutual information learning (MIL) framework, we design a new loss for training the
DNN classifiers, and the loss itself originates from a representation of mutual information of the
dataset generating distribution, and we propose new pipelines associated with the proposed frame-
work. We will refer to this loss as mutual information learning loss (milLoss), and the traditional
cross entropy loss as conditional entropy learning loss (celLoss) since it essentially learning the con-
ditional entropy of the dataset generating distribution. When reformulated as a regularized form of
the celLoss, the milLoss can be interpreted as a weighted sum of the conditional label entropy loss
4
Figure 3: Information quantities involved in joint data distribution 𝑝𝑋 ,𝑌 . The big circle represents
the differential entropy (𝑋)or information contained in input random variable 𝑋, and the small
circle represents the entropy 𝐻(𝑌)or information contained in the label random variable 𝑌. The
overlapped area corresponds to the mutual information or information shared by the 𝑋and the 𝑌,
and the area in the small circle after excluding the overlapped area corresponds to the information
remained in the 𝑌after revealing the 𝑋.
and the label entropy loss. In the regularized form, the milLoss encourages the model not only to
accurately learn the conditional entropy of the label when an input is given, but also to precisely
learn the entropy of the label. This is distinctly different from the label smoothing regularization
(LSR), confidence penalty (CP), label correction (LC) etc which consider the conditional entropy of
the label [11, 31, 26].
For the proposed MIL framework, we establish an error probability lower bound for arbitrary
classification models in terms of the mutual information (MI) associated with the data generating
distribution by using Fano’s inequality [32] and an upper bound of error probability entropy developed
by Yi et al. [28]. These bounds explicitly characterize how the performance of the classification models
trained from a dataset is connected to the mutual information contained in it, i.e., the dependency
between the input and the output. Compared to Fano’s inequality, our bound is tighter due to a
carefully designed relaxation. Our error probability bound is applicable for arbitrary distribution
and arbitrary learning algorithms. We also consider a concrete binary classification problem in R𝑛,
and derive both lower and an upper bounds of the mutual information associated with the data
distribution. Besides, we derive an error probability bound for this binary classification data model.
We also establish theoretical guarantees for training models to accurately learn the mutual information
associated with arbitrary dataset generating distribution, and we give the sample complexity for
achieving this in practice. The keys for establishing these are the universal approximation properties
of neural networks and the concentration of measure phenomenon from statistics [33, 34, 35, 33, 14, 17].
We conduct extensive experiments to validate our theoretical analysis by using classification tasks on
benchmark dataset such as MNIST and CIFAR-102. The empirical results show that the proposed
MIL framework can achieve far superior generalization performance than the existing conditional
entropy learning approach and its variants [11, 36, 31].
1.3 Related Works
Our work is highly related to the following several works, but there are distinct differentiations be-
tween our work and them [6, 28, 29]. First of all, in 2019, Yi et al. formulated the classification
problem under the encoding-encoding paradigm by assuming there is an observation synthesis pro-
cess which can generate observations or inputs for a given label, and the classification task is simply
about inferring the label from the observation. Under this framework, they give theoretical charac-
terizations of the robustness of machine learning models to different types of perturbations [6]. They
also characterized the limit of an arbitrary adversarial attacking algorithm for an arbitrary machine
learning system for answering the question of what is the best attack that an adversary can acheive
and what the optimal adversarial attacks look like [28]. We continue to investigate the classification
2https://www.cs.toronto.edu/ kriz/cifar.html
5
摘要:

MutualInformationLearnedClassiers:anInformation-theoreticViewpointofTrainingDeepLearningClassicationSystemsJirongYi,QiaoshengZhang,ZhenChenQiaoLiu,WeiShao*October4,2022AbstractDeeplearningsystemshavebeenreportedtoacheivestate-of-the-artperformancesinmanyapplications,andoneofthekeysforachievingthis...

展开>> 收起<<
Mutual Information Learned Classifiers an Information-theoretic Viewpoint of Training Deep Learning Classification Systems.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:1.66MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注