Mutual Information Learned Classiﬁers an Information-theoretic Viewpoint of Training Deep Learning Classiﬁcation Systems

2025-05-02 0 0 1.66MB 42 页 10玖币

侵权投诉

Mutual Information Learned Classiﬁers: an

Information-theoretic Viewpoint of Training Deep Learning

Classiﬁcation Systems

Jirong Yi, Qiaosheng Zhang, Zhen Chen

Qiao Liu, Wei Shao ∗

October 4, 2022

Abstract

Deep learning systems have been reported to acheive state-of-the-art performances in many

applications, and one of the keys for achieving this is the existence of well trained classiﬁers on

benchmark datasets which can be used as backbone feature extractors in downstream tasks. As

a main-stream loss function for training deep neural network (DNN) classiﬁers, the cross entropy

loss can easily lead us to ﬁnd models which demonstrate severe overﬁtting behavior when no

other techniques are used for alleviating it such as data augmentation. In this paper, we prove

that the existing cross entropy loss minimization for training DNN classiﬁers essentially learns

the conditional entropy of the underlying data distribution of the dataset, i.e., the information or

uncertainty remained in the labels after revealing the input. In this paper, we propose a mutual

information learning framework where we train DNN classiﬁers via learning the mutual informa-

tion between the label and input. Theoretically, we give the population error probability lower

bound in terms of the mutual information. In addition, we derive the mutual information lower

and upper bounds for a concrete binary classiﬁcation data model in R𝑛, and also the error prob-

ability lower bound in this scenario. Besides, we establish the sample complexity for accurately

learning the mutual information from empirical data samples drawn from the underlying data

distribution. Empirically, we conduct extensive experiments on several benchmark datasets to

support our theory. Without whistles and bells, the proposed mutual information learned classi-

ﬁers (MILCs) acheive far better generalization performances than the state-of-the-art classiﬁers

with an improvement which can exceed more than 10% in testing accuracy.

Keywords: classiﬁcation; mutual information learning; error probability; sample complexity;

overﬁtting; deep neural network

1 Introduction

Ever since the breakthrough made by Krizhevsky et al. [1], deep learning has been ﬁnding trenmdous

applications in diﬀerent areas such as computer vision, natural language process, and traditonal

signal processing [2, 3, 4], and it achieved the state-of-the-art performances in almost all of them. In

nearly all of these applications, a fundamental classiﬁcation task is usually involved, i.e., determining

the class to which a given input belongs. Examples from computer vision include the image-level

classiﬁcation in recognition tasks, the patch-level classiﬁcation in object detection, and pixel-level

classiﬁcation in image segmentation tasks [5, 6, 7]. For many other applications which do not directly

or explicitly involve classiﬁcations, they still use model pretrained via classiﬁcation tasks as a backbone

for extracting useful and meaningful representation for the speciﬁc tasks [8, 9]. In practice, such

extracted representations have been reported to be beneﬁcial for the downstream tasks [8, 7, 9].

∗Jirong Yi is with CAD Science Group, Hologic Inc, Santa Clara, CA 95054 and Department of Electrical and

Computer Engineering, University of Iowa, Iowa City, IA 52242. Qiaosheng Zhang is with Department of Electrical

and Computer Engineering at National University of Singapore, Singapore 119077. Zhen Chen is with Department of

Electrical Engineering and Computer Science at University of California at Irvine, Irvine, CA 92697. Qiao Liu is with

Department of Statistics at Stanford University, Stanford, CA 94305. Wei Shao is with Department of Radiology at

Stanford University, Stanford, CA 94305. Emails: jirong.yi@hologic.com, ericzhang8951@gmail.com, zhenc4@uci.edu,

liuqiao@stanford.edu, weishao@stanford.edu. Corresponding emails should be sent to: jirong.yi@hologic.com, jirong-

yi@uiowa.edu.

arXiv:2210.01000v1 [cs.LG] 3 Oct 2022

To train such classiﬁers, the deep learning community has been mainly using the cross entropy

loss or its variants as the objective function for guiding the search of a good set of model weights [2].

However, the models trained this way can easily overﬁt the data and result in pretty bad generalization

performance, and this motivates the proposal of many techniques for improved generalization. These

techniques can be broadly divided into several categories. From the propsect of data, increasing the

dataset size has been proving to be beneﬁcial for better generalization, but collecting huge amount

of data can be labor-consuming and costly. For example, in medical image analysis and diagnostics,

collecting the dataset can cost millions of dollars, and the data size is usually very small for some rare

disease such as cancer [10]. The data augmentation is another commonly used technique to increase

the diversity of dataset such as random cropping, ﬂipping, and color jittering [9, 11, 12]. However, in

situations where the dataset itself is scarce, the data augmentation may not be enough for training a

well-performing classiﬁer such as few shot learning [13].

From the angle of models, traditional machine learning theory shows that decreasing the ﬂexibility

or complexity of the models can help alleviate the overﬁtting phenomenon [14, 15, 16, 17]. However,

under the background of deep learning, this does not seem to be a feasible solution because the models

which achieve the state-of-the-art (SOTA) performance are becoming increasingly more complex with

parameters even over one trillion [18, 19, 8]. These huge models are motivated by the increasingly

more challenging learning problems which require the strong capability of huge models to extract

useful information and ﬁnd meaningful patterns that can be used for solving them, and which the

smaller models are incapable of [19].

Another line of works for improving the generalization performance comes from the regularization

viewpoint, i.e., restricting the model space for searching during training to avoid overﬁtting [2].

Examples includes the weight decay (or ℓ2regularization), ℓ1regularization, and label smoothing

regularization [12]. The major limitations of these approaches are that they require prior knowledge

about the learning tasks. For example, the ℓ1regularization usually requires the ground truth model

to have sparse weights while the ℓ2regularization requires the ground truth model to have small

magnitude to achieve good generalization performance. Unfortunately, such prior knowledge is not

always available in practice. What makes things worse is that it is recently reported that such prior

knowledge may make the learned models adversarially vulnerable such that adversarial attacks can be

easily acheived, and this is because the model can be underﬁtted to those unseen adversarial examples

[20, 21, 22].

1.1 Ignored Conditional Label Entropy

In this paper, we show that the existing cross entropy loss minimization for training deep neural

network classiﬁers essentially learns the conditional entropy of the underlying data distribution of the

input and the label. We argue that this can be the fundamental reason which accounts for the severe

overﬁtting of models trained by cross entropy loss minimization, and the extremely small training loss

in practice implies that the learned model completely ignores the conditional entropy about the label

distribution. The reasons for such ignorance of the conditional entropy include that the marginal

input entropy is very big when compared with the conditional label entropy, and that the annotating

of input samples in the data collection process and the encoding of the labels during the training

both ignore the conditional entropy of label.

To see these, we consider the MNIST image recognition task from computer vision where we

want to train a model to predict which class from 10 classes a given digit image belongs to1. In

this example, the label has only 10 choices, and the maximum label entropy is log2(10) ≈ 3.32 bits.

However, since the image input is in R784, the maximum entropy of the image distribution can be very

big, i.e., log2256784 =6272 bits if we assume each pixel to take value in {0,1,··· ,255}uniformly.

The gap between the information contained in the input image distribution and that contained in

the label distribution is so big that when the model is trained to learn the conditional entropy of

the label distribution after revealing the image input, it has a strong tendency to simply ignore such

remaining information and directly treat it as zero. This is indeed the case in many machine learning

applications. In the MNIST classiﬁcation task, when the image of the hand-written digital is given,

we are usually 100% sure which class the image belongs to. In Figure 1a, we show one of such

images, and there is no doubt that the digit is 1, thus the label entropy is 0 when this image is given.

However, this is not always case because we can have image samples whose classes cannot determined

with complete certainty. Some such examples are also presented in Figure 1, the digit in Figure 1b

1http://yann.lecun.com/exdb/mnist/

has a truth label 1, but it looks like 2. Similarly, the digit in Figure 1c has a truth label 4 but looks

like 9, and Figure 1d has a truth label 9 but looks like 4. Diﬀerent people can have diﬀerent labels for

these image samples, but their ground truth annotations or labels are at the discretion of the creator

of them. In more complex image classiﬁcation tasks such as ImageNet classiﬁcation, a single image

itself can contain multiple objects, and thus belong to multiple classes. However, it has only single

annotation or label which depends on the discretion of the human annotators [23]. In Figure 2, we

show image examples from the ImageNet-1K classiﬁcation task where the goal is to classify a given

image into 1000 classes. Though Figure 2a contains also a pencil, the human annotator only labeled

it with cauliﬂower label. In deep learning practice, since we usually use the one-hot encoding of the

label for a given image, i.e., assigning all the probability mass to the annotated class while zero to all

the other classes, this further encourages the model to ignore the conditional information of the label

[9, 12, 6]. The ignorance of label entropy allows the classiﬁers to give over-conﬁdent label predictions,

resulting in unsatisfactory generalization performance.

(a) Truth label is 1 (b) Truth label is 1 (c) Truth label is 4 (d) Truth label is 9

Figure 1: Image examples from MNIST dataset.

(a) Truth label is cauliﬂower (b) Truth label is cauliﬂower

Figure 2: Image examples from ImageNet dataset. Though Figure 2a contains also a pencil, the

human annotator only labeled it with cauliﬂower. In deep learning practice, since we usually use

the one-hot encoding of the label for a given image, i.e., assigning all the probability mass to the

annotated class while zero to all the other classes, this further encourages the model to ignore the

conditional entropy of the label [9, 12, 6].

Based on the above observations, a naive way for improving the generalization performance can be

getting back the conditional entropy of label, e.g., giving multiple annotations for a single image if it

contains multiple objects during the data generation process, and using other types of label encoding

instead of one-hot encoding during the training process. The deep learning community seems to also

realize the limitations of dataset with single annotation for images containing multiple objects, thus

created the ImageNet ReaL benchmark in 2020 which is more than 10 years after the construction

of the original ImageNet dataset [23, 24]. However, these re-assessed labels only partially ﬁxed the

conditional information loss of labels, and can still suﬀer the conditional information loss when an

object itself has uncertainty as we show in Figure 1. Besides, annotating each object in an image can

be extremely labor-consuming, or even impossible in cases where some objects are so small that they

can hardly be perceivable [25].

As for using diﬀerent label encoding methods instead of the one-hot encoding, all of them assume

implicitly that the annotations are reasonably good [12, 9, 26]. Examples of other label encodings

include the label-smoothing regularization (LSR), generalized entropy regularization (GER), and so

on [12, 26]. In LSR, Szegedy et al. replaced the one-hot encoding of labels in the cross entropy

loss minimization with a mixture of the original one-hot distribution and a uniform distribution,

and this mixture encoding is obtained by taking a small probability mass from the annotated class

and then evenly spreading it over all the other classes [12]. In GER, Meister et al. proposed to

use a skew-Jensen divergence to encourage the learned conditional distribution to approximate the

mixture encoding of label, and they add this extra divergence term as a regularization to the original

cross entropy loss minimization [26]. However, these eﬀorts can still have severe limitations. First

of all, the construction of mixture encoding of label can be quite biased due to the similar reasons

accounting for the conditional entropy loss in dataset construction process. Secondly, their focus is

still on improving the conditional entropy, and this can be very challenging since the gap between the

conditional entropy of the label and the diﬀerential entropy of the input is so big that after we reveal

the the input, the conditional entropy of the label distribution can be very small and hard to learn.

1.2 Mutual Information Learned Classiﬁers

In this paper, we propose a new learning framework, i.e., mutual information learning (MIL) where

we train classiﬁers via learning the mutual information of the dataset, i.e., the dependency between

the input and the label, and this is motivated by several observations.

First of all, we show that the existing cross entropy loss minimization for training DNN classiﬁers

actualy learns the conditional entropy of the label when the input is given. From an information

theoretic viewpoint, the mutual information between the input and the label quantiﬁes the information

shared by them while the conditional entropy quantiﬁes the information remained in the label after

revealing the input. Compared with the conditional entropy, the magnitude of the mutual information

can be much larger, and it may not be easily ignored by the model during training, thus possible

to alleviate the overﬁtting phenomenon. An illustration of the relation among diﬀerent information

quantities involved in a dataset is shown in Figure 3.

In addition, in 2020, there are several works which apply information theoretic tools to investigate

DNN models [27, 28, 29]. In [28], Yi et al. investigated the adversarial attack problems from an

information theoretic viewpoint, and they proposed to acheive adversarial attacks by minimizing the

mutual information between the input and the label. Yi et al. also established theoretical results

for characterizing what the best an adversary can do for attacking machine learning models [28].

Concurrently, in [27], Wang et al. used information theoretic tools to derive interesting relations

between the existing adversarial training formulations and the conditional entropy optimizations.

These works imply that there are intrinsinc connections between the properties of the model learned

from the dataset and the information contained in the dataset. Last but not least, some recent works

on training DNN reported that increasing entropy about the labels can improve the generalization

performance, and even make the model more adversarially robust [12, 26, 30]. Though these results

imply that the overﬁtting can be due to the severe ignorance of the conditional entropy of the label

during training, we argue that a more appropriate quantity for guiding the learning and training of

classiﬁcation systems can be the mutual information (MI) which better characterizes the dependency

between the input and the label. Besides, the MI usually has larger magnitude than the conditional

label entropy in classiﬁcation tasks, making it less to be ignored by the model and aﬀected by the

numerical precisions. To see this, we can consider an extreme case where the model gives uniform

distribution for the label when an arbitrary example is fed to the model. In this case, the conditional

entropy of the label acheives the maximum, but what the model learned can be meaningless since it

cannot accurately characterize the dependency between the input and the label.

Under our mutual information learning (MIL) framework, we design a new loss for training the

DNN classiﬁers, and the loss itself originates from a representation of mutual information of the

dataset generating distribution, and we propose new pipelines associated with the proposed frame-

work. We will refer to this loss as mutual information learning loss (milLoss), and the traditional

cross entropy loss as conditional entropy learning loss (celLoss) since it essentially learning the con-

ditional entropy of the dataset generating distribution. When reformulated as a regularized form of

the celLoss, the milLoss can be interpreted as a weighted sum of the conditional label entropy loss

Figure 3: Information quantities involved in joint data distribution 𝑝𝑋 ,𝑌 . The big circle represents

the diﬀerential entropy ℎ(𝑋)or information contained in input random variable 𝑋, and the small

circle represents the entropy 𝐻(𝑌)or information contained in the label random variable 𝑌. The

overlapped area corresponds to the mutual information or information shared by the 𝑋and the 𝑌,

and the area in the small circle after excluding the overlapped area corresponds to the information

remained in the 𝑌after revealing the 𝑋.

and the label entropy loss. In the regularized form, the milLoss encourages the model not only to

accurately learn the conditional entropy of the label when an input is given, but also to precisely

learn the entropy of the label. This is distinctly diﬀerent from the label smoothing regularization

(LSR), conﬁdence penalty (CP), label correction (LC) etc which consider the conditional entropy of

the label [11, 31, 26].

For the proposed MIL framework, we establish an error probability lower bound for arbitrary

classiﬁcation models in terms of the mutual information (MI) associated with the data generating

distribution by using Fano’s inequality [32] and an upper bound of error probability entropy developed

by Yi et al. [28]. These bounds explicitly characterize how the performance of the classiﬁcation models

trained from a dataset is connected to the mutual information contained in it, i.e., the dependency

between the input and the output. Compared to Fano’s inequality, our bound is tighter due to a

carefully designed relaxation. Our error probability bound is applicable for arbitrary distribution

and arbitrary learning algorithms. We also consider a concrete binary classiﬁcation problem in R𝑛,

and derive both lower and an upper bounds of the mutual information associated with the data

distribution. Besides, we derive an error probability bound for this binary classiﬁcation data model.

We also establish theoretical guarantees for training models to accurately learn the mutual information

associated with arbitrary dataset generating distribution, and we give the sample complexity for

achieving this in practice. The keys for establishing these are the universal approximation properties

of neural networks and the concentration of measure phenomenon from statistics [33, 34, 35, 33, 14, 17].

We conduct extensive experiments to validate our theoretical analysis by using classiﬁcation tasks on

benchmark dataset such as MNIST and CIFAR-102. The empirical results show that the proposed

MIL framework can achieve far superior generalization performance than the existing conditional

entropy learning approach and its variants [11, 36, 31].

1.3 Related Works

Our work is highly related to the following several works, but there are distinct diﬀerentiations be-

tween our work and them [6, 28, 29]. First of all, in 2019, Yi et al. formulated the classiﬁcation

problem under the encoding-encoding paradigm by assuming there is an observation synthesis pro-

cess which can generate observations or inputs for a given label, and the classiﬁcation task is simply

about inferring the label from the observation. Under this framework, they give theoretical charac-

terizations of the robustness of machine learning models to diﬀerent types of perturbations [6]. They

also characterized the limit of an arbitrary adversarial attacking algorithm for an arbitrary machine

learning system for answering the question of what is the best attack that an adversary can acheive

and what the optimal adversarial attacks look like [28]. We continue to investigate the classiﬁcation

2https://www.cs.toronto.edu/ kriz/cifar.html

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MutualInformationLearnedClassiers:anInformation-theoreticViewpointofTrainingDeepLearningClassicationSystemsJirongYi,QiaoshengZhang,ZhenChenQiaoLiu,WeiShao*October4,2022AbstractDeeplearningsystemshavebeenreportedtoacheivestate-of-the-artperformancesinmanyapplications,andoneofthekeysforachievingthis...

展开>> 收起<<

Mutual Information Learned Classiﬁers an Information-theoretic Viewpoint of Training Deep Learning Classiﬁcation Systems.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mutual Information Learned Classiﬁers an Information-theoretic Viewpoint of Training Deep Learning Classiﬁcation Systems

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: