Self-Supervised Learning via Maximum Entropy Coding Xin Liu Zhongdao Wang Yali Li Shengjin Wang

2025-05-03 0 0 2.65MB 21 页 10玖币

侵权投诉

Self-Supervised Learning via

Maximum Entropy Coding

Xin Liu Zhongdao Wang Yali Li Shengjin Wang∗

Beijing National Research Center for Information Science and Technology (BNRist)

Department of Electronic Engineering, Tsinghua University

{xinliu20, wcd17}@mails.tsinghua.edu.cn

{liyali13, wgsgj}@tsinghua.edu.cn

Abstract

A mainstream type of current self-supervised learning methods pursues a general-

purpose representation that can be well transferred to downstream tasks, typically

by optimizing on a given pretext task such as instance discrimination. In this work,

we argue that existing pretext tasks inevitably introduce biases into the learned

representation, which in turn leads to biased transfer performance on various

downstream tasks. To cope with this issue, we propose Maximum Entropy Cod-

ing (

MEC

), a more principled objective that explicitly optimizes on the structure

of the representation, so that the learned representation is less biased and thus gen-

eralizes better to unseen downstream tasks. Inspired by the principle of maximum

entropy in information theory, we hypothesize that a generalizable representation

should be the one that admits the maximum entropy among all plausible represen-

tations. To make the objective end-to-end trainable, we propose to leverage the

minimal coding length in lossy data coding as a computationally tractable surrogate

for the entropy, and further derive a scalable reformulation of the objective that

allows fast computation. Extensive experiments demonstrate that MEC learns a

more generalizable representation than previous methods based on speciﬁc pretext

tasks. It achieves state-of-the-art performance consistently on various downstream

tasks, including not only ImageNet linear probe, but also semi-supervised classiﬁ-

cation, object detection, instance segmentation, and object tracking. Interestingly,

we show that existing batch-wise and feature-wise self-supervised objectives could

be seen equivalent to low-order approximations of MEC. Code and pre-trained

models are available at https://github.com/xinliu20/MEC.

1 Introduction

Self-supervised learning (SSL) aims to learn rich and meaningful representations without relying

on human annotations. Pursuing general-purpose representations, SSL models are typically used as

pre-trained weights for providing a good initialization to downstream tasks. In this sense, SSL [12,

31,10,29,15,14,86,30] has seen great progress in computer vision, and can achieve competitive or

even better performance on various downstream tasks compared to its supervised counterparts.

At the core of current SSL methods is the design of pretext tasks. A pretext task is a (usually

hand-crafted) learning objective in which the supervision signals could be mined from the data itself,

with the aim of applying the learned representation to other downstream tasks. Early attempts of

pretext tasks typically discard a certain property of the input data and then force the model to predict

the discarded property. For instance, one can convert an RGB image to a gray-scale one, and train the

model to predict the original color [87,88]; or apply random rotation to image patches and ask the

∗Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11464v1 [cs.CV] 20 Oct 2022

Figure 1: Comparison of transfer learning performance on ﬁve image-based tasks (top row) and ﬁve

video-based tasks (bottom row). Along each axis we plot the performance ranking of the represen-

tation on a speciﬁc downstream task, so a polygon with a larger area means better generalization

capacity on various downstream tasks, across the board.

model to predict the rotation angles [27]. In both cases, to fulﬁll the pretext task, the model must

learn a meaningful representation that can describe object texture, shape or even category, therefore

the learned representation transfers well to downstream tasks related to these features.

Despite successes of existing pretext tasks, we ﬁnd that they inevitably introduce biases into the

learned representation, which conﬂicts with the original aim of “general-purpose”. For example,

representations trained with the most prevalent image-level task, instance discrimination [20,80], are

found biased to image-level tasks such as image classiﬁcation, while by contrast degenerate in patch-

or pixel-level tasks like object detection and semantic segmentation [81]. Even being transferred

to image-level tasks, such representations still suffer from domain gaps in cases of different data

distributions [26], e.g., classiﬁcation on unseen categories other than ImageNet objects.

In this work, we are curious of what makes for generalizable representations, and pursue an explicit

optimization with a criterion that directly measures the structure of representations, with the aim

of minimizing the biases brought by the pretext task. To this end, we propose Maximum Entropy

Coding (MEC). Inspired by the principle of maximum entropy in information theory, the basic

hypothesis in MEC is that a generalizable representation should be the one that admit the maximum

entropy among all plausible representations. Accordingly, optimizing towards maximum entropy

leads to representations with good generalization capacity. The main challenge confronting us is

that it is difﬁcult, and computationally expensive, if possible, to estimate the distribution of a given

representation (usually a ﬁnite set of high-dimensional vectors), so in turn it is difﬁcult to estimate

the entropy. To cope with this issue, we replace the optimization objective from the originally deﬁned

entropy to a computationally tractable surrogate, i.e., the necessary number of bits needed to encode

the representations via lossy data coding [16]. The log-determinant term costs the most computation

in the coding length function. By leveraging Taylor expansion of matrix, we further approximate

this term with polynomial functions, leading to signiﬁcant speed up and thus makes large-scale

pre-training possible.

In contrast to previous SSL methods that mainly evaluate on ImageNet classiﬁcation, we experiment

on a wide variety of vision tasks to show the good generalization capacity of MEC. The considered

tasks span across not only image-level recognition on various data distributions, but also patch- or

pixel-level tasks like object detection, instance segmentation, and object tracking. We show MEC

generalizes well consistently across all tasks considered, while representations learned with previous

pretext tasks usually perform well on closely related downstream tasks but degenerate on less related

ones (see Figure 1). Besides empirical results, we ﬁnd interesting equivalence between low-order

approximations of MEC and existing batch-wise (e.g., SimSiam [14]) or feature-wise objectives (e.g.,

Barlow Twins [86]), which provides a new perspective for a uniﬁed understanding of prevalent SSL

methods. In summary, the main contributions of our work are as follows:

•

To learn representations generalizable to various downstream tasks, we introduce the principle of

maximum entropy into self-supervised learning.

Maximum Entropy Coding Batch-wise Objectives

Feature-wise Objectives

encoder

<latexit sha1_base64="8wRlJbdaI+L0aMFZCtDbrHvIrGs=">AAACAXicbVDLSsNAFJ3UV42vqBvBzWAjuCpJF+qy6MZlhb6gCWUynbZDZyZhZiKUUDf+ihsXirj1L9z5N07aLLT1wIXDOfdy7z1RwqjSnvdtldbWNza3ytv2zu7e/oFzeNRWcSoxaeGYxbIbIUUYFaSlqWakm0iCeMRIJ5rc5n7ngUhFY9HU04SEHI0EHVKMtJH6zonr2hoGinIYcKTHGLGsObNdt+9UvKo3B1wlfkEqoECj73wFgxinnAiNGVKq53uJDjMkNcWMzOwgVSRBeIJGpGeoQJyoMJt/MIPnRhnAYSxNCQ3n6u+JDHGlpjwynfmVatnLxf+8XqqH12FGRZJqIvBi0TBlUMcwjwMOqCRYs6khCEtqboV4jCTC2oRmmxD85ZdXSbtW9S+r/n2tUr8p4iiDU3AGLoAPrkAd3IEGaAEMHsEzeAVv1pP1Yr1bH4vWklXMHIM/sD5/AK2ElR4=</latexit>

t⇠T

batch

dimension

feature

dimension

<latexit sha1_base64="Du+FnUbzaj84C5bOg67Y+iNxJ90=">AAAB8XicbZC7TsMwFIZPyq2EW4GRxaJBYqoShsKCqGBhLBK9iDaqHNdprTpOZDtIVdS3YGEAIRh5EHYWxNvgXgZo+SVLn/7/HPmcEyScKe2631ZuaXlldS2/bm9sbm3vFHb36ipOJaE1EvNYNgOsKGeC1jTTnDYTSXEUcNoIBlfjvHFPpWKxuNXDhPoR7gkWMoK1se4cx26zMLQdp1MouiV3IrQI3gyKFx/2efL2ZVc7hc92NyZpRIUmHCvV8txE+xmWmhFOR3Y7VTTBZIB7tGVQ4IgqP5tMPEJHxumiMJbmCY0m7u+ODEdKDaPAVEZY99V8Njb/y1qpDs/8jIkk1VSQ6UdhypGO0Xh91GWSEs2HBjCRzMyKSB9LTLQ5km2O4M2vvAj1k5JXLnk3brFyCVPl4QAO4Rg8OIUKXEMVakBAwAM8wbOlrEfrxXqdluasWc8+/JH1/gNApJJW</latexit>

()

<latexit sha1_base64="+TSHRhH6sRfvzJrF34h1DTvpGI0=">AAACAHicbVC7TsMwFHV4lvIKMDCwWLRITFXSARgrWBgYikofUhtFjuu0Vm0nsh2kKsrCr7AwgBArn8HG3+C0GaDlSFc6Oude3XtPEDOqtON8Wyura+sbm6Wt8vbO7t6+fXDYUVEiMWnjiEWyFyBFGBWkralmpBdLgnjASDeY3OR+95FIRSPxoKcx8TgaCRpSjLSRfPu4OuBIjzFi6V3mpy3KWxTxrOrbFafmzACXiVuQCijQ9O2vwTDCCSdCY4aU6rtOrL0USU0xI1l5kCgSIzxBI9I3VCBOlJfOHsjgmVGGMIykKaHhTP09kSKu1JQHpjO/Vi16ufif1090eOWlVMSJJgLPF4UJgzqCeRpwSCXBmk0NQVhScyvEYyQR1iazsgnBXXx5mXTqNfei5t7XK43rIo4SOAGn4By44BI0wC1ogjbAIAPP4BW8WU/Wi/VufcxbV6xi5gj8gfX5A44elmE=</latexit>

LSimSiam

<latexit sha1_base64="rbUK+pOsmcnhu4F0IhKwVV0hK0U=">AAAClnicnVFNTxsxEPVuSwspLWm5IHGxGpDCgWg3h34cqBARogcqpRIBRBwir9cLFl57Zc9Wiiz/pP6Z3vg3eEMOhfTUkSw/vfEbz7zJKiksJMl9FL94ufLq9epa683623cb7fcfzq2uDeMjpqU2lxm1XArFRyBA8svKcFpmkl9kd4Mmf/GLGyu0OoNZxSclvVGiEIxCoKbt3zukpHDLqHSnfup+HA/8tVMHfX+wT3TFDQVtFC25OzO+S8qayFA7p5hkWuZ2VobLXQVh6peofqhEQFd+nxSGMhfUPpDd/ymxdx3w3s603Ul6yTzwMkgXoIMWMZy2/5Bcs7rkCpik1o7TpIKJowYEk9y3SG15RdkdveHjAJtJ7cTNbfV4NzA5LrQJRwGes38rHC1t02x42Xhon+ca8l+5cQ3Fl4kTqqqBK/b4UVFLDBo3O8K5MJyBnAVAmRGhV8xuafAQwiZbwYT0+cjL4LzfSz/10p/9zuHRwo5VtI0+oi5K0Wd0iL6jIRohFm1GX6OjaBBvxd/i4/jk8WkcLTSb6EnEwwdn9M3T</latexit>

Ln=2

MEC =Tr( µZ1Z>

2µ

2(Z1Z>

2)2)

()

<latexit sha1_base64="1iK7vCttDPZbmT7+o/ngRyPu6aY=">AAACBXicbVC7TsMwFHXKq5RXgBEGixaJqUo6AGNVFgaGIvUltVHkuE5r1XEi26Gqoiws/AoLAwix8g9s/A1OmwFajnSlo3Pu1b33eBGjUlnWt1FYW9/Y3Cpul3Z29/YPzMOjjgxjgUkbhywUPQ9JwignbUUVI71IEBR4jHS9yU3mdx+IkDTkLTWLiBOgEac+xUhpyTVPK4MAqTFGLLlL3aSBBAunsDWlXKYV1yxbVWsOuErsnJRBjqZrfg2GIY4DwhVmSMq+bUXKSZBQFDOSlgaxJBHCEzQifU05Coh0kvkXKTzXyhD6odDFFZyrvycSFEg5CzzdmZ0sl71M/M/rx8q/dhLKo1gRjheL/JhBFcIsEjikgmDFZpogLKi+FeIxEggrHVxJh2Avv7xKOrWqfVm172vleiOPowhOwBm4ADa4AnVwC5qgDTB4BM/gFbwZT8aL8W58LFoLRj5zDP7A+PwBXTKYfA==</latexit>

LBarlowTwins

asym.

Figure 2: Illustration of our MEC and its relation to batch-wise and feature-wise objectives.

•

We propose Maximum Entropy Coding (

MEC

), which explicitly optimizes on representations

based on the principle of maximum entropy, and leverages the minimal coding length in lossy

data coding as a computationally tractable surrogate for the entropy.

•

We reformulate the log-determinant term in coding length function into a scalable form, which

makes large-scale pre-training possible, and uniﬁes existing batch-wise and feature-wise objec-

tives as low-order approximations of our method.

•

We show MEC representation generalizes well on a wide variety of image- and video-based

downstream tasks, achieving state-of-the-arts on most tasks considered.

2 Method

In this section, we start with illustrating the maximum entropy principle, and then introduce a

computationally tractable surrogate of the information entropy for high-dimensional vectors. We then

present a scalable reformulation of the proposed surrogate, which makes large-scale training possible.

And we further incorporate the view consistency prior for maximum entropy coding. Finally, we

demonstrate how the proposed method can unify existing batch-wise and feature-wise SSL objectives.

Please refer to Appendix Efor more details about the proofs in this section.

2.1 Maximum Entropy Coding

The maximum entropy principle.

The main purpose of this work is to improve the generalization

capacity of self-supervised learning representations across unseen downstream tasks and data dis-

tributions, and reduce the biases brought by speciﬁcally designed pretext tasks as much as possible.

This naturally raises a question, i.e., what makes for a generalizable representation? To answer this

question, we are particularly inspired by the maximum entropy principle in information theory, which

states that the probability distribution that best represents the current state of knowledge about a

system is the one with largest entropy, given a testable information (such as accuracy) and in this way

no additional bias or assumptions is introduced [38,39,59]. We therefore hypothesis that a general-

izable representation is the one that has the maximum entropy among all plausible representations.

Intuitively, if we are able to express the entropy in a closed form, the maximum entropy principle

then can serve as an optimization objective and supervise the representation learning.

Minimal coding length as a surrogate for entropy.

Entropy is originally deﬁned on probability

distributions [63], i.e.,

H(z),−Rp(z) log p(z)dz

, for continuous random variables. However,

it is very difﬁcult to estimate the true distributions

p(z)

of a representation [6,54], from a ﬁnite

set of high dimensional vectors

Z=z1, z2, . . . , zm∈Rd×m

. A handy fact is that entropy is

conceptually equivalent to the minimal number of bits required to encode the data losslessly, so the

minimal lossless coding length could be used to represent the entropy. However, lossless coding of

continuous random variables is infeasible in our case since it often requires an inﬁnite number of bits,

breaking the numerical stability. Instead, we exploit the coding length in lossy data coding [16] as a

computationally tractable surrogate for the entropy of continuous random variables. Given a set of

samples

, the minimal number of bits needed to encode

subject to a distortion



is given by the

a b c

<latexit sha1_base64="gWOwXJQtrXwOMJxrwBcFauKdjlk=">AAAB8nicbVA9SwNBEN2LXzF+RS1tDhPBKtylUMugjWUE8wHJEfY2c8mSvd1jd04IR36GjYUitv4aO/+Nm+QKTXww8Hhvhpl5YSK4Qc/7dgobm1vbO8Xd0t7+weFR+fikbVSqGbSYEkp3Q2pAcAkt5Cigm2igcSigE07u5n7nCbThSj7iNIEgpiPJI84oWqlXrfchMVwoWR2UK17NW8BdJ35OKiRHc1D+6g8VS2OQyAQ1pud7CQYZ1ciZgFmpnxpIKJvQEfQslTQGE2SLk2fuhVWGbqS0LYnuQv09kdHYmGkc2s6Y4tisenPxP6+XYnQTZFwmKYJky0VRKlxU7vx/d8g1MBRTSyjT3N7qsjHVlKFNqWRD8FdfXiftes2/qvkP9UrjNo+jSM7IObkkPrkmDXJPmqRFGFHkmbySNwedF+fd+Vi2Fpx85pT8gfP5A3x8kLw=</latexit>

2✏

Figure 4:

Effects of the distortion measure on maximum entropy coding (MEC).

(a): Encod-

ing the representations is akin to packing



-balls into the representation space. (b): T-SNE [71]

visualization of the representations learned with large (left plot, 2

d= 0.12) and small (right plot,

2

d= 0.01). (c): Linear and kNN accuracy and the spectral norm w.r.t the degree of distortion .

following coding length function [47,72]:

L,m+d

2log det Im+d

m2Z>Z,(1)

where

denotes the identity matrix with dimension

, and



is the upper bound of the expected

decoding error between z∈Zand the decoded bz,i.e.,E[kz−bzk2]≤.

We note that the computation of log-determinant of high dimensional matrix in Equation

(1)

is highly

expensive and may cause numerically unstable results for ill-conditioned matrix, which inhibits its ap-

plication to large-scale pre-training (e.g., over 1 million images). Therefore, a scalable and stable refor-

mulation of Equation

(1)

is required. We ﬁrst rewrite Equation

(1)

L=µlog det Im+λZ>Z

where

µ=m+d

and

λ=d

m2

. Utilizing the identical equation

det(exp(A)) = exp(Tr(A))

[35],

we obtain

L= Tr µlog Im+λZ>Z

, where

stands for the trace of the matrix. Finally, we

apply Taylor series expansion to expand the logarithm of the matrix and obtain

L= Tr µ∞

k=1

(−1)k+1

kλZ>Zk!,(2)

Figure 3: Comparison of running time

and relative approximation error be-

tween Equation

(1)

(origin) and Equa-

tion

(2)

(approx.) for different number

of samples in Z(dim).

with convergence condition:



λZ>Z



2<1

, and this can

be achieved by adjusting the hyperparameter



(detailed

in Section 2.2). Compared to Equation (1), there are only

matrix multiplication and addition in Equation

(2)

, which

signiﬁcantly speeds up the computation process and avoids

the numerical unstability. To verify this, in Figure 3, we

make a comparison of running time and relative approx-

imation error between Equation

(1)

and

(2)

(using the ﬁrst

four terms). The results show that our reformulation can

approximate the original coding length function with negli-

gible error (all errors are well below

0.5%

), and accelerate

the computation considerably (over 50

acceleration for

all cases), thus making large-scale pre-training possible.

2.2 Combining with the View Consistency Prior

It should be noted that the necessary premise of the max-

imum entropy principle is that testable information is given as prior. For example, the testable

information could be the accuracy of a predictive model: the most generalizable model should be

the one with maximum entropy, but it is only when a group of models reaches a given accuracy.

Otherwise, simply optimizing towards maximum entropy will lead to trivial solutions such as uniform

distributions. In Equation

(1)

, the prior is not considered. To introduce a prior, we employ a common

practice in SSL by augmenting

into two different views

and

. More speciﬁcally, given a

set of images

, an image

is sampled uniformly from

, and two augmented views

and

are obtained from

via a distribution of data augmentations

. Then they are fed to an encoder

consisting of a backbone and a projector network, which produces

-normalized embeddings of

and

. For a batch of

images, we have

Z1=z1

1, z2

1, . . . , zm

1

and similarly for

, which are

two observations of the same Z. MEC aims to minimize the following loss:

LMEC =−µlog det Im+λZ>

1Z2≈ −Tr µ

k=1

(−1)k+1

kλZ>

1Z2k!,(3)

where the same notations in Section 2.1 apply, and

is the order of Taylor expansion. Compared

with Equation

(1)

, the formulation in

(3)

considers not only maximizing entropy, but also the view

consistency prior mined from the data itself, therefore learning meaningful representations.

As noted in Section 2.1, the convergence condition of Taylor expansion requires

kCk2<1

, where

C=λZ>

1Z2

and

λ=d

m2=1

m2

. We show such condition can be strictly satisﬁed by setting

2

d>1

because of the inequality

kCk2≤pkCk1kCk∞<1

. In practice, we empirically ﬁnd

that the Taylor expansion converges over a wide range of

d

(Figure 4(c)) with a linear warm-up.

From the preliminary experiments on CIFAR-10 [41] (detailed in Appendix B), we also ﬁnd that

the distributions of representations show progressive ﬁner granularity as

d

decreases (Figure 4(b)).

This can be interpreted by the practical meaning of the distortion

d

(Figure 4(a)), i.e., a smaller

d

encourages the representation space to be encoded in ﬁner granularity (and hence more uniform). By

contrast, a small

d

might break the semantic structures of similar images (i.e., tolerance). Therefore,

a good choice of

d

is needed to compromise the uniformity and tolerance properties [75,74] of

representations, which shares the same role as the temperature [80] term in contrastive learning.

An overview of MEC is illustrated in Figure 2and a PyTorch-like pseudocode is provided in

Appendix A. The algorithm describes the minimalist variant of MEC, which can be further improved

by integrating momentum encoder and asymmetric networks (detailed in experiments).

2.3 A Uniﬁed View of Batch-wise and Feature-wise SSL Objectives

Current SSL methods based on Siamese networks can be roughly divided into two categories: batch-

wise methods [12,31,13,15,14,10] and feature-wise methods [86,5,24,36]. The former aims to

minimize the distance between augmented views of the same sample while maximizing the distance

between different samples, which can be viewed as decorrelating the different features in a batch.

The latter, in contrast, tries to decorrelate the different vector components in the representation. The

relationship between them has not been fully understood. Our work builds bridges between these two

types of methods through the following derivation:

LMEC =−µlog det Im+λZ>

1Z2

| {z }

batch-wise

=−µlog det Id+λZ1Z>

2

| {z }

feature-wise

,(4)

which can be proved since

1Z2∈Rm×m

and

Z1Z>

2∈Rd×d

have the same nonzero eigenvalues.

In Figure 2, under the framework of MEC, we show the equivalence between batch-wise and feature-

wise methods using two examples, SimSiam [14] and Barlow Twins [86]. By taking Taylor expansion

(Equation

(2)

) of the left side of Equation

(4)

and before the trace operation, the diagonal elements of

the leading term (i.e.,

µλZ>

1Z2

) measure the similarity between the views of the same images in a

batch, and the objective of SimSiam [14] is equivalent to maximizing the trace of this term. Similarly,

the leading term of the right side expansion models the correlation between dimensions of the feature,

and the objective of Barlow Twins [86] is equivalent to the second-order expansion of

LMEC

. With

the above derivation, our method naturally subsumes the two different kinds of objectives as its

low-order expansions, and we show in experiments that better downstream task performance can be

achieved with higher-order approximations. We further show in Appendix Ethat our MEC can also

bridge other self-supervised objectives. And we hope the direct tying of a family of objectives to a

very grounded mathematical concept can inspire more new methods.

3 Experiments

We perform self-supervised pre-training using the proposed MEC on the training set of the Ima-

geNet ILSVRC-2012 dataset [17]. After pre-training, we conduct extensive experiments to examine

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-SupervisedLearningviaMaximumEntropyCodingXinLiuZhongdaoWangYaliLiShengjinWangBeijingNationalResearchCenterforInformationScienceandTechnology(BNRist)DepartmentofElectronicEngineering,TsinghuaUniversity{xinliu20,wcd17}@mails.tsinghua.edu.cn{liyali13,wgsgj}@tsinghua.edu.cnAbstractAmainstreamtypeo...

展开>> 收起<<

Self-Supervised Learning via Maximum Entropy Coding Xin Liu Zhongdao Wang Yali Li Shengjin Wang.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Supervised Learning via Maximum Entropy Coding Xin Liu Zhongdao Wang Yali Li Shengjin Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: