Pixel -global Self -supervised Learning with Unce rtainty -aware Context Stabilizer Zhuangzhu ang Zhang Weixiong Zhang

2025-05-02 0 0 621.8KB 21 页 10玖币

侵权投诉

Pixel-global Self-supervised Learning with Uncertainty-aware Context

Stabilizer

Zhuangzhuang Zhang, Weixiong Zhang

Introduction

Most computer vision (CV) tasks can be classified into classification and dense prediction. Dense

CV prediction tasks predict or estimate a label for every pixel of input images, such as depth

prediction, semantic segmentation, and image registration [3, 4], each of which has a wide range

of applications. The development of deep learning has brought significant improvements to the CV

field; deep learning-based methods have been shown to outperform heuristic-based methods

significantly. One prominent advantage of deep learning-based methods is that they learn to extract

features rather than rely on hand-crafted features [5], unlocking the potential of learning salient

features that benefit downstream tasks. While deep-learning methods can extract features from

images automatically, they have a serious drawback of their enormous demand for training data.

This high demand is rarely satisfied in the medical field because annotated datasets are expensive

to prepare. The lack-of-annotated data is exacerbated for medical image applications because data

annotation requires specialized medical expertise [6, 7].

A general hybrid learning approach has been proposed to address this critical annotation shortage,

which combines self-supervised pre-training and supervised fine-tuning [8-11]. In the self-

supervised pre-training phase, various tasks, namely pre-tasks, are introduced for a chosen

backbone neural network to learn semantically meaningful representations. After pre-training, the

backbone network is fine-tuned for downstream tasks. For example, a pre-task generated differently

rotated images as training samples. The backbone network learns to predict the rotation applied,

during which it learns to encode quality representations. After pre-training, the backbone network

is fine-tuned for the downstream image classification task, in which the model learns to classify

input images into different classes.

The latest SSL methods adopt contrastive representation learning [2]. Contrastive learning is based

on the assumption that differently augmented views of the same image should have similar

representation, and augmented views of different images should have distinct representations [10,

12]. A typical contrastive SSL method is structured as follows. Assuming we need to train a

ResNet50 [13] (a classic deep learning backbone for images) for downstream image classification

tasks, contrastive SSL sets up a student network and a teacher network with the same architecture

(ResNet50). In each pre-training iteration, two differently augmented views of the same image are

fed into the student and teacher network. We use the discrepancy between their output

representations as the loss to update student network parameters. Teacher network parameters are

normally updated by the moving average [14] of the student’s parameters. While learning to

perform well for this contrastive pre-task, the backbone network (ResNet50) learns to generate

quality representations for input images. After pre-training the student’s ability to extract features,

we keep its parameters and fine-tune them with downstream classification datasets.

However, most contrastive SSL methods [3, 14-17] focus on image-level global consistency by

treating every image as a class. However, local consistency between two augmented views of the

same image is overlooked. Recent SSL models attempt to explore local consistency for generic

images [18-20] and medical images [1, 2], yet they are still limited to enforcing the consistency at

the region level. In these methods, each feature vector corresponds to a region in the original image.

Enforcing such region-level consistency is too coarse-grained to be accurate or adequate for

downstream pixel-wise prediction tasks, such as semantic segmentation and registration. While

pixel-level consistency was explored earlier [1, 21], pixel-to-pixel consistency modeling has not

been well studied. It has been studied by generative self-supervised learning methods [22]

(comparison in Appendix A), but the potential of building pixel-level fine-grained SSL methods

has not been explored. Furthermore, the context difference between a pixel in one view and its

counterpart in the other view was not considered. For example, a pixel on the edge of one cropped

view can be at the center of the other cropped view. Different data augmentations create this context

gap, and it is not ideal to directly push the feature vector towards its corresponding one in the high-

dimensional space.

We developed a novel SSL approach to capture global consistency and pixel-level local

consistencies between differently augmented views of the same images to accommodate

downstream discriminative and dense predictive tasks. We adopted the teacher-student architecture

used in previous contrastive SSL methods [3, 14, 20]. In our method, the global consistency is

enforced by aggregating the compressed representations of augmented views of the same image.

The pixel-level consistency is enforced by pursuing similar representations for the same pixel in

differently augmented views. Importantly, we introduced an uncertainty-aware context stabilizer

to adaptively preserve the context gap created by the two views from different augmentations.

Moreover, we used Monte Carlo dropout [23] in the stabilizer to measure uncertainty and

adaptively balance the discrepancy between the representations of the same pixels in different

views.

We experimentally tested and evaluated the new method using medical images, where high-quality

annotated data are commonly unavailable or insufficient. For instance, a widely used general image

dataset, ImageNet [4, 24], contains more than four million images. In contrast, the samples in most

medical image datasets with annotation are in the orders of hundreds to thousands. None of the

public medical image datasets (even with missing or incomplete annotations) has more than one

million images. The data shortage is even more severe in cases where annotations are more labor-

intensive, such as semantic segmentation. Our experiment assembled the largest dataset for medical

image semantic segmentation, with 1808 cases. Despite the small samples size compared to generic

image datasets such as ImageNet, we focused on image semantic segmentation as the downstream

task to assess the performance of our new approach against several state-of-the-art methods. Thanks

to its pixel-level consistency modeling and context stabilizer, the new approach showed superior

performance over the existing methods.

In short, we make two main contributions in this paper:

• We propose a novel contrastive SSL approach that effectively enforces global and pixel-

level consistencies, enabling deep learning models to automatically derive semantically

meaningful representations and providing great transfer learning potential to various

downstream tasks, e.g., semantic segmentation.

• We address the challenge of modeling pixel-level consistency by proposing an uncertainty-

aware context stabilizer. This stabilizer has two eminent features. It adaptively cancels the

context gap between two random data augmentations, which benefits pixel-wise

consistency modeling and stabilizes the learning process by estimating uncertainty via

Monte Carlo dropout.

Related work

Despite the success of deep learning techniques in computer vision, their demand for large

quantities of training data and human supervision has been a bottleneck for applications where data

annotations are limited or expensive [8]. To the rescue, SSL methods build models to learn pertinent

image representations via completing different pre-tasks, guided by generated supervision signals.

SSL approaches can be summarized as in three categories based on the pre-task designs they adopt:

predictive, generative, and contrastive SSL [12]. The predictive SSL methods apply classification

pre-tasks in their models, training the backbone networks to make predictions based on learned

latent features from unlabeled data. Early pre-tasks include image exemplar [25], relative position

prediction [25], jigsaw puzzle [26], and rotation prediction [27]. The generative SSL methods use

reconstruction pre-tasks for building models to learn latent features without human annotation.

Generative pre-tasks include image denoising [28], inpainting [29], and colorization [30].

Recently, contrastive SSL has attracted great attention and led to various methods, such as

contrastive predictive coding, simLR [3], MOCO [31], BYOL [14], and DINO [20]. The main idea

of contrastive learning is to build positive pairs and negative pairs of examples via different ways

of data augmentation and pair selection, and negative pairs may not always be necessary [14, 20].

Constructive SSL models learn to enforce positive pairs to be consistent while negative pairs to be

dissimilar. A typical architecture of contrastive SSL has two branches: a student (online) branch

and a teacher (target) branch, both of which use the same network backbone structures [1-3, 14, 18,

20, 31]. The outputs from the two branches are contrasted to assess the consistency between the

two representations that the two branches learn. The student network parameters are updated by a

back-propagation loss, whereas the teacher network parameters are normally updated by the

moving average of the student network parameters [20, 31].

It is important to note that most existing contrastive SSL methods are limited to capturing global

consistency between projected feature vectors that summarize the whole images [3, 14]. They only

enforce the consistency between global feature vectors and do not capture the high-quality

semantics of these images. To this end, dense contrastive learning methods have been proposed [18,

19] to enforce dense consistency between every pair of feature vectors on the extracted feature

maps of the same images. However, they still focus on consistency between down-sampled feature

maps [18, 19], enforcing region-to-region consistency (depending on how large each feature vector

represents) rather than pixel-to-pixel consistency. For dense prediction tasks like semantic

segmentation, modeling region-level consistency is too coarse-grained to be adequate.

In our new method, we enforced pixel-level consistency while maintaining global consistency to

support classification and dense prediction tasks. The closest related works are [1, 21], but they

have two major drawbacks. Firstly, they enforce local consistency without considering the context

gap created by different data augmentations. This design is suboptimal because the representation

for each pixel contains not just color and intensity information, but more importantly, its context.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Pixel-globalSelf-supervisedLearningwithUncertainty-awareContextStabilizerZhuangzhuangZhang,WeixiongZhangIntroductionMostcomputervision(CV)taskscanbeclassifiedintoclassificationanddenseprediction.DenseCVpredictiontaskspredictorestimatealabelforeverypixelofinputimages,suchasdepthprediction,semanticseg...

展开>> 收起<<

Pixel -global Self -supervised Learning with Unce rtainty -aware Context Stabilizer Zhuangzhu ang Zhang Weixiong Zhang.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pixel -global Self -supervised Learning with Unce rtainty -aware Context Stabilizer Zhuangzhu ang Zhang Weixiong Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: