Pixel -global Self -supervised Learning with Unce rtainty -aware Context Stabilizer Zhuangzhu ang Zhang Weixiong Zhang

2025-05-02 0 0 621.8KB 21 页 10玖币
侵权投诉
Pixel-global Self-supervised Learning with Uncertainty-aware Context
Stabilizer
Zhuangzhuang Zhang, Weixiong Zhang
Introduction
Most computer vision (CV) tasks can be classified into classification and dense prediction. Dense
CV prediction tasks predict or estimate a label for every pixel of input images, such as depth
prediction, semantic segmentation, and image registration [3, 4], each of which has a wide range
of applications. The development of deep learning has brought significant improvements to the CV
field; deep learning-based methods have been shown to outperform heuristic-based methods
significantly. One prominent advantage of deep learning-based methods is that they learn to extract
features rather than rely on hand-crafted features [5], unlocking the potential of learning salient
features that benefit downstream tasks. While deep-learning methods can extract features from
images automatically, they have a serious drawback of their enormous demand for training data.
This high demand is rarely satisfied in the medical field because annotated datasets are expensive
to prepare. The lack-of-annotated data is exacerbated for medical image applications because data
annotation requires specialized medical expertise [6, 7].
A general hybrid learning approach has been proposed to address this critical annotation shortage,
which combines self-supervised pre-training and supervised fine-tuning [8-11]. In the self-
supervised pre-training phase, various tasks, namely pre-tasks, are introduced for a chosen
backbone neural network to learn semantically meaningful representations. After pre-training, the
backbone network is fine-tuned for downstream tasks. For example, a pre-task generated differently
rotated images as training samples. The backbone network learns to predict the rotation applied,
during which it learns to encode quality representations. After pre-training, the backbone network
is fine-tuned for the downstream image classification task, in which the model learns to classify
input images into different classes.
The latest SSL methods adopt contrastive representation learning [2]. Contrastive learning is based
on the assumption that differently augmented views of the same image should have similar
representation, and augmented views of different images should have distinct representations [10,
12]. A typical contrastive SSL method is structured as follows. Assuming we need to train a
ResNet50 [13] (a classic deep learning backbone for images) for downstream image classification
tasks, contrastive SSL sets up a student network and a teacher network with the same architecture
(ResNet50). In each pre-training iteration, two differently augmented views of the same image are
fed into the student and teacher network. We use the discrepancy between their output
representations as the loss to update student network parameters. Teacher network parameters are
normally updated by the moving average [14] of the student’s parameters. While learning to
perform well for this contrastive pre-task, the backbone network (ResNet50) learns to generate
quality representations for input images. After pre-training the student’s ability to extract features,
we keep its parameters and fine-tune them with downstream classification datasets.
However, most contrastive SSL methods [3, 14-17] focus on image-level global consistency by
treating every image as a class. However, local consistency between two augmented views of the
same image is overlooked. Recent SSL models attempt to explore local consistency for generic
images [18-20] and medical images [1, 2], yet they are still limited to enforcing the consistency at
the region level. In these methods, each feature vector corresponds to a region in the original image.
Enforcing such region-level consistency is too coarse-grained to be accurate or adequate for
downstream pixel-wise prediction tasks, such as semantic segmentation and registration. While
pixel-level consistency was explored earlier [1, 21], pixel-to-pixel consistency modeling has not
been well studied. It has been studied by generative self-supervised learning methods [22]
(comparison in Appendix A), but the potential of building pixel-level fine-grained SSL methods
has not been explored. Furthermore, the context difference between a pixel in one view and its
counterpart in the other view was not considered. For example, a pixel on the edge of one cropped
view can be at the center of the other cropped view. Different data augmentations create this context
gap, and it is not ideal to directly push the feature vector towards its corresponding one in the high-
dimensional space.
We developed a novel SSL approach to capture global consistency and pixel-level local
consistencies between differently augmented views of the same images to accommodate
downstream discriminative and dense predictive tasks. We adopted the teacher-student architecture
used in previous contrastive SSL methods [3, 14, 20]. In our method, the global consistency is
enforced by aggregating the compressed representations of augmented views of the same image.
The pixel-level consistency is enforced by pursuing similar representations for the same pixel in
differently augmented views. Importantly, we introduced an uncertainty-aware context stabilizer
to adaptively preserve the context gap created by the two views from different augmentations.
Moreover, we used Monte Carlo dropout [23] in the stabilizer to measure uncertainty and
adaptively balance the discrepancy between the representations of the same pixels in different
views.
We experimentally tested and evaluated the new method using medical images, where high-quality
annotated data are commonly unavailable or insufficient. For instance, a widely used general image
dataset, ImageNet [4, 24], contains more than four million images. In contrast, the samples in most
medical image datasets with annotation are in the orders of hundreds to thousands. None of the
public medical image datasets (even with missing or incomplete annotations) has more than one
million images. The data shortage is even more severe in cases where annotations are more labor-
intensive, such as semantic segmentation. Our experiment assembled the largest dataset for medical
image semantic segmentation, with 1808 cases. Despite the small samples size compared to generic
image datasets such as ImageNet, we focused on image semantic segmentation as the downstream
task to assess the performance of our new approach against several state-of-the-art methods. Thanks
to its pixel-level consistency modeling and context stabilizer, the new approach showed superior
performance over the existing methods.
In short, we make two main contributions in this paper:
We propose a novel contrastive SSL approach that effectively enforces global and pixel-
level consistencies, enabling deep learning models to automatically derive semantically
meaningful representations and providing great transfer learning potential to various
downstream tasks, e.g., semantic segmentation.
We address the challenge of modeling pixel-level consistency by proposing an uncertainty-
aware context stabilizer. This stabilizer has two eminent features. It adaptively cancels the
context gap between two random data augmentations, which benefits pixel-wise
consistency modeling and stabilizes the learning process by estimating uncertainty via
Monte Carlo dropout.
Related work
Despite the success of deep learning techniques in computer vision, their demand for large
quantities of training data and human supervision has been a bottleneck for applications where data
annotations are limited or expensive [8]. To the rescue, SSL methods build models to learn pertinent
image representations via completing different pre-tasks, guided by generated supervision signals.
SSL approaches can be summarized as in three categories based on the pre-task designs they adopt:
predictive, generative, and contrastive SSL [12]. The predictive SSL methods apply classification
pre-tasks in their models, training the backbone networks to make predictions based on learned
latent features from unlabeled data. Early pre-tasks include image exemplar [25], relative position
prediction [25], jigsaw puzzle [26], and rotation prediction [27]. The generative SSL methods use
reconstruction pre-tasks for building models to learn latent features without human annotation.
Generative pre-tasks include image denoising [28], inpainting [29], and colorization [30].
Recently, contrastive SSL has attracted great attention and led to various methods, such as
contrastive predictive coding, simLR [3], MOCO [31], BYOL [14], and DINO [20]. The main idea
of contrastive learning is to build positive pairs and negative pairs of examples via different ways
of data augmentation and pair selection, and negative pairs may not always be necessary [14, 20].
Constructive SSL models learn to enforce positive pairs to be consistent while negative pairs to be
dissimilar. A typical architecture of contrastive SSL has two branches: a student (online) branch
and a teacher (target) branch, both of which use the same network backbone structures [1-3, 14, 18,
20, 31]. The outputs from the two branches are contrasted to assess the consistency between the
two representations that the two branches learn. The student network parameters are updated by a
back-propagation loss, whereas the teacher network parameters are normally updated by the
moving average of the student network parameters [20, 31].
It is important to note that most existing contrastive SSL methods are limited to capturing global
consistency between projected feature vectors that summarize the whole images [3, 14]. They only
enforce the consistency between global feature vectors and do not capture the high-quality
semantics of these images. To this end, dense contrastive learning methods have been proposed [18,
19] to enforce dense consistency between every pair of feature vectors on the extracted feature
maps of the same images. However, they still focus on consistency between down-sampled feature
maps [18, 19], enforcing region-to-region consistency (depending on how large each feature vector
represents) rather than pixel-to-pixel consistency. For dense prediction tasks like semantic
segmentation, modeling region-level consistency is too coarse-grained to be adequate.
In our new method, we enforced pixel-level consistency while maintaining global consistency to
support classification and dense prediction tasks. The closest related works are [1, 21], but they
have two major drawbacks. Firstly, they enforce local consistency without considering the context
gap created by different data augmentations. This design is suboptimal because the representation
for each pixel contains not just color and intensity information, but more importantly, its context.
摘要:

Pixel-globalSelf-supervisedLearningwithUncertainty-awareContextStabilizerZhuangzhuangZhang,WeixiongZhangIntroductionMostcomputervision(CV)taskscanbeclassifiedintoclassificationanddenseprediction.DenseCVpredictiontaskspredictorestimatealabelforeverypixelofinputimages,suchasdepthprediction,semanticseg...

展开>> 收起<<
Pixel -global Self -supervised Learning with Unce rtainty -aware Context Stabilizer Zhuangzhu ang Zhang Weixiong Zhang.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:621.8KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注