HiCo Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining Chunhui Zhang120000000290171828 Yixiong Chen34000000030268076X

2025-05-06 0 0 710.85KB 18 页 10玖币
侵权投诉
HiCo: Hierarchical Contrastive Learning for
Ultrasound Video Model Pretraining
Chunhui Zhang1,2[0000000290171828], Yixiong Chen3,4[000000030268076X],
Li Liu3,4,?[0000000244970135], Qiong Liu2[0000000258082761], and Xi
Zhou2,1[0000000199435482]
1Shanghai Jiaotong University, 200240 Shanghai, China
2CloudWalk Technology Co., Ltd, 201203 Shanghai, China
3The Chinese University of Hong Kong (Shenzhen), 518172 Shenzhen, China
liuli@cuhk.edu.cn
4Shenzhen Research Institute of Big Data, 518172 Shenzhen, China
Abstract. The self-supervised ultrasound (US) video model pretraining
can use a small amount of labeled data to achieve one of the most promis-
ing results on US diagnosis. However, it does not take full advantage of
multi-level knowledge for learning deep neural networks (DNNs), and
thus is difficult to learn transferable feature representations. This work
proposes a hierarchical contrastive learning (HiCo) method to improve
the transferability for the US video model pretraining. HiCo introduces
both peer-level semantic alignment and cross-level semantic alignment
to facilitate the interaction between different semantic levels, which can
effectively accelerate the convergence speed, leading to better generaliza-
tion and adaptation of the learned model. Additionally, a softened ob-
jective function is implemented by smoothing the hard labels, which can
alleviate the negative effect caused by local similarities of images between
different classes. Experiments with HiCo on five datasets demonstrate its
favorable results over state-of-the-art approaches. The source code of this
work is publicly available at https://github.com/983632847/HiCo.
1 Introduction
Thanks to the cost-effectiveness, safety, and portability, combined with a rea-
sonable sensitivity to a wide variety of pathologies, ultrasound (US) has become
one of the most common medical imaging techniques in clinical diagnosis [1]. To
mitigate sonographers’ reading burden and improve diagnosis efficiency, auto-
matic US analysis using deep learning is becoming popular [2,3,4,5]. In the past
decades, a successful practice is to train a deep neural network (DNN) on a large
number of well-labeled US images within the supervised learning paradigm [1,6].
However, annotations of US images and videos can be expensive to obtain and
sometimes infeasible to access because of the expertise requirements and time-
consuming reading, which motivates the development of US diagnosis that re-
quires few or even no manual annotations.
?Corresponding author.
arXiv:2210.04477v1 [cs.CV] 10 Oct 2022
2 Chunhui Zhang et al.
XX
(a) Vanilla contrastive learning (b) Hierarchical contrastive learning
l
F
m
F
g
F
l
F
m
F
g
F
l
F
m
F
l
F
m
F
g
F
Peer-level Alignment
gg
L
mm
L
ll
L
mg
L
mg
L
Cross-level Alignment
lg
L
lg
L
(c) Convergence speed
X'
X'
Fig. 1. Motivation of hierarchical contrastive learning. Unlike (a) vanilla contrastive
learning, our (b) hierarchical contrastive learning can fully take advantage of both peer-
level and cross-level information. Thus, (c) the pretraining model from our proposed
hierarchical contrastive learning can accelerate the convergence speed, which is much
better than learning from scratch, supervised learning, and vanilla contrastive learning.
In recent years, pretraining combined with fine-tuning has attracted great
attention because it can transfer knowledge learned on large amounts of unla-
beled or weakly labeled data to downstream tasks, especially when the amount
of labeled data is limited. This has also profoundly affected the field of US diag-
nosis, which started to pretrain models from massive unlabeled US data accord-
ing to a pretext task. To learn meaningful and strong representations, the US
video pretraining methods are designed to correct the order of a reshuffled video
clip, predict the geometric transformation applied to the video clip or colorize
a grayscale image to its color version equivalent [7,8]. Inspired by the powerful
ability of contrastive learning (CL) [9,10] in computer vision, some recent studies
propose to learn US video representations by CL [11,3], and showed a powerful
learning capability [12,11]. However, most of the existing US video pretraining
methods following the vanilla contrastive learning setting [10,13], only use the
output of a certain layer of a DNN for contrast (see Fig. 1(a)). Although the
CL methods are usually better than learning from scratch and supervised learn-
ing, the lack of multi-level information interaction will inevitably degrade the
transferability of pretrained models [3,14].
To address the above issue, we first propose a hierarchical contrastive learn-
ing (HiCo) method for US video model pretraining. The main motivation is to
design a feature-based peer-level and cross-level semantic alignment method (see
Fig. 1(b)) to improve the efficiency of learning and enhance the ability of feature
representation. Specially, based on the assumption that the top layer of a DNN
has strong semantic information, and the bottom layer has high-resolution local
information (e.g., texture and shape) [15], we design a joint learning task to
force the model to learn multi-level semantic representations during the CL pro-
cess: minimize the peer-level semantic alignment loss (i.e., 1
global CL loss, 2
medium CL loss, and 3
local CL loss) and cross-level semantic alignment loss
(i.e., 4
,5
global-medium CL losses, and 6
,7
global-local CL losses) simul-
taneously. Intuitively, our framework can greatly improve the convergence speed
of the model (i.e., providing a good initialized model for downstream tasks) (see
HiCo: Hierarchical Contrastive Learning 3
Fig. 1(c)), due to the sufficient interaction of peer-level and cross-level informa-
tion. Different from existing methods [16,17,18,19,20], this work assumes that the
knowledge inside the backbone is sufficient but underutilized, so that simple yet
effective peer-level and cross-level semantic alignments can be used to enhance
feature representation other than designing a complex structure. In addition,
medical images from different classes/lesions may have significant local simi-
larities (e.g., normal and infected individuals have similar regions of tissues and
organs unrelated to disease), which is more severe than natural images. Thus, we
follow the popular label smoothing strategy to design a batch-based softened ob-
jective function during the pretraining to avoid the model being over-confident,
which alleviates the negative effect caused by local similarities.
The main contributions of this work can be summarized as follows:
1) We propose a novel hierarchical contrastive learning method for US video
model pretraining, which can make full use of the multi-level knowledge inside
a DNN via peer-level semantic alignment and cross-level semantic alignment.
2) We soften one-hot labels during the pretraining process to avoid the model
being over-confident, alleviating the negative effect caused by local similarities
of images between different classes.
3) Experiments on five downstream tasks demonstrate the effectiveness of
our approach in learning transferable representations.
2 Related Work
We first review related works on supervised learning for US diagnosis and then
discuss the self-supervised representation learning.
2.1 US Diagnosis
With the rise of deep learning in computer vision, supervised learning became
the most common strategy in US diagnosis with DNN [1,3,21,22,23]. In the
last decades, numerous datasets and methods have been introduced for US im-
age classification [24], detection [25] and segmentation [26] tasks. For exam-
ple, some US image datasets with labeled data were designed for breast cancer
classification [27,28], breast US lesions detection [25], diagnosis of malignant
thyroid nodule [29,30], and automated measurement of the fetal head circum-
ference [31]. At the same time, many deep learning approaches have been done
on lung US [32,33], B-line detection or quantification [34,35], pleural line extrac-
tion [36], and subpleural pulmonary lesions [37]. Compared with image-based
datasets, recent video-based US datasets [1,3] are becoming much richer and
can provide more diverse categories and data modalities (e.g., convex and linear
probe US images [3]). Thus, many works are focused on video-based US diag-
nosis within the supervised learning paradigm. In [1], a frame-based model was
proposed to correctly distinguish COVID-19 lung US videos from healthy and
bacterial pneumonia data. Other works focus on quality assessment for med-
ical US video compressing [38], localizing target structures [39], or describing
4 Chunhui Zhang et al.
US video content [2]. Until recently, many advanced DNNs (e.g., UNet [40],
DeepLab [41,42], Transformer [43]), and technologies (e.g., neural architecture
search [44], reinforcement learning [45], meta-learning [46]) have brought great
advances in supervised learning for US diagnosis. Unfortunately, US diagnosis
using supervised learning highly relies on large-scale labeled, often expensive
medical datasets.
2.2 Self-supervised Learning
Recently, many self-supervised learning methods for visual feature represen-
tation learning have been developed without using any human-annotated la-
bels [47,48,49]. Existing self-supervised learning methods can be divided into two
main categories, i.e., learning via pretext tasks and CL. A wide range of pre-
text tasks have been proposed to facilitate the development of self-supervised
learning. Examples include solving jigsaw puzzles [50], colorization [8], image
context restoration [51], and relative patch prediction [52]. However, many of
these tasks rely on ad-hoc heuristics that could limit the generalization and ro-
bustness of learned feature representations for downstream tasks [13,10]. The CL
has emerged as the front-runner for self-supervision representation learning and
has demonstrated remarkable performance on downstream tasks. Unlike learn-
ing via pretext tasks, CL is a discriminative approach that aims at grouping
similar positive samples closer and repelling negative samples. To achieve this,
a similarity metric is used to measure how close two feature embeddings are.
For computer vision tasks, a standard loss function, i.e., Noise-Contrastive Esti-
mation loss (InfoNCE) [53], is evaluated based on the feature representations of
images extracted from a backbone network (e.g., ResNet [22]). Most successful
CL approaches are focused on studying effective contrastive loss, generation of
positive and negative pairs, and sampling method [10,9]. SimCLR [10] is a sim-
ple framework for CL of visual representations with strong data augmentations
and a large training batch size. MoCo [9] builds a dynamic dictionary with a
queue and a moving-averaged encoder. Other works explores learning without
negative samples [54,55], and incorporating self-supervised learning with visual
transformers [56], etc.
Considering the superior performance of contrastive self-supervised learning
in computer vision and medical imaging tasks, this work follows the line of CL.
First, we propose both peer-level and cross-level alignments to speed up the
convergence of the model learning, compared with the existing CL methods,
which usually use the output of a certain layer of the network for contrast (see
Fig. 1). Second, we design a softened objective function to facilitate the CL by
addressing the negative effect of local similarities between different classes.
3 Hierarchical Contrastive Learning
In this section, we present our HiCo approach for US video model pretraining.
To this end, we first introduce the preliminary of CL, after that present the peer-
摘要:

HiCo:HierarchicalContrastiveLearningforUltrasoundVideoModelPretrainingChunhuiZhang1;2[0000000290171828],YixiongChen3;4[000000030268076X],LiLiu3;4;?[0000000244970135],QiongLiu2[0000000258082761],andXiZhou2;1[0000000199435482]1ShanghaiJiaotongUniversity,200240Shanghai,China2CloudWalkTechnologyCo.,Ltd,...

展开>> 收起<<
HiCo Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining Chunhui Zhang120000000290171828 Yixiong Chen34000000030268076X.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:710.85KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注