HiCo Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining Chunhui Zhang120000000290171828 Yixiong Chen34000000030268076X

2025-05-06 0 0 710.85KB 18 页 10玖币

侵权投诉

HiCo: Hierarchical Contrastive Learning for

Ultrasound Video Model Pretraining

Chunhui Zhang1,2[0000−0002−9017−1828], Yixiong Chen3,4[0000−0003−0268−076X],

Li Liu3,4,?[0000−0002−4497−0135], Qiong Liu2[0000−0002−5808−2761], and Xi

Zhou2,1[0000−0001−9943−5482]

1Shanghai Jiaotong University, 200240 Shanghai, China

2CloudWalk Technology Co., Ltd, 201203 Shanghai, China

3The Chinese University of Hong Kong (Shenzhen), 518172 Shenzhen, China

liuli@cuhk.edu.cn

4Shenzhen Research Institute of Big Data, 518172 Shenzhen, China

Abstract. The self-supervised ultrasound (US) video model pretraining

can use a small amount of labeled data to achieve one of the most promis-

ing results on US diagnosis. However, it does not take full advantage of

multi-level knowledge for learning deep neural networks (DNNs), and

thus is diﬃcult to learn transferable feature representations. This work

proposes a hierarchical contrastive learning (HiCo) method to improve

the transferability for the US video model pretraining. HiCo introduces

both peer-level semantic alignment and cross-level semantic alignment

to facilitate the interaction between diﬀerent semantic levels, which can

eﬀectively accelerate the convergence speed, leading to better generaliza-

tion and adaptation of the learned model. Additionally, a softened ob-

jective function is implemented by smoothing the hard labels, which can

alleviate the negative eﬀect caused by local similarities of images between

diﬀerent classes. Experiments with HiCo on ﬁve datasets demonstrate its

favorable results over state-of-the-art approaches. The source code of this

work is publicly available at https://github.com/983632847/HiCo.

1 Introduction

Thanks to the cost-eﬀectiveness, safety, and portability, combined with a rea-

sonable sensitivity to a wide variety of pathologies, ultrasound (US) has become

one of the most common medical imaging techniques in clinical diagnosis [1]. To

mitigate sonographers’ reading burden and improve diagnosis eﬃciency, auto-

matic US analysis using deep learning is becoming popular [2,3,4,5]. In the past

decades, a successful practice is to train a deep neural network (DNN) on a large

number of well-labeled US images within the supervised learning paradigm [1,6].

However, annotations of US images and videos can be expensive to obtain and

sometimes infeasible to access because of the expertise requirements and time-

consuming reading, which motivates the development of US diagnosis that re-

quires few or even no manual annotations.

?Corresponding author.

arXiv:2210.04477v1 [cs.CV] 10 Oct 2022

2 Chunhui Zhang et al.

①①

③

②

④⑤

⑥⑦

(a) Vanilla contrastive learning (b) Hierarchical contrastive learning

F

①

Peer-level Alignment

L

②

mm 

③

ll 

④

L

⑤

mg

Cross-level Alignment

⑥

L

⑦

lg

Fig. 1. Motivation of hierarchical contrastive learning. Unlike (a) vanilla contrastive

learning, our (b) hierarchical contrastive learning can fully take advantage of both peer-

level and cross-level information. Thus, (c) the pretraining model from our proposed

hierarchical contrastive learning can accelerate the convergence speed, which is much

better than learning from scratch, supervised learning, and vanilla contrastive learning.

In recent years, pretraining combined with ﬁne-tuning has attracted great

attention because it can transfer knowledge learned on large amounts of unla-

beled or weakly labeled data to downstream tasks, especially when the amount

of labeled data is limited. This has also profoundly aﬀected the ﬁeld of US diag-

nosis, which started to pretrain models from massive unlabeled US data accord-

ing to a pretext task. To learn meaningful and strong representations, the US

video pretraining methods are designed to correct the order of a reshuﬄed video

clip, predict the geometric transformation applied to the video clip or colorize

a grayscale image to its color version equivalent [7,8]. Inspired by the powerful

ability of contrastive learning (CL) [9,10] in computer vision, some recent studies

propose to learn US video representations by CL [11,3], and showed a powerful

learning capability [12,11]. However, most of the existing US video pretraining

methods following the vanilla contrastive learning setting [10,13], only use the

output of a certain layer of a DNN for contrast (see Fig. 1(a)). Although the

CL methods are usually better than learning from scratch and supervised learn-

ing, the lack of multi-level information interaction will inevitably degrade the

transferability of pretrained models [3,14].

To address the above issue, we ﬁrst propose a hierarchical contrastive learn-

ing (HiCo) method for US video model pretraining. The main motivation is to

design a feature-based peer-level and cross-level semantic alignment method (see

Fig. 1(b)) to improve the eﬃciency of learning and enhance the ability of feature

representation. Specially, based on the assumption that the top layer of a DNN

has strong semantic information, and the bottom layer has high-resolution local

information (e.g., texture and shape) [15], we design a joint learning task to

force the model to learn multi-level semantic representations during the CL pro-

cess: minimize the peer-level semantic alignment loss (i.e., 1

○global CL loss, 2

○

medium CL loss, and 3

○local CL loss) and cross-level semantic alignment loss

(i.e., 4

○,5

○global-medium CL losses, and 6

○,7

○global-local CL losses) simul-

taneously. Intuitively, our framework can greatly improve the convergence speed

of the model (i.e., providing a good initialized model for downstream tasks) (see

HiCo: Hierarchical Contrastive Learning 3

Fig. 1(c)), due to the suﬃcient interaction of peer-level and cross-level informa-

tion. Diﬀerent from existing methods [16,17,18,19,20], this work assumes that the

knowledge inside the backbone is suﬃcient but underutilized, so that simple yet

eﬀective peer-level and cross-level semantic alignments can be used to enhance

feature representation other than designing a complex structure. In addition,

medical images from diﬀerent classes/lesions may have signiﬁcant local simi-

larities (e.g., normal and infected individuals have similar regions of tissues and

organs unrelated to disease), which is more severe than natural images. Thus, we

follow the popular label smoothing strategy to design a batch-based softened ob-

jective function during the pretraining to avoid the model being over-conﬁdent,

which alleviates the negative eﬀect caused by local similarities.

The main contributions of this work can be summarized as follows:

1) We propose a novel hierarchical contrastive learning method for US video

model pretraining, which can make full use of the multi-level knowledge inside

a DNN via peer-level semantic alignment and cross-level semantic alignment.

2) We soften one-hot labels during the pretraining process to avoid the model

being over-conﬁdent, alleviating the negative eﬀect caused by local similarities

of images between diﬀerent classes.

3) Experiments on ﬁve downstream tasks demonstrate the eﬀectiveness of

our approach in learning transferable representations.

2 Related Work

We ﬁrst review related works on supervised learning for US diagnosis and then

discuss the self-supervised representation learning.

2.1 US Diagnosis

With the rise of deep learning in computer vision, supervised learning became

the most common strategy in US diagnosis with DNN [1,3,21,22,23]. In the

last decades, numerous datasets and methods have been introduced for US im-

age classiﬁcation [24], detection [25] and segmentation [26] tasks. For exam-

ple, some US image datasets with labeled data were designed for breast cancer

classiﬁcation [27,28], breast US lesions detection [25], diagnosis of malignant

thyroid nodule [29,30], and automated measurement of the fetal head circum-

ference [31]. At the same time, many deep learning approaches have been done

on lung US [32,33], B-line detection or quantiﬁcation [34,35], pleural line extrac-

tion [36], and subpleural pulmonary lesions [37]. Compared with image-based

datasets, recent video-based US datasets [1,3] are becoming much richer and

can provide more diverse categories and data modalities (e.g., convex and linear

probe US images [3]). Thus, many works are focused on video-based US diag-

nosis within the supervised learning paradigm. In [1], a frame-based model was

proposed to correctly distinguish COVID-19 lung US videos from healthy and

bacterial pneumonia data. Other works focus on quality assessment for med-

ical US video compressing [38], localizing target structures [39], or describing

4 Chunhui Zhang et al.

US video content [2]. Until recently, many advanced DNNs (e.g., UNet [40],

DeepLab [41,42], Transformer [43]), and technologies (e.g., neural architecture

search [44], reinforcement learning [45], meta-learning [46]) have brought great

advances in supervised learning for US diagnosis. Unfortunately, US diagnosis

using supervised learning highly relies on large-scale labeled, often expensive

medical datasets.

2.2 Self-supervised Learning

Recently, many self-supervised learning methods for visual feature represen-

tation learning have been developed without using any human-annotated la-

bels [47,48,49]. Existing self-supervised learning methods can be divided into two

main categories, i.e., learning via pretext tasks and CL. A wide range of pre-

text tasks have been proposed to facilitate the development of self-supervised

learning. Examples include solving jigsaw puzzles [50], colorization [8], image

context restoration [51], and relative patch prediction [52]. However, many of

these tasks rely on ad-hoc heuristics that could limit the generalization and ro-

bustness of learned feature representations for downstream tasks [13,10]. The CL

has emerged as the front-runner for self-supervision representation learning and

has demonstrated remarkable performance on downstream tasks. Unlike learn-

ing via pretext tasks, CL is a discriminative approach that aims at grouping

similar positive samples closer and repelling negative samples. To achieve this,

a similarity metric is used to measure how close two feature embeddings are.

For computer vision tasks, a standard loss function, i.e., Noise-Contrastive Esti-

mation loss (InfoNCE) [53], is evaluated based on the feature representations of

images extracted from a backbone network (e.g., ResNet [22]). Most successful

CL approaches are focused on studying eﬀective contrastive loss, generation of

positive and negative pairs, and sampling method [10,9]. SimCLR [10] is a sim-

ple framework for CL of visual representations with strong data augmentations

and a large training batch size. MoCo [9] builds a dynamic dictionary with a

queue and a moving-averaged encoder. Other works explores learning without

negative samples [54,55], and incorporating self-supervised learning with visual

transformers [56], etc.

Considering the superior performance of contrastive self-supervised learning

in computer vision and medical imaging tasks, this work follows the line of CL.

First, we propose both peer-level and cross-level alignments to speed up the

convergence of the model learning, compared with the existing CL methods,

which usually use the output of a certain layer of the network for contrast (see

Fig. 1). Second, we design a softened objective function to facilitate the CL by

addressing the negative eﬀect of local similarities between diﬀerent classes.

3 Hierarchical Contrastive Learning

In this section, we present our HiCo approach for US video model pretraining.

To this end, we ﬁrst introduce the preliminary of CL, after that present the peer-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HiCo:HierarchicalContrastiveLearningforUltrasoundVideoModelPretrainingChunhuiZhang1;2[0000000290171828],YixiongChen3;4[000000030268076X],LiLiu3;4;?[0000000244970135],QiongLiu2[0000000258082761],andXiZhou2;1[0000000199435482]1ShanghaiJiaotongUniversity,200240Shanghai,China2CloudWalkTechnologyCo.,Ltd,...

展开>> 收起<<

HiCo Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining Chunhui Zhang120000000290171828 Yixiong Chen34000000030268076X.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HiCo Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining Chunhui Zhang120000000290171828 Yixiong Chen34000000030268076X

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: