LAPFormer A Light and Accurate Polyp Segmentation Transformer Mai Nguyen Tung Thanh Bui Quan Van Nguyen Thanh Tung Nguyen Toan Van Pham

2025-05-03 0 0 842.58KB 7 页 10玖币

侵权投诉

LAPFormer: A Light and Accurate Polyp

Segmentation Transformer

Mai Nguyen, Tung Thanh Bui, Quan Van Nguyen, Thanh Tung Nguyen, Toan Van Pham

R&D Lab, Sun* Inc

{nguyen.mai, bui.thanh.tung, nguyen.van.quan, nguyen.tung.thanh, pham.van.toan}@sun-asterisk.com

Abstract—Polyp segmentation is still known as a difﬁcult

problem due to the large variety of polyp shapes, scanning

and labeling modalities. This prevents deep learning model to

generalize well on unseen data. However, Transformer-based

approach recently has achieved some remarkable results on per-

formance with the ability of extracting global context better than

CNN-based architecture and yet lead to better generalization.

To leverage this strength of Transformer, we propose a new

model with encoder-decoder architecture named LAPFormer,

which uses a hierarchical Transformer encoder to better extract

global feature and combine with our novel CNN (Convolutional

Neural Network) decoder for capturing local appearance of the

polyps. Our proposed decoder contains a progressive feature

fusion module designed for fusing feature from upper scales

and lower scales and enable multi-scale features to be more

correlative. Besides, we also use feature reﬁnement module and

feature selection module for processing feature. We test our model

on ﬁve popular benchmark datasets for polyp segmentation,

including Kvasir, CVC-Clinic DB, CVC-ColonDB, CVC-T, and

ETIS-Larib.

Index Terms—Polyp Segmentation, Deep Learning

I. INTRODUCTION

A. Overview

Colorectal cancer (CRC) is the most common cancer around

the world [1]. Colonoscopy has always been recognised as

the standard diagnostic for the early detection of colorec-

tal cancer. Therefore, several deep learning methods have

been proposed to aid clinical system in identifying colonic

polyps. Among them, segmentation approach is signiﬁcantly

considered as the most appropriate way with promising result

recently. However, colonoscopy has some limitations. In some

previous reports, about 18% of polyps are missed from the

diagnosis process [2], [3]. This is because it is an operator-

driven procedure and solely dependent on the knowledge

and skills of the endoscopist. With the current colonoscopy

equipment, less experienced endoscopists cannot distinguish

polyp regions during colonoscopy examinations [4]. More

importantly, previous research has shown that increasing polyp

detection accuracy by 1% reduces colorectal cancer risk by

approximately 3%. Therefore, improving polyp detectability

and robust segmentation tools are important in this problem.

Recently, with the vigorous development of deep learning

technology, the accuracy of many classical problems has

improved, including image segmentation problems. Various

studies aimed to develop CADx models for automatic polyp

segmentation. There are also a few studies aiming to build a

speciﬁc model for polyp segmentation. HarDNet-MSEG [5]

is one of them, which is an encoder-decoder architecture base

on HarDNet [6] backbone that achieved high performance on

the Kvasir-SEG dataset with processing speed up to 86 FPS.

AG-ResUNet++ improved UNet++ with attention gates and

ResNet backbone. Another study called Transfuse combined

Transformer and CNN using BiFusion module [7]. Colon-

Former [8] used MiT backbone, Uper decoder, and residual

axial reverse attention to further boost the polyp segmentation

accuracy. NeoUNet [9] and BlazeNeo [10] proposed effec-

tive encoder-decoder networks for polyp segmentation and

neoplasm detection. Generally, research in changing model

architecture is still a potential approach.

Among the recent deep learning architecture, Transformer

based architecture has attracted the most attention. To efﬁ-

cient semantic segmentation, incorporating the advantages of

a hierarchical Transformer encoder with a suitable decoder

head has been widely researched. In this paper, we utilize a

Transformer backbone as an encoder and propose a novel,

light, and accurate decoder head for polyp segmentation task.

The proposed decode head consists of a light feature fusion

module which efﬁciently reduces the semantic gaps between

feature from two level scales, and a feature selection module

along with a feature reﬁnement module which help handling

feature from backbone better and calibrating feature comes

from progressive feature fusion module before prediction.

B. Our contributions

Our main contributions are:

•We propose a Light and Accurate Polyp Sementation

Transformer, called LAPFormer, that integrates a hi-

erarchical Transformer backbone as encoder.

•A novel decoder for LAPFormer, which leverages multi-

scale features and consists of Feature Reﬁnement Module,

Feature Selection Module to produce ﬁne polyp segmen-

tation mask.

•Extensive experiments indicate that our LAPFormer

achieves state-of-the-art on CVC-ColonDB [11] and get

competitive results on different famous polyp segmenta-

tion benchmarks while being less computation complex-

ity than other Transformer-based methods.

II. RELATED WORKS

A. Semantic Segmentation

Semantic segmentation is one of the essential tasks in

computer vision which is required to classify each pixel in

arXiv:2210.04393v1 [cs.CV] 10 Oct 2022

the image. Recently, deep learning had an enormous impact on

computer vision ﬁeld, including semantic segmentation task.

Many deep learning models are based on fully convolutional

networks (FCNs) [12], which encoder gradually reduces the

spatial resolution and captures more semantic context of an

image with larger receptive ﬁelds. However, CNN still has

small receptive ﬁelds, leading to missing context features.

To overcome this limitation, PSPNet [13] proposed Pyramid

Pooling Module to enhance global context, DeepLab [14]

utilized atrous convolution to expand receptive ﬁelds.

B. Vision Transformer

Transformer [15] is a deep neural network architecture,

originally proposed to solve machine translation task in natural

language processing. Nowadays, Transformer has a huge inﬂu-

ence on natural language processing and other tasks, including

computer vision. Vision Transformer (ViT) [16] is the ﬁrst

model successfully apply Transformer in computer vision by

dividing an image into sequences and treating each sequence

as a token, and then putting them into Transformer. Following

this success, PVT [17], Swin [18], and SegFormer [19] are

designed as hierarchical Transformer, generating feature maps

at different scales to enhance the local feature, thus improving

the performance in dense prediction tasks, including detection

and segmentation.

C. Polyp segmentation

Recent advances in deep learning have helped solve med-

ical semantic segmentation tasks effectively, including polyp

segmentation. However, it remains a challenging problem due

to medical image characteristics, and polyps come in different

sizes, shapes, textures, and colors. U-net [20] is a well-known

medical image segmentation model consisting of residual

connections to preserve local features between encoder and

decoder. Inspired by U-net, most models [21], [22], [23], [5],

[24], [25], for medical image segmentation use an architecture

containing a CNN backbone as an encoder, and a decoder tries

to fuse features at different scales to generate segmentation

result. PraNet [26], SFA [27] focus on the distinction between

a polyp boundary and background to improve segmentation

performance. CaraNet [28] designs a attention module for

small medical object segmentation. Since the successful ap-

plication of Transformer in computer vision, current methods

[29], [30], [8] have utilized the capability of Transformer as a

backbone for polyp segmentation task and yielded promising

results. Other methods [31], [7], [32] use both Transformer and

CNN backbone for feature extraction and combine features

from two brands, thus enhancing segmentation result.

III. PROPOSED METHOD

In this section, we describe the proposed LAPFormer in

detail. An overview of our model is presented in Fig 1.

A. Encoder

We choose MiT (Mix Transformer) proposed in [19] as our

segmentation encoder for two main reasons. First, it consists of

a novel hierarchically structured Transformer encoder which

produces multi-level multi-scale feature outputs. Second, MiT

uses a convolutional kernel instead of Positional Encoding

(PE), therefore avoiding decreased performance when the test-

ing resolution differs from training. Convolutional layers are

argued to be more adequate for extracting location information

for Transformer than other methods. Furthermore, MiT uses

small image patches of size 4×4, which are proved to favor

intensive dense prediction tasks like semantic segmentation.

Assume that we have input image Xwith spatial dimensions

of H×W×3(represent the height, width and channel).

MiT generates four different level feature fiwith resolution

2i+1 ×W

2i+1 ×Ciwhere i∈ {1,2,3,4}and Ci+1 is larger than

Ci. Basically, MiT has six variants with same architecture but

varies in sizes, from MiT-B0 to MiT-B5.

B. Progressive Feature Fusion

Dense prediction is a famous task in computer vision

which consists of Object Detection, Semantic Segmentation,

... Object of vastly different scales all appear in the same

picture. Therefore, multi-scale features are heavily required

to generate good results. The most popular way to utilize

multi-scale features is to construct a feature pyramid network

(FPN) [33]. However, as pointed out in [34] and [35], there are

big semantic gaps between feature from non-adjacent scales,

features from two scale levels apart are weakly correlated,

while features between adjacent scales are highly correlated.

In FPN, features from upper scale are upscaled then directly

added to features from lower scale. We argue that this is

sub-optimal, when fusing features, they should be further

processed before directly fusing to next level. We propose

Progressive Feature Fusion module, which progressively fuses

features from upper scales to lower scales, therefore, reduce

the information gap between the low resolution, high semantic

feature maps and high resolution, low semantic ones.

Instead of fusing feature maps with addition operation, we

use concatenation operation. Feature maps of all scale are

upsampled to (H/4, W/4). Then feature maps from upper

scale are progressively fused with lower scale as follow:

F(xi, xi−1) = Linear([xi, xi−1])

where xiis features from scale i;xi−1is features from one

scale lower than i;[. . . ]is concatenation operation; Linear is

a fully-connected layer.

C. Aggregation for Prediction

On conventional Encoder-Decoder architecture, after per-

forming multi-scale feature fusion from feature pyramid, other

models [20], [36], [23], [29] often perform prediction on the

last output feature maps, which has the highest resolution (Fig

2a). Others adopt auxiliary predictions during training [37],

[28] but during test time, they only make prediction from the

highest resolution feature map (Fig 2b).

We argue that predicting on only the highest resolution

feature maps is sub-optimal. This make the highest resolution

feature maps to carry a lot of information, but information

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LAPFormer:ALightandAccuratePolypSegmentationTransformerMaiNguyen,TungThanhBui,QuanVanNguyen,ThanhTungNguyen,ToanVanPhamR&DLab,Sun*Incfnguyen.mai,bui.thanh.tung,nguyen.van.quan,nguyen.tung.thanh,pham.van.toang@sun-asterisk.comAbstractPolypsegmentationisstillknownasadifcultproblemduetothelargevariet...

展开>> 收起<<

LAPFormer A Light and Accurate Polyp Segmentation Transformer Mai Nguyen Tung Thanh Bui Quan Van Nguyen Thanh Tung Nguyen Toan Van Pham.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LAPFormer A Light and Accurate Polyp Segmentation Transformer Mai Nguyen Tung Thanh Bui Quan Van Nguyen Thanh Tung Nguyen Toan Van Pham

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: