LAPFormer A Light and Accurate Polyp Segmentation Transformer Mai Nguyen Tung Thanh Bui Quan Van Nguyen Thanh Tung Nguyen Toan Van Pham

2025-05-03 0 0 842.58KB 7 页 10玖币
侵权投诉
LAPFormer: A Light and Accurate Polyp
Segmentation Transformer
Mai Nguyen, Tung Thanh Bui, Quan Van Nguyen, Thanh Tung Nguyen, Toan Van Pham
R&D Lab, Sun* Inc
{nguyen.mai, bui.thanh.tung, nguyen.van.quan, nguyen.tung.thanh, pham.van.toan}@sun-asterisk.com
Abstract—Polyp segmentation is still known as a difficult
problem due to the large variety of polyp shapes, scanning
and labeling modalities. This prevents deep learning model to
generalize well on unseen data. However, Transformer-based
approach recently has achieved some remarkable results on per-
formance with the ability of extracting global context better than
CNN-based architecture and yet lead to better generalization.
To leverage this strength of Transformer, we propose a new
model with encoder-decoder architecture named LAPFormer,
which uses a hierarchical Transformer encoder to better extract
global feature and combine with our novel CNN (Convolutional
Neural Network) decoder for capturing local appearance of the
polyps. Our proposed decoder contains a progressive feature
fusion module designed for fusing feature from upper scales
and lower scales and enable multi-scale features to be more
correlative. Besides, we also use feature refinement module and
feature selection module for processing feature. We test our model
on five popular benchmark datasets for polyp segmentation,
including Kvasir, CVC-Clinic DB, CVC-ColonDB, CVC-T, and
ETIS-Larib.
Index Terms—Polyp Segmentation, Deep Learning
I. INTRODUCTION
A. Overview
Colorectal cancer (CRC) is the most common cancer around
the world [1]. Colonoscopy has always been recognised as
the standard diagnostic for the early detection of colorec-
tal cancer. Therefore, several deep learning methods have
been proposed to aid clinical system in identifying colonic
polyps. Among them, segmentation approach is significantly
considered as the most appropriate way with promising result
recently. However, colonoscopy has some limitations. In some
previous reports, about 18% of polyps are missed from the
diagnosis process [2], [3]. This is because it is an operator-
driven procedure and solely dependent on the knowledge
and skills of the endoscopist. With the current colonoscopy
equipment, less experienced endoscopists cannot distinguish
polyp regions during colonoscopy examinations [4]. More
importantly, previous research has shown that increasing polyp
detection accuracy by 1% reduces colorectal cancer risk by
approximately 3%. Therefore, improving polyp detectability
and robust segmentation tools are important in this problem.
Recently, with the vigorous development of deep learning
technology, the accuracy of many classical problems has
improved, including image segmentation problems. Various
studies aimed to develop CADx models for automatic polyp
segmentation. There are also a few studies aiming to build a
specific model for polyp segmentation. HarDNet-MSEG [5]
is one of them, which is an encoder-decoder architecture base
on HarDNet [6] backbone that achieved high performance on
the Kvasir-SEG dataset with processing speed up to 86 FPS.
AG-ResUNet++ improved UNet++ with attention gates and
ResNet backbone. Another study called Transfuse combined
Transformer and CNN using BiFusion module [7]. Colon-
Former [8] used MiT backbone, Uper decoder, and residual
axial reverse attention to further boost the polyp segmentation
accuracy. NeoUNet [9] and BlazeNeo [10] proposed effec-
tive encoder-decoder networks for polyp segmentation and
neoplasm detection. Generally, research in changing model
architecture is still a potential approach.
Among the recent deep learning architecture, Transformer
based architecture has attracted the most attention. To effi-
cient semantic segmentation, incorporating the advantages of
a hierarchical Transformer encoder with a suitable decoder
head has been widely researched. In this paper, we utilize a
Transformer backbone as an encoder and propose a novel,
light, and accurate decoder head for polyp segmentation task.
The proposed decode head consists of a light feature fusion
module which efficiently reduces the semantic gaps between
feature from two level scales, and a feature selection module
along with a feature refinement module which help handling
feature from backbone better and calibrating feature comes
from progressive feature fusion module before prediction.
B. Our contributions
Our main contributions are:
We propose a Light and Accurate Polyp Sementation
Transformer, called LAPFormer, that integrates a hi-
erarchical Transformer backbone as encoder.
A novel decoder for LAPFormer, which leverages multi-
scale features and consists of Feature Refinement Module,
Feature Selection Module to produce fine polyp segmen-
tation mask.
Extensive experiments indicate that our LAPFormer
achieves state-of-the-art on CVC-ColonDB [11] and get
competitive results on different famous polyp segmenta-
tion benchmarks while being less computation complex-
ity than other Transformer-based methods.
II. RELATED WORKS
A. Semantic Segmentation
Semantic segmentation is one of the essential tasks in
computer vision which is required to classify each pixel in
arXiv:2210.04393v1 [cs.CV] 10 Oct 2022
the image. Recently, deep learning had an enormous impact on
computer vision field, including semantic segmentation task.
Many deep learning models are based on fully convolutional
networks (FCNs) [12], which encoder gradually reduces the
spatial resolution and captures more semantic context of an
image with larger receptive fields. However, CNN still has
small receptive fields, leading to missing context features.
To overcome this limitation, PSPNet [13] proposed Pyramid
Pooling Module to enhance global context, DeepLab [14]
utilized atrous convolution to expand receptive fields.
B. Vision Transformer
Transformer [15] is a deep neural network architecture,
originally proposed to solve machine translation task in natural
language processing. Nowadays, Transformer has a huge influ-
ence on natural language processing and other tasks, including
computer vision. Vision Transformer (ViT) [16] is the first
model successfully apply Transformer in computer vision by
dividing an image into sequences and treating each sequence
as a token, and then putting them into Transformer. Following
this success, PVT [17], Swin [18], and SegFormer [19] are
designed as hierarchical Transformer, generating feature maps
at different scales to enhance the local feature, thus improving
the performance in dense prediction tasks, including detection
and segmentation.
C. Polyp segmentation
Recent advances in deep learning have helped solve med-
ical semantic segmentation tasks effectively, including polyp
segmentation. However, it remains a challenging problem due
to medical image characteristics, and polyps come in different
sizes, shapes, textures, and colors. U-net [20] is a well-known
medical image segmentation model consisting of residual
connections to preserve local features between encoder and
decoder. Inspired by U-net, most models [21], [22], [23], [5],
[24], [25], for medical image segmentation use an architecture
containing a CNN backbone as an encoder, and a decoder tries
to fuse features at different scales to generate segmentation
result. PraNet [26], SFA [27] focus on the distinction between
a polyp boundary and background to improve segmentation
performance. CaraNet [28] designs a attention module for
small medical object segmentation. Since the successful ap-
plication of Transformer in computer vision, current methods
[29], [30], [8] have utilized the capability of Transformer as a
backbone for polyp segmentation task and yielded promising
results. Other methods [31], [7], [32] use both Transformer and
CNN backbone for feature extraction and combine features
from two brands, thus enhancing segmentation result.
III. PROPOSED METHOD
In this section, we describe the proposed LAPFormer in
detail. An overview of our model is presented in Fig 1.
A. Encoder
We choose MiT (Mix Transformer) proposed in [19] as our
segmentation encoder for two main reasons. First, it consists of
a novel hierarchically structured Transformer encoder which
produces multi-level multi-scale feature outputs. Second, MiT
uses a convolutional kernel instead of Positional Encoding
(PE), therefore avoiding decreased performance when the test-
ing resolution differs from training. Convolutional layers are
argued to be more adequate for extracting location information
for Transformer than other methods. Furthermore, MiT uses
small image patches of size 4×4, which are proved to favor
intensive dense prediction tasks like semantic segmentation.
Assume that we have input image Xwith spatial dimensions
of H×W×3(represent the height, width and channel).
MiT generates four different level feature fiwith resolution
H
2i+1 ×W
2i+1 ×Ciwhere i∈ {1,2,3,4}and Ci+1 is larger than
Ci. Basically, MiT has six variants with same architecture but
varies in sizes, from MiT-B0 to MiT-B5.
B. Progressive Feature Fusion
Dense prediction is a famous task in computer vision
which consists of Object Detection, Semantic Segmentation,
... Object of vastly different scales all appear in the same
picture. Therefore, multi-scale features are heavily required
to generate good results. The most popular way to utilize
multi-scale features is to construct a feature pyramid network
(FPN) [33]. However, as pointed out in [34] and [35], there are
big semantic gaps between feature from non-adjacent scales,
features from two scale levels apart are weakly correlated,
while features between adjacent scales are highly correlated.
In FPN, features from upper scale are upscaled then directly
added to features from lower scale. We argue that this is
sub-optimal, when fusing features, they should be further
processed before directly fusing to next level. We propose
Progressive Feature Fusion module, which progressively fuses
features from upper scales to lower scales, therefore, reduce
the information gap between the low resolution, high semantic
feature maps and high resolution, low semantic ones.
Instead of fusing feature maps with addition operation, we
use concatenation operation. Feature maps of all scale are
upsampled to (H/4, W/4). Then feature maps from upper
scale are progressively fused with lower scale as follow:
F(xi, xi1) = Linear([xi, xi1])
where xiis features from scale i;xi1is features from one
scale lower than i;[. . . ]is concatenation operation; Linear is
a fully-connected layer.
C. Aggregation for Prediction
On conventional Encoder-Decoder architecture, after per-
forming multi-scale feature fusion from feature pyramid, other
models [20], [36], [23], [29] often perform prediction on the
last output feature maps, which has the highest resolution (Fig
2a). Others adopt auxiliary predictions during training [37],
[28] but during test time, they only make prediction from the
highest resolution feature map (Fig 2b).
We argue that predicting on only the highest resolution
feature maps is sub-optimal. This make the highest resolution
feature maps to carry a lot of information, but information
摘要:

LAPFormer:ALightandAccuratePolypSegmentationTransformerMaiNguyen,TungThanhBui,QuanVanNguyen,ThanhTungNguyen,ToanVanPhamR&DLab,Sun*Incfnguyen.mai,bui.thanh.tung,nguyen.van.quan,nguyen.tung.thanh,pham.van.toang@sun-asterisk.comAbstract—Polypsegmentationisstillknownasadifcultproblemduetothelargevariet...

展开>> 收起<<
LAPFormer A Light and Accurate Polyp Segmentation Transformer Mai Nguyen Tung Thanh Bui Quan Van Nguyen Thanh Tung Nguyen Toan Van Pham.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:842.58KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注