Query Semantic Reconstruction for Background in Few-Shot Segmentation

2025-05-01 0 0 5.75MB 11 页 10玖币
侵权投诉
Query Semantic Reconstruction for Background in Few-Shot
Segmentation
Haoyan Guana,Michael Spratlinga
aDepartment of Informatics, King’s College London, London, WC2B 4BG, UK
ARTICLE INFO
Keywords:
few-shot learning
semantic segmentation
ABSTRACT
Few-shot segmentation (FSS) aims to segment unseen classes using a few annotated samples.
Typically, a prototype representing the foreground class is extracted from annotated support image(s)
and is matched to features representing each pixel in the query image. However, models learnt in this
way are insufficiently discriminatory, and often produce false positives: misclassifying background
pixels as foreground. Some FSS methods try to address this issue by using the background in the
support image(s) to help identify the background in the query image. However, the backgrounds
of theses images is often quite distinct, and hence, the support image background information is
uninformative. This article proposes a method, QSR, that extracts the background from the query
image itself, and as a result is better able to discriminate between foreground and background features
in the query image. This is achieved by modifying the training process to associate prototypes with
class labels including known classes from the training data and latent classes representing unknown
background objects. This class information is then used to extract a background prototype from
the query image. To successfully associate prototypes with class labels and extract a background
prototype that is capable of predicting a mask for the background regions of the image, the machinery
for extracting and using foreground prototypes is induced to become more discriminative between
different classes. Experiments achieves state-of-the-art results for both 1-shot and 5-shot FSS on the
PASCAL-5𝑖and COCO-20𝑖dataset. As QSR operates only during training, results are produced with
no extra computational complexity during testing.
1. Introduction
The ability to segment objects is a long-standing goal of
computer vision, and recent methods have achieved extraor-
dinary results (He, Zhang, Ren and Sun,2016;He, Deng,
Zhou, Wang and Qiao,2019;Long, Shelhamer and Darrell,
2015). These results depend on a large number of pixel-
level annotations which are time-consuming and costly to
produce. When facing the situation where few exemplars
from a novel class are available, these methods overfit and
perform poorly. To deal with this situation, few-shot seg-
mentation (FSS) methods aim to predict a segmentation
mask for a novel category using only a few images and their
corresponding segmentation ground-truths.
Most current FSS algorithms (Zhang, Lin, Liu, Yao and
Shen,2019b;Siam, Oreshkin and Jagersand,2019;Zhang,
Lin, Liu, Guo, Wu and Yao,2019a;Lu, He, Zhu, Zhang,
Song and Xiang,2021;Liu, Ding, Jiao, Ji and Ye,2021;
Li, Jampani, Sevilla-Lara, Sun, Kim and Kim,2021;Wu,
Shi, Lin and Cai,2021;Zhang, Xiao and Qin,2021) follow
a similar sequence of steps. Features are extracted from
support and query images by a shared convolutional neural
network (CNN) which is pre-trained on ImageNet (Rus-
sakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpa-
thy, Khosla, Bernstein et al.,2015;Yang, Liu, Li, Jiao and
Ye,2020;Siam, Doraiswamy, Oreshkin, Yao and Jagersand,
2020;Zhang et al.,2019b). Then the support image ground-
truth segmentation mask is used to identity the foreground
information in the support features. Generally, the object
haoyan.guan@kcl.ac.uk (H. Guan); michael.spratling@kcl.ac.uk
(M. Spratling)
class is represented by a single foreground prototype feature
vector (Wang, Liew, Zou, Zhou and Feng,2019;Yang et al.,
2020;Tian, Zhao, Shu, Yang, Li and Jia,2020;Zhang et al.,
2021;Li et al.,2021). Finally, a decoder is used to calculate
the similarity of the foreground prototype and every pixel
in the query feature-set to predict the locations occupied
by the foreground object in the query image. This standard
approach ignores the importance of background features that
can be mined for negative samples in order to reduce false-
positives, and hence, make the model more discriminative.
Some FSS methods (Yang et al.,2020;Boudiaf, Ker-
vadec, Masud, Piantanida, Ayed and Dolz,2021;Wang
et al.,2019) extract background information from support
images by using the support masks to identify the support
image background. RPMMs (Yang et al.,2020) uses the
Expectation-Maximization (EM) algorithm to mine more
background information in the support images. MLC (Yang,
Zhuo, Qi, Shi and Gao,2021) extracts a global background
prototype by averaging together the backgrounds extracted
from the whole training data in an offline process, then
updates this global background prototype with the support
background during training. However, the same category
object may appear against different backgrounds in differ-
ent images. The background information extracted from or
aligned with the support image(s) is, therefore, unlikely to
be useful for segmenting the query image. Existing FSS
methods ignore the fact that the background information of
an image is most relevant for segmenting that specific image.
In this paper, we are motivated by the issue illustrated
in Fig. 1and design a method that can extract background
information from the query image itself to make existing
H. Guan, M. Spratling: Preprint submitted to Elsevier Page 1 of 11
arXiv:2210.12055v2 [cs.CV] 21 Dec 2022
Query Semantic Reconstruction for Background in Few-Shot Segmentation
Foreground
prototype
decoder
decoder
Background class:
Foreground class:
Support image
Query image
Background
prototype
×
QSR
foreground
decision boundary
background
Previous method
foreground
decision boundary
background
Our method
Query Semantic Reconstruction(QSR)
Query image
Background
prototype
Semantic separation Foreground elimination
Support mask
Figure 1: Motivation for our method. Most previous FSS methods (as shown above the dashed line) use a decoder to classify
features of the query image, by comparing them to a foreground prototype extracted from the support image and mask. This
process often produces false positives: misclassifying the background (e.g . cat) as the foreground (e.g . dog). QSR (as shown below
the dashed line) uses background information extracted from the query image at training time to learn a more descriminative
decoder which is achieved by the semantic separation and foreground elimination.
FSS algorithms be more discriminative. Our method, Query
Semantic Reconstruction (QSR), separates the feature ex-
tracted from a query image according to known classes and
latent classes. Known classes are the categories that appear
in the training data, like dog and cat in the example used
in Fig. 1. Latent classes are unknown categories like mat
and wall which are not explicitly labelled in the training
data, but which can appear in the background in the training
images. QSR learns to eliminate the foreground information
according to the class labels. The remaining classes are used
to define a prototype for the background of the query image
that excludes contributions from the foreground class.
The extracted foreground and background prototypes
are used as input to the prototype decoder module from
the underlying, baseline, FSS method. The decoder pro-
duces predictions of foreground and background masks.
The predictions are compared to a ground-truth mask and
the loss is used to tune the parameters of the model. For
these foreground and background prototypes to be effective
at identifying the foreground and background regions of
the query image, the whole model must be able to make
the prototypes discriminative of features representing dif-
ferent semantics in the images. Hence, our method trains
the underlying FSS method so that at test time it is able to
more accurately segment images. Our method only predicts
background masks during training to optimize the whole
model. Hence, during testing the method is identical to that
of the baseline.
The main contributions of our work are as follows:
1. To address the long-standing high false positive prob-
lem in FSS and to demonstrate that background in-
formation from the query image itself can be em-
ployed usefully for segmentation, we propose QSR
that can be applied to many existing FSS algorithms
to ensure they are better able to discriminate between
foreground and background objects.
2. QSR improves existing FSS methods through opti-
mized training. During testing our method is identical
to the baseline, so no additional parameters or extra
computation is needed at test-time.
3. We demonstrate the effectiveness of QSR using three
different baselines methods: CaNet (Zhang et al.,
2019b), ASGNet (Li et al.,2021) and PFENet (Tian
et al.,2020). For the PASCAL-5𝑖dataset, QSR im-
proves mIOU results of 1-shot and 5-shot FSS by 1.0%
and 1.5% for CaNet, 1.8% and 2.1% for ASGNet, and
by 1.9% and 4.8% for PFENet. For the COCO-20𝑖
dataset, QSR improves ASGNet by 2.8% and 1.6%,
PFENet by 4.5% and 3.8%.
4. Our method achieves new state-of-the-art performance
on PASCAL-5𝑖, with mIOU of 62.7% in 1-shot,
and 66.7% in 5-shot. On the COCO-20𝑖dataset, our
method achieves strong results of 36.9% in 1-shot, and
41.2% in 5-shot.
2. Related Work
Semantic segmentation. Semantic segmentation requires
the prediction of per-pixel class labels. The introduction
of end-to-end trained fully convolutional networks (Long
et al.,2015) has provided the foundation for recent success
on this task. Additional innovations to improve segmenta-
tion accuracy further have included a multi-scale cascade
model named U-Net (Ronneberger, Fischer and Brox,2015),
dilated convolution (Chen, Zhu, Papandreou, Schroff and
Adam,2018) and pyramid pooling (Zhao, Shi, Qi, Wang and
H. Guan, M. Spratling: Preprint submitted to Elsevier Page 2 of 11
Query Semantic Reconstruction for Background in Few-Shot Segmentation
Jia,2017). In contrast to these methods, we explore semantic
segmentation in the few-shot scenario.
Few-shot learning. Few-shot learning (FSL) explores
methods to enable models to quickly adapt to perform
classification of new data. FSL methods can be categorized
into generation, optimization or metric learning approaches.
Generation methods (Hariharan and Girshick,2017;Wang,
Girshick, Hebert and Hariharan,2018;Chen, Fu, Zhang,
Jiang, Xue and Sigal,2019;Liu, Sun, Han, Dou and Li,
2020) generate samples or features to augment the novel
class data. Optimization approaches (Finn, Abbeel and
Levine,2017;Ravi and Larochelle,2017) learn commonali-
ties among different tasks, then a novel task can be fine-tuned
on a few annotated samples based on the commonalities.
Metric learning methods (Snell, Swersky and Zemel,2017;
Grant, Finn, Levine, Darrell and Griffiths,2018) learn to
produce a feature space that allows samples to be classified
by comparing the distance between their features. Most
FSL methods focus on image classification and cannot be
easily adapted to produce the per-pixel labels required for
segmentation.
Few-shot segmentation learning. The first FSS method
(Shaban, Bansal, Liu, Essa and Boots,2017) employed
a two-branch comparison framework that has become the
basis for FSS methods. PaNet (Wang et al.,2019) used
prototype feature-vectors to represent support object classes,
then compared their similarity with query features to make
predictions. Other methods have improved different aspects
of this process, for example, by extracting multiple proto-
types representing different semantic classes (Yang et al.,
2020;Li et al.,2021), by iteratively refining the predictions
(Zhang et al.,2019b), or using a training-free prior mask
generation method (Tian et al.,2020). Some methods extract
information not only from support images, mining latent
classes from the training dataset to search for more proto-
types (Yang et al.,2021), or supplementing prototypes with
support predictions (Zhang et al.,2021).
3. Problem Setting
Formally, we define a base dataset 𝑏𝑎𝑠𝑒 with known
classes 𝑘𝑛𝑜𝑤𝑛. The FSS task is to use 𝑏𝑎𝑠𝑒 to train a model
which is able to segment new classes 𝑛𝑜𝑣𝑒𝑙, for which only a
few annotated examples are available. The key point of FSS
is that 𝑛𝑜𝑣𝑒𝑙 𝑘𝑛𝑜𝑤𝑛. Specifically, 𝑏𝑎𝑠𝑒 is a large set of
image-mask pairs (𝐼𝑗, 𝑀𝑗)𝑁𝑢𝑚
𝑗=1 , where 𝑀𝑗is the semantic
segmentation mask for the training image 𝐼𝑗, and 𝑁𝑢𝑚 is
the number of image-mask pairs. During testing, the model
has access to a support set 𝑆= (𝐼𝑖
𝑠, 𝑀𝑖
𝑠)𝑘
𝑖=1 𝑛𝑜𝑣𝑒𝑙,
where 𝑀𝑖
𝑠is the semantic segmentation mask for support
image 𝐼𝑖
𝑠, and k is the number of image-mask pairs, which
is small (typically either 1 or 5 for 1-shot and 5-shot tasks
respectively). A query (or test) set 𝑄= (𝐼𝑞, 𝑀𝑞) ∈ 𝑛𝑜𝑣𝑒𝑙 is
used to evaluate the performance of the model, where 𝑀𝑞
is the ground-truth mask for image 𝐼𝑞. The model uses the
pool
pool
loss
loss
Background class:
Foreground class:
CNN
×
×0
1
QSR
+
+
shared
decoder
decoder
𝑃
!𝑃
"
𝐹
!
𝐹
#
𝑃
$
𝑀
&$
𝑀
&"
𝑀!
𝑀#
𝐼#
𝐼!
During training 𝑃
$
Figure 2: An overview of our method for 1-shot segmentation.
Like other FSS methods, our method extracts a foreground
prototype from the support image and uses this to predict
a foreground segmentation mask for the query image. QSR
(dashed box) operates at training time to learn to represent
different semantic categories in the query image, and uses
this class information to define a background prototype. The
background prototype is then used to predict a segmentation
mask for the background regions of the query image via
the same decoder as is used for the foreground prediction.
To improve the accuracy of this additional prediction, the
decoder is induced to become more discriminate. This ability
to discriminate between foreground and background objects
results in improved performance at test time, when the process
illustrated in the dashed region is not used.
support set 𝑆to predict a segmentation mask, ̂
𝑀𝑓, for each
image 𝐼𝑞in query set 𝑄.
4. Method
4.1. Overview
Fig. 2illustrates our method for 1-shot segmentation.
Both support and query images are input into a shared CNN.
In common with our baselines, CaNet (Zhang et al.,2019b),
ASGNet (Li et al.,2021) and PFENet (Tian et al.,2020), we
use a ResNet (He et al.,2016) pre-trained on ImageNet (Rus-
sakovsky et al.,2015) for this encoder backbone and choose
features generated by 𝑏𝑙𝑜𝑐𝑘2and 𝑏𝑙𝑜𝑐𝑘3. All parameter
values in 𝑏𝑙𝑜𝑐𝑘2,𝑏𝑙𝑜𝑐𝑘3, and earlier layers are fixed. These
features are concatenated and encoded using a convolution
layer. The convolution layer parameters are optimized by
the loss function (details in Section 4.3). For CaNet (Zhang
et al.,2019b) and ASGNet (Li et al.,2021), this layer has a
3 × 3 convolution kernel shared between support and query
branches. For PFENet (Tian et al.,2020), two independent
1 × 1 convolution layers are defined for support and query
features respectively. After the convolution layer, the CNN
produces support features 𝐹𝑠and query features 𝐹𝑞of size
𝑑××𝑤, where 𝑑is the number of channels, and ℎ, 𝑤 are
the height and width.
As for the baseline methods (Zhang et al.,2019b;Li
et al.,2021;Tian et al.,2020), masked average pooling
(MAP) was used to extract the foreground prototype 𝑃𝑓:
𝑃𝑓=ℎ𝑤
𝑖=1 𝐹𝑠(𝑖)[𝑀𝑠(𝑖) = 1]
ℎ𝑤
𝑖=1 [𝑀𝑠(𝑖) = 1]
(1)
H. Guan, M. Spratling: Preprint submitted to Elsevier Page 3 of 11
摘要:

QuerySemanticReconstructionforBackgroundinFew-ShotSegmentationHaoyanGuana,MichaelSpratlingaaDepartmentofInformatics,King'sCollegeLondon,London,WC2B4BG,UKARTICLEINFOKeywords:few-shotlearningsemanticsegmentationABSTRACTFew-shotsegmentation(FSS)aimstosegmentunseenclassesusingafewannotatedsamples.Typica...

展开>> 收起<<
Query Semantic Reconstruction for Background in Few-Shot Segmentation.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:5.75MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注