Query Semantic Reconstruction for Background in Few-Shot Segmentation

2025-05-01 0 0 5.75MB 11 页 10玖币

侵权投诉

Query Semantic Reconstruction for Background in Few-Shot

Segmentation

Haoyan Guana,Michael Spratlinga

aDepartment of Informatics, King’s College London, London, WC2B 4BG, UK

ARTICLE INFO

Keywords:

few-shot learning

semantic segmentation

ABSTRACT

Few-shot segmentation (FSS) aims to segment unseen classes using a few annotated samples.

Typically, a prototype representing the foreground class is extracted from annotated support image(s)

and is matched to features representing each pixel in the query image. However, models learnt in this

way are insuﬃciently discriminatory, and often produce false positives: misclassifying background

pixels as foreground. Some FSS methods try to address this issue by using the background in the

support image(s) to help identify the background in the query image. However, the backgrounds

of theses images is often quite distinct, and hence, the support image background information is

uninformative. This article proposes a method, QSR, that extracts the background from the query

image itself, and as a result is better able to discriminate between foreground and background features

in the query image. This is achieved by modifying the training process to associate prototypes with

class labels including known classes from the training data and latent classes representing unknown

background objects. This class information is then used to extract a background prototype from

the query image. To successfully associate prototypes with class labels and extract a background

prototype that is capable of predicting a mask for the background regions of the image, the machinery

for extracting and using foreground prototypes is induced to become more discriminative between

diﬀerent classes. Experiments achieves state-of-the-art results for both 1-shot and 5-shot FSS on the

PASCAL-5𝑖and COCO-20𝑖dataset. As QSR operates only during training, results are produced with

no extra computational complexity during testing.

1. Introduction

The ability to segment objects is a long-standing goal of

computer vision, and recent methods have achieved extraor-

dinary results (He, Zhang, Ren and Sun,2016;He, Deng,

Zhou, Wang and Qiao,2019;Long, Shelhamer and Darrell,

2015). These results depend on a large number of pixel-

level annotations which are time-consuming and costly to

produce. When facing the situation where few exemplars

from a novel class are available, these methods overﬁt and

perform poorly. To deal with this situation, few-shot seg-

mentation (FSS) methods aim to predict a segmentation

mask for a novel category using only a few images and their

corresponding segmentation ground-truths.

Most current FSS algorithms (Zhang, Lin, Liu, Yao and

Shen,2019b;Siam, Oreshkin and Jagersand,2019;Zhang,

Lin, Liu, Guo, Wu and Yao,2019a;Lu, He, Zhu, Zhang,

Song and Xiang,2021;Liu, Ding, Jiao, Ji and Ye,2021;

Li, Jampani, Sevilla-Lara, Sun, Kim and Kim,2021;Wu,

Shi, Lin and Cai,2021;Zhang, Xiao and Qin,2021) follow

a similar sequence of steps. Features are extracted from

support and query images by a shared convolutional neural

network (CNN) which is pre-trained on ImageNet (Rus-

sakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpa-

thy, Khosla, Bernstein et al.,2015;Yang, Liu, Li, Jiao and

Ye,2020;Siam, Doraiswamy, Oreshkin, Yao and Jagersand,

2020;Zhang et al.,2019b). Then the support image ground-

truth segmentation mask is used to identity the foreground

information in the support features. Generally, the object

haoyan.guan@kcl.ac.uk (H. Guan); michael.spratling@kcl.ac.uk

(M. Spratling)

class is represented by a single foreground prototype feature

vector (Wang, Liew, Zou, Zhou and Feng,2019;Yang et al.,

2020;Tian, Zhao, Shu, Yang, Li and Jia,2020;Zhang et al.,

2021;Li et al.,2021). Finally, a decoder is used to calculate

the similarity of the foreground prototype and every pixel

in the query feature-set to predict the locations occupied

by the foreground object in the query image. This standard

approach ignores the importance of background features that

can be mined for negative samples in order to reduce false-

positives, and hence, make the model more discriminative.

Some FSS methods (Yang et al.,2020;Boudiaf, Ker-

vadec, Masud, Piantanida, Ayed and Dolz,2021;Wang

et al.,2019) extract background information from support

images by using the support masks to identify the support

image background. RPMMs (Yang et al.,2020) uses the

Expectation-Maximization (EM) algorithm to mine more

background information in the support images. MLC (Yang,

Zhuo, Qi, Shi and Gao,2021) extracts a global background

prototype by averaging together the backgrounds extracted

from the whole training data in an oﬄine process, then

updates this global background prototype with the support

background during training. However, the same category

object may appear against diﬀerent backgrounds in diﬀer-

ent images. The background information extracted from or

aligned with the support image(s) is, therefore, unlikely to

be useful for segmenting the query image. Existing FSS

methods ignore the fact that the background information of

an image is most relevant for segmenting that speciﬁc image.

In this paper, we are motivated by the issue illustrated

in Fig. 1and design a method that can extract background

information from the query image itself to make existing

H. Guan, M. Spratling: Preprint submitted to Elsevier Page 1 of 11

arXiv:2210.12055v2 [cs.CV] 21 Dec 2022

Query Semantic Reconstruction for Background in Few-Shot Segmentation

Foreground

prototype

decoder

Background class:

Foreground class:

Support image

Query image

Background

prototype

QSR

foreground

decision boundary

background

Previous method

foreground

decision boundary

background

Our method

Query Semantic Reconstruction(QSR)

Query image

Background

prototype

Semantic separation Foreground elimination

Support mask

Figure 1: Motivation for our method. Most previous FSS methods (as shown above the dashed line) use a decoder to classify

features of the query image, by comparing them to a foreground prototype extracted from the support image and mask. This

process often produces false positives: misclassifying the background (e.g . cat) as the foreground (e.g . dog). QSR (as shown below

the dashed line) uses background information extracted from the query image at training time to learn a more descriminative

decoder which is achieved by the semantic separation and foreground elimination.

FSS algorithms be more discriminative. Our method, Query

Semantic Reconstruction (QSR), separates the feature ex-

tracted from a query image according to known classes and

latent classes. Known classes are the categories that appear

in the training data, like dog and cat in the example used

in Fig. 1. Latent classes are unknown categories like mat

and wall which are not explicitly labelled in the training

data, but which can appear in the background in the training

images. QSR learns to eliminate the foreground information

according to the class labels. The remaining classes are used

to deﬁne a prototype for the background of the query image

that excludes contributions from the foreground class.

The extracted foreground and background prototypes

are used as input to the prototype decoder module from

the underlying, baseline, FSS method. The decoder pro-

duces predictions of foreground and background masks.

The predictions are compared to a ground-truth mask and

the loss is used to tune the parameters of the model. For

these foreground and background prototypes to be eﬀective

at identifying the foreground and background regions of

the query image, the whole model must be able to make

the prototypes discriminative of features representing dif-

ferent semantics in the images. Hence, our method trains

the underlying FSS method so that at test time it is able to

more accurately segment images. Our method only predicts

background masks during training to optimize the whole

model. Hence, during testing the method is identical to that

of the baseline.

The main contributions of our work are as follows:

1. To address the long-standing high false positive prob-

lem in FSS and to demonstrate that background in-

formation from the query image itself can be em-

ployed usefully for segmentation, we propose QSR

that can be applied to many existing FSS algorithms

to ensure they are better able to discriminate between

foreground and background objects.

2. QSR improves existing FSS methods through opti-

mized training. During testing our method is identical

to the baseline, so no additional parameters or extra

computation is needed at test-time.

3. We demonstrate the eﬀectiveness of QSR using three

diﬀerent baselines methods: CaNet (Zhang et al.,

2019b), ASGNet (Li et al.,2021) and PFENet (Tian

et al.,2020). For the PASCAL-5𝑖dataset, QSR im-

proves mIOU results of 1-shot and 5-shot FSS by 1.0%

and 1.5% for CaNet, 1.8% and 2.1% for ASGNet, and

by 1.9% and 4.8% for PFENet. For the COCO-20𝑖

dataset, QSR improves ASGNet by 2.8% and 1.6%,

PFENet by 4.5% and 3.8%.

4. Our method achieves new state-of-the-art performance

on PASCAL-5𝑖, with mIOU of 62.7% in 1-shot,

and 66.7% in 5-shot. On the COCO-20𝑖dataset, our

method achieves strong results of 36.9% in 1-shot, and

41.2% in 5-shot.

2. Related Work

Semantic segmentation. Semantic segmentation requires

the prediction of per-pixel class labels. The introduction

of end-to-end trained fully convolutional networks (Long

et al.,2015) has provided the foundation for recent success

on this task. Additional innovations to improve segmenta-

tion accuracy further have included a multi-scale cascade

model named U-Net (Ronneberger, Fischer and Brox,2015),

dilated convolution (Chen, Zhu, Papandreou, Schroﬀ and

Adam,2018) and pyramid pooling (Zhao, Shi, Qi, Wang and

H. Guan, M. Spratling: Preprint submitted to Elsevier Page 2 of 11

Query Semantic Reconstruction for Background in Few-Shot Segmentation

Jia,2017). In contrast to these methods, we explore semantic

segmentation in the few-shot scenario.

Few-shot learning. Few-shot learning (FSL) explores

methods to enable models to quickly adapt to perform

classiﬁcation of new data. FSL methods can be categorized

into generation, optimization or metric learning approaches.

Generation methods (Hariharan and Girshick,2017;Wang,

Girshick, Hebert and Hariharan,2018;Chen, Fu, Zhang,

Jiang, Xue and Sigal,2019;Liu, Sun, Han, Dou and Li,

2020) generate samples or features to augment the novel

class data. Optimization approaches (Finn, Abbeel and

Levine,2017;Ravi and Larochelle,2017) learn commonali-

ties among diﬀerent tasks, then a novel task can be ﬁne-tuned

on a few annotated samples based on the commonalities.

Metric learning methods (Snell, Swersky and Zemel,2017;

Grant, Finn, Levine, Darrell and Griﬃths,2018) learn to

produce a feature space that allows samples to be classiﬁed

by comparing the distance between their features. Most

FSL methods focus on image classiﬁcation and cannot be

easily adapted to produce the per-pixel labels required for

segmentation.

Few-shot segmentation learning. The ﬁrst FSS method

(Shaban, Bansal, Liu, Essa and Boots,2017) employed

a two-branch comparison framework that has become the

basis for FSS methods. PaNet (Wang et al.,2019) used

prototype feature-vectors to represent support object classes,

then compared their similarity with query features to make

predictions. Other methods have improved diﬀerent aspects

of this process, for example, by extracting multiple proto-

types representing diﬀerent semantic classes (Yang et al.,

2020;Li et al.,2021), by iteratively reﬁning the predictions

(Zhang et al.,2019b), or using a training-free prior mask

generation method (Tian et al.,2020). Some methods extract

information not only from support images, mining latent

classes from the training dataset to search for more proto-

types (Yang et al.,2021), or supplementing prototypes with

support predictions (Zhang et al.,2021).

3. Problem Setting

Formally, we deﬁne a base dataset 𝑏𝑎𝑠𝑒 with known

classes 𝑘𝑛𝑜𝑤𝑛. The FSS task is to use 𝑏𝑎𝑠𝑒 to train a model

which is able to segment new classes 𝑛𝑜𝑣𝑒𝑙, for which only a

few annotated examples are available. The key point of FSS

is that 𝑛𝑜𝑣𝑒𝑙 ∉𝑘𝑛𝑜𝑤𝑛. Speciﬁcally, 𝑏𝑎𝑠𝑒 is a large set of

image-mask pairs (𝐼𝑗, 𝑀𝑗)𝑁𝑢𝑚

𝑗=1 , where 𝑀𝑗is the semantic

segmentation mask for the training image 𝐼𝑗, and 𝑁𝑢𝑚 is

the number of image-mask pairs. During testing, the model

has access to a support set 𝑆= (𝐼𝑖

𝑠, 𝑀𝑖

𝑠)𝑘

𝑖=1 ∈𝑛𝑜𝑣𝑒𝑙,

where 𝑀𝑖

𝑠is the semantic segmentation mask for support

image 𝐼𝑖

𝑠, and k is the number of image-mask pairs, which

is small (typically either 1 or 5 for 1-shot and 5-shot tasks

respectively). A query (or test) set 𝑄= (𝐼𝑞, 𝑀𝑞) ∈ 𝑛𝑜𝑣𝑒𝑙 is

used to evaluate the performance of the model, where 𝑀𝑞

is the ground-truth mask for image 𝐼𝑞. The model uses the

pool

loss

Background class:

Foreground class:

CNN

×0

QSR

shared

decoder

𝑃

!𝑃

𝐹

𝑃

𝑀

𝑀!

𝑀#

𝐼#

𝐼!

During training 𝑃

Figure 2: An overview of our method for 1-shot segmentation.

Like other FSS methods, our method extracts a foreground

prototype from the support image and uses this to predict

a foreground segmentation mask for the query image. QSR

(dashed box) operates at training time to learn to represent

diﬀerent semantic categories in the query image, and uses

this class information to deﬁne a background prototype. The

background prototype is then used to predict a segmentation

mask for the background regions of the query image via

the same decoder as is used for the foreground prediction.

To improve the accuracy of this additional prediction, the

decoder is induced to become more discriminate. This ability

to discriminate between foreground and background objects

results in improved performance at test time, when the process

illustrated in the dashed region is not used.

support set 𝑆to predict a segmentation mask, ̂

𝑀𝑓, for each

image 𝐼𝑞in query set 𝑄.

4. Method

4.1. Overview

Fig. 2illustrates our method for 1-shot segmentation.

Both support and query images are input into a shared CNN.

In common with our baselines, CaNet (Zhang et al.,2019b),

ASGNet (Li et al.,2021) and PFENet (Tian et al.,2020), we

use a ResNet (He et al.,2016) pre-trained on ImageNet (Rus-

sakovsky et al.,2015) for this encoder backbone and choose

features generated by 𝑏𝑙𝑜𝑐𝑘2and 𝑏𝑙𝑜𝑐𝑘3. All parameter

values in 𝑏𝑙𝑜𝑐𝑘2,𝑏𝑙𝑜𝑐𝑘3, and earlier layers are ﬁxed. These

features are concatenated and encoded using a convolution

layer. The convolution layer parameters are optimized by

the loss function (details in Section 4.3). For CaNet (Zhang

et al.,2019b) and ASGNet (Li et al.,2021), this layer has a

3 × 3 convolution kernel shared between support and query

branches. For PFENet (Tian et al.,2020), two independent

1 × 1 convolution layers are deﬁned for support and query

features respectively. After the convolution layer, the CNN

produces support features 𝐹𝑠and query features 𝐹𝑞of size

𝑑×ℎ×𝑤, where 𝑑is the number of channels, and ℎ, 𝑤 are

the height and width.

As for the baseline methods (Zhang et al.,2019b;Li

et al.,2021;Tian et al.,2020), masked average pooling

(MAP) was used to extract the foreground prototype 𝑃𝑓:

𝑃𝑓=∑ℎ𝑤

𝑖=1 𝐹𝑠(𝑖)⋅[𝑀𝑠(𝑖) = 1]

∑ℎ𝑤

𝑖=1 [𝑀𝑠(𝑖) = 1]

(1)

H. Guan, M. Spratling: Preprint submitted to Elsevier Page 3 of 11

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QuerySemanticReconstructionforBackgroundinFew-ShotSegmentationHaoyanGuana,MichaelSpratlingaaDepartmentofInformatics,King'sCollegeLondon,London,WC2B4BG,UKARTICLEINFOKeywords:few-shotlearningsemanticsegmentationABSTRACTFew-shotsegmentation(FSS)aimstosegmentunseenclassesusingafewannotatedsamples.Typica...

展开>> 收起<<

Query Semantic Reconstruction for Background in Few-Shot Segmentation.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Query Semantic Reconstruction for Background in Few-Shot Segmentation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: