Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers Zhiwei LinZengyu YangYongtao WangB

2025-05-06 0 0 3.25MB 11 页 10玖币

侵权投诉

Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object

Discovery with Transformers

Zhiwei Lin∗Zengyu Yang∗† Yongtao WangB

Wangxuan Institute of Computer Technology, Peking University

zwlin@pku.edu.cn, yzyysj@gmail.com, wyt@pku.edu.cn

Abstract

Unsupervised object discovery (UOD) has recently

shown encouraging progress with the adoption of pre-

trained Transformer features. However, current methods

based on Transformers mainly focus on designing the local-

ization head (e.g., seed selection-expansion and normalized

cut) and overlook the importance of improving Transformer

features. In this work, we handle UOD task from the per-

spective of feature enhancement and propose FOReground

guidance and MUlti-LAyer feature fusion for unsupervised

object discovery, dubbed FORMULA. Firstly, we present

a foreground guidance strategy with an off-the-shelf UOD

detector to highlight the foreground regions on the feature

maps and then reﬁne object locations in an iterative fashion.

Moreover, to solve the scale variation issues in object detec-

tion, we design a multi-layer feature fusion module that ag-

gregates features responding to objects at different scales.

The experiments on VOC07, VOC12, and COCO 20k show

that the proposed FORMULA achieves new state-of-the-art

results on unsupervised object discovery. The code will be

released at https://github.com/VDIGPKU/FORMULA.

1. Introduction

Object detection is one of the fundamental problems in

computer vision, which serves a wide range of applica-

tions such as face recognition [42], pose estimation [57],

and autonomous driving [44]. In recent years, signiﬁcant

success has been achieved in the ﬁeld [36, 35] thanks to

the increasing amount of annotated training data. However,

the labeling of large-scale datasets [31, 38] is rather costly.

Although multiple techniques, including semi-supervised

learning [2], weakly-supervised learning [17], and self-

supervised learning [19] have been proposed to alleviate

this issue, manual labeling is still required.

Here we concentrate on a fully unsupervised task for ob-

ject detection, named unsupervised object discovery. Pre-

∗Equal contribution. †As an intern at PKU. BCorresponding author.

(a) LOST (b) TokenCut

Figure 1. Example results of UOD on VOC12. In (a) and (b), we

show results obtained by LOST [40] and TokenCut [52] respec-

tively, which are two state-of-the-art UOD methods. The results

of our method are presented in (c) and (d). Predictions are in red,

and ground-truth boxes are in yellow. We can ﬁnd that the pro-

posed FORMULA localizes object more accurately. Best viewed

in color.

vious CNN-based methods [13, 47, 48, 54] leverage region

proposal and localize objects by comparing the proposed

bounding boxes of each image across a whole image col-

lection. However, these approaches are difﬁcult to scale to

large datasets due to the quadratic complexity brought by

the comparison process [49]. Recently, DINO [6] has found

that the attention maps of a Vision Transformer (ViT) [14]

pre-trained with self-supervised learning reveal salient fore-

ground regions. Motivated by DINO, LOST [40] and To-

kenCut [52] are proposed to discover objects by leverag-

ing the high-quality ViT features. Both methods ﬁrst con-

struct an undirected graph using the similarity of patch-wise

features from the last layer of the ViT. Then, a two-step

seed-selection-expansion strategy and Normalized Cut [39]

are adopted respectively to segment the foreground objects.

While both approaches have achieved superior results over

previous state-of-the-art [48, 49], we have found that they

arXiv:2210.13053v1 [cs.CV] 24 Oct 2022

mainly focus on the construction of the localization head

and overlook the potential of improving the ViT features.

In this paper, we propose a simple but effective feature

enhancement method for existing ViT-based UOD frame-

works, named FORMULA. Our method consists of two

parts, i.e., the foreground guidance module and the multi-

layer feature fusion module. For the foreground guidance

module, we utilize the object mask predicted by an off-the-

shelf UOD detector to highlight the foreground object re-

gion and then reﬁne the object location through an iterative

process. Speciﬁcally, we ﬁrst generate an object mask from

the original ViT feature map using an existing UOD detec-

tor (e.g., LOST or TokenCut). Then, we construct a prob-

ability map with 2D Gaussian distribution from the mask,

which roughly localizes the foreground objects. After that,

we highlight the foreground area by applying the probabil-

ity map to the original ViT feature map. Finally, the updated

feature map is used for the UOD detector to obtain a reﬁned

object mask. The whole process can be iterated. In this

way, we enhance the ViT feature map by introducing the

foreground object information and suppressing background

interference. Our method can localize objects much more

accurately with only a few iterations.

Besides, we note that LOST and TokenCut only use

the feature map from the last layer of ViT. However, the

scale of objects in non-object-centric images, like those in

COCO [31], can vary greatly. The feature from the last layer

of a pre-trained ViT mainly captures the key areas for clas-

siﬁcation, which is usually at a larger scale. Thus, the per-

formance on smaller objects is hurt. To address this issue,

we propose the multi-layer feature fusion module. In detail,

we simply merge the features from the last several layers

through a weighted summation to aggregate multi-scale in-

formation for unsupervised object discovery.

Our main contributions can be summarized as follows:

• We introduce foreground guidance from the object mask

predicted by an existing UOD detector to the original ViT

feature map and propose an iterative process to reﬁne the

object location.

• We further design a multi-layer feature fusion module

to address the scale variation issues in object detection,

releasing the potential of ViT feature representation for

object discovery.

• The proposed method can be easily incorporated into any

existing UOD methods based on ViT and achieves new

state-of-the-art results on the popular PASCAL VOC and

COCO benchmarks.

2. Related Work

Self-supervised learning. Learning powerful feature

representations in a self-supervised manner that dispenses

with human annotation has made great progress recently.

This is performed by deﬁning a pretext task that provides

surrogate supervision for feature learning [34, 59, 19].

Despite no labels, models trained with self-supervision

have outperformed their supervised counterparts on several

downstream tasks, such as image classiﬁcation [19, 55, 8,

23, 9, 5, 20, 10, 6, 50, 11, 22] and object detection [53, 7].

While [55, 8, 23, 5, 20, 10] have adopted CNN as pre-

training backbones, recent works [6, 50, 11, 22, 29, 28] have

explored Transformers [46] for self-supervised visual learn-

ing, demonstrating their superiority over traditional CNN.

Our work utilizes the strong localization capability of self-

supervised ViT for unsupervised object discovery.

Unsupervised object discovery. The goal of unsupervised

object discovery is to localize objects without any supervi-

sion. Early works generally rely on image features encoded

by a CNN [24, 48, 54, 49, 12]. These methods need to com-

pare the extracted features of each image across those of

every other image in the dataset, leading to quadratic com-

putational overhead [49]. Besides, the dependence on the

inter-image similarity results in these methods being unable

to run on a single image. Recently, LOST [40] and To-

kenCut [52] have been proposed to address these issues by

leveraging the high-quality feature representation generated

from a self-supervised ViT. Their motivation is that the at-

tention map extracted from the last layer of a pre-trained

ViT contains explicit information about the foreground ob-

jects. Speciﬁcally, both methods ﬁrst propose the construc-

tion of an intra-image similarity graph using features ex-

tracted from the ViT backbone [6, 14]. A heuristic seed-

selection-expansion strategy and Normalized Cut [39] are

then adopted respectively by the two methods to segment a

single foreground object in an image.

Although achieving excellent performance, these meth-

ods mainly concentrate on localization design and fail to

further improve ViT features for unsupervised object dis-

covery. Instead, our work starts from the perspective of

feature enhancement. Concretely, we introduce the fore-

ground guidance to highlight the foreground regions on ViT

features and propose a multi-layer feature fusion module to

aggregate multi-scale features.

Multi-layer Feature Representations. One of the main

challenges in object detection is to effectively represent fea-

tures on different scales. Extensive works have been pro-

posed over the years to deal with the multi-scale problem

with multi-layer features. These methods leverage the pyra-

midal features of CNN to compute a multi-scale feature

representation. [33, 21, 26, 1, 37, 18, 30] combine low-

resolution and high-resolution features with upsampling or

lateral connections to aggregate semantic information from

all levels. [32, 4, 30] make predictions at different scales

from different layers and use post-processing to ﬁlter the

UOD

detector

Fuse

Features from

Multiple Layers

ViT

Guide ①

②

③

FNL

Figure 2. The pipeline of FORMULA. To enhance ViT features for unsupervised object discovery, we extract multi-layer features to

aggregate information from different scales and introduce foreground guidance from the predicted segmentation to the input of the UOD

detector.

ﬁnal predictions. In addition to CNN, several works have

recently exploited the multi-layer representation for Trans-

former networks. [51] aggregates the class tokens from each

Transformer layer to gather the local, low-level, and middle-

level information that is crucial for ﬁne-grained visual cat-

egorization. In [56], the authors divide the multi-layer rep-

resentations of the Transformer’s encoder and decoder into

different groups and then fuse these group features to fully

leverage the low-level and high-level features for neural ma-

chine translation.

These works inspire us to explore the multi-layer fea-

tures of ViT for better object localization. Instead of de-

signing complicated fusion modules, we propose a simple

and efﬁcient fusion method that sums the feature from each

layer of the Transformer with different weights.

3. Approach

In this section, we introduce our method for unsu-

pervised object discovery, i.e., FORMULA. The overall

pipeline of FORMULA is presented in Fig. 2. Firstly, we

brieﬂy review Vision Transformers and their previous ap-

plications in UOD as preliminary knowledge. Then, we de-

scribe the two modules of FORMULA, namely foreground

guidance and multi-layer feature fusion.

3.1. Preliminary

Vision Transformers [14] receive a sequence of image

patches and use stacked multi-head self-attention blocks to

extract feature maps from images. It divides an input image

of H×Winto a sequence of N=HW/P 2patches of

ﬁxed resolution P×P. Patch embeddings are then formed

by mapping the ﬂattened patches to a D-dimensional latent

space with a trainable linear projection. An extra learnable

[CLS] token is attached to the patch embeddings and posi-

tion embeddings are added to form the standard transformer

input in R(N+1)×D.

DINO [6] has shown that the attention map extracted

from the last layer of a self-supervised ViT indicates promi-

nent foreground areas. Following this observation, LOST

and TokenCut propose to localize objects using the key fea-

tures k∈RN×Dfrom the last layer in two steps. First,

an intermediate feature map Fint ∈Rh×wis constructed

from the inter-patch similarity graph, where h=H/P and

w=W/P . Speciﬁcally, for LOST, it is the map of inverse

degrees; for TokenCut, it is the second smallest eigenvec-

tor of the graph. Second, a object mask m∈ {0,1}h×wis

generated from Fint to segment the foreground object.

3.2. Foreground Guidance with Self-iteration

In the foreground guidance module, the predicted ob-

ject mask is treated as the foreground guidance to high-

light the foreground region and guide the segmentation pro-

cess. Speciﬁcally, given an existing unsupervised object

detector Dand an intermediate feature map Fint extracted

from the pre-trained ViT, the binary foreground object mask

m∈ {0,1}h×wof the object can be generated as follows:

m=D(Fint).(1)

Here, Dcould be any ViT-based object discovery methods,

e.g., LOST and TokenCut. Moreover, the value of m(xi)

equals to 1 if the corresponding patch iwith coordinates

xiis predicted to belong to the foreground object. With

the foreground mask m, the approximate coordinates of the

object center Ocan be calculated by

xO=1

Ph×w

i=1 m(xi)

h×w

i=1

m(xi)xi.(2)

Then, we construct a probability map P∈Rh×wusing the

2D Gaussian distribution function g:

P(i) = g(i|xO, σ2) = 1

2πσ2e−∥xi−xO∥2

2σ2,(3)

where σis a hyper-parameter. Intuitively, the value of P

indicates the regions in an image that are likely to belong

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ForegroundGuidanceandMulti-LayerFeatureFusionforUnsupervisedObjectDiscoverywithTransformersZhiweiLin∗ZengyuYang∗†YongtaoWangBWangxuanInstituteofComputerTechnology,PekingUniversityzwlin@pku.edu.cn,yzyysj@gmail.com,wyt@pku.edu.cnAbstractUnsupervisedobjectdiscovery(UOD)hasrecentlyshownencouragingprogre...

展开>> 收起<<

Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers Zhiwei LinZengyu YangYongtao WangB.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers Zhiwei LinZengyu YangYongtao WangB

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: