Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers Zhiwei LinZengyu YangYongtao WangB

2025-05-06 0 0 3.25MB 11 页 10玖币
侵权投诉
Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object
Discovery with Transformers
Zhiwei LinZengyu Yang∗† Yongtao WangB
Wangxuan Institute of Computer Technology, Peking University
zwlin@pku.edu.cn, yzyysj@gmail.com, wyt@pku.edu.cn
Abstract
Unsupervised object discovery (UOD) has recently
shown encouraging progress with the adoption of pre-
trained Transformer features. However, current methods
based on Transformers mainly focus on designing the local-
ization head (e.g., seed selection-expansion and normalized
cut) and overlook the importance of improving Transformer
features. In this work, we handle UOD task from the per-
spective of feature enhancement and propose FOReground
guidance and MUlti-LAyer feature fusion for unsupervised
object discovery, dubbed FORMULA. Firstly, we present
a foreground guidance strategy with an off-the-shelf UOD
detector to highlight the foreground regions on the feature
maps and then refine object locations in an iterative fashion.
Moreover, to solve the scale variation issues in object detec-
tion, we design a multi-layer feature fusion module that ag-
gregates features responding to objects at different scales.
The experiments on VOC07, VOC12, and COCO 20k show
that the proposed FORMULA achieves new state-of-the-art
results on unsupervised object discovery. The code will be
released at https://github.com/VDIGPKU/FORMULA.
1. Introduction
Object detection is one of the fundamental problems in
computer vision, which serves a wide range of applica-
tions such as face recognition [42], pose estimation [57],
and autonomous driving [44]. In recent years, significant
success has been achieved in the field [36, 35] thanks to
the increasing amount of annotated training data. However,
the labeling of large-scale datasets [31, 38] is rather costly.
Although multiple techniques, including semi-supervised
learning [2], weakly-supervised learning [17], and self-
supervised learning [19] have been proposed to alleviate
this issue, manual labeling is still required.
Here we concentrate on a fully unsupervised task for ob-
ject detection, named unsupervised object discovery. Pre-
Equal contribution. As an intern at PKU. BCorresponding author.
(a) LOST (b) TokenCut
(c) FORMULA-L (ours) (d) FORMULA-TC (ours)
Figure 1. Example results of UOD on VOC12. In (a) and (b), we
show results obtained by LOST [40] and TokenCut [52] respec-
tively, which are two state-of-the-art UOD methods. The results
of our method are presented in (c) and (d). Predictions are in red,
and ground-truth boxes are in yellow. We can find that the pro-
posed FORMULA localizes object more accurately. Best viewed
in color.
vious CNN-based methods [13, 47, 48, 54] leverage region
proposal and localize objects by comparing the proposed
bounding boxes of each image across a whole image col-
lection. However, these approaches are difficult to scale to
large datasets due to the quadratic complexity brought by
the comparison process [49]. Recently, DINO [6] has found
that the attention maps of a Vision Transformer (ViT) [14]
pre-trained with self-supervised learning reveal salient fore-
ground regions. Motivated by DINO, LOST [40] and To-
kenCut [52] are proposed to discover objects by leverag-
ing the high-quality ViT features. Both methods first con-
struct an undirected graph using the similarity of patch-wise
features from the last layer of the ViT. Then, a two-step
seed-selection-expansion strategy and Normalized Cut [39]
are adopted respectively to segment the foreground objects.
While both approaches have achieved superior results over
previous state-of-the-art [48, 49], we have found that they
arXiv:2210.13053v1 [cs.CV] 24 Oct 2022
mainly focus on the construction of the localization head
and overlook the potential of improving the ViT features.
In this paper, we propose a simple but effective feature
enhancement method for existing ViT-based UOD frame-
works, named FORMULA. Our method consists of two
parts, i.e., the foreground guidance module and the multi-
layer feature fusion module. For the foreground guidance
module, we utilize the object mask predicted by an off-the-
shelf UOD detector to highlight the foreground object re-
gion and then refine the object location through an iterative
process. Specifically, we first generate an object mask from
the original ViT feature map using an existing UOD detec-
tor (e.g., LOST or TokenCut). Then, we construct a prob-
ability map with 2D Gaussian distribution from the mask,
which roughly localizes the foreground objects. After that,
we highlight the foreground area by applying the probabil-
ity map to the original ViT feature map. Finally, the updated
feature map is used for the UOD detector to obtain a refined
object mask. The whole process can be iterated. In this
way, we enhance the ViT feature map by introducing the
foreground object information and suppressing background
interference. Our method can localize objects much more
accurately with only a few iterations.
Besides, we note that LOST and TokenCut only use
the feature map from the last layer of ViT. However, the
scale of objects in non-object-centric images, like those in
COCO [31], can vary greatly. The feature from the last layer
of a pre-trained ViT mainly captures the key areas for clas-
sification, which is usually at a larger scale. Thus, the per-
formance on smaller objects is hurt. To address this issue,
we propose the multi-layer feature fusion module. In detail,
we simply merge the features from the last several layers
through a weighted summation to aggregate multi-scale in-
formation for unsupervised object discovery.
Our main contributions can be summarized as follows:
We introduce foreground guidance from the object mask
predicted by an existing UOD detector to the original ViT
feature map and propose an iterative process to refine the
object location.
We further design a multi-layer feature fusion module
to address the scale variation issues in object detection,
releasing the potential of ViT feature representation for
object discovery.
The proposed method can be easily incorporated into any
existing UOD methods based on ViT and achieves new
state-of-the-art results on the popular PASCAL VOC and
COCO benchmarks.
2. Related Work
Self-supervised learning. Learning powerful feature
representations in a self-supervised manner that dispenses
with human annotation has made great progress recently.
This is performed by defining a pretext task that provides
surrogate supervision for feature learning [34, 59, 19].
Despite no labels, models trained with self-supervision
have outperformed their supervised counterparts on several
downstream tasks, such as image classification [19, 55, 8,
23, 9, 5, 20, 10, 6, 50, 11, 22] and object detection [53, 7].
While [55, 8, 23, 5, 20, 10] have adopted CNN as pre-
training backbones, recent works [6, 50, 11, 22, 29, 28] have
explored Transformers [46] for self-supervised visual learn-
ing, demonstrating their superiority over traditional CNN.
Our work utilizes the strong localization capability of self-
supervised ViT for unsupervised object discovery.
Unsupervised object discovery. The goal of unsupervised
object discovery is to localize objects without any supervi-
sion. Early works generally rely on image features encoded
by a CNN [24, 48, 54, 49, 12]. These methods need to com-
pare the extracted features of each image across those of
every other image in the dataset, leading to quadratic com-
putational overhead [49]. Besides, the dependence on the
inter-image similarity results in these methods being unable
to run on a single image. Recently, LOST [40] and To-
kenCut [52] have been proposed to address these issues by
leveraging the high-quality feature representation generated
from a self-supervised ViT. Their motivation is that the at-
tention map extracted from the last layer of a pre-trained
ViT contains explicit information about the foreground ob-
jects. Specifically, both methods first propose the construc-
tion of an intra-image similarity graph using features ex-
tracted from the ViT backbone [6, 14]. A heuristic seed-
selection-expansion strategy and Normalized Cut [39] are
then adopted respectively by the two methods to segment a
single foreground object in an image.
Although achieving excellent performance, these meth-
ods mainly concentrate on localization design and fail to
further improve ViT features for unsupervised object dis-
covery. Instead, our work starts from the perspective of
feature enhancement. Concretely, we introduce the fore-
ground guidance to highlight the foreground regions on ViT
features and propose a multi-layer feature fusion module to
aggregate multi-scale features.
Multi-layer Feature Representations. One of the main
challenges in object detection is to effectively represent fea-
tures on different scales. Extensive works have been pro-
posed over the years to deal with the multi-scale problem
with multi-layer features. These methods leverage the pyra-
midal features of CNN to compute a multi-scale feature
representation. [33, 21, 26, 1, 37, 18, 30] combine low-
resolution and high-resolution features with upsampling or
lateral connections to aggregate semantic information from
all levels. [32, 4, 30] make predictions at different scales
from different layers and use post-processing to filter the
UOD
detector
Fuse
Features from
Multiple Layers
ViT
Guide
FNL
Figure 2. The pipeline of FORMULA. To enhance ViT features for unsupervised object discovery, we extract multi-layer features to
aggregate information from different scales and introduce foreground guidance from the predicted segmentation to the input of the UOD
detector.
final predictions. In addition to CNN, several works have
recently exploited the multi-layer representation for Trans-
former networks. [51] aggregates the class tokens from each
Transformer layer to gather the local, low-level, and middle-
level information that is crucial for fine-grained visual cat-
egorization. In [56], the authors divide the multi-layer rep-
resentations of the Transformer’s encoder and decoder into
different groups and then fuse these group features to fully
leverage the low-level and high-level features for neural ma-
chine translation.
These works inspire us to explore the multi-layer fea-
tures of ViT for better object localization. Instead of de-
signing complicated fusion modules, we propose a simple
and efficient fusion method that sums the feature from each
layer of the Transformer with different weights.
3. Approach
In this section, we introduce our method for unsu-
pervised object discovery, i.e., FORMULA. The overall
pipeline of FORMULA is presented in Fig. 2. Firstly, we
briefly review Vision Transformers and their previous ap-
plications in UOD as preliminary knowledge. Then, we de-
scribe the two modules of FORMULA, namely foreground
guidance and multi-layer feature fusion.
3.1. Preliminary
Vision Transformers [14] receive a sequence of image
patches and use stacked multi-head self-attention blocks to
extract feature maps from images. It divides an input image
of H×Winto a sequence of N=HW/P 2patches of
fixed resolution P×P. Patch embeddings are then formed
by mapping the flattened patches to a D-dimensional latent
space with a trainable linear projection. An extra learnable
[CLS] token is attached to the patch embeddings and posi-
tion embeddings are added to form the standard transformer
input in R(N+1)×D.
DINO [6] has shown that the attention map extracted
from the last layer of a self-supervised ViT indicates promi-
nent foreground areas. Following this observation, LOST
and TokenCut propose to localize objects using the key fea-
tures kRN×Dfrom the last layer in two steps. First,
an intermediate feature map Fint Rh×wis constructed
from the inter-patch similarity graph, where h=H/P and
w=W/P . Specifically, for LOST, it is the map of inverse
degrees; for TokenCut, it is the second smallest eigenvec-
tor of the graph. Second, a object mask m∈ {0,1}h×wis
generated from Fint to segment the foreground object.
3.2. Foreground Guidance with Self-iteration
In the foreground guidance module, the predicted ob-
ject mask is treated as the foreground guidance to high-
light the foreground region and guide the segmentation pro-
cess. Specifically, given an existing unsupervised object
detector Dand an intermediate feature map Fint extracted
from the pre-trained ViT, the binary foreground object mask
m∈ {0,1}h×wof the object can be generated as follows:
m=D(Fint).(1)
Here, Dcould be any ViT-based object discovery methods,
e.g., LOST and TokenCut. Moreover, the value of m(xi)
equals to 1 if the corresponding patch iwith coordinates
xiis predicted to belong to the foreground object. With
the foreground mask m, the approximate coordinates of the
object center Ocan be calculated by
xO=1
Ph×w
i=1 m(xi)
h×w
X
i=1
m(xi)xi.(2)
Then, we construct a probability map PRh×wusing the
2D Gaussian distribution function g:
P(i) = g(i|xO, σ2) = 1
2πσ2exixO2
2σ2,(3)
where σis a hyper-parameter. Intuitively, the value of P
indicates the regions in an image that are likely to belong
摘要:

ForegroundGuidanceandMulti-LayerFeatureFusionforUnsupervisedObjectDiscoverywithTransformersZhiweiLin∗ZengyuYang∗†YongtaoWangBWangxuanInstituteofComputerTechnology,PekingUniversityzwlin@pku.edu.cn,yzyysj@gmail.com,wyt@pku.edu.cnAbstractUnsupervisedobjectdiscovery(UOD)hasrecentlyshownencouragingprogre...

展开>> 收起<<
Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers Zhiwei LinZengyu YangYongtao WangB.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:3.25MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注