mainly focus on the construction of the localization head
and overlook the potential of improving the ViT features.
In this paper, we propose a simple but effective feature
enhancement method for existing ViT-based UOD frame-
works, named FORMULA. Our method consists of two
parts, i.e., the foreground guidance module and the multi-
layer feature fusion module. For the foreground guidance
module, we utilize the object mask predicted by an off-the-
shelf UOD detector to highlight the foreground object re-
gion and then refine the object location through an iterative
process. Specifically, we first generate an object mask from
the original ViT feature map using an existing UOD detec-
tor (e.g., LOST or TokenCut). Then, we construct a prob-
ability map with 2D Gaussian distribution from the mask,
which roughly localizes the foreground objects. After that,
we highlight the foreground area by applying the probabil-
ity map to the original ViT feature map. Finally, the updated
feature map is used for the UOD detector to obtain a refined
object mask. The whole process can be iterated. In this
way, we enhance the ViT feature map by introducing the
foreground object information and suppressing background
interference. Our method can localize objects much more
accurately with only a few iterations.
Besides, we note that LOST and TokenCut only use
the feature map from the last layer of ViT. However, the
scale of objects in non-object-centric images, like those in
COCO [31], can vary greatly. The feature from the last layer
of a pre-trained ViT mainly captures the key areas for clas-
sification, which is usually at a larger scale. Thus, the per-
formance on smaller objects is hurt. To address this issue,
we propose the multi-layer feature fusion module. In detail,
we simply merge the features from the last several layers
through a weighted summation to aggregate multi-scale in-
formation for unsupervised object discovery.
Our main contributions can be summarized as follows:
• We introduce foreground guidance from the object mask
predicted by an existing UOD detector to the original ViT
feature map and propose an iterative process to refine the
object location.
• We further design a multi-layer feature fusion module
to address the scale variation issues in object detection,
releasing the potential of ViT feature representation for
object discovery.
• The proposed method can be easily incorporated into any
existing UOD methods based on ViT and achieves new
state-of-the-art results on the popular PASCAL VOC and
COCO benchmarks.
2. Related Work
Self-supervised learning. Learning powerful feature
representations in a self-supervised manner that dispenses
with human annotation has made great progress recently.
This is performed by defining a pretext task that provides
surrogate supervision for feature learning [34, 59, 19].
Despite no labels, models trained with self-supervision
have outperformed their supervised counterparts on several
downstream tasks, such as image classification [19, 55, 8,
23, 9, 5, 20, 10, 6, 50, 11, 22] and object detection [53, 7].
While [55, 8, 23, 5, 20, 10] have adopted CNN as pre-
training backbones, recent works [6, 50, 11, 22, 29, 28] have
explored Transformers [46] for self-supervised visual learn-
ing, demonstrating their superiority over traditional CNN.
Our work utilizes the strong localization capability of self-
supervised ViT for unsupervised object discovery.
Unsupervised object discovery. The goal of unsupervised
object discovery is to localize objects without any supervi-
sion. Early works generally rely on image features encoded
by a CNN [24, 48, 54, 49, 12]. These methods need to com-
pare the extracted features of each image across those of
every other image in the dataset, leading to quadratic com-
putational overhead [49]. Besides, the dependence on the
inter-image similarity results in these methods being unable
to run on a single image. Recently, LOST [40] and To-
kenCut [52] have been proposed to address these issues by
leveraging the high-quality feature representation generated
from a self-supervised ViT. Their motivation is that the at-
tention map extracted from the last layer of a pre-trained
ViT contains explicit information about the foreground ob-
jects. Specifically, both methods first propose the construc-
tion of an intra-image similarity graph using features ex-
tracted from the ViT backbone [6, 14]. A heuristic seed-
selection-expansion strategy and Normalized Cut [39] are
then adopted respectively by the two methods to segment a
single foreground object in an image.
Although achieving excellent performance, these meth-
ods mainly concentrate on localization design and fail to
further improve ViT features for unsupervised object dis-
covery. Instead, our work starts from the perspective of
feature enhancement. Concretely, we introduce the fore-
ground guidance to highlight the foreground regions on ViT
features and propose a multi-layer feature fusion module to
aggregate multi-scale features.
Multi-layer Feature Representations. One of the main
challenges in object detection is to effectively represent fea-
tures on different scales. Extensive works have been pro-
posed over the years to deal with the multi-scale problem
with multi-layer features. These methods leverage the pyra-
midal features of CNN to compute a multi-scale feature
representation. [33, 21, 26, 1, 37, 18, 30] combine low-
resolution and high-resolution features with upsampling or
lateral connections to aggregate semantic information from
all levels. [32, 4, 30] make predictions at different scales
from different layers and use post-processing to filter the