BoxTeacher Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation Tianheng Cheng1 Xinggang Wang1y Shaoyu Chen1 Qian Zhang2 Wenyu Liu1_2

2025-04-30 1 0 2.91MB 11 页 10玖币

侵权投诉

BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised

Instance Segmentation

Tianheng Cheng1,?, Xinggang Wang1,†, Shaoyu Chen1,?, Qian Zhang2, Wenyu Liu1

1School of EIC, Huazhong University of Science & Technology

2Horizon Robotics

https://github.com/hustvl/BoxTeacher

Abstract

Labeling objects with pixel-wise segmentation requires a

huge amount of human labor compared to bounding boxes.

Most existing methods for weakly supervised instance seg-

mentation focus on designing heuristic losses with priors

from bounding boxes. While, we ﬁnd that box-supervised

methods can produce some ﬁne segmentation masks and we

wonder whether the detectors could learn from these ﬁne

masks while ignoring low-quality masks. To answer this

question, we present BoxTeacher, an efﬁcient and end-to-

end training framework for high-performance weakly su-

pervised instance segmentation, which leverages a sophis-

ticated teacher to generate high-quality masks as pseudo

labels. Considering the massive noisy masks hurt the train-

ing, we present a mask-aware conﬁdence score to esti-

mate the quality of pseudo masks, and propose the noise-

aware pixel loss and noise-reduced afﬁnity loss to adap-

tively optimize the student with pseudo masks. Extensive

experiments can demonstrate effectiveness of the proposed

BoxTeacher. Without bells and whistles, BoxTeacher re-

markably achieves 35.0mask AP and 36.5mask AP with

ResNet-50 and ResNet-101 respectively on the challenging

COCO dataset, which outperforms the previous state-of-

the-art methods by a signiﬁcant margin and bridges the gap

between box-supervised and mask-supervised methods.

1. Introduction

Instance segmentation, aiming at recognizing and seg-

menting objects in images, is a fairly challenging task in

computer vision. Fortunately, the rapid development of

object detection methods [7,40,50] has greatly advanced

the emergence of numbers of successful methods [5,6,23,

49,54,55] for effective and efﬁcient instance segmenta-

?This work was done when Tianheng Cheng and Shaoyu Chen were

interns at Horizon Robotics. †Xinggang Wang is the corresponding au-

thor: xgwang@hust.edu.cn

BoxInst, 30.7 AP Ground Truth

79031

BoxInst, 30.7 AP Ground Truth

30.7 31.0

32.6

31.8

31.3

34.2

BoxInst

Self-Training

BoxTeacher

Mask AP

1× Schedule

3× Schedule

(a) (b)

000000377486

Figure 1. (a) Segmentation Masks from BoxInst. BoxInst

(ResNet-50 [24]) can produce some ﬁne segmentation masks with

weak supervisions from bounding boxes and images. (b) Self-

Training with Pseudo Masks on COCO val.We explore the

self-training to train a CondInst [49] with the pseudo labels gener-

ated by BoxInst. However, the improvements are limited

tion. With the ﬁne-grained human annotations, recent in-

stance segmentation methods can achieve impressive re-

sults on challenging the COCO dataset [34]. Nevertheless,

labeling instance-level segmentation is much complicated

and time-consuming, e.g., labeling an object with polygon-

based masks requires 10.3×more time than that with a 4-

point bounding box [11].

Recently, a few works [25,31–33,51,53] explore weakly

supervised instance segmentation with box annotations or

low-level colors. These weakly supervised methods can ef-

fectively train instance segmentation methods [23,49,55]

without pixel-wise or polygon-based annotations and ob-

tain ﬁne segmentation masks. As shown in Fig. 1(a), Box-

Inst [51] can output a few high-quality segmentation masks

and segment well on the object boundary, e.g., the person,

even performs better than the ground-truth mask in details

though other objects may be badly segmented. Naturally,

we wonder if the generated masks of box-supervised meth-

ods, especially the high-quality masks, could be qualiﬁed

as pseudo segmentation labels to further improve the per-

formance of weakly supervised instance segmentation.

To answer this question, we ﬁrst employ the naive

arXiv:2210.05174v2 [cs.CV] 17 Mar 2023

self-training to evaluate the performance of using box-

supervised pseudo masks. Given the generated instance

masks from BoxInst, we propose a simple yet effective box-

based pseudo mask assignment to assign pseudo masks to

ground-truth boxes. And then we train the CondInst [49]

with the pseudo masks, which has the same architecture

with BoxInst and consists of a detector [50] and a dynamic

mask head. Fig. 1(b) shows that using self-training brings

minor improvements and fails to unleash the power of high-

quality pseudo masks, which can be attributed to two obsta-

cles, i.e., (1) the naive self-training fails to ﬁlter low-quality

masks, and (2) the noisy pseudo masks hurt the training

using fully-supervised pixel-wise loss. Besides, the multi-

stage self-training is inefﬁcient.

To address these problems, we present BoxTeacher, an

end-to-end training framework, which takes advantage of

high-quality pseudo masks produced by box supervision.

BoxTeacher is composed of a sophisticated Teacher and

a perturbed Student, in which the teacher generates high-

quality pseudo instance masks along with the mask-aware

conﬁdence scores to estimate the quality of masks. Then the

proposed box-based pseudo mask assignment will assign

the pseudo masks to the ground-truth boxes. The student is

normally optimized with the ground-truth boxes and pseudo

masks through box-based loss and noise-aware pseudo

mask loss, and then progressively updates the teacher via

Exponential Moving Average (EMA). In contrast to the

naive multi-stage self-training, BoxTeacher is more simple

and efﬁcient. The proposed mask-aware conﬁdence score

effectively reduces the impact of low-quality masks. More

importantly, pseudo labeling can mutually improve the stu-

dent and further enforce the teacher to generate higher-

quality masks, hence pushing the limits of the box supervi-

sion. BoxTeacher can serve as a general training paradigm

and is agnostic to the methods for instance segmentation.

To benchmark the proposed BoxTeacher, we adopt

CondInst [49] as the basic segmentation method. On the

challenging COCO dataset [34], BoxTeacher surprisingly

achieves 35.0and 36.5mask AP based on ResNet-50 [24]

and ResNet-101 respectively, which remarkably outper-

forms the counterparts. We provide extensive experiments

on PASCAL VOC and Cityscapes to demonstrate its ef-

fectiveness and generalization ability. Furthermore, Box-

Teacher with Swin Transformer [37] obtains 40.6 mask AP

as a weakly approach for instance segmentation.

Overall, the contribution can be summarized as follows:

• We solve the box-supervised instance segmentation

problem from a new perspective, i.e., self-training with

pseudo masks, and illustrate its effectiveness.

• We present BoxTeacher, a simple yet effective frame-

work, which leverages pseudo masks with the mask-

aware conﬁdence score and noise-aware pseudo masks

loss. Besides, we propose a pseudo mask assignment

to assign pseudo masks to ground-truth boxes.

• We improve the weakly supervised instance segmenta-

tion by large margins and bridge the gap between box-

supervised and mask-supervised methods, e.g., Box-

Teacher achieves 36.5mask AP on COCO compared

to 39.1AP obtained by CondInst.

2. Related Work

Instance Segmentation. Methods for instance segmenta-

tion can be roughly divided into two groups, i.e., single-

stage methods and two-stage methods. Single-stage meth-

ods [5,49,58,62] tend to adopt single-stage object detec-

tors [35,50], to localize and recognize objects, and then

generate segmentation masks through object enmbeddings

or dynamic convolution [9]. Wang et al. present box-free

SOLO [54] and SOLOv2 [55], which are independent of ob-

ject detectors. SparseInst [13] and YOLACT [5], aiming for

real-time inference, achieve great trade-off between speed

and accuracy. Two-stage methods [14,23,27,29] adopt

bounding boxes from object detectors and RoIAlign [23] to

extract the RoI (region-of-interest) features for object seg-

mentation, e.g., Mask R-CNN [23]. Several methods [14,

27,29] based on Mask R-CNN are proposed to reﬁne the

segmentation masks for high-quality instance segmentation.

Recently, many approaches [7,10,12,17,20,63] based on

transformers [18,52] or the Hungarian algorithm [46] have

made great progress in instance segmentation.

Weakly Supervised Instance Segmentation. Considering

the huge cost of labeling instance segmentation, weakly

supervised instance segmentation using image-level labels

or bounding boxes gets lots of attention. Several meth-

ods [1,2,64,66] exploit image-level labels to generate

pseudo masks from activation maps. Khoreva et.al. [28]

propose to generate pseudo masks with GrabCut [42] from

given bounding boxes. BoxCaseg [53] leverages a saliency

model to generate pseudo object masks for training Mask R-

CNN along with the multiple instance learning (MIL) loss.

Recently, many box-supervised methods [25,31,33,51]

combines the MIL loss or pairwise relation loss from low-

level features obtain impressing results with box annota-

tions. In comparison with BoxInst [51], BoxTeacher inher-

its the box supervision [51] but concentrates more on the

novel training paradigm and exploiting noisy pseudo masks

for high-performance box-supervised instance segmenta-

tion with box annotations. Different from DiscoBox [31]

based on mean teacher [48], BoxTeacher aims at a simple

yet effective training framework with obtaining high-quality

pseudo masks and learning from noisy masks.

Semi-supervised Learning. Pseudo labeling [3,21,41] and

consistency regularization [4,30,43,44,59] have greatly

advanced the semi-supervised learning, which enables the

training on large-scale unlabeled datasets. Recently, semi-

supervised learning has been widely used in object de-

tection [36,45,60] and semantic segmentation [8,56,61]

and demonstrated its effectiveness. Motivated by high-

quality masks from box supervision, we adopt the success-

ful pseudo labeling and consistency regularization to de-

velop a new training framework for weakly supervised in-

stance segmentation. Compared to [22] which has simi-

lar motivation but aims for semi-supervised object detec-

tion with labeled images and extra point annotations, Box-

Teacher addresses box-supervised instance segmentation

with box-only annotations. Compared to [26,47] which

adopt multi-stage training and combine weakly supervised

and semi-supervised learning, BoxTeacher is a one-stage

framework without pre-trained labelers.

3. Naive Self-Training with Pseudo Masks

Revisiting Box-supervised Methods. Note that box-only

annotations is sufﬁcient to train an object detector, which

can accurately localize and recognize objects. Box-

supervised methods [31,33,51] based on object detectors

mainly exploit two exquisite losses to supervise mask pre-

dictions, i.e., the multiple instance learning (MIL) loss and

the pairwise relation loss. Concretely, according to the

bounding boxes, the MIL loss can determine the positive

and negative bags of pixels of the predicted masks. Pair-

wise relation loss concentrates on the local relations of pix-

els from low-level colors or features, in which neighboring

pixels have the similar color will be regarded as a positive

pair and should output similar probabilities. The MIL loss

and pairwise relation loss enables the box-supervised meth-

ods to produce the complete segmentation masks, and even

some high-quality masks with ﬁne details.

Naive Self-Training. Considering that the box-supervised

methods can produce some high-quality masks without

mask annotations, we adopt self-training to utilize the high-

quality masks as pseudo labels to train an instance seg-

mentation method with full supervision. Speciﬁcally, we

adopt the successful BoxInst [51] to generate pseudo in-

stance masks on the given dataset X={X ,Bg}, which

only contains the box annotations. For each input image

X, let {Bp,Cp,Mp}denote the predicted bounding boxes,

conﬁdence scores, and predicted instance masks, respec-

tively. We propose a simple yet effective Box-based Pseudo

Mask Assignment algorithm in Alg. 1to assign the pre-

dicted instance masks to the box annotations via the con-

ﬁdence scores and intersection-over-union (IoU) between

ground-truth boxes Bgand predicted boxes Bp. The hyper-

parameters τiou and τcare set to 0.5and 0.05, respectively.

The assigned instance masks will be rectiﬁed by removing

the parts beyond the bounding boxes. Then, we adopt the

dataset ˆ

X={X ,Bg,Mg}with pseudo instance masks to

train an approach, e.g., CondInst [49].

Naive Self-Training is Limited. Fig. 1(b) and Tab. 7pro-

vide the experimental results of using naive self-training

pseudo masks. Compared to the pseudo labeler, using self-

training brings minor improvements and even fails to sur-

pass the pseudo labeler. We attribute the limited perfor-

mance to two issues, i.e., the naive self-training fails to

exclude low-quality masks and the fully-supervised loss is

sensitive to the noisy pseudo masks.

Algorithm 1: Box-based Pseudo Mask Assignment

Input: predicted boxes Bp∈RN×4, predicted masks

Mp∈RN×H×W, conﬁdence score Cp∈RN,

ground-truth boxes Bg∈RK×4.

Parameter: IoU threshold τiou, conﬁdence

threshold τc.

Output: assigned pseudo masks Mg∈RK×H×W.

1Initialize output masks Mgwith empty (0),

assignment index A∈RKwith −1;

2Filter the predictions by the conﬁdence threshold τc;

3Sort the conﬁdence score Cpin descending order

with output indices S∈NN;

4foreach prediction iin Sdo

5Initialize u← −1,v← −1;

6for j= 1 to Kdo

7iouij =ComputeIoU(Bp

i,Bg

j);

8if Aj>0then

9continue;

10 end

11 if iouij ≥τiou and iouij ≥uthen

12 u←iouij ,v←i;

13 end

14 if v > 0then

15 Assign mask Mp

ito mask Mg

16 Aj←i;

17 end

18 end

19 end

4. BoxTeacher

In this section, we present BoxTeacher, an end-to-end

training framework, which aims to unleash the power of

pseudo masks. In contrast to multi-stage self-training, Box-

Teacher, consisting of a teacher and a student, simultane-

ously facilitates the training of the student and pseudo la-

beling of the teacher. The mutual optimization is beneﬁcial

to both the teacher and the student, thus leading to higher

performance for box-supervised instance segmentation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoxTeacher:ExploringHigh-QualityPseudoLabelsforWeaklySupervisedInstanceSegmentationTianhengCheng1;?,XinggangWang1;y,ShaoyuChen1;?,QianZhang2,WenyuLiu11SchoolofEIC,HuazhongUniversityofScience&Technology2HorizonRoboticshttps://github.com/hustvl/BoxTeacherAbstractLabelingobjectswithpixel-wisesegmentati...

展开>> 收起<<

BoxTeacher Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation Tianheng Cheng1 Xinggang Wang1y Shaoyu Chen1 Qian Zhang2 Wenyu Liu1_2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BoxTeacher Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation Tianheng Cheng1 Xinggang Wang1y Shaoyu Chen1 Qian Zhang2 Wenyu Liu1_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: