self-training to evaluate the performance of using box-
supervised pseudo masks. Given the generated instance
masks from BoxInst, we propose a simple yet effective box-
based pseudo mask assignment to assign pseudo masks to
ground-truth boxes. And then we train the CondInst [49]
with the pseudo masks, which has the same architecture
with BoxInst and consists of a detector [50] and a dynamic
mask head. Fig. 1(b) shows that using self-training brings
minor improvements and fails to unleash the power of high-
quality pseudo masks, which can be attributed to two obsta-
cles, i.e., (1) the naive self-training fails to filter low-quality
masks, and (2) the noisy pseudo masks hurt the training
using fully-supervised pixel-wise loss. Besides, the multi-
stage self-training is inefficient.
To address these problems, we present BoxTeacher, an
end-to-end training framework, which takes advantage of
high-quality pseudo masks produced by box supervision.
BoxTeacher is composed of a sophisticated Teacher and
a perturbed Student, in which the teacher generates high-
quality pseudo instance masks along with the mask-aware
confidence scores to estimate the quality of masks. Then the
proposed box-based pseudo mask assignment will assign
the pseudo masks to the ground-truth boxes. The student is
normally optimized with the ground-truth boxes and pseudo
masks through box-based loss and noise-aware pseudo
mask loss, and then progressively updates the teacher via
Exponential Moving Average (EMA). In contrast to the
naive multi-stage self-training, BoxTeacher is more simple
and efficient. The proposed mask-aware confidence score
effectively reduces the impact of low-quality masks. More
importantly, pseudo labeling can mutually improve the stu-
dent and further enforce the teacher to generate higher-
quality masks, hence pushing the limits of the box supervi-
sion. BoxTeacher can serve as a general training paradigm
and is agnostic to the methods for instance segmentation.
To benchmark the proposed BoxTeacher, we adopt
CondInst [49] as the basic segmentation method. On the
challenging COCO dataset [34], BoxTeacher surprisingly
achieves 35.0and 36.5mask AP based on ResNet-50 [24]
and ResNet-101 respectively, which remarkably outper-
forms the counterparts. We provide extensive experiments
on PASCAL VOC and Cityscapes to demonstrate its ef-
fectiveness and generalization ability. Furthermore, Box-
Teacher with Swin Transformer [37] obtains 40.6 mask AP
as a weakly approach for instance segmentation.
Overall, the contribution can be summarized as follows:
• We solve the box-supervised instance segmentation
problem from a new perspective, i.e., self-training with
pseudo masks, and illustrate its effectiveness.
• We present BoxTeacher, a simple yet effective frame-
work, which leverages pseudo masks with the mask-
aware confidence score and noise-aware pseudo masks
loss. Besides, we propose a pseudo mask assignment
to assign pseudo masks to ground-truth boxes.
• We improve the weakly supervised instance segmenta-
tion by large margins and bridge the gap between box-
supervised and mask-supervised methods, e.g., Box-
Teacher achieves 36.5mask AP on COCO compared
to 39.1AP obtained by CondInst.
2. Related Work
Instance Segmentation. Methods for instance segmenta-
tion can be roughly divided into two groups, i.e., single-
stage methods and two-stage methods. Single-stage meth-
ods [5,49,58,62] tend to adopt single-stage object detec-
tors [35,50], to localize and recognize objects, and then
generate segmentation masks through object enmbeddings
or dynamic convolution [9]. Wang et al. present box-free
SOLO [54] and SOLOv2 [55], which are independent of ob-
ject detectors. SparseInst [13] and YOLACT [5], aiming for
real-time inference, achieve great trade-off between speed
and accuracy. Two-stage methods [14,23,27,29] adopt
bounding boxes from object detectors and RoIAlign [23] to
extract the RoI (region-of-interest) features for object seg-
mentation, e.g., Mask R-CNN [23]. Several methods [14,
27,29] based on Mask R-CNN are proposed to refine the
segmentation masks for high-quality instance segmentation.
Recently, many approaches [7,10,12,17,20,63] based on
transformers [18,52] or the Hungarian algorithm [46] have
made great progress in instance segmentation.
Weakly Supervised Instance Segmentation. Considering
the huge cost of labeling instance segmentation, weakly
supervised instance segmentation using image-level labels
or bounding boxes gets lots of attention. Several meth-
ods [1,2,64,66] exploit image-level labels to generate
pseudo masks from activation maps. Khoreva et.al. [28]
propose to generate pseudo masks with GrabCut [42] from
given bounding boxes. BoxCaseg [53] leverages a saliency
model to generate pseudo object masks for training Mask R-
CNN along with the multiple instance learning (MIL) loss.
Recently, many box-supervised methods [25,31,33,51]
combines the MIL loss or pairwise relation loss from low-
level features obtain impressing results with box annota-
tions. In comparison with BoxInst [51], BoxTeacher inher-
its the box supervision [51] but concentrates more on the
novel training paradigm and exploiting noisy pseudo masks
for high-performance box-supervised instance segmenta-
tion with box annotations. Different from DiscoBox [31]
based on mean teacher [48], BoxTeacher aims at a simple
yet effective training framework with obtaining high-quality
pseudo masks and learning from noisy masks.
Semi-supervised Learning. Pseudo labeling [3,21,41] and
consistency regularization [4,30,43,44,59] have greatly
2