ated by our proposed instance segmentation generator and
background segmentation completion network (depicted in
Fig. 2).
Moreover, Latent Diffusion Model (LDM) and Stable Dif-
fusion [40] utilize powerful cross-attention mechanism [50]
to condition the diffusion model on various inputs, such as
texts and segmentation masks. To leverage the large-scale,
pretrained stable diffusion model for our task, we devise
a new inference scheme to generate target instances in a
missing region using a segmentation mask provided by our
framework. Similar to our approach, SPG-Net [48] and
SG-Net [26] use semantic segmentation labels to fuel more
informative supervision for image completion. Our approach
is similar to SPG-Net and SG-Net in that ImComplete pre-
dicts/recovers the semantic segmentation mask and generate
the content based on the mask, but differs in that ImComplete
handles more challenging scenarios where the instances are
entirely removed.
To our best knowledge, the only existing work for
instance-aware image completion is HVITA [38], where
a target instance is wholly removed from an image. HVITA
consists of four steps: (1) detecting visible instances, (2) con-
structing a graph using detected instances to understand the
scene context, (3) generating a missing instance and placing
it on the missing region, and (4) refining the inserted image.
Yet, HVITA depends on conventional object detection and
is designed for completing rectangular regions. Despite the
additional refinement module in HVITA, it still produces
low-quality image completion results. In particular, distor-
tion occurs at the boundary between the generated instance
and its surroundings. On the other hand, our proposed Im-
Complete is free to handle arbitrarily shaped masks. Our
sophisticated pipeline for segmentation mask recovery helps
to understand the context of images better and encourages
visual continuity and plausibility on the boundaries between
generated and unmasked regions.
3. Method
Our framework aims to complete the corrupted image,
where the visual instance is completely removed. This prob-
lem is challenging as the model must not only generate a
target instance but also ensure that the generated instance
seamlessly blends with the remaining areas of the image. To
alleviate the problem, our framework (ImComplete) com-
pletes masked images in three steps. First, we infer a con-
textually appropriate instance by figuring out the category
of the missing instance (Sec. 3.1). Second, we complete the
semantic segmentation map of the missing region based on
the inference result from the previous step (Sec. 3.2). Finally,
we transform the masked image with semantic segmentation
maps to a realistic completed image (Sec. 3.3).
The overview figure of our framework is shown in Fig. 2.
We leave out the all the architecture details and training
details in the Appendix E.
3.1. Missing Instance Prediction
To obtain the relationship information between instances
in a given scene, ImComplete first predicts instance bounding
box coordinates, instance classes, and a semantic segmenta-
tion map using the pre-trained DETR [4]. Let the panoptic
segmentation map
SM=DETR(IM)
, then we can extract
box coordinates of the visible instances
B= [b1, ..., bk]
and object classes
c= [c1, ..., ck]⊤
from
SM
, where
k
is the
number of predicted instances. Then, to infer the class of
the missing instance
ytarget
, a transformer network, called
missing instance inference transformer utilizes tokens ob-
tained from the object classes
c
of visible instances, as well
as additional tokens responsible for the missing region. In
particular, we convert the visible instances’ classes
c
into
learnable input tokens using a single linear layer. A quick
approach is to utilize object queries from DETR as input
tokens directly, but we observe such a method exhibits a
worse performance compared to employing new learnable
class embeddings.
Furthermore, to inject the location information of the visi-
ble instances, we embed their bounding box coordinates into
positional encoding vectors and sum them to the learnable
class embeddings. To acquire the positional encoding vec-
tors, we input the normalized center coordinate (
Cx
,
Cy
),
width (
W
), and height (
H
) of the bounding box to a single
linear layer with sigmoid activation function. We also apply
the same procedure for the missing region token that will be
used for missing instance inference. We explored different
ways to create the positional encoding vectors in Appendix B
and adopt the aforementioned positional encoding since it
gives the best missing instance inference performance. The
below formulations are mathematical expressions of how our
missing instance infer transformer works.
z0=Eclass +Epos
=MLP(c′) + σ(MLP(B′)),z0∈R(k+1)×d
z′
l=MSA(LN(zl−1)) + zl−1, l = 1, ..., L
zl=MLP(LN(z′
l)) + z′
l, l = 1, ..., L
y=LN(z0
L),
(1)
where MLP is a multi-layer perceptron,
σ
is a sigmoid activa-
tion,
d
is the dimension of embedding vectors,
B′
is
[b0]∪B
,
b0
is the missing region bounding box coordinate, and
c′
is
[c0]∪c
where
c0
is an extra class token for missing in-
stance. The missing instance inference transformer consists
of 12 transformer encoder layers (
L= 12
) with eight heads.
The missing region token interacts with the visible region
tokens through self-attention mechanism. Thus, the network
can more accurately predict the likely class of the missing
instance based on the detected instances and their location
3