Instance-Aware Image Completion Jinoh Cho1 Minguk Kang1 Vibhav Vineet2 Jaesik Park1 1Pohang University of Science and Technology POSTECH South Korea

2025-05-05 0 0 7.28MB 19 页 10玖币
侵权投诉
Instance-Aware Image Completion
Jinoh Cho1, Minguk Kang1, Vibhav Vineet2, Jaesik Park1
1Pohang University of Science and Technology (POSTECH), South Korea
2Microsoft Research, United States
Abstract
Image completion is a task that aims to fill in the missing
region of a masked image with plausible contents. How-
ever, existing image completion methods tend to fill in the
missing region with the surrounding texture instead of hal-
lucinating a visual instance that is suitable in accordance
with the context of the scene. In this work, we propose a
novel image completion model, dubbed ImComplete, that
hallucinates the missing instance that harmonizes well with -
and thus preserves - the original context. ImComplete first
adopts a transformer architecture that considers the visible
instances and the location of the missing region. Then, Im-
Complete completes the semantic segmentation masks within
the missing region, providing pixel-level semantic and struc-
tural guidance. Finally, the image synthesis blocks generate
photo-realistic content. We perform a comprehensive eval-
uation of the results in terms of visual quality (LPIPS and
FID) and contextual preservation scores (CLIPscore and
object detection accuracy) with COCO-panoptic and Visual
Genome datasets. Experimental results show the superiority
of ImComplete on various natural images.
1. Introduction
Image completion is the task of restoring arbitrary miss-
ing regions in an image. Researchers have been working on
developing image completion models for various applica-
tions, such as image editing [17,29], restoration [25,51], and
object removal [44]. Even though current image completion
models can produce highly realistic results, most previous
works focus on filling in missing regions in a realistic way
without considering the appropriate instance that needs to be
inserted and how it harmonizes with the undamaged regions.
For example, we observe that even the cutting-edge image
completion models [24,33,54] tend to use textures from
the surrounding areas to fill in the missing parts rather than
synthesizing plausible instances.
When the image completion model fills in a missing re-
gion with surrounding textures, it can significantly alter the
overall context of the image. For example, the removal of
the horse in the image of Fig. 1changes the context around
the missing region from “a person riding a horse on the
beach” to “a boy walking on the beach”. Unfortunately, it
is rarely studied to develop an image completion framework
that can synthesize the missing region with an instance in a
photo-realistic as well as context-conserved fashion.
In this paper, we propose an image completion pipeline,
named ImComplete, that can reason the type of missing in-
stance in the damaged image and complete the damaged
image with plausible contents. ImComplete completes im-
ages in three stages: 1) identifying the class of the missing
instance, 2) generating a semantic segmentation mask for the
missing area, and 3) using the segmentation guidance to com-
plete the missing region. Specifically, ImComplete employs a
transformer network to examine the image’s context, defined
by analyzing the co-occurrence of instances, to predict the
class of the missing instance. Then, it utilizes a conditional
GAN and transformer body reconstruction network to gen-
erate separate segmentation masks for the missing instance
and background of the missing region. Lastly, with state-
of-the-art semantic image synthesis approaches [35,40,43],
ImComplete fills an instance and background scenes that fit
the context in the missing region.
To compare our model against existing image comple-
tion methods, we suggest using two evaluation metrics: (1)
DETR [4] Accuracy to check whether to complete the proper
instance (2) CLIPScore [14] to measure how much the con-
text of the image changes after completion by focusing both
instance and non-instance parts of the missing regions. We
also evaluate our method using standard image quality as-
sessment metrics, FID [15] and LPIPS [57]. The results on
COCO-panoptic [28] and Visual Genome [21] datasets show
that ImComplete has a similar level of visual quality com-
pared to the state-of-the-art image completion approaches.
However, the DETR Accuracy and CLIPScore indicate that
ImComplete can complete damaged images better than the
other methods with suitable instances.
Our contributions are summarized as follows:
We propose a new image completion pipeline called
ImComplete that completes the missing region of a
masked image in a context-preserved manner.
1
arXiv:2210.12350v3 [cs.CV] 26 May 2023
Context Query: “A person riding a horse on the beach
Input MAT TargetOursHVITARePaint
0.7200 0.7840
0.7153 0.8070
Figure 1. From the first column, Input image with a missing region, results of state-of-the-art image completion approaches, such as
MAT [24], RePaint [33] and HVITA [38], our result (ImComplete), and the target image. We compute CLIPScore around the generated part
using the query text. As our approach generates a horse to complete the image rather than fills using background textures, CLIPScore of our
result exhibits the best performance among the other models.
We propose to use DETR Accuracy and CLIPScore to
evaluate the instance/context consistency between the
original image and the completed image.
ImComplete produces high-quality completion results.
Our results show better contextural scores (DETR Ac-
curacy and CLIPScore) compared with cutting-edge
image completion approaches while keeping compara-
ble image quality scores (FID and LPIPS).
We show that ImComplete can be plugged with state-
of-the-art semantic image synthesis models, such as
SPADE [35], OASIS [43], and Stable Diffusion [40].
We show that ImComplete can perform object removal
as well by skipping the proposed missing instance in-
ference step, which demonstrates the flexibility of the
proposed approach.
2. Related Work
2.1. Image Completion
Early research on image completion can be roughly
divided into diffusion-based [1,2] and patch-based meth-
ods [6,7,10,13,22,23,49]. These models assume that the
missing region of an image can be found and replaced with
patterns and features from the remaining parts of the image,
making the model fill in the missing region using only basic
visual elements and repeating patterns.
With the advance of deep learning, image completion
models based on deep generative models have become the
mainstream to achieve photo-realistic image completion.
Context encoder [37] utilizes adversarial training inspired
by Generative Adversarial Network (GAN) [12] and shows
perceptually faithful completion results. VQ-GAN [11] and
MaskGIT [5] use auto-regressive generative transformer to
fill the missing region. RePaint [33] proposes the new infer-
ence scheme using pretrained denosing denoising diffusion
model [16,46], which reduces the distortion between the
generated and known region.
Along with introducing advanced generative models, re-
searchers have also been working to improve image comple-
tion performance by modifying existing architectures and
convolution operations. Series of studies [31,47,53,54] pro-
pose contextual attention layers to encode long-range con-
textual embeddings and perform image completion based on
these. Liu et al. [30] propose a partial convolutional layer
to mitigate color discrepancy and blurriness in completed
images. Yu et al. [55] generalize the partial convolution by
introducing a dynamic feature selection mechanism at each
spatial coordinate.
Recent works have focused on completing larger missing
region. Zhao et al. [58] deal with the challenging image com-
pletion by bridging image conditioning with the modulation
technique used in StyleGAN2 [19]. Li et al. [24] propose
a transformer block that can effectively capture long-range
context interactions and hallucinate large missing regions.
2.2. Semantic Image Synthesis
Generating photo-realistic images from a semantic seg-
mentation mask is called semantic image synthesis. The ob-
jective is to generate realistic images that accurately reflect
the semantic guideline given by a segmentation mask. To
achieve this, the authors of SPADE [35] develop a spatially-
adaptive (de)normalization layer to modulate semantic in-
formation to the pixel-wise image features. Besides, OA-
SIS [43] improves the power of the discriminator by re-
placing the vanilla discriminator with a segmentation-based
discriminator. This replacement allows the generator to be
trained with more informative signal from the discriminator,
resulting in better synthesis results. SPADE and OASIS were
initially not designed to perform the image completion task.
However, in this paper, we extend the usage of semantic
image synthesis blocks and use them for semantic-guided
image completion where the segmentation guidance is cre-
2
ated by our proposed instance segmentation generator and
background segmentation completion network (depicted in
Fig. 2).
Moreover, Latent Diffusion Model (LDM) and Stable Dif-
fusion [40] utilize powerful cross-attention mechanism [50]
to condition the diffusion model on various inputs, such as
texts and segmentation masks. To leverage the large-scale,
pretrained stable diffusion model for our task, we devise
a new inference scheme to generate target instances in a
missing region using a segmentation mask provided by our
framework. Similar to our approach, SPG-Net [48] and
SG-Net [26] use semantic segmentation labels to fuel more
informative supervision for image completion. Our approach
is similar to SPG-Net and SG-Net in that ImComplete pre-
dicts/recovers the semantic segmentation mask and generate
the content based on the mask, but differs in that ImComplete
handles more challenging scenarios where the instances are
entirely removed.
To our best knowledge, the only existing work for
instance-aware image completion is HVITA [38], where
a target instance is wholly removed from an image. HVITA
consists of four steps: (1) detecting visible instances, (2) con-
structing a graph using detected instances to understand the
scene context, (3) generating a missing instance and placing
it on the missing region, and (4) refining the inserted image.
Yet, HVITA depends on conventional object detection and
is designed for completing rectangular regions. Despite the
additional refinement module in HVITA, it still produces
low-quality image completion results. In particular, distor-
tion occurs at the boundary between the generated instance
and its surroundings. On the other hand, our proposed Im-
Complete is free to handle arbitrarily shaped masks. Our
sophisticated pipeline for segmentation mask recovery helps
to understand the context of images better and encourages
visual continuity and plausibility on the boundaries between
generated and unmasked regions.
3. Method
Our framework aims to complete the corrupted image,
where the visual instance is completely removed. This prob-
lem is challenging as the model must not only generate a
target instance but also ensure that the generated instance
seamlessly blends with the remaining areas of the image. To
alleviate the problem, our framework (ImComplete) com-
pletes masked images in three steps. First, we infer a con-
textually appropriate instance by figuring out the category
of the missing instance (Sec. 3.1). Second, we complete the
semantic segmentation map of the missing region based on
the inference result from the previous step (Sec. 3.2). Finally,
we transform the masked image with semantic segmentation
maps to a realistic completed image (Sec. 3.3).
The overview figure of our framework is shown in Fig. 2.
We leave out the all the architecture details and training
details in the Appendix E.
3.1. Missing Instance Prediction
To obtain the relationship information between instances
in a given scene, ImComplete first predicts instance bounding
box coordinates, instance classes, and a semantic segmenta-
tion map using the pre-trained DETR [4]. Let the panoptic
segmentation map
SM=DETR(IM)
, then we can extract
box coordinates of the visible instances
B= [b1, ..., bk]
and object classes
c= [c1, ..., ck]
from
SM
, where
k
is the
number of predicted instances. Then, to infer the class of
the missing instance
ytarget
, a transformer network, called
missing instance inference transformer utilizes tokens ob-
tained from the object classes
c
of visible instances, as well
as additional tokens responsible for the missing region. In
particular, we convert the visible instances’ classes
c
into
learnable input tokens using a single linear layer. A quick
approach is to utilize object queries from DETR as input
tokens directly, but we observe such a method exhibits a
worse performance compared to employing new learnable
class embeddings.
Furthermore, to inject the location information of the visi-
ble instances, we embed their bounding box coordinates into
positional encoding vectors and sum them to the learnable
class embeddings. To acquire the positional encoding vec-
tors, we input the normalized center coordinate (
Cx
,
Cy
),
width (
W
), and height (
H
) of the bounding box to a single
linear layer with sigmoid activation function. We also apply
the same procedure for the missing region token that will be
used for missing instance inference. We explored different
ways to create the positional encoding vectors in Appendix B
and adopt the aforementioned positional encoding since it
gives the best missing instance inference performance. The
below formulations are mathematical expressions of how our
missing instance infer transformer works.
z0=Eclass +Epos
=MLP(c) + σ(MLP(B)),z0R(k+1)×d
z
l=MSA(LN(zl1)) + zl1, l = 1, ..., L
zl=MLP(LN(z
l)) + z
l, l = 1, ..., L
y=LN(z0
L),
(1)
where MLP is a multi-layer perceptron,
σ
is a sigmoid activa-
tion,
d
is the dimension of embedding vectors,
B
is
[b0]B
,
b0
is the missing region bounding box coordinate, and
c
is
[c0]c
where
c0
is an extra class token for missing in-
stance. The missing instance inference transformer consists
of 12 transformer encoder layers (
L= 12
) with eight heads.
The missing region token interacts with the visible region
tokens through self-attention mechanism. Thus, the network
can more accurately predict the likely class of the missing
instance based on the detected instances and their location
3
Final
Segmentation Mask
Pseudo
Segmentation Mask
Output
Background
Segmentation
Completion
Network
Instance
Segmentation
Generator
Segmentation
Guided
Image Completion
Network
Instance
Segmentation Mask
Learnable PE
using BBOX
BBOX
DETR
Input
Missing Instance
Inference
Transformer
Noise
Z ~ N(0, 1)
Step 1 Step 2 Step 3
~
Figure 2. Overview of the proposed approach, called ImComplete.ImComplete completes the image in three steps: (1) infer the missing
instance class, (2) complete a segmentation map in the missing region, and (3) translate the segmentation map to an image to hallucinate the
missing region.
information. Additional information regarding the training
and architecture specifics can be found within Appendix E.1.
3.2. Semantic Segmentation Map Generation
Utilizing the predicted class of the missing instance from
the previous step, we aim to generate the semantic segmenta-
tion map of the missing region. We create the segmentation
map of the instance and the background area individually
with separate modules (instance segmentation generator and
background segmentation completion network) and obtain
the final segmentation map by inserting the missing instance
segmentation into the background segmentation, as shown
in Fig. 2.
Instance Segmentation Generator. We perform the miss-
ing instance segmentation generation using two modules:
a generator and a discriminator, for instance segmentation.
The instance segmentation generator aims to create a plau-
sible segmentation map corresponding to the predicted in-
stance class. For the implementation, we use the architecture
from BigGAN [3], one of the most successful conditional
GANs, with slight modification. We input the predicted
missing instance class from the previous step and the box co-
ordinates of the missing region to the Conditional Batch Nor-
malization [8] module in the instance segmentation generator.
We train the model by using spectral normalization [34] and
hinge loss [27] with DiffAug [59] technique.
Background Segmentation Completion Network. The
background segmentation completion network produces the
segmentation map of the non-instance region without at-
tempting to generate the missing instance. To do this, we
randomly scribble the ground truth segmentation map and
let the background segmentation completion network restore
it using cross-entropy loss. We experimentally identify that
this procedure successfully reconstructs the background seg-
mentation maps. The background segmentation completion
network is implemented using convolutional heads and tails
with a transformer body. In Sec. 4.5, we demonstrate the
importance of transformer body, particularly in cases where
there exists a large hole in the damaged image.
Finally, we obtain the overall segmentation map of the
missing region by inserting the instance segmentation into
the background segmentation. Further training and architec-
ture details can be found in Appendix E.2 and Appendix E.3.
3.3. Segmentation-guided Image Completion
This module is designed to complete the masked im-
age using the reconstructed segmentation mask as the guid-
ance. There are three versions (ImComplete
spade
,ImCom-
plete
oasis
, and ImComplete
stable
) depending on the image
generation approach plugged into our framework.
SPADE/OASIS Version. For the ImComplete
spade
and
ImComplete
oasis
, we can use UNet [41]-like completion
model to hallucinate the missing region using the masked
image and predicted segmentation map pairs. The input
to the completion model is the masked image, and the pre-
dicted segmentation map is conditioned by SPADE [35] or
OASIS [43] blocks to help the image completion model pre-
cisely fill in the missing region based on the semantics that
segmentation map provides. We only apply the conditioning
blocks to the decoder part of the completion network. For
ImComplete
oasis
, the generator and discriminator losses pro-
posed in the OASIS paper are applied to the masked region,
and L2 loss is penalized to the remaining undamaged area to
4
摘要:

Instance-AwareImageCompletionJinohCho1,MingukKang1,VibhavVineet2,JaesikPark11PohangUniversityofScienceandTechnology(POSTECH),SouthKorea2MicrosoftResearch,UnitedStatesAbstractImagecompletionisataskthataimstofillinthemissingregionofamaskedimagewithplausiblecontents.How-ever,existingimagecompletionmeth...

展开>> 收起<<
Instance-Aware Image Completion Jinoh Cho1 Minguk Kang1 Vibhav Vineet2 Jaesik Park1 1Pohang University of Science and Technology POSTECH South Korea.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:7.28MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注