Instance-Aware Image Completion Jinoh Cho1 Minguk Kang1 Vibhav Vineet2 Jaesik Park1 1Pohang University of Science and Technology POSTECH South Korea

2025-05-05 0 0 7.28MB 19 页 10玖币

侵权投诉

Instance-Aware Image Completion

Jinoh Cho1, Minguk Kang1, Vibhav Vineet2, Jaesik Park1

1Pohang University of Science and Technology (POSTECH), South Korea

2Microsoft Research, United States

Abstract

Image completion is a task that aims to ﬁll in the missing

region of a masked image with plausible contents. How-

ever, existing image completion methods tend to ﬁll in the

missing region with the surrounding texture instead of hal-

lucinating a visual instance that is suitable in accordance

with the context of the scene. In this work, we propose a

novel image completion model, dubbed ImComplete, that

hallucinates the missing instance that harmonizes well with -

and thus preserves - the original context. ImComplete ﬁrst

adopts a transformer architecture that considers the visible

instances and the location of the missing region. Then, Im-

Complete completes the semantic segmentation masks within

the missing region, providing pixel-level semantic and struc-

tural guidance. Finally, the image synthesis blocks generate

photo-realistic content. We perform a comprehensive eval-

uation of the results in terms of visual quality (LPIPS and

FID) and contextual preservation scores (CLIPscore and

object detection accuracy) with COCO-panoptic and Visual

Genome datasets. Experimental results show the superiority

of ImComplete on various natural images.

1. Introduction

Image completion is the task of restoring arbitrary miss-

ing regions in an image. Researchers have been working on

developing image completion models for various applica-

tions, such as image editing [17,29], restoration [25,51], and

object removal [44]. Even though current image completion

models can produce highly realistic results, most previous

works focus on ﬁlling in missing regions in a realistic way

without considering the appropriate instance that needs to be

inserted and how it harmonizes with the undamaged regions.

For example, we observe that even the cutting-edge image

completion models [24,33,54] tend to use textures from

the surrounding areas to ﬁll in the missing parts rather than

synthesizing plausible instances.

When the image completion model ﬁlls in a missing re-

gion with surrounding textures, it can signiﬁcantly alter the

overall context of the image. For example, the removal of

the horse in the image of Fig. 1changes the context around

the missing region from “a person riding a horse on the

beach” to “a boy walking on the beach”. Unfortunately, it

is rarely studied to develop an image completion framework

that can synthesize the missing region with an instance in a

photo-realistic as well as context-conserved fashion.

In this paper, we propose an image completion pipeline,

named ImComplete, that can reason the type of missing in-

stance in the damaged image and complete the damaged

image with plausible contents. ImComplete completes im-

ages in three stages: 1) identifying the class of the missing

instance, 2) generating a semantic segmentation mask for the

missing area, and 3) using the segmentation guidance to com-

plete the missing region. Speciﬁcally, ImComplete employs a

transformer network to examine the image’s context, deﬁned

by analyzing the co-occurrence of instances, to predict the

class of the missing instance. Then, it utilizes a conditional

GAN and transformer body reconstruction network to gen-

erate separate segmentation masks for the missing instance

and background of the missing region. Lastly, with state-

of-the-art semantic image synthesis approaches [35,40,43],

ImComplete ﬁlls an instance and background scenes that ﬁt

the context in the missing region.

To compare our model against existing image comple-

tion methods, we suggest using two evaluation metrics: (1)

DETR [4] Accuracy to check whether to complete the proper

instance (2) CLIPScore [14] to measure how much the con-

text of the image changes after completion by focusing both

instance and non-instance parts of the missing regions. We

also evaluate our method using standard image quality as-

sessment metrics, FID [15] and LPIPS [57]. The results on

COCO-panoptic [28] and Visual Genome [21] datasets show

that ImComplete has a similar level of visual quality com-

pared to the state-of-the-art image completion approaches.

However, the DETR Accuracy and CLIPScore indicate that

ImComplete can complete damaged images better than the

other methods with suitable instances.

Our contributions are summarized as follows:

•

We propose a new image completion pipeline called

ImComplete that completes the missing region of a

masked image in a context-preserved manner.

arXiv:2210.12350v3 [cs.CV] 26 May 2023

Context Query: “A person riding a horse on the beach”

Input MAT TargetOursHVITARePaint

0.7200 0.7840

0.7153 0.8070

Figure 1. From the ﬁrst column, Input image with a missing region, results of state-of-the-art image completion approaches, such as

MAT [24], RePaint [33] and HVITA [38], our result (ImComplete), and the target image. We compute CLIPScore around the generated part

using the query text. As our approach generates a horse to complete the image rather than ﬁlls using background textures, CLIPScore of our

result exhibits the best performance among the other models.

•

We propose to use DETR Accuracy and CLIPScore to

evaluate the instance/context consistency between the

original image and the completed image.

•

ImComplete produces high-quality completion results.

Our results show better contextural scores (DETR Ac-

curacy and CLIPScore) compared with cutting-edge

image completion approaches while keeping compara-

ble image quality scores (FID and LPIPS).

•

We show that ImComplete can be plugged with state-

of-the-art semantic image synthesis models, such as

SPADE [35], OASIS [43], and Stable Diffusion [40].

•

We show that ImComplete can perform object removal

as well by skipping the proposed missing instance in-

ference step, which demonstrates the ﬂexibility of the

proposed approach.

2. Related Work

2.1. Image Completion

Early research on image completion can be roughly

divided into diffusion-based [1,2] and patch-based meth-

ods [6,7,10,13,22,23,49]. These models assume that the

missing region of an image can be found and replaced with

patterns and features from the remaining parts of the image,

making the model ﬁll in the missing region using only basic

visual elements and repeating patterns.

With the advance of deep learning, image completion

models based on deep generative models have become the

mainstream to achieve photo-realistic image completion.

Context encoder [37] utilizes adversarial training inspired

by Generative Adversarial Network (GAN) [12] and shows

perceptually faithful completion results. VQ-GAN [11] and

MaskGIT [5] use auto-regressive generative transformer to

ﬁll the missing region. RePaint [33] proposes the new infer-

ence scheme using pretrained denosing denoising diffusion

model [16,46], which reduces the distortion between the

generated and known region.

Along with introducing advanced generative models, re-

searchers have also been working to improve image comple-

tion performance by modifying existing architectures and

convolution operations. Series of studies [31,47,53,54] pro-

pose contextual attention layers to encode long-range con-

textual embeddings and perform image completion based on

these. Liu et al. [30] propose a partial convolutional layer

to mitigate color discrepancy and blurriness in completed

images. Yu et al. [55] generalize the partial convolution by

introducing a dynamic feature selection mechanism at each

spatial coordinate.

Recent works have focused on completing larger missing

region. Zhao et al. [58] deal with the challenging image com-

pletion by bridging image conditioning with the modulation

technique used in StyleGAN2 [19]. Li et al. [24] propose

a transformer block that can effectively capture long-range

context interactions and hallucinate large missing regions.

2.2. Semantic Image Synthesis

Generating photo-realistic images from a semantic seg-

mentation mask is called semantic image synthesis. The ob-

jective is to generate realistic images that accurately reﬂect

the semantic guideline given by a segmentation mask. To

achieve this, the authors of SPADE [35] develop a spatially-

adaptive (de)normalization layer to modulate semantic in-

formation to the pixel-wise image features. Besides, OA-

SIS [43] improves the power of the discriminator by re-

placing the vanilla discriminator with a segmentation-based

discriminator. This replacement allows the generator to be

trained with more informative signal from the discriminator,

resulting in better synthesis results. SPADE and OASIS were

initially not designed to perform the image completion task.

However, in this paper, we extend the usage of semantic

image synthesis blocks and use them for semantic-guided

image completion where the segmentation guidance is cre-

ated by our proposed instance segmentation generator and

background segmentation completion network (depicted in

Fig. 2).

Moreover, Latent Diffusion Model (LDM) and Stable Dif-

fusion [40] utilize powerful cross-attention mechanism [50]

to condition the diffusion model on various inputs, such as

texts and segmentation masks. To leverage the large-scale,

pretrained stable diffusion model for our task, we devise

a new inference scheme to generate target instances in a

missing region using a segmentation mask provided by our

framework. Similar to our approach, SPG-Net [48] and

SG-Net [26] use semantic segmentation labels to fuel more

informative supervision for image completion. Our approach

is similar to SPG-Net and SG-Net in that ImComplete pre-

dicts/recovers the semantic segmentation mask and generate

the content based on the mask, but differs in that ImComplete

handles more challenging scenarios where the instances are

entirely removed.

To our best knowledge, the only existing work for

instance-aware image completion is HVITA [38], where

a target instance is wholly removed from an image. HVITA

consists of four steps: (1) detecting visible instances, (2) con-

structing a graph using detected instances to understand the

scene context, (3) generating a missing instance and placing

it on the missing region, and (4) reﬁning the inserted image.

Yet, HVITA depends on conventional object detection and

is designed for completing rectangular regions. Despite the

additional reﬁnement module in HVITA, it still produces

low-quality image completion results. In particular, distor-

tion occurs at the boundary between the generated instance

and its surroundings. On the other hand, our proposed Im-

Complete is free to handle arbitrarily shaped masks. Our

sophisticated pipeline for segmentation mask recovery helps

to understand the context of images better and encourages

visual continuity and plausibility on the boundaries between

generated and unmasked regions.

3. Method

Our framework aims to complete the corrupted image,

where the visual instance is completely removed. This prob-

lem is challenging as the model must not only generate a

target instance but also ensure that the generated instance

seamlessly blends with the remaining areas of the image. To

alleviate the problem, our framework (ImComplete) com-

pletes masked images in three steps. First, we infer a con-

textually appropriate instance by ﬁguring out the category

of the missing instance (Sec. 3.1). Second, we complete the

semantic segmentation map of the missing region based on

the inference result from the previous step (Sec. 3.2). Finally,

we transform the masked image with semantic segmentation

maps to a realistic completed image (Sec. 3.3).

The overview ﬁgure of our framework is shown in Fig. 2.

We leave out the all the architecture details and training

details in the Appendix E.

3.1. Missing Instance Prediction

To obtain the relationship information between instances

in a given scene, ImComplete ﬁrst predicts instance bounding

box coordinates, instance classes, and a semantic segmenta-

tion map using the pre-trained DETR [4]. Let the panoptic

segmentation map

SM=DETR(IM)

, then we can extract

box coordinates of the visible instances

B= [b1, ..., bk]

and object classes

c= [c1, ..., ck]⊤

from

, where

is the

number of predicted instances. Then, to infer the class of

the missing instance

ytarget

, a transformer network, called

missing instance inference transformer utilizes tokens ob-

tained from the object classes

of visible instances, as well

as additional tokens responsible for the missing region. In

particular, we convert the visible instances’ classes

into

learnable input tokens using a single linear layer. A quick

approach is to utilize object queries from DETR as input

tokens directly, but we observe such a method exhibits a

worse performance compared to employing new learnable

class embeddings.

Furthermore, to inject the location information of the visi-

ble instances, we embed their bounding box coordinates into

positional encoding vectors and sum them to the learnable

class embeddings. To acquire the positional encoding vec-

tors, we input the normalized center coordinate (

width (

), and height (

) of the bounding box to a single

linear layer with sigmoid activation function. We also apply

the same procedure for the missing region token that will be

used for missing instance inference. We explored different

ways to create the positional encoding vectors in Appendix B

and adopt the aforementioned positional encoding since it

gives the best missing instance inference performance. The

below formulations are mathematical expressions of how our

missing instance infer transformer works.

z0=Eclass +Epos

=MLP(c′) + σ(MLP(B′)),z0∈R(k+1)×d

z′

l=MSA(LN(zl−1)) + zl−1, l = 1, ..., L

zl=MLP(LN(z′

l)) + z′

l, l = 1, ..., L

y=LN(z0

L),

(1)

where MLP is a multi-layer perceptron,

is a sigmoid activa-

tion,

is the dimension of embedding vectors,

B′

[b0]∪B

is the missing region bounding box coordinate, and

c′

[c0]∪c

where

is an extra class token for missing in-

stance. The missing instance inference transformer consists

of 12 transformer encoder layers (

L= 12

) with eight heads.

The missing region token interacts with the visible region

tokens through self-attention mechanism. Thus, the network

can more accurately predict the likely class of the missing

instance based on the detected instances and their location

Final

Segmentation Mask

Pseudo

Segmentation Mask

Output

Background

Segmentation

Completion

Network

Instance

Segmentation

Generator

Segmentation

Guided

Image Completion

Network

Instance

Segmentation Mask

Learnable PE

using BBOX

BBOX

DETR

Input

Missing Instance

Inference

Transformer

Noise

Z ~ N(0, 1)

Step 1 Step 2 Step 3

Figure 2. Overview of the proposed approach, called ImComplete.ImComplete completes the image in three steps: (1) infer the missing

instance class, (2) complete a segmentation map in the missing region, and (3) translate the segmentation map to an image to hallucinate the

missing region.

information. Additional information regarding the training

and architecture speciﬁcs can be found within Appendix E.1.

3.2. Semantic Segmentation Map Generation

Utilizing the predicted class of the missing instance from

the previous step, we aim to generate the semantic segmenta-

tion map of the missing region. We create the segmentation

map of the instance and the background area individually

with separate modules (instance segmentation generator and

background segmentation completion network) and obtain

the ﬁnal segmentation map by inserting the missing instance

segmentation into the background segmentation, as shown

in Fig. 2.

Instance Segmentation Generator. We perform the miss-

ing instance segmentation generation using two modules:

a generator and a discriminator, for instance segmentation.

The instance segmentation generator aims to create a plau-

sible segmentation map corresponding to the predicted in-

stance class. For the implementation, we use the architecture

from BigGAN [3], one of the most successful conditional

GANs, with slight modiﬁcation. We input the predicted

missing instance class from the previous step and the box co-

ordinates of the missing region to the Conditional Batch Nor-

malization [8] module in the instance segmentation generator.

We train the model by using spectral normalization [34] and

hinge loss [27] with DiffAug [59] technique.

Background Segmentation Completion Network. The

background segmentation completion network produces the

segmentation map of the non-instance region without at-

tempting to generate the missing instance. To do this, we

randomly scribble the ground truth segmentation map and

let the background segmentation completion network restore

it using cross-entropy loss. We experimentally identify that

this procedure successfully reconstructs the background seg-

mentation maps. The background segmentation completion

network is implemented using convolutional heads and tails

with a transformer body. In Sec. 4.5, we demonstrate the

importance of transformer body, particularly in cases where

there exists a large hole in the damaged image.

Finally, we obtain the overall segmentation map of the

missing region by inserting the instance segmentation into

the background segmentation. Further training and architec-

ture details can be found in Appendix E.2 and Appendix E.3.

3.3. Segmentation-guided Image Completion

This module is designed to complete the masked im-

age using the reconstructed segmentation mask as the guid-

ance. There are three versions (ImComplete

spade

,ImCom-

plete

oasis

, and ImComplete

stable

) depending on the image

generation approach plugged into our framework.

SPADE/OASIS Version. For the ImComplete

spade

and

ImComplete

oasis

, we can use UNet [41]-like completion

model to hallucinate the missing region using the masked

image and predicted segmentation map pairs. The input

to the completion model is the masked image, and the pre-

dicted segmentation map is conditioned by SPADE [35] or

OASIS [43] blocks to help the image completion model pre-

cisely ﬁll in the missing region based on the semantics that

segmentation map provides. We only apply the conditioning

blocks to the decoder part of the completion network. For

ImComplete

oasis

, the generator and discriminator losses pro-

posed in the OASIS paper are applied to the masked region,

and L2 loss is penalized to the remaining undamaged area to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Instance-AwareImageCompletionJinohCho1,MingukKang1,VibhavVineet2,JaesikPark11PohangUniversityofScienceandTechnology(POSTECH),SouthKorea2MicrosoftResearch,UnitedStatesAbstractImagecompletionisataskthataimstofillinthemissingregionofamaskedimagewithplausiblecontents.How-ever,existingimagecompletionmeth...

展开>> 收起<<

Instance-Aware Image Completion Jinoh Cho1 Minguk Kang1 Vibhav Vineet2 Jaesik Park1 1Pohang University of Science and Technology POSTECH South Korea.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Instance-Aware Image Completion Jinoh Cho1 Minguk Kang1 Vibhav Vineet2 Jaesik Park1 1Pohang University of Science and Technology POSTECH South Korea

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: