Guiding Users to Where to Give Color Hints for Efficient Interactive Sketch Colorization via Unsupervised Region Prioritization Youngin Cho1Junsoo Lee2Soyoung Yang1Juntae Kim3Yeojeong Park1Haneol Lee4

2025-05-06 0 0 5.08MB 19 页 10玖币

侵权投诉

Guiding Users to Where to Give Color Hints for Efﬁcient Interactive Sketch

Colorization via Unsupervised Region Prioritization

Youngin Cho*1Junsoo Lee*2Soyoung Yang1Juntae Kim3Yeojeong Park1Haneol Lee4

Mohammad Azam Khan1Daesik Kim2Jaegul Choo1

1KAIST AI 2NAVER WEBTOON AI 3Korea University 4UNIST

{choyi0521,sy yang,indigopyj,azamkhan,jchoo}@kaist.ac.kr

{junsoolee93,daesik.kim}@webtoonscorp.com

kjt7889@korea.ac.kr

haneollee@unist.ac.kr

1stSketch Image

3rd 5th 6th 2nd 3rd 5th 6th

Figure 1: Results of our proposed model on human faces and comics datasets. Each column of (a)-(c) indicates the order

of interactions as the i-th priority. (a) visualizes masked regions which our model guides at the i-th step. Given a region as

(a), users select its representative color, and the region is ﬁlled with the selected color. (c) shows intermediate colorization

results for given accumulated color hints as (b).

Abstract

Existing deep interactive colorization models have fo-

cused on ways to utilize various types of interactions, such

as point-wise color hints, scribbles, or natural-language

texts, as methods to reﬂect a user’s intent at runtime. How-

ever, another approach, which actively informs the user

of the most effective regions to give hints for sketch im-

age colorization, has been under-explored. This paper pro-

poses a novel model-guided deep interactive colorization

framework that reduces the required amount of user inter-

actions, by prioritizing the regions in a colorization model.

Our method, called GuidingPainter, prioritizes these re-

gions where the model most needs a color hint, rather than

just relying on the user’s manual decision on where to give

a color hint. In our extensive experiments, we show that

our approach outperforms existing interactive colorization

*Equal contribution

methods in terms of the conventional metrics, such as PSNR

and FID, and reduces required amount of interactions.

1. Introduction

The colorization task in computer vision has received

considerable attention recently, since it can be widely ap-

plied in content creation. Most content creation starts with

drawn or sketch images, and these can be accomplished

within a reasonable amount of time, but fully colorizing

them is a labor-intensive task. For this reason, the ability to

automatically colorize sketch images has signiﬁcant poten-

tial values. However, automatic sketch image colorization

is still challenging for the following reasons. (i) The infor-

mation provided by an input sketch image is extremely lim-

ited compared to colored images or even gray-scale ones,

and (ii) there can be multiple possible outcomes for a given

sketch image without any conditional input, which tends to

degrade the model performance and introduce bias toward

arXiv:2210.14270v1 [cs.CV] 25 Oct 2022

the dominant colors in the dataset.

To alleviate these issues, conditional image colorization

methods take partial hints in addition to the input image, and

attempt to generate a realistic output image that reﬂects the

context of the given hints. Several studies have leveraged

user-guided interactions as a form of user-given conditions

to the model, assuming that the users would provide a de-

sired color value for a region as a type of point-wise color

hint [40] or a scribble [28, 3]. Although these approaches

have made remarkable progress, there still exist nontrivial

limitations. First, existing approaches do not address the is-

sue of estimating semantic regions which indicate how far

the user-given color hints should be spread, and thus the col-

orization model tends to require lots of user hints to produce

a desirable output. Second, for every interaction at test time,

the users are still expected to provide a local-position infor-

mation of color hint by pointing out the region of interest

(RoI), which increases the user’s effort and time commit-

ment. Lastly, since existing approaches typically obtain the

color hints on randomized locations at training time, the dis-

crepancies among intervention mechanisms for the training

and the test phases need to be addressed.

In this work, we propose a novel model-guided frame-

work for the interactive colorization of a sketch image,

called GuidingPainter. A key idea behind our work is to

make a model actively seek for regions where color hints

would be provided, which can signiﬁcantly improve the

efﬁciency of interactive colorization process. To this end,

GuidingPainter consists of two modules: active-guidance

module and colorization module. Although colorization

module works similar to previous methods, our main con-

tribution is a hint generation mechanism in active-guidance

module. The active-guidance module (Section 3.2-3.3) (i)

divides the input image into multiple semantic regions and

(ii) ranks them in decreasing order of estimated model gains

when the region is colorized (Fig. 1(a)).

Since it is extremely expensive to obtain groundtruth for

segmentation labels or even their prioritization, we explore

a simple yet effective approach that identiﬁes the meaning-

ful regions in an order of their priority without any man-

ually annotated labels. In our active guidance mechanism

(Section 3.3), GuidingPainter can learn such regions by in-

tentionally differentiating the frequency of usage for each

channel obtained from the segmentation network. Also, we

conduct a toy experiment (Section 4.5) to understand the

mechanism, and to verify the validity of our approach. We

propose several loss terms, e.g. smoothness loss and total

variance loss, to improve colorization quality in our frame-

work (Section 3.5), and analyze its effectiveness for both

quantitatively and qualitatively (Section 4.6). Note that the

only action required of users in our framework is to select

one representative color for each region the model provides

based on the estimated priorities (Fig. 1(b)). Afterwards, the

colorization network (Section 3.4) generates a high-quality

colorized output by taking the given sketch image and the

color hints (Fig. 1(c)).

In summary, our contributions are threefold:

• We propose a novel model-guided deep image col-

orization framework, which prioritizes regions of a

sketch image in the order of the interest of the coloriza-

tion model.

• GuidingPainter can learn to discover meaningful re-

gions for colorization and arrange them in their priority

just by using the groundtruth colorized image, without

additional manual supervision.

• We demonstrate that our framework can be applied to

a variety of datasets by comparing it against previous

interactive colorization approaches in terms of various

metrics, including our proposed evaluation protocol.

2. Related Work

2.1. Deep Image Colorization

Existing deep image colorization methods, which uti-

lize deep neural networks for colorization, can be divided

into automatic and conditional approaches, depending on

whether conditions are involved or not. Automatic image

colorization models [39, 29, 36, 1] take a gray-scale or

sketch image as an input and generate a colorized image.

CIC [39] proposed a fully automatic colorization model

using convolutional neural networks (CNNs), and Su et

al. [29] further improved the model by extracting the fea-

tures of objects in the input image. Despite the substantial

performances of automatic colorization models, a nontrivial

amount of user intervention is still required in practice.

Conditional image colorization models attempt to re-

solve these limitations by taking reference images [16] or

user interactions [40, 3, 38, 34, 37] as additional input. For

example, Zhang et al. [40] allowed the users to input the

point-wise color hint in real time, and AlacGAN [3] uti-

lized stroke-based user hints by extracting semantic feature

maps. Although these studies consider the results are im-

proved by user hints, they generally require a large amount

of user interactions.

2.2. Interactive Image Generation

Beyond the colorization task, user interaction is uti-

lized in numerous computer vision tasks, such as image

generation, and image segmentation. In image genera-

tion, research has been actively conducted to utilize vari-

ous user interactions as additional input to GANs. A va-

riety of GAN models employ image-related features from

users to generate user-driven images [7, 17] and face im-

ages [26, 12, 31, 15, 30]. Several models generate and edit

images via natural-language text [35, 23, 42, 2]. In image

Active-guidance Module Colorization Module

ST -GUMBEL

Segmentation Network Colorization Network 

Hint

Generation



# of hints ()

,,,

0 0

0 0 0

,

  ××

  {0,1}××

  ××



copy

  {0,1}××

  {0,1}

Element-wise Multiplication Averaging Sum-up-to-map



  {0,1}××

  ×

Broad

cast

(a)

(b)

(c)

(d)

(e)







 









Hint Generation 

Broad

cast





Disc

Figure 2: Hint generation process of our proposed GuidingPainter model. The segmentation network and the hint

generation function renders colored hints (C) and condition masks (M). Based on the guidance results, our colorization

network colorizes the sketch image. The example illustrates the hint generation process in the training phase where Nh= 3

and Nc= 4. First, the groundtruth image is copied as Nctimes to consider each color segment at each interaction step.

After element-wise multiplication with guided regions, (a) averages the color to decide representative colors for each guided

region. To restrict the number of hints, we mask out the segments whose iteration step is larger than Nh, The masked results

are (b). Based on (a) and (b), our module generates the colored condition for each segment as (c). In (d), we combine them

into one partially-colorized image C. (e) operates as the same manner with (d) and generates the condition mask M.

segmentation, to improve the details of segmentation re-

sults, recent models have utilized dots [27, 20] and texts [9]

from users. Although we surveyed a wide scope of inter-

active deep learning models beyond sketch image coloriza-

tion, there is no directly related work with our approach,

to the best of our knowledge. Therefore, the use of a deep

learning-based guidance system for interactive process can

be viewed as a promising but under-explored approach.

3. Proposed Approach

3.1. Problem Setting

The goal of the interactive colorization task is to train

networks to generate a colored image ˆ

Y∈R3×H×Wby

taking as input a sketch image X∈R1×H×Walong with

user-provided partial hints U, where Hand Windicate the

height and width of the target image, respectively. The user-

provided partial hints are deﬁned as a pair U= (C, M)

where C∈R3×H×Wis a sparse tensor with RGB values,

and M∈ {0,1}1×H×Wis a binary mask indicating the

region in which the color hints are provided. Our training

framework consists of two networks and one function: seg-

mentation network f(Section 3.2), colorization network g

(Section 3.4), and a hint generation function called h(Sec-

tion 3.3), which are trained in an end-to-end manner.

3.2. Segmentation Network

The purpose of segmentation network f(·)is to divide

the sketch input Xinto several semantic regions which are

expected to be painted in a single color, i.e.,

S=f(X;θf),(1)

where S= (S1, S2, ..., SNc)∈ {0,1}Nc×H×W,Siis the i-

th guided region, and Ncdenotes the maximum number of

hints. Speciﬁcally, fcontains an encoder-decoder network

with skip connections, based on U-Net [10] architecture, to

preserve the spatial details of given objects.

Since each guided region will be painted with a single

color, we have to segment the output of U-Net in a dis-

crete form while taking advantages of end-to-end learn-

ing. To this end, after obtaining an output tensor Slogit ∈

RNc×H×Wof U-Net, we discretize Slogit by applying

straight-through (ST) gumbel estimator [11, 19] across

channel dimensions to obtain Sas a differentiable approxi-

mation. The result Ssatisﬁes PNc

i=1 Si(j) = 1 where Si(j)

indicates the i-th scalar value of the j-th position vector, i.e.,

every pixel is contained in only one guided region. Here,

Si(j) = 1 indicates that the j-th pixel is contained in the

i-th guided region while Si(j) = 0 indicates that the pixel

is not contained in the guided region.

3.3. Hint Generation

The hint generation function h(·)is a non-parametric

function that plays the role of simulating Ubased on S,

a colored image Y, and the number of hints Nh, i.e.,

U=h(S, Y, Nh).(2)

To this end, we ﬁrst randomly sample Nhfrom a bounded

distribution which is similar to a geometric distribution for-

mulated as

G(Nh=i) = ((1 −p)ipif i= 0,1, ..., Nc−1

(1 −p)Ncif i=Nc,(3)

where p < 1is a hyperparameter indicating the probability

that the user stops adding a hint on each trial. We set Nc=

30 and p= 0.125 for the following experiments.

Step1: building masked segments ˜

S.Given Nh, we con-

struct a mask vector m∈ {0,1}Nchaving each element

with the following rule:

mi=(1if i≤Nh

0otherwise,(4)

where miindicates the i-th scalar value of the vector m.

Afterwards, we obtain a masked segment ˜

S∈RNc×H×W

by element-wise multiplying the i-th element of mwith the

i-th channel of Sas

Si=miSi,(5)

where Si,˜

Si∈R1×H×Wdenote the i-th channel of Sand

S, respectively.

Step2: building hint maps C.The goal of this step is to

ﬁnd the representative color value of the activated region in

each segment ˜

Si, and then to ﬁll the corresponding region

with this color. To this end, we calculate a mean RGB color

¯ci∈R3as

¯ci=(1

NpPHW

jSi(j)⊙Y(j)if 1≤Np

0otherwise,(6)

where Np=PjSi(j)indicates the number of activated

pixels of the i-th segment, ⊙denotes an element-wise mul-

tiplication, i.e., the Hadamard product, after each element of

Siis broadcast to the RGB channels of Y, and both Si(j)

and Y(j)indicate the j-th position vector of each map. Fi-

nally, we obtain hint maps C∈R3×H×Was

i=1

¯ci˜

Si,(7)

where ¯ciis repeated to the spatial axis as the form of

Si∈R1×H×Wsimilar to Eq. (5) and ˜

Siis broadcast to

the channel axis as the form of ¯ci∈R3as in Eq. (6). In

order to indicate the region of given hints, we simply obtain

a condition mask M∈R1×H×Was

i=1

Si.(8)

Eventually, the output of this module U=C⊕M∈

R4×H×Wwhere ⊕indicates a channel-wise concatenation.

Fig. 2 illustrates overall scheme of the hint generation pro-

cess. At the inference time, we can create Usimilar to the

hint generation process, but without an explicit groundtruth

image. Note that a sketch image is all we need to produce ˜

at the inference time. We can obtain Cand Mby assigning

a color to each Sifor i= 1,2, ..., Nh.

To understand how the hint generation module works,

recall that Nhis randomly sampled from the bounded geo-

metric distribution G(Eq. (3)) per mini-batch at the training

time. Since the probability that i≤Nhis higher than the

probability that j≤Nhfor i < j,Siis more frequently

activated than Sjduring training the model. Hence, we

can expect the following effects via this module: i) Nhaf-

fects in determining how many segments starting from the

ﬁrst channel of Sas computed in Eq. (4-5); therefore, this

mechanism encourages the segmentation network f(·)to lo-

cate relatively important and uncertain regions at the for-

ward indexes of S. Section 4.5 shows this module behaves

as our expectation. ii) We can provide more abundant in-

formation for the following colorization networks g(·)than

previous approaches without requiring additional labels at

training time or even interactions at test time, helping to

generate better results even with fewer hints than baselines

(Section 4.3).

3.4. Colorization Network

The colorization network g(·)aims to generate a colored

image ˆ

Yby taking all the information obtained from the

previous steps, i.e., a sketch image X, guided regions S,

and partial hints U, as

Y=g(X, S, U;θg).(9)

The reason for using the segments as input is to provide in-

formation about the color relationship, which the segmen-

tation network infers. In order to capture the context of the

input and to preserve the spatial information of the sketch

image, our colorization networks also adopt the U-Net ar-

chitecture, the same as in the segmentation network. We

then apply a hyperbolic tangent activation function to nor-

malize the output tensor of the U-Net.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GuidingUserstoWheretoGiveColorHintsforEfficientInteractiveSketchColorizationviaUnsupervisedRegionPrioritizationYounginCho*1JunsooLee*2SoyoungYang1JuntaeKim3YeojeongPark1HaneolLee4MohammadAzamKhan1DaesikKim2JaegulChoo11KAISTAI2NAVERWEBTOONAI3KoreaUniversity4UNIST{choyi0521,syyang,indigopyj,azamkhan,j...

展开>> 收起<<

Guiding Users to Where to Give Color Hints for Efficient Interactive Sketch Colorization via Unsupervised Region Prioritization Youngin Cho1Junsoo Lee2Soyoung Yang1Juntae Kim3Yeojeong Park1Haneol Lee4.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Guiding Users to Where to Give Color Hints for Efficient Interactive Sketch Colorization via Unsupervised Region Prioritization Youngin Cho1Junsoo Lee2Soyoung Yang1Juntae Kim3Yeojeong Park1Haneol Lee4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: