the dominant colors in the dataset.
To alleviate these issues, conditional image colorization
methods take partial hints in addition to the input image, and
attempt to generate a realistic output image that reflects the
context of the given hints. Several studies have leveraged
user-guided interactions as a form of user-given conditions
to the model, assuming that the users would provide a de-
sired color value for a region as a type of point-wise color
hint [40] or a scribble [28, 3]. Although these approaches
have made remarkable progress, there still exist nontrivial
limitations. First, existing approaches do not address the is-
sue of estimating semantic regions which indicate how far
the user-given color hints should be spread, and thus the col-
orization model tends to require lots of user hints to produce
a desirable output. Second, for every interaction at test time,
the users are still expected to provide a local-position infor-
mation of color hint by pointing out the region of interest
(RoI), which increases the user’s effort and time commit-
ment. Lastly, since existing approaches typically obtain the
color hints on randomized locations at training time, the dis-
crepancies among intervention mechanisms for the training
and the test phases need to be addressed.
In this work, we propose a novel model-guided frame-
work for the interactive colorization of a sketch image,
called GuidingPainter. A key idea behind our work is to
make a model actively seek for regions where color hints
would be provided, which can significantly improve the
efficiency of interactive colorization process. To this end,
GuidingPainter consists of two modules: active-guidance
module and colorization module. Although colorization
module works similar to previous methods, our main con-
tribution is a hint generation mechanism in active-guidance
module. The active-guidance module (Section 3.2-3.3) (i)
divides the input image into multiple semantic regions and
(ii) ranks them in decreasing order of estimated model gains
when the region is colorized (Fig. 1(a)).
Since it is extremely expensive to obtain groundtruth for
segmentation labels or even their prioritization, we explore
a simple yet effective approach that identifies the meaning-
ful regions in an order of their priority without any man-
ually annotated labels. In our active guidance mechanism
(Section 3.3), GuidingPainter can learn such regions by in-
tentionally differentiating the frequency of usage for each
channel obtained from the segmentation network. Also, we
conduct a toy experiment (Section 4.5) to understand the
mechanism, and to verify the validity of our approach. We
propose several loss terms, e.g. smoothness loss and total
variance loss, to improve colorization quality in our frame-
work (Section 3.5), and analyze its effectiveness for both
quantitatively and qualitatively (Section 4.6). Note that the
only action required of users in our framework is to select
one representative color for each region the model provides
based on the estimated priorities (Fig. 1(b)). Afterwards, the
colorization network (Section 3.4) generates a high-quality
colorized output by taking the given sketch image and the
color hints (Fig. 1(c)).
In summary, our contributions are threefold:
• We propose a novel model-guided deep image col-
orization framework, which prioritizes regions of a
sketch image in the order of the interest of the coloriza-
tion model.
• GuidingPainter can learn to discover meaningful re-
gions for colorization and arrange them in their priority
just by using the groundtruth colorized image, without
additional manual supervision.
• We demonstrate that our framework can be applied to
a variety of datasets by comparing it against previous
interactive colorization approaches in terms of various
metrics, including our proposed evaluation protocol.
2. Related Work
2.1. Deep Image Colorization
Existing deep image colorization methods, which uti-
lize deep neural networks for colorization, can be divided
into automatic and conditional approaches, depending on
whether conditions are involved or not. Automatic image
colorization models [39, 29, 36, 1] take a gray-scale or
sketch image as an input and generate a colorized image.
CIC [39] proposed a fully automatic colorization model
using convolutional neural networks (CNNs), and Su et
al. [29] further improved the model by extracting the fea-
tures of objects in the input image. Despite the substantial
performances of automatic colorization models, a nontrivial
amount of user intervention is still required in practice.
Conditional image colorization models attempt to re-
solve these limitations by taking reference images [16] or
user interactions [40, 3, 38, 34, 37] as additional input. For
example, Zhang et al. [40] allowed the users to input the
point-wise color hint in real time, and AlacGAN [3] uti-
lized stroke-based user hints by extracting semantic feature
maps. Although these studies consider the results are im-
proved by user hints, they generally require a large amount
of user interactions.
2.2. Interactive Image Generation
Beyond the colorization task, user interaction is uti-
lized in numerous computer vision tasks, such as image
generation, and image segmentation. In image genera-
tion, research has been actively conducted to utilize vari-
ous user interactions as additional input to GANs. A va-
riety of GAN models employ image-related features from
users to generate user-driven images [7, 17] and face im-
ages [26, 12, 31, 15, 30]. Several models generate and edit
images via natural-language text [35, 23, 42, 2]. In image