2 Y. Wang et al.
heavily relies on scene understanding. However, even the ground-truth color
is available for supervision, it is still very challenging to predict pixel colors
from gray images, due to the ill-posed nature that one input grayscale could
correspond to multiple possible color variants.
Most current methods [54,56,26,12,23,38,49,17,3] formulate colorization as a
pixel-level regression task, suffering from multimodal representation more or less.
With the large-scale training data and end-to-end learning models, they can learn
the color distribution prior conveniently, e.g. vegetation greenish tones, human
skin colors, etc.. Anyhow, when it comes to objects with inherently color am-
biguity (e.g. human clothes, cars, and other man-made stuff), these approaches
tend to predict the brownish average colors. To tackle such multi-modality, re-
searches [54,56,24] proposed to formulate the color prediction as pixel-level color
classification, which allows multiple colors to be assigned to each pixel based
on posterior probability. Unfortunately, these suffer from regional color incon-
sistency due to the independent pixel-wise sampling mechanism. In this regard,
means of utilizing the sequential modeling [12,23] can only partially help the
sampling issue, because the unidirectional sequential dependence of 2D flattened
pixel primitives causes error accumulation and hinders the learning efficiency.
Apart from the multimodal issue, color bleeding is another common issue
in colorization due to inaccurate identification of semantic boundaries. To sup-
press such visual artifacts, most works [54,56,26,38,49,17,3] resort to Generative
adversarial networks (GAN) to encourage the generated chrome distribution to
be indistinguishable from that of the real-life color images. Currently, no spe-
cial algorithms or modules for deep models have been proposed to enhance the
performance of this aspect, which matters the visual pleasantness considerably.
To avoid modeling the color multimodality pixel-wisely, we propose a new
colorization framework PalGAN that predicts the pixel colors in a coarse-to-fine
paradigm. The key idea is to first predict the global palette probability (e.g.
palette histogram) from the grayscale. It does not collapse into a single specific
colorization solution but represents a certain color distribution of the potential
color variants. Then, the uncertainty about the per-pixel color assignment is
modeled with a generative model in the GAN framework, conditioned on the
grayscale and palette histogram. Therefore, multiple colorization results could
be achieved by changing the palette histogram input.
To guarantee the color assignment with semantic correctness and regional
consistency, we study color affinities by a proposed chromatic attention mod-
ule. It explicitly aligns color affinity with both semantics and low-level charac-
teristics. In structure, chromatic attention includes global interaction and lo-
cal delineation. The former enables global context utilization for color infer-
ence by using semantic features in the attention mechanism. The latter pre-
serves regional details by mapping the gray input to color through local affine
transformation. The transformation is explicitly parameterized by the correla-
tion between gray input and color feature. Experiments illustrate the effective-
ness of our method. It achieves impressive visual results (Fig. 1) and quantita-
tive superiority over state-of-the-art approaches over ImageNet [9] and COCO-