points, which inevitably contain fork points due to the com-
plex character structure. Second, these methods are typically
tailored to the regular and highly structural standard fonts
and may not perform well on handwritten characters due to
the large intra-class variance of strokes caused by different
handwriting habits. Last, they aim to optimize the stroke ex-
traction task only and may not produce transferable features
to benefit downstream tasks.
Moreover, there are no standardized benchmarks to pro-
vide a fair comparison between different stroke extraction
methods, which is of great importance to guide and facilitate
further research. And the lack of publicly available datasets
leads to inconsistent evaluation protocols. Specifically, (Cao
and Tan 2000; Qiguang 2004; Xu et al. 2016) consider ac-
curacy as the main evaluation metric for the stroke extrac-
tion task, which does not consider the spatial location of
the extracted stroke, thereby, can not comprehensively mea-
sure the performance of stroke extraction algorithm. (Chen
et al. 2016, 2017) leverage Hamming distance and cut dis-
crepancy to measure the consistency of stroke interiors and
the similarity of stroke boundaries, respectively. They re-
quire the extracted strokes and the ground truth strokes to be
strictly aligned by spatial location and categories, which is
hard to evaluate the missed and false extraction. Thus, how
to effectively evaluate the stroke extraction algorithm with
reasonable protocol remains an unsolved question.
To facilitate stroke extraction research, we present a Chi-
nese Character Stroke Extraction (CCSE) benchmark, with
two new large-scale datasets and evaluation methods. As
the foundation of the CCSE benchmark, the datasets have
two requirements: i.e., character-level diversity and stroke-
level diversity. Specifically, the datasets should cover as
many Chinese characters to represent the structure between
strokes, whose relationship can be very complex (see the left
of Figure 2). Moreover, since humans with different writing
habits will produce very different appearances even for the
same stroke (see the right of Figure 2), the datasets should
cover this kind of diversity for models to achieve effective
extraction. To this end, we harvested a large set of Kai Ti
(a kind of Chinese font) Chinese character images and hand-
written Chinese character images to achieve character-level
diversity and stroke-level diversity, respectively.
With the large-scale datasets, we hope to leverage the rep-
resentation power of deep models such as CNNs to solve
the stroke extraction task, which, however, remains an open
question. To this end, we turn the stroke extraction problem
into the stroke instance segmentation problem. This change
of view not only allows us to take advantage of the state-
of-the-art instance segmentation models but also the well-
defined evaluation metrics (i.e., box AP and mask AP). We
perform experiments with state-of-the-art instance segmen-
tation models to produce benchmark results that facilitate
further research. Compared to previous methods of stroke
extraction, our approach does not require reference images
and in-depth domain expertise. Moreover, the deep models
trained on our dataset are able to produce transferable fea-
tures that consistently benefit the downstream tasks.
We summarize our contributions as follows:
• We propose the first benchmark containing two high-
quality large-scale datasets that satisfy the requirements
of the character-level and stroke-level diversities for
building promising stroke extraction models.
• We cast the stroke extraction problem into the stroke
instance segmentation problem. In this way, we build
deep stroke extraction models that scale to scenarios with
highly-diverse characters and stroke variance while pro-
ducing transferable features to benefit downstream tasks.
• By leveraging the state-of-the-art instance segmentation
models and well-defined evaluation metrics, we build
standardized benchmarks to facilitate further research.
Related Work
Stroke Extraction
Stroke extraction aims to extract strokes from handwritten
image (Lee and Wu 1998), which is very difficult to solve
due to the complex character structure (Cao and Tan 2000)
and the large intra-class variances (Xu et al. 2016). Exist-
ing methods mainly follow stroke extraction from skele-
tonized character or from original character paradigms. For
the first kind of approach, efforts have been put into explor-
ing the relations between strokes by resolving the fork points
issues (Fan and Wu 2000), applying affine transformation
to strokes (Liu, Jia, and Tan 2006), detecting ambiguous
zone (Su, Cao, and Wang 2009) and using additional ref-
erence image (Zeng et al. 2010). However, these approaches
are limited by the thinning step that introduces stroke dis-
tortion and the loss of short strokes. Therefore, stroke ex-
traction from the original image is proposed to conquer this
limitation. These approaches focus on leveraging the rich in-
formation in characters such as stroke width and curvature
by combining multiple contour information in strokes (Lee
and Wu 1998), exploring pixel-stroke relationships (Cao and
Tan 2000), detecting strokes in multiple directions (Su and
Wang 2004) and using corner points (Yu, Wu, and Yuan
2012). The latest approach (Xu et al. 2016) considers the
advantages from both worlds to further improve the per-
formance. Nonetheless, these methods typically use hand-
crafted rules to improve the stroke extraction task only dur-
ing algorithm design. Therefore, they inherently suffer from
extracting strokes from complex characters and with highly
irregular shape. Moreover, they can not be trivially em-
ployed for downstream tasks such as font generation, lim-
iting their further application.
Instance Segmentation
The goal of instance segmentation is to segment every in-
stance (countable objects) in an image by assigning it with
pixel-wise class label. Existing approaches can be broadly
divided into two categories: two-stage (He et al. 2017;
Hsieh et al. 2021) and one-stage (Bolya et al. 2019). Two-
stage methods consist of instance detection and segmenta-
tion steps. In Mask R-CNN (He et al. 2017), one of the most
important milestones in computer vision, the segmentation
head is applied to the detected instances from the Faster
R-CNN (Ren et al. 2015) detector to acquire the instance-
wise segmentation mask. Approaches based on Mask R-
CNN typically demand dense prior proposals or anchors to