FontTransformer Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

2025-04-27 0 0 5.57MB 23 页 10玖币
侵权投诉
FontTransformer: Few-shot High-resolution Chinese Glyph Image
Synthesis via Stacked Transformers
Yitian Liua,Zhouhui Lian*a
aWangxuan Institute of Computer Technology, Peking University, Beijing, 100871, China
ARTICLE INFO
Keywords:
font generation
style transfer
Transformers
ABSTRACT
Automatic generation of high-quality Chinese fonts from a few online training samples is a
challenging task, especially when the amount of samples is very small. Existing few-shot font
generation methods can only synthesize low-resolution glyph images that often possess incorrect
topological structures or/and incomplete strokes. To address the problem, this paper proposes
FontTransformer, a novel few-shot learning model, for high-resolution Chinese glyph image
synthesis by using stacked Transformers. The key idea is to apply the parallel Transformer to
avoid the accumulation of prediction errors and utilize the serial Transformer to enhance the
quality of synthesized strokes. Meanwhile, we also design a novel encoding scheme to feed more
glyph information and prior knowledge to our model, which further enables the generation of
high-resolution and visually-pleasing glyph images. Both qualitative and quantitative experi-
mental results demonstrate the superiority of our method compared to other existing approaches
in few-shot Chinese font synthesis task.
1. Introduction
Computer fonts are widely used in our daily lives. The legibility and aesthetic of fonts adopted in books, posters,
advertisements, etc., are critical for their producers during the designing procedures. Thereby, the demands for high-
quality fonts in various styles have increased rapidly. However, font design is a creative and time-consuming task,
especially for font libraries consisting of large amounts of characters (e.g., Chinese). For example, the official character
set GB18030-2000 consists of 27533 Chinese characters, most of which have complicated structures and contain dozens
of strokes [1]. Designing or writing out such large amounts of complex glyphs in a consistent style is time-consuming
and costly. Thus, more and more researchers and companies are interested in developing systems that can automatically
generate high-quality Chinese fonts from a few input samples.
With the help of various neural network architectures (e.g., CNNs and RNNs), researchers have proposed many
DL-based methods for Chinese font synthesis. DL-based methods aim to model the relationship between input and
output data (outlines, glyph images, or writing trajectories). Most of them are CNN-based models, such as zi2zi [2],
EMD [3], and SCFont [4]. Intuitively, we can represent a glyph as the combination of a writing trajectory and a stroke
rendering style. Thus, there are some RNN-based methods (e.g., FontRNN [5]) synthesizing the writing trajectory
for each Chinese character. Despite the great progress made in the last few years, most existing approaches still need
Corresponding author
lsflyt@pku.edu.cn (Y. Liu); lianzhouhui@pku.edu.cn (Z. Lian*)
https://www.icst.pku.edu.cn/zlian/ (Z. Lian*)
ORCID(s): 0000-0002-2683-7170 (Z. Lian*)
Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 1 of 21
arXiv:2210.06301v2 [cs.CV] 13 Oct 2022
FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers
Source image Synthesized glyph images
(Stage 1)
(a)
Parallel
Transformer
(Style transfer)
Serial
Transformer
(Refinement)
Reference images Synthesized glyph images
(Stage 2) (b)
AGIS-Net
(64×64px) Ours
(256×256px) Ours
(1024×1024px)
Figure 1: (a) An overview of our method, consisting of a style transfer stage and a refinement stage. (b) Based on
few-shot learning, the proposed FontTransformer can synthesize high-resolution (e.g., 1024 × 1024) glyph images needed
by font designers to produce high-quality commercial font libraries. Existing approaches (e.g., AGIS-Net) often obtain
low-resolution (e.g., 64 × 64) glyph images with blurry outlines.
large amounts of offline or online glyph images to train the font synthesis models. Moreover, the quality of vector
outlines/glyph images synthesized by those methods is often unsatisfactory, especially when the desired font is in a
cursive style or the number of input samples is too small.
The design of our FontTransformer is motivated by the observation that a Chinese character is typically rendered
as a glyph image when it is composed of sequential strokes in essence. In other words, we can treat a Chinese glyph
as an image integrated with serialized information. However, none of the existing methods mentioned above make
good use of this feature, leading to fatal problems in the few-shot Chinese font synthesis task: generating glyph images
with broken strokes and noisy boundaries. To address this issue, we propose a novel end-to-end few-shot learning
model, FontTransformer, which can synthesize high-quality Chinese fonts (see Figure 1) from just a few online training
samples. Specifically, the proposed FontTransformer is a two-stage model that consists of stacked Transformers,
including a parallel Transformer and a serial Transformer. The key idea is to apply the parallel Transformer to avoid
the accumulation of prediction errors and utilize the serial Transformer to enhance the quality of strokes synthesized.
In this way, our model can learn both the brush and layout styles of the desired font from just a few input images.
Furthermore, font designers typically need high-resolution glyph images (e.g., 1024 × 1024) to create practical
font libraries which consist of vector glyphs that enable arbitrary scaling without quality loss. However, existing
font synthesis approaches usually synthesize low-resolution glyph images (up to 256 × 256), mainly due to the
exponentially increased regression difficulty and memory requirement. To resolve this problem, we design a chunked
glyph image encoding scheme based on the fact that there are many repeated patches in binary glyph images. So our
Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 2 of 21
FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers
model can synthesize visually pleasing and high-resolution glyph images (see Figure 1) without markedly increasing
computational cost.
In summary, the major contributions of this paper are threefold:
We propose FontTransformer, a novel few-shot Chinese font synthesis model, using stacked transformers to
synthesize high-resolution (e.g., 256 × 256 or 1024 × 1024) glyph images. To the best of our knowledge, this is
the first work that effectively applies Transformers on the task of few-shot Chinese font synthesis.
We design a novel chunked glyph image encoding scheme to encode glyph images into token sequences. With
this encoding scheme, our method can synthesize arbitrarily high-resolution glyph images by keeping the length
of the token sequence a constant.
Extensive experiments have been conducted to demonstrate that our method is capable of synthesizing high-
quality glyph images in the target font style from a few input samples, outperforming the state of the art both
quantitatively and qualitatively.
2. Related Work
2.1. Chinese Font Synthesis
Font design heavily relies on the personal experience of the designer. Although this process can be done with
the help of some font editing software such as FontLab1, it still takes a lot of time and workload to complete this
task. In order to generate font quickly and automatically, Campbell and Kautz [6] built a generative manifold for
several standard fonts and generated new fonts by interpolating existing fonts in a high dimensional space. For Chinese
characters, Zong and Zhu [7] proposed StrokeBank and built a component mapping dictionary from a seed set using
a semi-supervised algorithm. The main limitation of StrokeBank is that it is hard to extract perfect strokes or radicals
from complex glyphs, especially for those in handwritten styles. To generate a high-quality handwritten font library,
Lian et al. [1] proposed EasyFont, an automatic system to synthesize personal handwritten fonts by learning style from
a set of carefully selected samples.
With the rapid development of deep learning techniques, many DL-based methods have been proposed for the
Chinese font synthesis task, such as zi2zi [2], DCFont [8], CalliGAN [9] and so on. Rewrite [10] attempted to convert
the style of a given glyph image from the source font to the target by using multiple convolution layers. After that,
Tian [2] designed zi2zi, a variant of Pix2pix [11], which is capable of synthesizing multiple fonts using a single model
by adding the font style embedding. To synthesize more realistic glyph images, Jiang et al. [8] proposed an end-to-end
system, DCFont, which only needs 775 or fewer glyph images as training data to learn the style feature. Compared to
1https://www.fontlab.com/
Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 3 of 21
FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers
zi2zi, DCFont can synthesize more visually-pleasing glyph images in the desired font style as input samples, especially
for handwritten fonts.
Since the geometric structures of many Chinese characters are complex, existing font synthesis methods often fail
to ensure the structure correctness of synthesized glyphs. To solve this problem, some researchers sought the help
of prior knowledge of Chinese characters. For instance, SCFont [4] and FontRL [12] utilized glyph skeletons. Those
glyph skeletons contain rich but expensive prior knowledge, which helps these two models generate correct glyphs
and synthesize high-quality Chinese glyph images. However, it is time-consuming to manually annotate the glyph
skeleton for every character in the training dataset. One possible way to handle this problem is to extract glyph skeletons
automatically, such as ChiroGAN [13]. In ChiroGAN, Gao and Wu [13] proposed a fast skeleton extraction method
(ENet) to replace the manual annotation process. There also exist some methods that employ more convenient prior
knowledge as guidance. SA-VAE [14] combined the content feature extracted by the Content Recognition Network with
the characters encoding, a 133-bits vector. And the encoding vector consists of the information of structure, radical, and
character indices. Their experimental results demonstrated that the designed encoding scheme works better than simple
character embedding. Similar to SA-VAE, Stroke-GAN [15] used the one-bit stroke encoding to refine the label at the
stroke level and used a stroke-encoding reconstruction loss to synthesize better details. Meanwhile, ChinFont [16], a
system to synthesize vector fonts, introduced the well-known wubi coding to represent content information for Chinese
characters.
These methods aim to synthesize glyph images in the desired font style. And we can represent the writing trajectory
of a Chinese character as a sequence of points. Thereby, some researchers tried to apply the Recurrent Neural Network
(RNN) or Long Short-Term Memory (LSTM) models to handle the Chinese font synthesis task. Ha [17] tried to
synthesize writing trajectories by using RNNs for the first time. Zhang et al. [18] adopted RNNs to recognize and draw
Chinese characters and proposed a pre-processing algorithm to process natural handwritten sequences into model-
friendly data. More recently, Tang et al. [5] proposed FontRNN, an RNN-based model with the monotonic attention
mechanism and the transfer learning strategy. FontRNN generates the writing trajectories of characters instead of the
bitmap images and uses a simple CNN network to synthesize shape details.
2.2. Few-shot Chinese Font Synthesis
Few-shot font synthesis aims to synthesize glyph images in the designed font style with very few online training
samples, which can further reduce the designer’s workload and make it more applicable in some special scenarios, such
as ancient books restoration. There exists many few-shot font synthesis methods for Latin or Arabic letters [19,20].
Since Chinese consists of a large number of characters and complex glyphs, the performance of these methods on the
few-shot Chinese font synthesis task is unsatisfactory. To address this challenging problem, Zhang et al. [3] proposed
Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 4 of 21
FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers
EMD, which can handle novel styles with a few reference images. Afterwards, Gao et al. [21] designed a few-shot
artistic glyph image synthesis method with shape, texture, and local discriminators. Zhu et al. [22] proposed a novel
method, assigning features as the weights by calculating the deep feature similarity between the target character and
reference characters. Then they decoded the weighted feature to the target image. To synthesize more refined details,
Huang et al. [23] proposed RD-GAN to crop the output images to many radical images by using the proposed radical
extraction module (REM). Then they fed these images to a multi-level discriminator (MLD) to guarantee the global
structure and local details. Similarly, instead of modeling global styles, Park et al. [24] introduced a novel approach by
learning localized styles to synthesize glyph images. They decomposed Chinese characters into 371 components and
utilized the factorization modules to reconstruct the character-wise style representations from a few reference images.
More recently, Liu et al. [25] proposed XMP-Font adding stroke-level feature to synthesize better inter-component
spacing and connected-strokes. Meanwhile, some researchers tried to combine the online data and the offline data
to address the few-shot problem by using some unsupervised font generation methods, such as ZiGAN [26] and
DG-Font [27]. Wen et al. proposed ZiGAN, an end-to-end Chinese calligraphy font generation framework utilizing
many unpaired glyph images to align the feature distributions. As for DG-Font, its key idea is the proposed Feature
Deformation Skip Connection (FDSC) module, which adopts the deformable convolution to perform a geometric
transformation on the low-level feature.
Since most writing systems only consist of a small number of characters (e.g., English and Bangla), a similar task
is to transfer font styles from other languages to Chinese with a few reference glyph images. FTransGAN [28] used
multi-level attention to extract global and local style features with a few English samples. Park et al. [29] proposed MX-
Font to extract style features with a multiple localized experts method in local regions, showing strong generalizability
to unseen languages.
2.3. Vision Transformer
Due to the utilization of the multi-head self-attention module, Transformers [30] have obtained state-of-the-art
performance on many sequence processing tasks, such as machine translation [31], text generation [32], document
classification [33], question answering [34,35], text recognition [36,37], and so on. To apply Transformers in computer
vision tasks, we can replace some components of CNNs with attention modules. By adding self-attention modules,
Zhang et al. [38] proposed SA-GAN, which can generate image details by using features in all locations. Besides,
researchers have also tried to add CNN modules/structures into Transformers, such as BoTNet [39], LocalViT [40],
CvT [41], and so on.
On the other hand, we can directly apply Transformers to vision tasks. ViT [42] showed that the reliance on CNNs
in vision tasks is unnecessary, and achieved state-of-the-art performance on the image classification task. Latter,
Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 5 of 21
摘要:

FontTransformer:Few-shotHigh-resolutionChineseGlyphImageSynthesisviaStackedTransformersYitianLiua,ZhouhuiLian*aaWangxuanInstituteofComputerTechnology,PekingUniversity,Beijing,100871,ChinaARTICLEINFOKeywords:fontgenerationstyletransferTransformersABSTRACTAutomaticgenerationofhigh-qualityChinesefontsf...

展开>> 收起<<
FontTransformer Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:5.57MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注