FontTransformer Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

2025-04-27 0 0 5.57MB 23 页 10玖币

侵权投诉

FontTransformer: Few-shot High-resolution Chinese Glyph Image

Synthesis via Stacked Transformers

Yitian Liua,Zhouhui Lian*a

aWangxuan Institute of Computer Technology, Peking University, Beijing, 100871, China

ARTICLE INFO

Keywords:

font generation

style transfer

Transformers

ABSTRACT

Automatic generation of high-quality Chinese fonts from a few online training samples is a

challenging task, especially when the amount of samples is very small. Existing few-shot font

generation methods can only synthesize low-resolution glyph images that often possess incorrect

topological structures or/and incomplete strokes. To address the problem, this paper proposes

FontTransformer, a novel few-shot learning model, for high-resolution Chinese glyph image

synthesis by using stacked Transformers. The key idea is to apply the parallel Transformer to

avoid the accumulation of prediction errors and utilize the serial Transformer to enhance the

quality of synthesized strokes. Meanwhile, we also design a novel encoding scheme to feed more

glyph information and prior knowledge to our model, which further enables the generation of

high-resolution and visually-pleasing glyph images. Both qualitative and quantitative experi-

mental results demonstrate the superiority of our method compared to other existing approaches

in few-shot Chinese font synthesis task.

1. Introduction

Computer fonts are widely used in our daily lives. The legibility and aesthetic of fonts adopted in books, posters,

advertisements, etc., are critical for their producers during the designing procedures. Thereby, the demands for high-

quality fonts in various styles have increased rapidly. However, font design is a creative and time-consuming task,

especially for font libraries consisting of large amounts of characters (e.g., Chinese). For example, the oﬃcial character

set GB18030-2000 consists of 27533 Chinese characters, most of which have complicated structures and contain dozens

of strokes [1]. Designing or writing out such large amounts of complex glyphs in a consistent style is time-consuming

and costly. Thus, more and more researchers and companies are interested in developing systems that can automatically

generate high-quality Chinese fonts from a few input samples.

With the help of various neural network architectures (e.g., CNNs and RNNs), researchers have proposed many

DL-based methods for Chinese font synthesis. DL-based methods aim to model the relationship between input and

output data (outlines, glyph images, or writing trajectories). Most of them are CNN-based models, such as zi2zi [2],

EMD [3], and SCFont [4]. Intuitively, we can represent a glyph as the combination of a writing trajectory and a stroke

rendering style. Thus, there are some RNN-based methods (e.g., FontRNN [5]) synthesizing the writing trajectory

for each Chinese character. Despite the great progress made in the last few years, most existing approaches still need

∗Corresponding author

lsflyt@pku.edu.cn (Y. Liu); lianzhouhui@pku.edu.cn (Z. Lian*)

https://www.icst.pku.edu.cn/zlian/ (Z. Lian*)

ORCID(s): 0000-0002-2683-7170 (Z. Lian*)

Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 1 of 21

arXiv:2210.06301v2 [cs.CV] 13 Oct 2022

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

Source image Synthesized glyph images

(Stage 1)

(a)

Parallel

Transformer

(Style transfer)

Serial

Transformer

(Refinement)

Reference images Synthesized glyph images

(Stage 2) (b)

AGIS-Net

(64×64px) Ours

(256×256px) Ours

(1024×1024px)

Figure 1: (a) An overview of our method, consisting of a style transfer stage and a reﬁnement stage. (b) Based on

few-shot learning, the proposed FontTransformer can synthesize high-resolution (e.g., 1024 × 1024) glyph images needed

by font designers to produce high-quality commercial font libraries. Existing approaches (e.g., AGIS-Net) often obtain

low-resolution (e.g., 64 × 64) glyph images with blurry outlines.

large amounts of oﬄine or online glyph images to train the font synthesis models. Moreover, the quality of vector

outlines/glyph images synthesized by those methods is often unsatisfactory, especially when the desired font is in a

cursive style or the number of input samples is too small.

The design of our FontTransformer is motivated by the observation that a Chinese character is typically rendered

as a glyph image when it is composed of sequential strokes in essence. In other words, we can treat a Chinese glyph

as an image integrated with serialized information. However, none of the existing methods mentioned above make

good use of this feature, leading to fatal problems in the few-shot Chinese font synthesis task: generating glyph images

with broken strokes and noisy boundaries. To address this issue, we propose a novel end-to-end few-shot learning

model, FontTransformer, which can synthesize high-quality Chinese fonts (see Figure 1) from just a few online training

samples. Speciﬁcally, the proposed FontTransformer is a two-stage model that consists of stacked Transformers,

including a parallel Transformer and a serial Transformer. The key idea is to apply the parallel Transformer to avoid

the accumulation of prediction errors and utilize the serial Transformer to enhance the quality of strokes synthesized.

In this way, our model can learn both the brush and layout styles of the desired font from just a few input images.

Furthermore, font designers typically need high-resolution glyph images (e.g., 1024 × 1024) to create practical

font libraries which consist of vector glyphs that enable arbitrary scaling without quality loss. However, existing

font synthesis approaches usually synthesize low-resolution glyph images (up to 256 × 256), mainly due to the

exponentially increased regression diﬃculty and memory requirement. To resolve this problem, we design a chunked

glyph image encoding scheme based on the fact that there are many repeated patches in binary glyph images. So our

Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 2 of 21

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

model can synthesize visually pleasing and high-resolution glyph images (see Figure 1) without markedly increasing

computational cost.

In summary, the major contributions of this paper are threefold:

•We propose FontTransformer, a novel few-shot Chinese font synthesis model, using stacked transformers to

synthesize high-resolution (e.g., 256 × 256 or 1024 × 1024) glyph images. To the best of our knowledge, this is

the ﬁrst work that eﬀectively applies Transformers on the task of few-shot Chinese font synthesis.

•We design a novel chunked glyph image encoding scheme to encode glyph images into token sequences. With

this encoding scheme, our method can synthesize arbitrarily high-resolution glyph images by keeping the length

of the token sequence a constant.

•Extensive experiments have been conducted to demonstrate that our method is capable of synthesizing high-

quality glyph images in the target font style from a few input samples, outperforming the state of the art both

quantitatively and qualitatively.

2. Related Work

2.1. Chinese Font Synthesis

Font design heavily relies on the personal experience of the designer. Although this process can be done with

the help of some font editing software such as FontLab1, it still takes a lot of time and workload to complete this

task. In order to generate font quickly and automatically, Campbell and Kautz [6] built a generative manifold for

several standard fonts and generated new fonts by interpolating existing fonts in a high dimensional space. For Chinese

characters, Zong and Zhu [7] proposed StrokeBank and built a component mapping dictionary from a seed set using

a semi-supervised algorithm. The main limitation of StrokeBank is that it is hard to extract perfect strokes or radicals

from complex glyphs, especially for those in handwritten styles. To generate a high-quality handwritten font library,

Lian et al. [1] proposed EasyFont, an automatic system to synthesize personal handwritten fonts by learning style from

a set of carefully selected samples.

With the rapid development of deep learning techniques, many DL-based methods have been proposed for the

Chinese font synthesis task, such as zi2zi [2], DCFont [8], CalliGAN [9] and so on. Rewrite [10] attempted to convert

the style of a given glyph image from the source font to the target by using multiple convolution layers. After that,

Tian [2] designed zi2zi, a variant of Pix2pix [11], which is capable of synthesizing multiple fonts using a single model

by adding the font style embedding. To synthesize more realistic glyph images, Jiang et al. [8] proposed an end-to-end

system, DCFont, which only needs 775 or fewer glyph images as training data to learn the style feature. Compared to

1https://www.fontlab.com/

Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 3 of 21

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

zi2zi, DCFont can synthesize more visually-pleasing glyph images in the desired font style as input samples, especially

for handwritten fonts.

Since the geometric structures of many Chinese characters are complex, existing font synthesis methods often fail

to ensure the structure correctness of synthesized glyphs. To solve this problem, some researchers sought the help

of prior knowledge of Chinese characters. For instance, SCFont [4] and FontRL [12] utilized glyph skeletons. Those

glyph skeletons contain rich but expensive prior knowledge, which helps these two models generate correct glyphs

and synthesize high-quality Chinese glyph images. However, it is time-consuming to manually annotate the glyph

skeleton for every character in the training dataset. One possible way to handle this problem is to extract glyph skeletons

automatically, such as ChiroGAN [13]. In ChiroGAN, Gao and Wu [13] proposed a fast skeleton extraction method

(ENet) to replace the manual annotation process. There also exist some methods that employ more convenient prior

knowledge as guidance. SA-VAE [14] combined the content feature extracted by the Content Recognition Network with

the character’s encoding, a 133-bits vector. And the encoding vector consists of the information of structure, radical, and

character indices. Their experimental results demonstrated that the designed encoding scheme works better than simple

character embedding. Similar to SA-VAE, Stroke-GAN [15] used the one-bit stroke encoding to reﬁne the label at the

stroke level and used a stroke-encoding reconstruction loss to synthesize better details. Meanwhile, ChinFont [16], a

system to synthesize vector fonts, introduced the well-known wubi coding to represent content information for Chinese

characters.

These methods aim to synthesize glyph images in the desired font style. And we can represent the writing trajectory

of a Chinese character as a sequence of points. Thereby, some researchers tried to apply the Recurrent Neural Network

(RNN) or Long Short-Term Memory (LSTM) models to handle the Chinese font synthesis task. Ha [17] tried to

synthesize writing trajectories by using RNNs for the ﬁrst time. Zhang et al. [18] adopted RNNs to recognize and draw

Chinese characters and proposed a pre-processing algorithm to process natural handwritten sequences into model-

friendly data. More recently, Tang et al. [5] proposed FontRNN, an RNN-based model with the monotonic attention

mechanism and the transfer learning strategy. FontRNN generates the writing trajectories of characters instead of the

bitmap images and uses a simple CNN network to synthesize shape details.

2.2. Few-shot Chinese Font Synthesis

Few-shot font synthesis aims to synthesize glyph images in the designed font style with very few online training

samples, which can further reduce the designer’s workload and make it more applicable in some special scenarios, such

as ancient books restoration. There exists many few-shot font synthesis methods for Latin or Arabic letters [19,20].

Since Chinese consists of a large number of characters and complex glyphs, the performance of these methods on the

few-shot Chinese font synthesis task is unsatisfactory. To address this challenging problem, Zhang et al. [3] proposed

Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 4 of 21

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

EMD, which can handle novel styles with a few reference images. Afterwards, Gao et al. [21] designed a few-shot

artistic glyph image synthesis method with shape, texture, and local discriminators. Zhu et al. [22] proposed a novel

method, assigning features as the weights by calculating the deep feature similarity between the target character and

reference characters. Then they decoded the weighted feature to the target image. To synthesize more reﬁned details,

Huang et al. [23] proposed RD-GAN to crop the output images to many radical images by using the proposed radical

extraction module (REM). Then they fed these images to a multi-level discriminator (MLD) to guarantee the global

structure and local details. Similarly, instead of modeling global styles, Park et al. [24] introduced a novel approach by

learning localized styles to synthesize glyph images. They decomposed Chinese characters into 371 components and

utilized the factorization modules to reconstruct the character-wise style representations from a few reference images.

More recently, Liu et al. [25] proposed XMP-Font adding stroke-level feature to synthesize better inter-component

spacing and connected-strokes. Meanwhile, some researchers tried to combine the online data and the oﬄine data

to address the few-shot problem by using some unsupervised font generation methods, such as ZiGAN [26] and

DG-Font [27]. Wen et al. proposed ZiGAN, an end-to-end Chinese calligraphy font generation framework utilizing

many unpaired glyph images to align the feature distributions. As for DG-Font, its key idea is the proposed Feature

Deformation Skip Connection (FDSC) module, which adopts the deformable convolution to perform a geometric

transformation on the low-level feature.

Since most writing systems only consist of a small number of characters (e.g., English and Bangla), a similar task

is to transfer font styles from other languages to Chinese with a few reference glyph images. FTransGAN [28] used

multi-level attention to extract global and local style features with a few English samples. Park et al. [29] proposed MX-

Font to extract style features with a multiple localized experts method in local regions, showing strong generalizability

to unseen languages.

2.3. Vision Transformer

Due to the utilization of the multi-head self-attention module, Transformers [30] have obtained state-of-the-art

performance on many sequence processing tasks, such as machine translation [31], text generation [32], document

classiﬁcation [33], question answering [34,35], text recognition [36,37], and so on. To apply Transformers in computer

vision tasks, we can replace some components of CNNs with attention modules. By adding self-attention modules,

Zhang et al. [38] proposed SA-GAN, which can generate image details by using features in all locations. Besides,

researchers have also tried to add CNN modules/structures into Transformers, such as BoTNet [39], LocalViT [40],

CvT [41], and so on.

On the other hand, we can directly apply Transformers to vision tasks. ViT [42] showed that the reliance on CNNs

in vision tasks is unnecessary, and achieved state-of-the-art performance on the image classiﬁcation task. Latter,

Yitian Liu, Zhouhui Lian: Preprint submitted to Elsevier Page 5 of 21

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FontTransformer:Few-shotHigh-resolutionChineseGlyphImageSynthesisviaStackedTransformersYitianLiua,ZhouhuiLian*aaWangxuanInstituteofComputerTechnology,PekingUniversity,Beijing,100871,ChinaARTICLEINFOKeywords:fontgenerationstyletransferTransformersABSTRACTAutomaticgenerationofhigh-qualityChinesefontsf...

展开>> 收起<<

FontTransformer Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FontTransformer Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: