Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task Cong Ma12 Yaping Zhang12 Mei Tu4 Xu Han12 Linghui Wu12Yang Zhao12 and Yu Zhou23

2025-05-08 0 0 1.2MB 7 页 10玖币
侵权投诉
Improving End-to-End Text Image Translation
From the Auxiliary Text Translation Task
Cong Ma1,2, Yaping Zhang1,2*, Mei Tu4, Xu Han1,2, Linghui Wu1,2,Yang Zhao1,2, and Yu Zhou2,3
1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, P.R. China
2National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences,
No.95 Zhongguan East Road, Beijing 100190, P.R. China
3Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing 100190, P.R. China
4Samsung Research China - Beijing (SRC-B)
Email: {cong.ma, xu.han, linghui.wu, yaping.zhang, yang.zhao, yzhou}@nlpr.ia.ac.cn, mei.tu@samsung.com
Abstract—End-to-end text image translation (TIT), which aims
at translating the source language embedded in images to
the target language, has attracted intensive attention in recent
research. However, data sparsity limits the performance of end-
to-end text image translation. Multi-task learning is a non-
trivial way to alleviate this problem via exploring knowledge
from complementary related tasks. In this paper, we propose
a novel text translation enhanced text image translation, which
trains the end-to-end model with text translation as an auxiliary
task. By sharing model parameters and multi-task training, our
model is able to take full advantage of easily-available large-
scale text parallel corpus. Extensive experimental results show
our proposed method outperforms existing end-to-end methods,
and the joint multi-task learning with both text translation and
recognition tasks achieves better results, proving translation and
recognition auxiliary tasks are complementary. 1
I. INTRODUCTION
Text image translation is widely used to translate images
containing source language texts into the target language,
such as photo translation, digital document translation, and
scene text translation. Figure 1 shows several architectures
designed for TIT. Figure 1 (a) depicts the cascade architecture,
which utilizes optical character recognition (OCR) model and
machine translation (MT) model together to translate source
texts in images into target language [1]–[5]. However, OCR
models have recognition errors and MT models are vulnerable
to noisy inputs. As a result, mistakes in recognition results
are further amplified by the translation model, causing error
propagation problems. Meanwhile, OCR and MT models are
trained and deployed independently, which makes the overall
progress redundant. The image containing the source language
is first encoded and decoded by the OCR model, then it is
encoded and decoded by the MT model, leading to high time
and space complexity. In summary, the cascade architecture
has the shortcomings of 1) error propagation, 2) parameter
redundancy, and 3) decoding delay problems.
End-to-end architecture for TIT is shown as Figure 1 (b),
which is designed to alleviate the shortcomings in cascade
*Corresponding author.
1Our codes are available at:
https://github.com/EriCongMa/E2E TIT With MT.
(a) Cascade architecture (b) End-to-end TIT
Image Encoder
Target Language
Decoder
Source Language Image
Targe t Language Text
OCR
MT
Source Language Image
Targe t Language Text
Source Language Text
(c) Multi-task with OCR
Image Encoder
Source
Language
Decoder
Source Language Image
Source
Language Tex t
Target
Language
Decoder
Targe t
Language Tex t
Fig. 1. Diagram of different architectures designed for text image translation.
architecture by transforming text images into target language
directly. However, end-to-end model training needs the dataset
containing paired source language text images and correspond-
ing translated target language sentences, which is difficult to
collect and annotate, leading to data limitation problems. The
existing methods utilized subtitle [6] or synthetic [7] text
line images to train and evaluate end-to-end models, but none
of these datasets are released to the public, which limits the
research and development of text image translation. Despite
end-to-end text image translation data, both OCR and MT have
large-scale available datasets, which are a valuable resource for
text image translation. To train end-to-end model with external
resource, existing methods explored to train end-to-end models
with the help of auxiliary text image recognition task by
incorporating external OCR dataset as shown in Figure 1
(c) [6], [7]. However, multi-task training with OCR has two
main limitations: 1) it only utilizes the OCR dataset, but
ignores large-scale text parallel corpus; 2) OCR auxiliary task
can only improve the image encoder optimization, but the
decoder has not been fully trained.
In order to address the shortcomings of multi-task training
with OCR, we propose a novel text translation enhanced end-
to-end text image translation model. As shown in Figure 2
(a), multi-task learning is utilized to train TIT and MT tasks
simultaneously. Shared parameters in the transformer encoder
and decoder are fully trained with both image translation and
text translation data. Specifically, images containing source
arXiv:2210.03887v1 [cs.CL] 8 Oct 2022
Source
Language Image
Source
Language Text
Shared Transformer Encoder
Target Language Text
TPS Net
Res Net
Image Encoder Text Encoder
Embedding
Shared Target Language
Transformer Decoder
Source
Language Image
Source
Language Text
Shared Transformer Encoder
Source
Language Text
TPS Net
Res Net
Image Encoder Text Encoder
Embedding
Source Language
Transformer
Decoder
Target Language
Transformer
Decoder
Target
Language Text
(a) Multi-task Learning with MT Task (b) Multi-task Learning with MT &OCR Tasks
Fig. 2. Architectures of multi-task learning for text image translation. Text encoder and source language transformer decoder are only utilized during training
and abandoned when evaluation.
language texts are encoded by the image encoder, and source
language texts are encoded by the text embedding encoder
separately. To learn the image features and text features into
the same semantic feature space, a shared transformer encoder
is utilized to encode both image and text features. Then, a
shared transformer decoder for the target language generates
translation results given semantic features from the transformer
encoder. The shared transformer is optimized given both TIT
and MT data, which has the potential to align image and text
features into the shared feature space. Furthermore, By jointly
training with MT and OCR tasks as shown in Figure 2 (b),
the end-to-end TIT model can take advantage of both external
translation and recognition corpus. Since the recognition task
generates source language texts, while the translation task
generates target language texts, separated transformer decoders
for source and target language are utilized to achieve OCR,
MT, and TIT tasks under one unified architecture.
Extensive experiments on three translation directions show
our method of multi-task training with MT auxiliary task
outperforms multi-task training with OCR. Furthermore, by
jointly training with MT and OCR tasks, our model achieves
new state-of-the-art, proving that MT and OCR joint multi-
task learning is complementary for text image translation.
Meanwhile, our end-to-end model outperforms cascade models
with fewer parameters and faster decoding speed, revealing the
advantages of the end-to-end TIT model.
The contributions of our work are summarized as follows:
We propose a multi-task learning architecture for text im-
age translation with external text parallel corpus. Param-
eters in the semantic encoder and cross-lingual decoder
are shared with the text translation model, which transfers
the translation knowledge from the text translation task.
We further improve the end-to-end text image translation
model by incorporating both MT and OCR auxiliary
tasks, which improves the optimization of TIT, MT, and
OCR tasks under one unified architecture.
Extensive experiments show our proposed method outper-
forms existing end-to-end methods by making better use
of easily-available large-scale translation and recognition
corpus. Meanwhile, our method outperforms cascade
models with fewer parameters and faster decoding speed.
II. RELATED WORK
a) OCR-MT Cascade System: OCR-MT cascade system
first utilizes OCR model to recognize source language texts
embedded in images [8]–[11]. Then text translation models
are incorporated to translate recognized source langauge texts
into target language [12]–[16]. In [1], recognized sentences in
manga images are considered as source language contexts for
further document translation. Existing research also explores
to integrate separated OCR and MT models for historical
documents translation [3], image document translation [4],
scene text translation [17] and photo translation on mobile
devices [5], [18]. The cascade system can utilize large-scale
datasets to train OCR and MT models independently. However,
errors made by recognition models are propagated through MT
models, which decreases the translation quality. Furthermore,
the cascade system has two separated encoder-decoder archi-
tectures, leading to parameter redundancy and decoding delay
problems.
b) End-to-End Text Image Translation: Mansimov et al.
takes a preliminary step for end-to-end image-to-image trans-
摘要:

ImprovingEnd-to-EndTextImageTranslationFromtheAuxiliaryTextTranslationTaskCongMa1,2,YapingZhang1,2*,MeiTu4,XuHan1,2,LinghuiWu1,2,YangZhao1,2,andYuZhou2,31SchoolofArticialIntelligence,UniversityofChineseAcademyofSciences,Beijing100049,P.R.China2NationalLaboratoryofPatternRecognition(NLPR),Instituteo...

展开>> 收起<<
Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task Cong Ma12 Yaping Zhang12 Mei Tu4 Xu Han12 Linghui Wu12Yang Zhao12 and Yu Zhou23.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:1.2MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注