
Improving End-to-End Text Image Translation
From the Auxiliary Text Translation Task
Cong Ma1,2, Yaping Zhang1,2*, Mei Tu4, Xu Han1,2, Linghui Wu1,2,Yang Zhao1,2, and Yu Zhou2,3
1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, P.R. China
2National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences,
No.95 Zhongguan East Road, Beijing 100190, P.R. China
3Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing 100190, P.R. China
4Samsung Research China - Beijing (SRC-B)
Email: {cong.ma, xu.han, linghui.wu, yaping.zhang, yang.zhao, yzhou}@nlpr.ia.ac.cn, mei.tu@samsung.com
Abstract—End-to-end text image translation (TIT), which aims
at translating the source language embedded in images to
the target language, has attracted intensive attention in recent
research. However, data sparsity limits the performance of end-
to-end text image translation. Multi-task learning is a non-
trivial way to alleviate this problem via exploring knowledge
from complementary related tasks. In this paper, we propose
a novel text translation enhanced text image translation, which
trains the end-to-end model with text translation as an auxiliary
task. By sharing model parameters and multi-task training, our
model is able to take full advantage of easily-available large-
scale text parallel corpus. Extensive experimental results show
our proposed method outperforms existing end-to-end methods,
and the joint multi-task learning with both text translation and
recognition tasks achieves better results, proving translation and
recognition auxiliary tasks are complementary. 1
I. INTRODUCTION
Text image translation is widely used to translate images
containing source language texts into the target language,
such as photo translation, digital document translation, and
scene text translation. Figure 1 shows several architectures
designed for TIT. Figure 1 (a) depicts the cascade architecture,
which utilizes optical character recognition (OCR) model and
machine translation (MT) model together to translate source
texts in images into target language [1]–[5]. However, OCR
models have recognition errors and MT models are vulnerable
to noisy inputs. As a result, mistakes in recognition results
are further amplified by the translation model, causing error
propagation problems. Meanwhile, OCR and MT models are
trained and deployed independently, which makes the overall
progress redundant. The image containing the source language
is first encoded and decoded by the OCR model, then it is
encoded and decoded by the MT model, leading to high time
and space complexity. In summary, the cascade architecture
has the shortcomings of 1) error propagation, 2) parameter
redundancy, and 3) decoding delay problems.
End-to-end architecture for TIT is shown as Figure 1 (b),
which is designed to alleviate the shortcomings in cascade
*Corresponding author.
1Our codes are available at:
https://github.com/EriCongMa/E2E TIT With MT.
(a) Cascade architecture (b) End-to-end TIT
Image Encoder
Target Language
Decoder
Source Language Image
Targe t Language Text
OCR
MT
Source Language Image
Targe t Language Text
Source Language Text
(c) Multi-task with OCR
Image Encoder
Source
Language
Decoder
Source Language Image
Source
Language Tex t
Target
Language
Decoder
Targe t
Language Tex t
Fig. 1. Diagram of different architectures designed for text image translation.
architecture by transforming text images into target language
directly. However, end-to-end model training needs the dataset
containing paired source language text images and correspond-
ing translated target language sentences, which is difficult to
collect and annotate, leading to data limitation problems. The
existing methods utilized subtitle [6] or synthetic [7] text
line images to train and evaluate end-to-end models, but none
of these datasets are released to the public, which limits the
research and development of text image translation. Despite
end-to-end text image translation data, both OCR and MT have
large-scale available datasets, which are a valuable resource for
text image translation. To train end-to-end model with external
resource, existing methods explored to train end-to-end models
with the help of auxiliary text image recognition task by
incorporating external OCR dataset as shown in Figure 1
(c) [6], [7]. However, multi-task training with OCR has two
main limitations: 1) it only utilizes the OCR dataset, but
ignores large-scale text parallel corpus; 2) OCR auxiliary task
can only improve the image encoder optimization, but the
decoder has not been fully trained.
In order to address the shortcomings of multi-task training
with OCR, we propose a novel text translation enhanced end-
to-end text image translation model. As shown in Figure 2
(a), multi-task learning is utilized to train TIT and MT tasks
simultaneously. Shared parameters in the transformer encoder
and decoder are fully trained with both image translation and
text translation data. Specifically, images containing source
arXiv:2210.03887v1 [cs.CL] 8 Oct 2022