Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task Cong Ma12 Yaping Zhang12 Mei Tu4 Xu Han12 Linghui Wu12Yang Zhao12 and Yu Zhou23

2025-05-08 0 0 1.2MB 7 页 10玖币

侵权投诉

Improving End-to-End Text Image Translation

From the Auxiliary Text Translation Task

Cong Ma1,2, Yaping Zhang1,2*, Mei Tu4, Xu Han1,2, Linghui Wu1,2,Yang Zhao1,2, and Yu Zhou2,3

1School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, P.R. China

2National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences,

No.95 Zhongguan East Road, Beijing 100190, P.R. China

3Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing 100190, P.R. China

4Samsung Research China - Beijing (SRC-B)

Email: {cong.ma, xu.han, linghui.wu, yaping.zhang, yang.zhao, yzhou}@nlpr.ia.ac.cn, mei.tu@samsung.com

Abstract—End-to-end text image translation (TIT), which aims

at translating the source language embedded in images to

the target language, has attracted intensive attention in recent

research. However, data sparsity limits the performance of end-

to-end text image translation. Multi-task learning is a non-

trivial way to alleviate this problem via exploring knowledge

from complementary related tasks. In this paper, we propose

a novel text translation enhanced text image translation, which

trains the end-to-end model with text translation as an auxiliary

task. By sharing model parameters and multi-task training, our

model is able to take full advantage of easily-available large-

scale text parallel corpus. Extensive experimental results show

our proposed method outperforms existing end-to-end methods,

and the joint multi-task learning with both text translation and

recognition tasks achieves better results, proving translation and

recognition auxiliary tasks are complementary. 1

I. INTRODUCTION

Text image translation is widely used to translate images

containing source language texts into the target language,

such as photo translation, digital document translation, and

scene text translation. Figure 1 shows several architectures

designed for TIT. Figure 1 (a) depicts the cascade architecture,

which utilizes optical character recognition (OCR) model and

machine translation (MT) model together to translate source

texts in images into target language [1]–[5]. However, OCR

models have recognition errors and MT models are vulnerable

to noisy inputs. As a result, mistakes in recognition results

are further ampliﬁed by the translation model, causing error

propagation problems. Meanwhile, OCR and MT models are

trained and deployed independently, which makes the overall

progress redundant. The image containing the source language

is ﬁrst encoded and decoded by the OCR model, then it is

encoded and decoded by the MT model, leading to high time

and space complexity. In summary, the cascade architecture

has the shortcomings of 1) error propagation, 2) parameter

redundancy, and 3) decoding delay problems.

End-to-end architecture for TIT is shown as Figure 1 (b),

which is designed to alleviate the shortcomings in cascade

*Corresponding author.

1Our codes are available at:

https://github.com/EriCongMa/E2E TIT With MT.

(a) Cascade architecture (b) End-to-end TIT

Image Encoder

Target Language

Decoder

Source Language Image

Targe t Language Text

OCR

Source Language Image

Targe t Language Text

Source Language Text

Image Encoder

Source

Language

Decoder

Source Language Image

Source

Language Tex t

Target

Language

Decoder

Targe t

Language Tex t

Fig. 1. Diagram of different architectures designed for text image translation.

architecture by transforming text images into target language

directly. However, end-to-end model training needs the dataset

containing paired source language text images and correspond-

ing translated target language sentences, which is difﬁcult to

collect and annotate, leading to data limitation problems. The

existing methods utilized subtitle [6] or synthetic [7] text

line images to train and evaluate end-to-end models, but none

of these datasets are released to the public, which limits the

research and development of text image translation. Despite

end-to-end text image translation data, both OCR and MT have

large-scale available datasets, which are a valuable resource for

text image translation. To train end-to-end model with external

resource, existing methods explored to train end-to-end models

with the help of auxiliary text image recognition task by

incorporating external OCR dataset as shown in Figure 1

main limitations: 1) it only utilizes the OCR dataset, but

ignores large-scale text parallel corpus; 2) OCR auxiliary task

can only improve the image encoder optimization, but the

decoder has not been fully trained.

In order to address the shortcomings of multi-task training

with OCR, we propose a novel text translation enhanced end-

to-end text image translation model. As shown in Figure 2

(a), multi-task learning is utilized to train TIT and MT tasks

simultaneously. Shared parameters in the transformer encoder

and decoder are fully trained with both image translation and

text translation data. Speciﬁcally, images containing source

arXiv:2210.03887v1 [cs.CL] 8 Oct 2022

Source

Language Image

Source

Language Text

Shared Transformer Encoder

Target Language Text

TPS Net

Res Net

Image Encoder Text Encoder

Embedding

Shared Target Language

Transformer Decoder

Source

Language Image

Source

Language Text

Shared Transformer Encoder

Source

Language Text

TPS Net

Res Net

Image Encoder Text Encoder

Embedding

Source Language

Transformer

Decoder

Target Language

Transformer

Decoder

Target

Language Text

(a) Multi-task Learning with MT Task (b) Multi-task Learning with MT &OCR Tasks

Fig. 2. Architectures of multi-task learning for text image translation. Text encoder and source language transformer decoder are only utilized during training

and abandoned when evaluation.

language texts are encoded by the image encoder, and source

language texts are encoded by the text embedding encoder

separately. To learn the image features and text features into

the same semantic feature space, a shared transformer encoder

is utilized to encode both image and text features. Then, a

shared transformer decoder for the target language generates

translation results given semantic features from the transformer

encoder. The shared transformer is optimized given both TIT

and MT data, which has the potential to align image and text

features into the shared feature space. Furthermore, By jointly

training with MT and OCR tasks as shown in Figure 2 (b),

the end-to-end TIT model can take advantage of both external

translation and recognition corpus. Since the recognition task

generates source language texts, while the translation task

generates target language texts, separated transformer decoders

for source and target language are utilized to achieve OCR,

MT, and TIT tasks under one uniﬁed architecture.

Extensive experiments on three translation directions show

our method of multi-task training with MT auxiliary task

outperforms multi-task training with OCR. Furthermore, by

jointly training with MT and OCR tasks, our model achieves

new state-of-the-art, proving that MT and OCR joint multi-

task learning is complementary for text image translation.

Meanwhile, our end-to-end model outperforms cascade models

with fewer parameters and faster decoding speed, revealing the

advantages of the end-to-end TIT model.

The contributions of our work are summarized as follows:

•We propose a multi-task learning architecture for text im-

age translation with external text parallel corpus. Param-

eters in the semantic encoder and cross-lingual decoder

are shared with the text translation model, which transfers

the translation knowledge from the text translation task.

•We further improve the end-to-end text image translation

model by incorporating both MT and OCR auxiliary

tasks, which improves the optimization of TIT, MT, and

OCR tasks under one uniﬁed architecture.

•Extensive experiments show our proposed method outper-

forms existing end-to-end methods by making better use

of easily-available large-scale translation and recognition

corpus. Meanwhile, our method outperforms cascade

models with fewer parameters and faster decoding speed.

II. RELATED WORK

a) OCR-MT Cascade System: OCR-MT cascade system

ﬁrst utilizes OCR model to recognize source language texts

embedded in images [8]–[11]. Then text translation models

are incorporated to translate recognized source langauge texts

into target language [12]–[16]. In [1], recognized sentences in

manga images are considered as source language contexts for

further document translation. Existing research also explores

to integrate separated OCR and MT models for historical

documents translation [3], image document translation [4],

scene text translation [17] and photo translation on mobile

devices [5], [18]. The cascade system can utilize large-scale

datasets to train OCR and MT models independently. However,

errors made by recognition models are propagated through MT

models, which decreases the translation quality. Furthermore,

the cascade system has two separated encoder-decoder archi-

tectures, leading to parameter redundancy and decoding delay

problems.

b) End-to-End Text Image Translation: Mansimov et al.

takes a preliminary step for end-to-end image-to-image trans-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingEnd-to-EndTextImageTranslationFromtheAuxiliaryTextTranslationTaskCongMa1,2,YapingZhang1,2*,MeiTu4,XuHan1,2,LinghuiWu1,2,YangZhao1,2,andYuZhou2,31SchoolofArticialIntelligence,UniversityofChineseAcademyofSciences,Beijing100049,P.R.China2NationalLaboratoryofPatternRecognition(NLPR),Instituteo...

展开>> 收起<<

Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task Cong Ma12 Yaping Zhang12 Mei Tu4 Xu Han12 Linghui Wu12Yang Zhao12 and Yu Zhou23.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task Cong Ma12 Yaping Zhang12 Mei Tu4 Xu Han12 Linghui Wu12Yang Zhao12 and Yu Zhou23

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: