CTC Alignments Improve Autoregressive Translation Brian Yan1Siddharth Dalmia1Yosuke Higuchi2 Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe13

2025-04-27 0 0 4.4MB 16 页 10玖币

侵权投诉

CTC Alignments Improve Autoregressive Translation

Brian Yan*1Siddharth Dalmia*1Yosuke Higuchi2

Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe1,3

1Language Technologies Institute, Carnegie Mellon University, USA

2Department of Communications and Computer Engineering, Waseda University, Japan

3Human Language Technology Center of Excellence, Johns Hopkins University, USA

{byan, sdalmia}@cs.cmu.edu

Abstract

Connectionist Temporal Classiﬁcation (CTC)

is a widely used approach for automatic speech

recognition (ASR) that performs conditionally

independent monotonic alignment. However

for translation, CTC exhibits clear limitations

due to the contextual and non-monotonic na-

ture of the task and thus lags behind atten-

tional decoder approaches in terms of trans-

lation quality. In this work, we argue that

CTC does in fact make sense for translation

if applied in a joint CTC/attention framework

wherein CTC’s core properties can counter-

act several key weaknesses of pure-attention

models during training and decoding. To val-

idate this conjecture, we modify the Hybrid

CTC/Attention model originally proposed for

ASR to support text-to-text translation (MT)

and speech-to-text translation (ST). Our pro-

posed joint CTC/attention models outperform

pure-attention baselines across six benchmark

translation tasks.

1 Introduction

Automatic speech recognition (ASR), machine

translation (MT), and speech translation (ST) have

conspicuous differences but are all closely related

sequence-to-sequence problems. Researchers from

these respective ﬁelds have long recognized the op-

portunity for cross-pollinating ideas (He and Deng,

2011), starting from the coupling of statistical ASR

(Huang et al.,2014) and MT (Al-Onaizan et al.,

1999) which gave rise to early approaches for ST

(Waibel,1996;Ney,1999). Notably in the end-to-

end era, attentional encoder-decoder approaches

emerged in both MT (Bahdanau et al.,2015) and

ASR (Chan et al.,2016) and have since risen to

great prominence in both ﬁelds.

During this same period, there has been an-

other prominent end-to-end approach in ASR: Con-

nectionist Temporal Classiﬁcation (CTC) (Graves

et al.,2006). Unlike the highly ﬂexible atten-

tion mechanism which can handle ASR, MT, and

ST alike, CTC models sequence transduction as a

monotonic alignment of inputs to outputs and thus

ﬁts more naturally with ASR than it does with trans-

lation. Still, many interested in non-autoregressive

translation have applied CTC to MT (Libovický

and Helcl,2018) and ST (Chuang et al.,2021) and

promising techniques have emerged which have

shrunk the gap between autoregressive approaches

(Saharia et al.,2020;Gu and Kong,2021;Inaguma

et al.,2021b;Huang et al.,2022). These recent de-

velopments suggest that the latent alignment ability

of CTC is a promising direction for translation –

this leads us to question: can CTC alignments im-

prove autoregressive translation? In particular, we

are interested in frameworks which leverage the

strength of CTC while minimizing its several harm-

ful incompatibilities (see §3) with translation tasks.

Inspired by the success of Hybrid CTC/Attention

in ASR (Watanabe et al.,2017), we investigate

jointly modeling CTC with an autoregressive at-

tentional encoder-decoder for translation. Our con-

jecture is that the monotonic alignment and condi-

tional independence of CTC, which weaken purely

CTC-based translation, counteract particular weak-

nesses of attentional models in joint CTC/attention

frameworks. In this work, we seek to investi-

gate how each CTC property interacts with cor-

responding properties of the attentional counterpart

during joint training and decoding. We design a

joint CTC/attention architecture for translation (§4)

and then examine the positive interactions which

ultimately result in improved translation quality

compared to pure-attention baselines, as demon-

strated on the IWSLT (Cettolo et al.,2012), MuST-

C (Di Gangi et al.,2019), and MTedX (Salesky

et al.,2021) MT/ST corpora (§6).

2 Background: Joint CTC/Attn for ASR

Both the CTC (Graves et al.,2006) and attentional

encoder-decoder (Bahdanau et al.,2015) frame-

works seek to model the Bayesian decision seeking

arXiv:2210.05200v1 [cs.CL] 11 Oct 2022

CTC ATTENTION JOINT CTC/ATTENTION ASR

MT/ST

PCTC(Y|X)∆

Z∈Z

t=1

P(zt|X,



z1:t−1)PAttn(Y|X)∆

=QL

l=1 P(yl|y1:l−1, X)PJoint(Y|X)∆

=PCTC(Y|X)λ×PAttn(Y|X)1−λ3 3

Hard Alignment

.....................................

Criterion only allows monotonic align-

ments of inputs to outputs

Soft Alignment

........................................

Flexible attention-based input-to-output

mappings may overﬁt to irregular patterns

During Training:

Hard alignment objective pro-

duces stable encoder representations allowing the

decoder to more rapidly learn soft alignment patterns

3L1

See §3

Conditional Independence

....................

Assumes that there are no dependencies

between each output unit given the input

Conditional Dependence

.......................

Locally normalized models with output

dependency exhibit label/exposure biases

During Decoding:

Use of conditionally independent

likelihoods in joint scoring eases the exposure/label

biases from conditionally dependent likelihoods

3L2

See §3

Input-Synchronous Emission

...............

Each input representation emits exactly

one blank or non-blank output token

Autoregressive Generation

....................

Need to detect end-points and compare hy-

potheses of different length in beam search

During Decoding:

Input-synchronous emission de-

termines output length based on input length counter-

acting the autoregressive end-detection problem

3L3

See §3

Table 1: Description of three reasons why joint CTC/attention modeling is powerful in ASR. In order to understand

whether these positive interactions between properties of the CTC and attention frameworks are applicable to

MT/ST, we must address three corresponding concerns, L1-3, about the applicability of CTC to translation (§2).

the output,

, from all possible sequences,

Vtgt∗

by selecting the sequence which maximizes the

posterior likelihood

P(Y|X)

, where

X={xt∈

Ssrc|t= 1, ..., T }

and

Y={yl∈ Vtgt|l=

1, ..., L}

. As shown in the ﬁrst two columns of

Table 1, CTC and attention offer different formu-

lations of the posterior likelihood,

PCTC(·)

and

PAttn(·)respectively.

What are the critical differences between the

CTC and attention frameworks? First of all, the

attention mechanism is a ﬂexible input-to-output

mapping function which allows a decoder to per-

form

soft alignment

of an output unit

to mul-

tiple input units

x[...]

without restriction. One

downside of this ﬂexibility is a risk of destabi-

lized optimization (Kim et al.,2017). CTC on

the other hand marginalizes the likelihoods of all

possible input to alignment sequence,

Z={zt∈

Vtgt ∪{∅}|t= 1 . . . T }

, mappings via

hard align-

ment

where each output unit

maps to a single

input unit xtin a strictly monotonic pattern.1

Secondly, the attentional decoder models each

output unit

with

conditional dependence

not only the input

, but also the previous output

units

y1:l−1

. In order to efﬁciently compute

the marginalized likelihoods of all possible

Z∈

Z(Y, T )

via dynamic programming, CTC makes

conditional independence

assumption that each

does not depend on

z1:t

if already conditioned

- this is a strong assumption. On the plus,

since CTC does not model any causality between

output units its is not plagued by the same label and

exposure biases that exist in attentional decoders

due to local normalization of causal likelihoods

(Bottou,1991;Ranzato et al.,2016;Hannun,2019).

1∅

is a "blank" denoting null emission and

maps deter-

ministically to Yby removing null and repeated emissions.

Finally, the attentional decoder is an

autore-

gressive generator

that decodes the output until

a stop token,

<eos>

. Comparing likelihoods for

sequences of different lengths requires a heuris-

tic brevity penalty. Furthermore label bias with

respect to the stop token manifests as a length prob-

lem where likelihoods degenerate for unexpectedly

long outputs (Murray and Chiang,2018). In com-

parison, CTC is an

input-synchronous emitter

that consumes an input unit in order to produce

an output unit. Therefore, CTC cannot produce an

output longer than the input representation which

feeds the ﬁnal posterior output layer but this also

means that CTC does not require any end detection.

As previously shown by (Kim et al.,2017;

Watanabe et al.,2017), jointly modeling CTC and

an attentional decoder is highly effective in ASR.

The foundation of this architecture is a shared en-

coder,

ENC

, which feeds into both CTC,

PCTC(·)

and attentional decoder, PAttn(·), posteriors:

h= Enc(X)(1)

PCTC(zt|X) = CTC(ht)(2)

PAttn(yl|X, y1:l−1) = Dec(h, y1:l−1)(3)

where

CTC(·)

denotes a projection to the CTC out-

put vocabulary,

Vtgt ∪ {∅}

followed by softmax,

and

DEC(·)

denotes autoregressive decoder layers

followed by a projection to the decoder output vo-

cabulary,

Vtgt ∪ {<eos>}

, and softmax. The joint

network is optimized via a multi-tasked objective,

LASR =LASR

CTC +λLASR

Attn

, where

interpolates the

CTC loss and the cross-entropy loss of the decoder.

Joint decoding is typically performed with a one-

pass beam search where CTC plays a secondary

role as a joint scorer while attention leads the major

hypothesis expansion and end detection functions

in the algorithm (Watanabe et al.,2017;Tsunoo

et al.,2021). However, CTC is capable of taking

over the lead role if called upon (e.g. for streaming

applications) (Moritz et al.,2019b).

3 Potential CTC Limitations in MT/ST

Why exactly does this joint CTC/attention frame-

work perform so well in ASR? As summarized in

column 3 of Table 1, we are particularly interested

in three reasons which arise from the combination

of the hard vs. soft alignment, conditional inde-

pendence vs. dependence, and input-synchronous

emission vs. autoregressive generation properties

of CTC and attention respectively. These dynamics

have become well understood in ASR, owing to the

popularity of the joint framework (Watanabe et al.,

2018) amongst ASR practitioners.

So can CTC and attention also complement each

other when applied jointly to translation?

ASR,

MT, and ST can all be generalized as sequence

transduction tasks following the Bayesian formula-

tion. Plus attentional decoders have are a predomi-

nant technical solution to each of these tasks. How-

ever, the CTC framework appears to have several

limitations speciﬁc to MT/ST that are not present

in ASR; this seemingly diminishes the promise of

the joint CTC/attention framework for translation.

In this work, we seek to address the following three

concerns about MT/ST CTC which appear to in-

hibit the CTC/attention framework (per Table 1).

Can CTC encoders perform sophisticated

input-to-output mappings required for translation?

Unlike ASR, translation entails non-monotonic

mappings due to variable word-ordering across lan-

guages. Additionally, inputs may be shorter than

outputs as mappings are not necessarily one-to-

one. Furthermore, the mapping task for ST is actu-

ally compositional where logically a speech signal

ﬁrst maps to a source language transcription be-

fore being mapped to the ultimate translation. All

of these complications appear to directly contra-

dict the

hard alignment

of CTC. If CTC cannot

produce stable encoder representations for MT/ST,

then during joint training attention does not receive

the optimization beneﬁt as in ASR (per row 2 of Ta-

ble 1). Fortunately, prior works suggest that these

challenges are not insurmountable. Chuang et al.

(2021) showed that self-attentional encoders can

perform latently model variable word-orders for ST,

Libovický and Helcl (2018); Dalmia et al. (2022)

This particular question has not been addressed in litera-

ture. For an account of related works, please see §9.

proposed up-sampling encoders that produce ex-

panded input representations for MT, and Sanabria

and Metze (2018); Higuchi et al. (2022) proposed

hierarchical encoders that can compose multiple

output resolutions for ASR. In §4.1, we incorpo-

rate these techniques into a uniﬁed solution which

achieves hierarchical encoding for translation.

Does CTC-based translation quality lag too

far behind attention-based to be useful?

CTC-based ASR has recently shown competitive

performance due in large part to improved neural ar-

chitectures (Gulati et al.,2020) and self-supervised

learning (Baevski et al.,2020;Hsu et al.,2021),

but the gap between CTC and attention for trans-

lation appears to be greater (Gu and Kong,2021).

Perhaps the

conditional independence

of CTC in-

hibits the quality to such a degree in MT/ST where

these likelihoods cannot ease the label/exposure

biases of the attentional decoder as they do in ASR

(per column 3 of Table 1). The relative weakness of

non-autoregressive translation approaches has been

well-studied. Knowledge distillation (Kim and

Rush,2016;Zhou et al.,2019) and iterative meth-

ods (Qian et al.,2021;Chan et al.,2020;Huang

et al.,2022) all attempt to bridge the gap between

non-autoregressive models and their autoregres-

sive counterparts. In §6, we address this concern

empirically: even CTC models with 28% relative

BLEU reduction compared to attention yield im-

provements when CTC and attention are jointly

decoded.

Is the alignment information produced by

CTC-based translation models reasonable?

In ASR, CTC alignments are reliable enough

to segment audio data by force aligning inputs

to a target transcription outputs (Kürzinger et al.,

2020) and exhibits minimal drift compared to hid-

den Markov models (Sak et al.,2015). However,

CTC alignments are not as well studied in transla-

tion. It is an open question of whether or not the

input-synchronous emission

of CTC for transla-

tion has sufﬁcient alignment quality to support the

end detection responsibility during joint decoding

as it does in ASR (per row 4 of Table 1). Ideally,

the CTC alignments are strong enough such that

CTC can lead joint decoding by proposing candi-

dates for hypothesis expansion in each beam step

until all input units are consumed (at which point

the end is detected), as in an input-synchronous

beam search. More conservatively, the CTC align-

x1…xT

p1…pnT

…

hsrc

1…hsrc

…

x1…xT

…

h1…hT/n

…

N1*

N2*

N3*

N2*

N3*

TransformerLayers

LegoNNOLCLayers

ConvolutionLayers

N1*

ConformerLayers

…

htgt

1…htgt

…

p1…pnT

…

hsrc

1hsrc

T/n

htgt

1htgt

T/n

SrcCTC

Decoder

SrcEnc

SrcCTC

TgtCTC

TgtEnc

TgtCTC

TgtEnc

Decoder

(a) Hierarchical MT Encoder (b) Hierarchical ST Encoder

MT MT

ST ST

Figure 1: Hierarchical MT/ST encoders where repre-

sentations are ﬁrst up/down-sampled by SRCENCMT/ST

and then re-ordered by TGTENCMT/ST.

ments may be too unreliable to take lead but could

still guide the attentional decoder’s end detection

by penalizing incorrect lengths via joint scoring,

as in an output-synchronous beam search. In §4.2,

we lay out comparable forms for input and output-

synchronous beam search which allows us to exam-

ine the impact on translation quality depending on

whether CTC is explicitly responsible for or only

implicitly contributing to end detection.

4 Joint CTC/Attention for Translation

4.1 Hierarchical CTC Encoding

Per L1 described in §3, we seek to build a CTC

encoder for translation which handles sophisti-

cated input-to-output mappings. We therefore pro-

pose to use a hierachical CTC encoding scheme

which ﬁrst aligns inputs to length-adjusted source-

oriented encodings before aligning to re-ordered

target-oriented encodings, as shown in Figure 1.

We decompose the encoding process into two func-

tions: length-adjustment and re-ordering.

Length-adjustment

For MT, we up-sample the

lengths of the source-oriented encodings in order

to output sequences longer than the input. For ST,

we down-sample the lengths of the source-oriented

encodings to coerce a discrete textual representa-

tion of the real-valued speech input. We enforce

source-orientations using CTC criteria that seek to

align intermediate layer encoder representations

towards source text sequences.

Re-ordering

We then obtain target-oriented en-

codings with hierarchical encoder layers, where

Hierarchical CTC encoding is a ﬂexible technique which

has been applied for various multi-objective scenarios in prior

ASR works (Sanabria and Metze,2018;Higuchi et al.,2022).

re-ordering is enforced using CTC criteria that seek

to align ﬁnal layer encoder representations towards

target text sequences. Critically, the underlying

neural network architecture must be able to model

latent re-ordering as the CTC criterion itself will

only consider monotonic alignments of the ﬁnal

encoder representation to the target.

Our proposed MT/ST hierarchical encoders con-

sist of the following components:

hSRC =SRCENCMT/ST(X)(4)

PCTC(zSRC

t|X) = SRCCTCMT/ST(hSRC

t)(5)

hTGT =TGTENCMT/ST(hSRC)(6)

PCTC(zTGT

t|X) = TGTCTCMT/ST(hTGT

t)(7)

where

SRCENCMT(·)

is realized by

Trans-

former (Vaswani et al.,2017) layers followed by

up-sampling LegoNN Output Length Con-

troller (OLC) layers (Dalmia et al.,2022), while

TGTENCMT(·)is realized by N3non-up-sampling

LegoNN OLC layers. We chose LegoNN based

on its previously demonstrated effectiveness for

up-sampling textual representations and its abil-

ity to perform latent re-ordering via self-attention.

SRCENCST(·)

is realized by

convolutional

blocks for downsampling (Dong et al.,2018) fol-

lowed by

Conformer (Gulati et al.,2020), while

TGTENCST(·)

is realized by

Conformer lay-

ers. We chose Conformer based on its previously

demonstrated effectiveness for modeling local and

global dependencies in speech signals and its abil-

ity to perform latent re-ordering via self-attention.

The hierarchical encoders are jointly optimized

with an attentional decoder using a multi-tasked

objective,

L=LSRCCTC +λ1LTGTCTC +λ2LATTN

where

’s interpolate source-oriented CTC, target-

oriented CTC, and decoder cross-entropy losses.

4.2 Input/Output-Synchronous Decoding

Per L2 and L3 described in §3, we seek to design a

joint decoding algorithm with input and output-

synchronous variants of one-pass beam search

which differ only in whether CTC or attention takes

the leading role. As shown in Algorithms 1 and 2,

we propose to align the input and output beam-step

functions along three common functions: hypothe-

sis expansion, joint scoring, and end detection.

Output-Synchrony

Consider ﬁrst that attention

is in the leading role, which means that we are

working with an output-synchronous beam search.

Note that this is the algorithm originally proposed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CTCAlignmentsImproveAutoregressiveTranslationBrianYan*1SiddharthDalmia*1YosukeHiguchi2GrahamNeubig1FlorianMetze1AlanWBlack1ShinjiWatanabe1;31LanguageTechnologiesInstitute,CarnegieMellonUniversity,USA2DepartmentofCommunicationsandComputerEngineering,WasedaUniversity,Japan3HumanLanguageTechnologyCente...

展开>> 收起<<

CTC Alignments Improve Autoregressive Translation Brian Yan1Siddharth Dalmia1Yosuke Higuchi2 Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe13.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CTC Alignments Improve Autoregressive Translation Brian Yan1Siddharth Dalmia1Yosuke Higuchi2 Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: