CTC Alignments Improve Autoregressive Translation Brian Yan1Siddharth Dalmia1Yosuke Higuchi2 Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe13

2025-04-27 0 0 4.4MB 16 页 10玖币
侵权投诉
CTC Alignments Improve Autoregressive Translation
Brian Yan*1Siddharth Dalmia*1Yosuke Higuchi2
Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe1,3
1Language Technologies Institute, Carnegie Mellon University, USA
2Department of Communications and Computer Engineering, Waseda University, Japan
3Human Language Technology Center of Excellence, Johns Hopkins University, USA
{byan, sdalmia}@cs.cmu.edu
Abstract
Connectionist Temporal Classification (CTC)
is a widely used approach for automatic speech
recognition (ASR) that performs conditionally
independent monotonic alignment. However
for translation, CTC exhibits clear limitations
due to the contextual and non-monotonic na-
ture of the task and thus lags behind atten-
tional decoder approaches in terms of trans-
lation quality. In this work, we argue that
CTC does in fact make sense for translation
if applied in a joint CTC/attention framework
wherein CTC’s core properties can counter-
act several key weaknesses of pure-attention
models during training and decoding. To val-
idate this conjecture, we modify the Hybrid
CTC/Attention model originally proposed for
ASR to support text-to-text translation (MT)
and speech-to-text translation (ST). Our pro-
posed joint CTC/attention models outperform
pure-attention baselines across six benchmark
translation tasks.
1 Introduction
Automatic speech recognition (ASR), machine
translation (MT), and speech translation (ST) have
conspicuous differences but are all closely related
sequence-to-sequence problems. Researchers from
these respective fields have long recognized the op-
portunity for cross-pollinating ideas (He and Deng,
2011), starting from the coupling of statistical ASR
(Huang et al.,2014) and MT (Al-Onaizan et al.,
1999) which gave rise to early approaches for ST
(Waibel,1996;Ney,1999). Notably in the end-to-
end era, attentional encoder-decoder approaches
emerged in both MT (Bahdanau et al.,2015) and
ASR (Chan et al.,2016) and have since risen to
great prominence in both fields.
During this same period, there has been an-
other prominent end-to-end approach in ASR: Con-
nectionist Temporal Classification (CTC) (Graves
et al.,2006). Unlike the highly flexible atten-
tion mechanism which can handle ASR, MT, and
ST alike, CTC models sequence transduction as a
monotonic alignment of inputs to outputs and thus
fits more naturally with ASR than it does with trans-
lation. Still, many interested in non-autoregressive
translation have applied CTC to MT (Libovický
and Helcl,2018) and ST (Chuang et al.,2021) and
promising techniques have emerged which have
shrunk the gap between autoregressive approaches
(Saharia et al.,2020;Gu and Kong,2021;Inaguma
et al.,2021b;Huang et al.,2022). These recent de-
velopments suggest that the latent alignment ability
of CTC is a promising direction for translation –
this leads us to question: can CTC alignments im-
prove autoregressive translation? In particular, we
are interested in frameworks which leverage the
strength of CTC while minimizing its several harm-
ful incompatibilities (see §3) with translation tasks.
Inspired by the success of Hybrid CTC/Attention
in ASR (Watanabe et al.,2017), we investigate
jointly modeling CTC with an autoregressive at-
tentional encoder-decoder for translation. Our con-
jecture is that the monotonic alignment and condi-
tional independence of CTC, which weaken purely
CTC-based translation, counteract particular weak-
nesses of attentional models in joint CTC/attention
frameworks. In this work, we seek to investi-
gate how each CTC property interacts with cor-
responding properties of the attentional counterpart
during joint training and decoding. We design a
joint CTC/attention architecture for translation (§4)
and then examine the positive interactions which
ultimately result in improved translation quality
compared to pure-attention baselines, as demon-
strated on the IWSLT (Cettolo et al.,2012), MuST-
C (Di Gangi et al.,2019), and MTedX (Salesky
et al.,2021) MT/ST corpora (§6).
2 Background: Joint CTC/Attn for ASR
Both the CTC (Graves et al.,2006) and attentional
encoder-decoder (Bahdanau et al.,2015) frame-
works seek to model the Bayesian decision seeking
arXiv:2210.05200v1 [cs.CL] 11 Oct 2022
CTC ATTENTION JOINT CTC/ATTENTION ASR
MT/ST
PCTC(Y|X)
=X
Z∈Z
T
Y
t=1
P(zt|X,
z1:t1)PAttn(Y|X)
=QL
l=1 P(yl|y1:l1, X)PJoint(Y|X)
=PCTC(Y|X)λ×PAttn(Y|X)1λ3 3
Hard Alignment
.....................................
Criterion only allows monotonic align-
ments of inputs to outputs
Soft Alignment
........................................
Flexible attention-based input-to-output
mappings may overfit to irregular patterns
During Training:
Hard alignment objective pro-
duces stable encoder representations allowing the
decoder to more rapidly learn soft alignment patterns
3L1
..
See §3
Conditional Independence
....................
Assumes that there are no dependencies
between each output unit given the input
Conditional Dependence
.......................
Locally normalized models with output
dependency exhibit label/exposure biases
During Decoding:
Use of conditionally independent
likelihoods in joint scoring eases the exposure/label
biases from conditionally dependent likelihoods
3L2
..
See §3
Input-Synchronous Emission
...............
Each input representation emits exactly
one blank or non-blank output token
Autoregressive Generation
....................
Need to detect end-points and compare hy-
potheses of different length in beam search
During Decoding:
Input-synchronous emission de-
termines output length based on input length counter-
acting the autoregressive end-detection problem
3L3
..
See §3
Table 1: Description of three reasons why joint CTC/attention modeling is powerful in ASR. In order to understand
whether these positive interactions between properties of the CTC and attention frameworks are applicable to
MT/ST, we must address three corresponding concerns, L1-3, about the applicability of CTC to translation (§2).
the output,
ˆ
Y
, from all possible sequences,
Vtgt
,
by selecting the sequence which maximizes the
posterior likelihood
P(Y|X)
, where
X={xt
Ssrc|t= 1, ..., T }
and
Y={yl∈ Vtgt|l=
1, ..., L}
. As shown in the first two columns of
Table 1, CTC and attention offer different formu-
lations of the posterior likelihood,
PCTC(·)
and
PAttn(·)respectively.
What are the critical differences between the
CTC and attention frameworks? First of all, the
attention mechanism is a flexible input-to-output
mapping function which allows a decoder to per-
form
soft alignment
of an output unit
yl
to mul-
tiple input units
x[...]
without restriction. One
downside of this flexibility is a risk of destabi-
lized optimization (Kim et al.,2017). CTC on
the other hand marginalizes the likelihoods of all
possible input to alignment sequence,
Z={zt
Vtgt {}|t= 1 . . . T }
, mappings via
hard align-
ment
where each output unit
zt
maps to a single
input unit xtin a strictly monotonic pattern.1
Secondly, the attentional decoder models each
output unit
y1
with
conditional dependence
on
not only the input
X
, but also the previous output
units
y1:l1
. In order to efficiently compute
the marginalized likelihoods of all possible
Z
Z(Y, T )
via dynamic programming, CTC makes
a
conditional independence
assumption that each
zt
does not depend on
z1:t
if already conditioned
on
X
- this is a strong assumption. On the plus,
since CTC does not model any causality between
output units its is not plagued by the same label and
exposure biases that exist in attentional decoders
due to local normalization of causal likelihoods
(Bottou,1991;Ranzato et al.,2016;Hannun,2019).
1
is a "blank" denoting null emission and
Z
maps deter-
ministically to Yby removing null and repeated emissions.
Finally, the attentional decoder is an
autore-
gressive generator
that decodes the output until
a stop token,
<eos>
. Comparing likelihoods for
sequences of different lengths requires a heuris-
tic brevity penalty. Furthermore label bias with
respect to the stop token manifests as a length prob-
lem where likelihoods degenerate for unexpectedly
long outputs (Murray and Chiang,2018). In com-
parison, CTC is an
input-synchronous emitter
that consumes an input unit in order to produce
an output unit. Therefore, CTC cannot produce an
output longer than the input representation which
feeds the final posterior output layer but this also
means that CTC does not require any end detection.
As previously shown by (Kim et al.,2017;
Watanabe et al.,2017), jointly modeling CTC and
an attentional decoder is highly effective in ASR.
The foundation of this architecture is a shared en-
coder,
ENC
, which feeds into both CTC,
PCTC(·)
,
and attentional decoder, PAttn(·), posteriors:
h= Enc(X)(1)
PCTC(zt|X) = CTC(ht)(2)
PAttn(yl|X, y1:l1) = Dec(h, y1:l1)(3)
where
CTC(·)
denotes a projection to the CTC out-
put vocabulary,
Vtgt ∪ {}
followed by softmax,
and
DEC(·)
denotes autoregressive decoder layers
followed by a projection to the decoder output vo-
cabulary,
Vtgt ∪ {<eos>}
, and softmax. The joint
network is optimized via a multi-tasked objective,
LASR =LASR
CTC +λLASR
Attn
, where
λ
interpolates the
CTC loss and the cross-entropy loss of the decoder.
Joint decoding is typically performed with a one-
pass beam search where CTC plays a secondary
role as a joint scorer while attention leads the major
hypothesis expansion and end detection functions
in the algorithm (Watanabe et al.,2017;Tsunoo
et al.,2021). However, CTC is capable of taking
over the lead role if called upon (e.g. for streaming
applications) (Moritz et al.,2019b).
3 Potential CTC Limitations in MT/ST
Why exactly does this joint CTC/attention frame-
work perform so well in ASR? As summarized in
column 3 of Table 1, we are particularly interested
in three reasons which arise from the combination
of the hard vs. soft alignment, conditional inde-
pendence vs. dependence, and input-synchronous
emission vs. autoregressive generation properties
of CTC and attention respectively. These dynamics
have become well understood in ASR, owing to the
popularity of the joint framework (Watanabe et al.,
2018) amongst ASR practitioners.
So can CTC and attention also complement each
other when applied jointly to translation?
2
ASR,
MT, and ST can all be generalized as sequence
transduction tasks following the Bayesian formula-
tion. Plus attentional decoders have are a predomi-
nant technical solution to each of these tasks. How-
ever, the CTC framework appears to have several
limitations specific to MT/ST that are not present
in ASR; this seemingly diminishes the promise of
the joint CTC/attention framework for translation.
In this work, we seek to address the following three
concerns about MT/ST CTC which appear to in-
hibit the CTC/attention framework (per Table 1).
L1
Can CTC encoders perform sophisticated
input-to-output mappings required for translation?
Unlike ASR, translation entails non-monotonic
mappings due to variable word-ordering across lan-
guages. Additionally, inputs may be shorter than
outputs as mappings are not necessarily one-to-
one. Furthermore, the mapping task for ST is actu-
ally compositional where logically a speech signal
first maps to a source language transcription be-
fore being mapped to the ultimate translation. All
of these complications appear to directly contra-
dict the
hard alignment
of CTC. If CTC cannot
produce stable encoder representations for MT/ST,
then during joint training attention does not receive
the optimization benefit as in ASR (per row 2 of Ta-
ble 1). Fortunately, prior works suggest that these
challenges are not insurmountable. Chuang et al.
(2021) showed that self-attentional encoders can
perform latently model variable word-orders for ST,
Libovický and Helcl (2018); Dalmia et al. (2022)
2
This particular question has not been addressed in litera-
ture. For an account of related works, please see §9.
proposed up-sampling encoders that produce ex-
panded input representations for MT, and Sanabria
and Metze (2018); Higuchi et al. (2022) proposed
hierarchical encoders that can compose multiple
output resolutions for ASR. In §4.1, we incorpo-
rate these techniques into a unified solution which
achieves hierarchical encoding for translation.
L2
Does CTC-based translation quality lag too
far behind attention-based to be useful?
CTC-based ASR has recently shown competitive
performance due in large part to improved neural ar-
chitectures (Gulati et al.,2020) and self-supervised
learning (Baevski et al.,2020;Hsu et al.,2021),
but the gap between CTC and attention for trans-
lation appears to be greater (Gu and Kong,2021).
Perhaps the
conditional independence
of CTC in-
hibits the quality to such a degree in MT/ST where
these likelihoods cannot ease the label/exposure
biases of the attentional decoder as they do in ASR
(per column 3 of Table 1). The relative weakness of
non-autoregressive translation approaches has been
well-studied. Knowledge distillation (Kim and
Rush,2016;Zhou et al.,2019) and iterative meth-
ods (Qian et al.,2021;Chan et al.,2020;Huang
et al.,2022) all attempt to bridge the gap between
non-autoregressive models and their autoregres-
sive counterparts. In §6, we address this concern
empirically: even CTC models with 28% relative
BLEU reduction compared to attention yield im-
provements when CTC and attention are jointly
decoded.
L3
Is the alignment information produced by
CTC-based translation models reasonable?
In ASR, CTC alignments are reliable enough
to segment audio data by force aligning inputs
to a target transcription outputs (Kürzinger et al.,
2020) and exhibits minimal drift compared to hid-
den Markov models (Sak et al.,2015). However,
CTC alignments are not as well studied in transla-
tion. It is an open question of whether or not the
input-synchronous emission
of CTC for transla-
tion has sufficient alignment quality to support the
end detection responsibility during joint decoding
as it does in ASR (per row 4 of Table 1). Ideally,
the CTC alignments are strong enough such that
CTC can lead joint decoding by proposing candi-
dates for hypothesis expansion in each beam step
until all input units are consumed (at which point
the end is detected), as in an input-synchronous
beam search. More conservatively, the CTC align-
x1xT
p1pnT
hsrc
1hsrc
nT
x1xT
h1hT/n
N1*
N2*
N3*
N2*
N3*
TransformerLayers
LegoNNOLCLayers
LegoNNOLCLayers
ConvolutionLayers
N1*
ConformerLayers
ConformerLayers
htgt
1htgt
nT
p1pnT
hsrc
1hsrc
T/n
htgt
1htgt
T/n
SrcCTC
Decoder
SrcEnc
SrcEnc
SrcCTC
TgtCTC
TgtEnc
TgtCTC
TgtEnc
Decoder
(a) Hierarchical MT Encoder (b) Hierarchical ST Encoder
MT MT
ST ST
Figure 1: Hierarchical MT/ST encoders where repre-
sentations are first up/down-sampled by SRCENCMT/ST
and then re-ordered by TGTENCMT/ST.
ments may be too unreliable to take lead but could
still guide the attentional decoder’s end detection
by penalizing incorrect lengths via joint scoring,
as in an output-synchronous beam search. In §4.2,
we lay out comparable forms for input and output-
synchronous beam search which allows us to exam-
ine the impact on translation quality depending on
whether CTC is explicitly responsible for or only
implicitly contributing to end detection.
4 Joint CTC/Attention for Translation
4.1 Hierarchical CTC Encoding
Per L1 described in §3, we seek to build a CTC
encoder for translation which handles sophisti-
cated input-to-output mappings. We therefore pro-
pose to use a hierachical CTC encoding scheme
3
which first aligns inputs to length-adjusted source-
oriented encodings before aligning to re-ordered
target-oriented encodings, as shown in Figure 1.
We decompose the encoding process into two func-
tions: length-adjustment and re-ordering.
Length-adjustment
For MT, we up-sample the
lengths of the source-oriented encodings in order
to output sequences longer than the input. For ST,
we down-sample the lengths of the source-oriented
encodings to coerce a discrete textual representa-
tion of the real-valued speech input. We enforce
source-orientations using CTC criteria that seek to
align intermediate layer encoder representations
towards source text sequences.
Re-ordering
We then obtain target-oriented en-
codings with hierarchical encoder layers, where
3
Hierarchical CTC encoding is a flexible technique which
has been applied for various multi-objective scenarios in prior
ASR works (Sanabria and Metze,2018;Higuchi et al.,2022).
re-ordering is enforced using CTC criteria that seek
to align final layer encoder representations towards
target text sequences. Critically, the underlying
neural network architecture must be able to model
latent re-ordering as the CTC criterion itself will
only consider monotonic alignments of the final
encoder representation to the target.
Our proposed MT/ST hierarchical encoders con-
sist of the following components:
hSRC =SRCENCMT/ST(X)(4)
PCTC(zSRC
t|X) = SRCCTCMT/ST(hSRC
t)(5)
hTGT =TGTENCMT/ST(hSRC)(6)
PCTC(zTGT
t|X) = TGTCTCMT/ST(hTGT
t)(7)
where
SRCENCMT(·)
is realized by
N1
Trans-
former (Vaswani et al.,2017) layers followed by
N2
up-sampling LegoNN Output Length Con-
troller (OLC) layers (Dalmia et al.,2022), while
TGTENCMT(·)is realized by N3non-up-sampling
LegoNN OLC layers. We chose LegoNN based
on its previously demonstrated effectiveness for
up-sampling textual representations and its abil-
ity to perform latent re-ordering via self-attention.
SRCENCST(·)
is realized by
N1
convolutional
blocks for downsampling (Dong et al.,2018) fol-
lowed by
N2
Conformer (Gulati et al.,2020), while
TGTENCST(·)
is realized by
N3
Conformer lay-
ers. We chose Conformer based on its previously
demonstrated effectiveness for modeling local and
global dependencies in speech signals and its abil-
ity to perform latent re-ordering via self-attention.
The hierarchical encoders are jointly optimized
with an attentional decoder using a multi-tasked
objective,
L=LSRCCTC +λ1LTGTCTC +λ2LATTN
,
where
λ
s interpolate source-oriented CTC, target-
oriented CTC, and decoder cross-entropy losses.
4.2 Input/Output-Synchronous Decoding
Per L2 and L3 described in §3, we seek to design a
joint decoding algorithm with input and output-
synchronous variants of one-pass beam search
which differ only in whether CTC or attention takes
the leading role. As shown in Algorithms 1 and 2,
we propose to align the input and output beam-step
functions along three common functions: hypothe-
sis expansion, joint scoring, and end detection.
Output-Synchrony
Consider first that attention
is in the leading role, which means that we are
working with an output-synchronous beam search.
Note that this is the algorithm originally proposed
摘要:

CTCAlignmentsImproveAutoregressiveTranslationBrianYan*1SiddharthDalmia*1YosukeHiguchi2GrahamNeubig1FlorianMetze1AlanWBlack1ShinjiWatanabe1;31LanguageTechnologiesInstitute,CarnegieMellonUniversity,USA2DepartmentofCommunicationsandComputerEngineering,WasedaUniversity,Japan3HumanLanguageTechnologyCente...

展开>> 收起<<
CTC Alignments Improve Autoregressive Translation Brian Yan1Siddharth Dalmia1Yosuke Higuchi2 Graham Neubig1Florian Metze1Alan W Black1Shinji Watanabe13.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:4.4MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注