
et al.,2021). However, CTC is capable of taking
over the lead role if called upon (e.g. for streaming
applications) (Moritz et al.,2019b).
3 Potential CTC Limitations in MT/ST
Why exactly does this joint CTC/attention frame-
work perform so well in ASR? As summarized in
column 3 of Table 1, we are particularly interested
in three reasons which arise from the combination
of the hard vs. soft alignment, conditional inde-
pendence vs. dependence, and input-synchronous
emission vs. autoregressive generation properties
of CTC and attention respectively. These dynamics
have become well understood in ASR, owing to the
popularity of the joint framework (Watanabe et al.,
2018) amongst ASR practitioners.
So can CTC and attention also complement each
other when applied jointly to translation?
2
ASR,
MT, and ST can all be generalized as sequence
transduction tasks following the Bayesian formula-
tion. Plus attentional decoders have are a predomi-
nant technical solution to each of these tasks. How-
ever, the CTC framework appears to have several
limitations specific to MT/ST that are not present
in ASR; this seemingly diminishes the promise of
the joint CTC/attention framework for translation.
In this work, we seek to address the following three
concerns about MT/ST CTC which appear to in-
hibit the CTC/attention framework (per Table 1).
L1
Can CTC encoders perform sophisticated
input-to-output mappings required for translation?
Unlike ASR, translation entails non-monotonic
mappings due to variable word-ordering across lan-
guages. Additionally, inputs may be shorter than
outputs as mappings are not necessarily one-to-
one. Furthermore, the mapping task for ST is actu-
ally compositional where logically a speech signal
first maps to a source language transcription be-
fore being mapped to the ultimate translation. All
of these complications appear to directly contra-
dict the
hard alignment
of CTC. If CTC cannot
produce stable encoder representations for MT/ST,
then during joint training attention does not receive
the optimization benefit as in ASR (per row 2 of Ta-
ble 1). Fortunately, prior works suggest that these
challenges are not insurmountable. Chuang et al.
(2021) showed that self-attentional encoders can
perform latently model variable word-orders for ST,
Libovický and Helcl (2018); Dalmia et al. (2022)
2
This particular question has not been addressed in litera-
ture. For an account of related works, please see §9.
proposed up-sampling encoders that produce ex-
panded input representations for MT, and Sanabria
and Metze (2018); Higuchi et al. (2022) proposed
hierarchical encoders that can compose multiple
output resolutions for ASR. In §4.1, we incorpo-
rate these techniques into a unified solution which
achieves hierarchical encoding for translation.
L2
Does CTC-based translation quality lag too
far behind attention-based to be useful?
CTC-based ASR has recently shown competitive
performance due in large part to improved neural ar-
chitectures (Gulati et al.,2020) and self-supervised
learning (Baevski et al.,2020;Hsu et al.,2021),
but the gap between CTC and attention for trans-
lation appears to be greater (Gu and Kong,2021).
Perhaps the
conditional independence
of CTC in-
hibits the quality to such a degree in MT/ST where
these likelihoods cannot ease the label/exposure
biases of the attentional decoder as they do in ASR
(per column 3 of Table 1). The relative weakness of
non-autoregressive translation approaches has been
well-studied. Knowledge distillation (Kim and
Rush,2016;Zhou et al.,2019) and iterative meth-
ods (Qian et al.,2021;Chan et al.,2020;Huang
et al.,2022) all attempt to bridge the gap between
non-autoregressive models and their autoregres-
sive counterparts. In §6, we address this concern
empirically: even CTC models with 28% relative
BLEU reduction compared to attention yield im-
provements when CTC and attention are jointly
decoded.
L3
Is the alignment information produced by
CTC-based translation models reasonable?
In ASR, CTC alignments are reliable enough
to segment audio data by force aligning inputs
to a target transcription outputs (Kürzinger et al.,
2020) and exhibits minimal drift compared to hid-
den Markov models (Sak et al.,2015). However,
CTC alignments are not as well studied in transla-
tion. It is an open question of whether or not the
input-synchronous emission
of CTC for transla-
tion has sufficient alignment quality to support the
end detection responsibility during joint decoding
as it does in ASR (per row 4 of Table 1). Ideally,
the CTC alignments are strong enough such that
CTC can lead joint decoding by proposing candi-
dates for hypothesis expansion in each beam step
until all input units are consumed (at which point
the end is detected), as in an input-synchronous
beam search. More conservatively, the CTC align-