
4 Experiments
4.1 Datasets
We conduct experiments on both text-to-text ST
and speech-to-text ST tasks.
•Text-to-text ST (T2T-ST)
IWSLT153English →Vietnamese (En→Vi)
(133K pairs) (Cettolo et al.,2015) We use TED
tst2012 as the validation set (1553 pairs) and TED
tst2013 as the test set (1268 pairs). Following
the previous setting (Raffel et al.,2017;Ma et al.,
2020c), we replace tokens that the frequency less
than 5 by
hunki
, and the vocabulary sizes are 17K
and 7.7K for English and Vietnamese respectively.
WMT154German→English (De→En)
(4.5M
pairs) We use newstest2013 as the validation set
(3000 pairs) and newstest2015 as the test set (2169
pairs). 32K BPE (Sennrich et al.,2016) is applied
and the vocabulary is shared across languages.
•Speech-to-text ST (S2T-ST)
MuST-C5English →German (En→De)
(234K pairs) and
English →Spanish (En→Es)
(270K pairs) (Di Gangi et al.,2019). We use
dev
as validation set (1423 pairs for En
→
De, 1316
pairs for En
→
Es) and use
tst-COMMON
as test
set (2641 pairs for En
→
De, 2502 pairs for En
→
Es),
respectively. Following Ma et al. (2020b), we use
Kaldi (Povey et al.,2011) to extract 80-dimensional
log-mel filter bank features for speech, computed
with a 25
ms
window size and a 10
ms
window shift,
and we use SentencePiece (Kudo and Richardson,
2018) to generate a unigram vocabulary of size
10000 respectively for source and target text.
4.2 Experimental Settings
We conduct experiments on the following systems.
All implementations are based on Transformer
(Vaswani et al.,2017) and adapted from Fairseq
Library (Ott et al.,2019).
Offline
Full-sentence MT (Vaswani et al.,
2017), which waits for the complete source inputs
and then starts translating.
Wait-k
Wait-k policy (Ma et al.,2019), the
most widely used fixed policy, which first READ
k
source tokens, and then alternately READ one
token and WRITE one token.
Multipath Wait-k
An efficient training for
wait-k (Elbayad et al.,2020), which randomly sam-
ples different kbetween batches during training.
3nlp.stanford.edu/projects/nmt/
4www.statmt.org/wmt15/
5https://ict.fbk.eu/must-c
Adaptive Wait-k
A heuristic composition of
multiple wait-k models (
k=1 ···13
) (Zheng et al.,
2020), which decides whether to translate accord-
ing to the generating probabilities of wait-k models.
MoE Wait-k6
Mixture-of-experts wait-k pol-
icy (Zhang and Feng,2021c), the SOTA fixed pol-
icy, which applies multiple experts to learn multiple
wait-k policies during training.
MMA7
Monotonic multi-head attention (Ma
et al.,2020c), which predicts a Bernoulli variable to
decide READ/WRITE, and the Bernoulli variable
is jointly learning with multi-head attention.
GSiMT
Generative ST (Miao et al.,2021),
which also predicts a Bernoulli variable to decide
READ/WRITE and the variable is trained with a
generative framework via dynamic programming.
RealTranS
End-to-end simultaneous speech
translation with Wait-K-Stride-N strategy (Zeng
et al.,2021), which waits for
N
frame at each step.
MoSST
Monotonic-segmented streaming
speech translation (Dong et al.,2022), which uses
integrate-and-firing method to segment the speech.
ITST The proposed method in Sec.3.
T2T-ST Settings
We apply Transformer-
Small (4 heads) for En
→
Vi and Transformer-
Base/Big (8/16 heads) for De
→
En. Note that we
apply the unidirectional encoder for Transformer
to enable simultaneous decoding. Since GSiMT
involves dynamic programming which makes its
training expensive, we report GSiMT on WMT15
De
→
En (Base) (Miao et al.,2021). For T2T-ST
evaluation, we report BLEU (Papineni et al.,2002)
for translation quality and Average Lagging (AL,
token) (Ma et al.,2019) for latency. We also give
the results with SacreBLEU in Appendix B.
S2T-ST Settings
The proposed ITST can per-
form end-to-end speech-to-text ST in two manners:
fixed pre-decision and flexible pre-decision (Ma
et al.,2020b). For fixed pre-decision, following Ma
et al. (2020b), we apply ConvTransformer-Espnet
(4 heads) (Inaguma et al.,2020) for both En
→
De
and En
→
Es, which adds a 3-layer convolutional
network before the encoder to capture the speech
features. Note that the encoder is also unidirec-
tional for simultaneous decoding. The convolu-
tional layers and encoder are initialized from the
pre-trained ASR task. All systems make a fixed
pre-decision of READ/WRITE every 7 source to-
kens (i.e., every 280
ms
). For flexible pre-decision,
6github.com/ictnlp/MoE-Waitk
7github.com/pytorch/fairseq/tree/
master/examples/simultaneous_translation