ESB A B ENCHMARK FORMULTI -DOMAIN END-TO-ENDSPEECH RECOGNITION Sanchit Gandhi Patrick von Platen Alexander M. Rush

2025-05-06 0 0 448.04KB 25 页 10玖币
侵权投诉
ESB: A BENCHMARK FOR MULTI-DOMAIN
END-TO-END SPEECH RECOGNITION
Sanchit Gandhi, Patrick von Platen & Alexander M. Rush
Hugging Face
{sanchit, patrick, sasha}@huggingface.co
ABSTRACT
Speech recognition applications cover a range of different audio and text distri-
butions, with different speaking styles, background noise, transcription punctu-
ation and character casing. However, many speech recognition systems require
dataset-specific tuning (audio filtering, punctuation removal and normalisation of
casing), therefore assuming a-priori knowledge of both the audio and text dis-
tributions. This tuning requirement can lead to systems failing to generalise to
other datasets and domains. To promote the development of multi-domain speech
systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating
the performance of a single automatic speech recognition (ASR) system across a
broad set of speech datasets. Benchmarked systems must use the same data pre-
and post-processing algorithm across datasets - assuming the audio and text data
distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA)
end-to-end (E2E) systems on this benchmark, demonstrating how a single speech
system can be applied and evaluated on a wide range of data distributions. We find
E2E systems to be effective across datasets: in a fair comparison, E2E systems
achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis re-
veals that transcription artefacts, such as punctuation and casing, pose difficulties
for ASR systems and should be included in evaluation. We believe E2E bench-
marking over a range of datasets promotes the research of multi-domain speech
recognition systems. ESB is available at https://huggingface.co/esb.
1 INTRODUCTION
Speech recognition covers various applications, including dictation, voice assistants, video caption-
ing, telephone conversations and meeting transcriptions (Aks¨
enova et al., 2021). Each application
has domain-specific data distributions for both the audio inputs and transcription outputs. The audio
inputs are derived from different recording conditions, degrees of background noise, speakers and
styles (narrated, oratory or spontaneous). The nature of the transcriptions is also domain-dependent;
in formal settings, such as meeting transcriptions, the text must be orthographic1and satisfy standard
formatting conventions. Whereas in more informal settings, such as telephone conversations, punc-
tuation and casing are often omitted (Kim & Woodland, 2003). To handle the diversity of speech
recognition conditions, there is a need for multi-domain systems that maintain their performance
over a collection of datasets with different audio and transcription distributions.
However, most automatic speech recognition (ASR) systems are trained and evaluated on a single
dataset, utilising dataset-specific model architectures and pre-/post-processing to optimise for single
dataset performance (Likhomanenko et al., 2020). Such dataset-specific tuning assumes a-priori
knowledge of both the audio and text distribution and yields systems that transfer poorly to other
datasets and domains. A generalisable system should transfer to different datasets and domains
with training data, but without the need for dataset-specific tuning (Wang et al., 2019b) or a-priori
knowledge of the data distributions. End-to-end (E2E) systems consist of a single model that maps
the raw audio inputs to the transcription outputs (Graves & Jaitly, 2014). Learning directly from
data, E2E systems do not require dataset-specific configurations (Hannun et al., 2014). As such,
1orthographic: the accepted way of spelling and writing words according to standard usage (McIntosh &
Cambridge University Press., 2015).
1
arXiv:2210.13352v1 [cs.CL] 24 Oct 2022
they can be applied independently to different datasets and domains (Chan et al., 2021; Radford
et al., 2022).
To facilitate the research of multi-domain, generalisable ASR systems, we present the End-to-end
Speech Benchmark (ESB), a benchmark for evaluating a single ASR system across a collection
of speech datasets spanning different domains and speech recognition conditions. Benchmarked
systems must use the same architecture across datasets and may not use dataset-specific pre- or
post-processing. Therefore, ESB favours systems that can be applied independently across speech
recognition domains with no a-priori knowledge of the data distributions. None of the datasets pre-
sented in ESB were created specifically for the benchmark; all datasets are pre-existing for the reason
that they are widely considered by the speech community to be the most applicable, challenging and
interesting datasets. We adopt an open-source and open-science approach by considering datasets
that are freely available and accessible.
To demonstrate ESB, we perform baseline experiments with five different E2E approaches. We find
these E2E systems to be effective across datasets. In a fair comparison, they perform to within
2.6% word error rate of state-of-the-art systems tuned to a specific dataset. Our analysis shows
that transcription artefacts, such as punctuation and casing, make the task of speech recognition
more difficult and should be included in evaluation. We believe E2E benchmarking over a range of
datasets encourages the research of multi-domain speech recognition systems.
2 RELATED WORK
Speech recognition datasets have long focused on covering different domains and speaking styles:
the TIMIT (Garofolo et al., 1993a) and Wall-Street Journal (Garofolo et al., 1993b) corpora contain
news broadcast recordings, SwitchBoard (Godfrey et al., 1992) and Fisher (Cieri et al., 2004a;b;
2005a;b) spontaneous telephone conversations, LibriSpeech (Panayotov et al., 2015) narrated au-
diobooks, Common Voice (Ardila et al., 2020) narrated Wikipedia articles and TED-LIUM (Her-
nandez et al., 2018) oratory educational talks. More recently, datasets such as People’s Speech
(Galvez et al., 2021) and GigaSpeech (Chen et al., 2021) extend this to cover multiple domains in
one dataset. However, these datasets lack certain important domains and speaking styles, such as
conversational speech, which are currently only covered by certain individual datasets. We see this
as an important trend towards multi-domain speech recognition and collect different datasets to form
a unified ASR benchmark.
Traditionally, ASR systems are trained on case and punctuation normalised text (NIST, 1998; Povey
et al., 2011); the transcriptions are pre-processed to remove casing and punctuation before training
and evaluation. However, in certain speech recognition applications, orthographic transcriptions are
required Kim & Woodland (2001). Recent work has looked at training ASR systems on orthographic
transcriptions (O’Neill et al., 2021; Radford et al., 2022), relying on a data-driven E2E approach in
learning to predict cased and punctuated outputs. However, the features of orthographic text remain
challenging for ASR systems. We evaluate a single system over multiple datasets and include all
dataset-specific transcription formatting requirements.
For text understanding, GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) pro-
vide well established benchmarks for assessing the generalisation abilities of a single system over
a range of different natural language understanding tasks. The SUPERB (wen Yang et al., 2021)
and XTREME-S (Conneau et al., 2022) benchmarks assess a single system over a multiple spoken
language processing tasks. This paper extends these efforts to show that English ASR has sufficient
diversity in datasets and domains to merit a benchmark of its own.
3 MOTIVATION FOR AN END-TO-END BENCHMARK
Different speech domains have different data distributions for audio artefacts (quality, speakers and
styles) and transcription outputs (punctuation, casing, orthography). In using the term end-to-end
(E2E), we refer to systems that map from the raw audio inputs to the transcription outputs without
domain-specific architectures or additional processing. In this section, we describe the existing
works regarding multi-domain and E2E ASR and outline the principal issues involved.
2
Recent datasets have focused on domains with more challenging audio inputs, specifically in audio
quality, speakers and speaking style (Panayotov et al., 2015; Ardila et al., 2020; Wang et al., 2021;
Hernandez et al., 2018; Chen et al., 2021; O’Neill et al., 2021; Del Rio et al., 2022; Carletta, 2007;
Renals et al., 2007; Godfrey et al., 1992; Cieri et al., 2004a;b; 2005a;b). These datasets incorporate
distinct audio domains, each with different recording conditions and degrees of background noise.
Each dataset includes speakers from both native or non-native English speaking backgrounds, and
together cover accents and dialects from seven different language regions (Del Rio et al., 2022). The
speaking style for each dataset falls into one of three categories: narrated, oratory or spontaneous,
with each style having different distributions for speaking speed and utterance length. We discuss
the individual datasets in detail in Section 4.
For many ASR systems, a series of dataset specific pre- and post-processing steps are applied when
training and evaluating systems on individual datasets. For the 10 datasets in this work, there are 10
different Kaldi (Povey et al., 2011) recipes in use, each with unique pre- and post-processing steps.
Of these recipes, one is not even publicly accessible. Employing dataset-specific pre-processing
steps results in systems that do not transfer to different domains. For example, a system that extracts
speech features without a noise-suppression algorithm works adequately well for a dataset with low-
background noise, but the same approach produces much worse results on a noisy dataset (Kim &
Stern, 2016).
Recent speech recognition datasets also include full transcriptions with all the necessary ortho-
graphic features required for their respective domains (Carletta, 2007; Renals et al., 2007; O’Neill
et al., 2021; Del Rio et al., 2022). These datasets aim to encourage ASR systems capable of pro-
ducing transcriptions that adhere to the formatting requirements of the target text domain. We note
that this differs from the standard ASR output transcription format known as Standard Normalised
Orthographic Representation (SNOR) (NIST, 1998), which consists of single-case letters without
punctuation marks or numbers. This format is necessary for ASR systems that do not predict punctu-
ated and cased outputs, relying on post-processing to restore transcription formatting (Chen, 1999).
Per contra, many speech recognition applications, such as financial meeting transcriptions or legal
documents, require orthographic text.
In circumstances where orthographic text is required, it is typically achieved through a series of
dataset-specific post-processing steps applied to the ASR output, each of which treats a single or-
thographic feature (Beeferman et al., 1998; Lita et al., 2003; Kim & Woodland, 2003; Gravano et al.,
2009; Yuan & Briscoe, 2016). However, there are significant shortcomings to this pipeline approach.
Firstly, certain orthographic decisions can only be made using acoustic information rather than text
alone. For instance, an inflection in vocal pitch at the end of an sentence can change its mean-
ing from a statement to a question, thus requiring a question mark instead of a period. Secondly,
cascading a series of post-processing steps into the speech recognition pipeline may lead to error
propagation that hampers overall system performance (Knill et al., 2018; Lu et al., 2019). Finally,
the pipeline system is evaluated for each post-processing component individually. This can result in
individual components being optimised in isolation, at the expense of lower overall performance due
to distribution shift (Sculley et al., 2015). As a result, post-processing can lead to systems failing to
accurately predict orthographic transcriptions on datasets where it is required.
These issues and the need for dataset specific pre- or post-processing can be bypassed entirely by
designing end-to-end models - from speech directly to orthographic transcripts (Graves & Jaitly,
2014; Chan et al., 2016). E2E models have been shown to outperform traditional cascaded ASR
systems, particularly when large amounts of labelled speech data is available (Hannun et al., 2014;
Synnaeve et al., 2020; Radford et al., 2022). What is more, E2E ASR systems require a single stage
of evaluation; the ASR system is assessed on the cased and punctuated transcription outputs that are
generated for the downstream application, giving a single, unified measure of overall performance.
However, for the further development and refinement of these systems, it is important to have a
benchmark targeting the specific challenges that end-to-end models face.
4 ESB DATASETS
ESB comprises eight English speech recognition datasets, capturing a broad range of domains,
acoustic conditions, speaker styles, and transcription requirements. We retain all punctuation, cas-
ing and formatting in the transcription outputs. Only annotation mistakes, such as double empty
3
Table 1: Datasets description and statistics. Speaking style falls into one of three categories: narrated
(N), oratory (O) and spontaneous (S). Datasets with multiple speaking styles are shown separated
by a comma. Dataset sizes for the train/validation/test splits are quoted in hours of audio data. The
transcription format is either normalised (Norm.), punctuated (P) or punctuated and cased (P+C).
Dataset Domain Style Train/Val/Test Trans.
LibriSpeech Audiobook N 960 / 11 / 11 Norm.
Common Voice Wikipedia N 1409 / 27 / 27 P+C
VoxPopuli EU Parliament O 523 / 5 / 5 P
TED-LIUM TED talks O 454 / 2 / 3 Norm.
GigaSpeech Audiobook, podcast, YouTube N, S 2500 /12 / 40 P
SPGISpeech Meetings O, S 4900 / 100 / 100 P+C
Earnings-22 Meetings O, S 105 / 5 / 5 P+C
AMI Meetings S 78 / 9 / 9 P+C
SwitchBoard (optional) Telephone S 3572 / 30 / 7 Norm.
CHiME-4 (optional) Broadcast news N 19/11/7 P+C
spaces, or annotation elements that cannot be considered transcriptions, such as <unk>, are cor-
rected. A comprehensive list of all transcription error corrections are detailed in Appendix A.2. As
the objective of ESB is to motivate the development of end-to-end ASR, systems must use the same
architecture across all datasets without any dataset-specific pre-processing or post-processing. Good
performance requires systems capable of handling a range of audio and text conditions without any
prior dataset-specific knowledge of the data distributions. The main datasets in ESB are accessible
with permissive licensing. We also include three optional paid datasets that challenge interesting
and unique domains of speech recognition, but do not require their inclusion for submission to the
benchmark. We describe the datasets below and in Table 1, with additional details in Appendix A.
LibriSpeech (Panayotov et al., 2015) is a standard large-scale dataset for evaluating ASR systems.
It consists of approximately 1000 hours of narrated audiobooks collected from the LibriVox2project.
Whilst instrumental in facilitating researchers to leverage a large body of pre-existing transcribed
speech data, its standalone use presents limitations. The audiobook domain provides high-quality
recording conditions that result in little to no background noise and the narrated speaking style lacks
the acoustic and prosodic features of spontaneous speech. The transcriptions are non-orthographic
without punctuation and casing. Since the books read are in the public domain, many contain an-
tiquated language and writing styles atypical of modern-day speech. We anticipate competitive
systems to perform extremely well on LibriSpeech (Zhang et al., 2020). We include LibriSpeech in
ESB to facilitate a comparison of performance between ideal speech recognition conditions and the
more challenging settings presented by other datasets in the benchmark. We use the standard split
of train, validation (dev-clean,dev-other) and test sets (test-clean,test-other).
Common Voice (Ardila et al., 2020) is a series of crowd-sourced open-licensed speech datasets
where speakers record text from Wikipedia in various languages. Since anyone can contribute
recordings, there is significant variation in both audio quality and speakers. The audio conditions
are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign
words. The transcriptions are orthographic, with both casing and punctuation. However, the speak-
ing style remains narrated (a shortcoming shared with LibriSpeech). We use the English subset of
version 9.0 (27-4-2022), with approximately 1,400 hours and data splits provided therein.
VoxPopuli (Wang et al., 2021) is a large-scale multilingual speech corpus consisting of data sourced
from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique do-
main of oratory, political speech, largely sourced from non-native speakers. We use the English
subset with approximately 550 hours and the canonical data splits.
TED-LIUM (Hernandez et al., 2018) is based on English-language TED Talk conference videos.
The transcribed talks cover a range of different cultural, political, and academic topics, resulting in a
2https://librivox.org/
4
technical vocabulary. We use Release 3 edition of the training set with approximately 450 hours and
the legacy distribution of validation and test data, consistent with earlier releases for comparison.
GigaSpeech (Chen et al., 2021) is a multi-domain English speech recognition corpus curated from
audiobooks, podcasts and YouTube. It covers both narrated and spontaneous speech over a variety
of topics, such as arts, science and sports. It is the only corpus in the benchmark to cover multiple
domains. We use the large subset (2,500 hours) to train and the standard validation and test splits.
SPGISpeech (O’Neill et al., 2021) is an English speech recognition corpus composed of company
earnings calls that have been manually transcribed by S&P Global, Inc. The transcriptions are fully-
formatted according to a professional style guide for oratory and spontaneous speech. We train on
the large subset (5,000 hours) and evaluate on the canonical validation and test splits.
Earnings-22 (Del Rio et al., 2022) is a 119-hour corpus of English-language earnings calls collected
from global companies. The dataset was developed with the goal of aggregating a broad range of
speakers and accents covering a range of real-world financial topics. There is large diversity in the
speakers and accents, with speakers taken from seven different language regions. To create train-
validation-test splits, we partition the Earnings-22 corpus 90:5:5.
AMI (Carletta, 2007; Renals et al., 2007) comprises 100 hours of meeting recordings captured using
different recording streams. The corpus contains manually annotated orthographic transcriptions of
the meetings aligned at the word level. Individual samples of the AMI dataset contain very large
audio files (between 10 and 60 minutes), which we segment to lengths feasible for training most
ASR systems (for details, see Appendix A). We use the individual headset microphones (AMI-IHM)
version of the dataset and the train, validation and test sets provided therein.
SwitchBoard (optional) is a collection of two-sided conversational telephone speech amongst
speakers from the US. Recorded over 10 years ago and at a lower sampling rate than the other
corpora, it presents a noisy and challenging ASR problem. We partition 5% of the SwitchBoard
(Godfrey et al., 1992) corpus to form the validation split. We combine the remainder of the Switch-
Board corpus with Fisher (Cieri et al., 2004a;b) to form a train set consisting of approximately
3,600 hours. The test sets are the Hub5Eval2000 (Linguistic Data Consortium, 2002) data with two
subsets: SwitchBoard and CallHome.
CHiME-4 (optional) (Vincent et al., 2017) consists of narrated samples from the Wall Street Journal
corpus (Garofolo et al., 1993b). Recordings are taken in challenging noisy environments using a 6-
channel tablet based microphone array. We limit the official training data to single-channel and 18
hours by randomly selecting one of the six channels for each of the official training recordings. We
use the official 1-channel development and test sets in their original annotated form.
SwitchBoard is a popular dataset for assessing ASR systems due to its unique telephone conversation
domain. Alongside CHiME-4, these two datasets present challenging and noisy audio conditions.
However, both datasets require payment for use. Thus, we include these corpora as optional extras
in ESB; the score for these datasets is standalone and does not contribute to the overall benchmark
score.
5 EVALUATION
System Requirements ESB requires a single system to be defined and evaluated across the con-
stituent datasets. The system must use the same architecture as well as training and evaluation
algorithms for all datasets. This requirement includes using the same data pre- and post-processing
of the audio inputs, target transcriptions, and system predictions. There is no restriction on the sys-
tem being a single model, provided it is defined uniformly across all datasets. Given the range in
size of the different datasets, hyper-parameter tuning is permitted, provided the algorithm for hyper-
parameters tuning is consistent across datasets. The validation sets from each dataset are used to
optimise system configurations and for hyper-parameter tuning, while the test sets are used only for
the final evaluation.
Systems submitted to ESB may use any public or private data to train and develop their systems,
including unlabelled audio data for pretraining, unlabelled text corpora for training language models
(LMs) and labelled audio data for supervised training. However, systems may only use the ESB-
distributed versions of the datasets included in the benchmark; in some cases, these datasets include
5
摘要:

ESB:ABENCHMARKFORMULTI-DOMAINEND-TO-ENDSPEECHRECOGNITIONSanchitGandhi,PatrickvonPlaten&AlexanderM.RushHuggingFacefsanchit,patrick,sashag@huggingface.coABSTRACTSpeechrecognitionapplicationscoverarangeofdifferentaudioandtextdistri-butions,withdifferentspeakingstyles,backgroundnoise,transcriptionpunctu...

展开>> 收起<<
ESB A B ENCHMARK FORMULTI -DOMAIN END-TO-ENDSPEECH RECOGNITION Sanchit Gandhi Patrick von Platen Alexander M. Rush.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:448.04KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注