ESB A B ENCHMARK FORMULTI -DOMAIN END-TO-ENDSPEECH RECOGNITION Sanchit Gandhi Patrick von Platen Alexander M. Rush

2025-05-06 0 0 448.04KB 25 页 10玖币

侵权投诉

ESB: A BENCHMARK FOR MULTI-DOMAIN

END-TO-END SPEECH RECOGNITION

Sanchit Gandhi, Patrick von Platen & Alexander M. Rush

Hugging Face

{sanchit, patrick, sasha}@huggingface.co

ABSTRACT

Speech recognition applications cover a range of different audio and text distri-

butions, with different speaking styles, background noise, transcription punctu-

ation and character casing. However, many speech recognition systems require

dataset-speciﬁc tuning (audio ﬁltering, punctuation removal and normalisation of

casing), therefore assuming a-priori knowledge of both the audio and text dis-

tributions. This tuning requirement can lead to systems failing to generalise to

other datasets and domains. To promote the development of multi-domain speech

systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating

the performance of a single automatic speech recognition (ASR) system across a

broad set of speech datasets. Benchmarked systems must use the same data pre-

and post-processing algorithm across datasets - assuming the audio and text data

distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA)

end-to-end (E2E) systems on this benchmark, demonstrating how a single speech

system can be applied and evaluated on a wide range of data distributions. We ﬁnd

E2E systems to be effective across datasets: in a fair comparison, E2E systems

achieve within 2.6% of SoTA systems tuned to a speciﬁc dataset. Our analysis re-

veals that transcription artefacts, such as punctuation and casing, pose difﬁculties

for ASR systems and should be included in evaluation. We believe E2E bench-

marking over a range of datasets promotes the research of multi-domain speech

recognition systems. ESB is available at https://huggingface.co/esb.

1 INTRODUCTION

Speech recognition covers various applications, including dictation, voice assistants, video caption-

ing, telephone conversations and meeting transcriptions (Aks¨

enova et al., 2021). Each application

has domain-speciﬁc data distributions for both the audio inputs and transcription outputs. The audio

inputs are derived from different recording conditions, degrees of background noise, speakers and

styles (narrated, oratory or spontaneous). The nature of the transcriptions is also domain-dependent;

in formal settings, such as meeting transcriptions, the text must be orthographic1and satisfy standard

formatting conventions. Whereas in more informal settings, such as telephone conversations, punc-

tuation and casing are often omitted (Kim & Woodland, 2003). To handle the diversity of speech

recognition conditions, there is a need for multi-domain systems that maintain their performance

over a collection of datasets with different audio and transcription distributions.

However, most automatic speech recognition (ASR) systems are trained and evaluated on a single

dataset, utilising dataset-speciﬁc model architectures and pre-/post-processing to optimise for single

dataset performance (Likhomanenko et al., 2020). Such dataset-speciﬁc tuning assumes a-priori

knowledge of both the audio and text distribution and yields systems that transfer poorly to other

datasets and domains. A generalisable system should transfer to different datasets and domains

with training data, but without the need for dataset-speciﬁc tuning (Wang et al., 2019b) or a-priori

knowledge of the data distributions. End-to-end (E2E) systems consist of a single model that maps

the raw audio inputs to the transcription outputs (Graves & Jaitly, 2014). Learning directly from

data, E2E systems do not require dataset-speciﬁc conﬁgurations (Hannun et al., 2014). As such,

1orthographic: the accepted way of spelling and writing words according to standard usage (McIntosh &

Cambridge University Press., 2015).

arXiv:2210.13352v1 [cs.CL] 24 Oct 2022

they can be applied independently to different datasets and domains (Chan et al., 2021; Radford

et al., 2022).

To facilitate the research of multi-domain, generalisable ASR systems, we present the End-to-end

Speech Benchmark (ESB), a benchmark for evaluating a single ASR system across a collection

of speech datasets spanning different domains and speech recognition conditions. Benchmarked

systems must use the same architecture across datasets and may not use dataset-speciﬁc pre- or

post-processing. Therefore, ESB favours systems that can be applied independently across speech

recognition domains with no a-priori knowledge of the data distributions. None of the datasets pre-

sented in ESB were created speciﬁcally for the benchmark; all datasets are pre-existing for the reason

that they are widely considered by the speech community to be the most applicable, challenging and

interesting datasets. We adopt an open-source and open-science approach by considering datasets

that are freely available and accessible.

To demonstrate ESB, we perform baseline experiments with ﬁve different E2E approaches. We ﬁnd

these E2E systems to be effective across datasets. In a fair comparison, they perform to within

2.6% word error rate of state-of-the-art systems tuned to a speciﬁc dataset. Our analysis shows

that transcription artefacts, such as punctuation and casing, make the task of speech recognition

more difﬁcult and should be included in evaluation. We believe E2E benchmarking over a range of

datasets encourages the research of multi-domain speech recognition systems.

2 RELATED WORK

Speech recognition datasets have long focused on covering different domains and speaking styles:

the TIMIT (Garofolo et al., 1993a) and Wall-Street Journal (Garofolo et al., 1993b) corpora contain

news broadcast recordings, SwitchBoard (Godfrey et al., 1992) and Fisher (Cieri et al., 2004a;b;

2005a;b) spontaneous telephone conversations, LibriSpeech (Panayotov et al., 2015) narrated au-

diobooks, Common Voice (Ardila et al., 2020) narrated Wikipedia articles and TED-LIUM (Her-

nandez et al., 2018) oratory educational talks. More recently, datasets such as People’s Speech

(Galvez et al., 2021) and GigaSpeech (Chen et al., 2021) extend this to cover multiple domains in

one dataset. However, these datasets lack certain important domains and speaking styles, such as

conversational speech, which are currently only covered by certain individual datasets. We see this

as an important trend towards multi-domain speech recognition and collect different datasets to form

a uniﬁed ASR benchmark.

Traditionally, ASR systems are trained on case and punctuation normalised text (NIST, 1998; Povey

et al., 2011); the transcriptions are pre-processed to remove casing and punctuation before training

and evaluation. However, in certain speech recognition applications, orthographic transcriptions are

required Kim & Woodland (2001). Recent work has looked at training ASR systems on orthographic

transcriptions (O’Neill et al., 2021; Radford et al., 2022), relying on a data-driven E2E approach in

learning to predict cased and punctuated outputs. However, the features of orthographic text remain

challenging for ASR systems. We evaluate a single system over multiple datasets and include all

dataset-speciﬁc transcription formatting requirements.

For text understanding, GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) pro-

vide well established benchmarks for assessing the generalisation abilities of a single system over

a range of different natural language understanding tasks. The SUPERB (wen Yang et al., 2021)

and XTREME-S (Conneau et al., 2022) benchmarks assess a single system over a multiple spoken

language processing tasks. This paper extends these efforts to show that English ASR has sufﬁcient

diversity in datasets and domains to merit a benchmark of its own.

3 MOTIVATION FOR AN END-TO-END BENCHMARK

Different speech domains have different data distributions for audio artefacts (quality, speakers and

styles) and transcription outputs (punctuation, casing, orthography). In using the term end-to-end

(E2E), we refer to systems that map from the raw audio inputs to the transcription outputs without

domain-speciﬁc architectures or additional processing. In this section, we describe the existing

works regarding multi-domain and E2E ASR and outline the principal issues involved.

Recent datasets have focused on domains with more challenging audio inputs, speciﬁcally in audio

quality, speakers and speaking style (Panayotov et al., 2015; Ardila et al., 2020; Wang et al., 2021;

Hernandez et al., 2018; Chen et al., 2021; O’Neill et al., 2021; Del Rio et al., 2022; Carletta, 2007;

Renals et al., 2007; Godfrey et al., 1992; Cieri et al., 2004a;b; 2005a;b). These datasets incorporate

distinct audio domains, each with different recording conditions and degrees of background noise.

Each dataset includes speakers from both native or non-native English speaking backgrounds, and

together cover accents and dialects from seven different language regions (Del Rio et al., 2022). The

speaking style for each dataset falls into one of three categories: narrated, oratory or spontaneous,

with each style having different distributions for speaking speed and utterance length. We discuss

the individual datasets in detail in Section 4.

For many ASR systems, a series of dataset speciﬁc pre- and post-processing steps are applied when

training and evaluating systems on individual datasets. For the 10 datasets in this work, there are 10

different Kaldi (Povey et al., 2011) recipes in use, each with unique pre- and post-processing steps.

Of these recipes, one is not even publicly accessible. Employing dataset-speciﬁc pre-processing

steps results in systems that do not transfer to different domains. For example, a system that extracts

speech features without a noise-suppression algorithm works adequately well for a dataset with low-

background noise, but the same approach produces much worse results on a noisy dataset (Kim &

Stern, 2016).

Recent speech recognition datasets also include full transcriptions with all the necessary ortho-

graphic features required for their respective domains (Carletta, 2007; Renals et al., 2007; O’Neill

et al., 2021; Del Rio et al., 2022). These datasets aim to encourage ASR systems capable of pro-

ducing transcriptions that adhere to the formatting requirements of the target text domain. We note

that this differs from the standard ASR output transcription format known as Standard Normalised

Orthographic Representation (SNOR) (NIST, 1998), which consists of single-case letters without

punctuation marks or numbers. This format is necessary for ASR systems that do not predict punctu-

ated and cased outputs, relying on post-processing to restore transcription formatting (Chen, 1999).

Per contra, many speech recognition applications, such as ﬁnancial meeting transcriptions or legal

documents, require orthographic text.

In circumstances where orthographic text is required, it is typically achieved through a series of

dataset-speciﬁc post-processing steps applied to the ASR output, each of which treats a single or-

thographic feature (Beeferman et al., 1998; Lita et al., 2003; Kim & Woodland, 2003; Gravano et al.,

2009; Yuan & Briscoe, 2016). However, there are signiﬁcant shortcomings to this pipeline approach.

Firstly, certain orthographic decisions can only be made using acoustic information rather than text

alone. For instance, an inﬂection in vocal pitch at the end of an sentence can change its mean-

ing from a statement to a question, thus requiring a question mark instead of a period. Secondly,

cascading a series of post-processing steps into the speech recognition pipeline may lead to error

propagation that hampers overall system performance (Knill et al., 2018; Lu et al., 2019). Finally,

the pipeline system is evaluated for each post-processing component individually. This can result in

individual components being optimised in isolation, at the expense of lower overall performance due

to distribution shift (Sculley et al., 2015). As a result, post-processing can lead to systems failing to

accurately predict orthographic transcriptions on datasets where it is required.

These issues and the need for dataset speciﬁc pre- or post-processing can be bypassed entirely by

designing end-to-end models - from speech directly to orthographic transcripts (Graves & Jaitly,

2014; Chan et al., 2016). E2E models have been shown to outperform traditional cascaded ASR

systems, particularly when large amounts of labelled speech data is available (Hannun et al., 2014;

Synnaeve et al., 2020; Radford et al., 2022). What is more, E2E ASR systems require a single stage

of evaluation; the ASR system is assessed on the cased and punctuated transcription outputs that are

generated for the downstream application, giving a single, uniﬁed measure of overall performance.

However, for the further development and reﬁnement of these systems, it is important to have a

benchmark targeting the speciﬁc challenges that end-to-end models face.

4 ESB DATASETS

ESB comprises eight English speech recognition datasets, capturing a broad range of domains,

acoustic conditions, speaker styles, and transcription requirements. We retain all punctuation, cas-

ing and formatting in the transcription outputs. Only annotation mistakes, such as double empty

Table 1: Datasets description and statistics. Speaking style falls into one of three categories: narrated

(N), oratory (O) and spontaneous (S). Datasets with multiple speaking styles are shown separated

by a comma. Dataset sizes for the train/validation/test splits are quoted in hours of audio data. The

transcription format is either normalised (Norm.), punctuated (P) or punctuated and cased (P+C).

Dataset Domain Style Train/Val/Test Trans.

LibriSpeech Audiobook N 960 / 11 / 11 Norm.

Common Voice Wikipedia N 1409 / 27 / 27 P+C

VoxPopuli EU Parliament O 523 / 5 / 5 P

TED-LIUM TED talks O 454 / 2 / 3 Norm.

GigaSpeech Audiobook, podcast, YouTube N, S 2500 /12 / 40 P

SPGISpeech Meetings O, S 4900 / 100 / 100 P+C

Earnings-22 Meetings O, S 105 / 5 / 5 P+C

AMI Meetings S 78 / 9 / 9 P+C

SwitchBoard (optional) Telephone S 3572 / 30 / 7 Norm.

CHiME-4 (optional) Broadcast news N 19/11/7 P+C

spaces, or annotation elements that cannot be considered transcriptions, such as <unk>, are cor-

rected. A comprehensive list of all transcription error corrections are detailed in Appendix A.2. As

the objective of ESB is to motivate the development of end-to-end ASR, systems must use the same

architecture across all datasets without any dataset-speciﬁc pre-processing or post-processing. Good

performance requires systems capable of handling a range of audio and text conditions without any

prior dataset-speciﬁc knowledge of the data distributions. The main datasets in ESB are accessible

with permissive licensing. We also include three optional paid datasets that challenge interesting

and unique domains of speech recognition, but do not require their inclusion for submission to the

benchmark. We describe the datasets below and in Table 1, with additional details in Appendix A.

LibriSpeech (Panayotov et al., 2015) is a standard large-scale dataset for evaluating ASR systems.

It consists of approximately 1000 hours of narrated audiobooks collected from the LibriVox2project.

Whilst instrumental in facilitating researchers to leverage a large body of pre-existing transcribed

speech data, its standalone use presents limitations. The audiobook domain provides high-quality

recording conditions that result in little to no background noise and the narrated speaking style lacks

the acoustic and prosodic features of spontaneous speech. The transcriptions are non-orthographic

without punctuation and casing. Since the books read are in the public domain, many contain an-

tiquated language and writing styles atypical of modern-day speech. We anticipate competitive

systems to perform extremely well on LibriSpeech (Zhang et al., 2020). We include LibriSpeech in

ESB to facilitate a comparison of performance between ideal speech recognition conditions and the

more challenging settings presented by other datasets in the benchmark. We use the standard split

of train, validation (dev-clean,dev-other) and test sets (test-clean,test-other).

Common Voice (Ardila et al., 2020) is a series of crowd-sourced open-licensed speech datasets

where speakers record text from Wikipedia in various languages. Since anyone can contribute

recordings, there is signiﬁcant variation in both audio quality and speakers. The audio conditions

are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign

words. The transcriptions are orthographic, with both casing and punctuation. However, the speak-

ing style remains narrated (a shortcoming shared with LibriSpeech). We use the English subset of

version 9.0 (27-4-2022), with approximately 1,400 hours and data splits provided therein.

VoxPopuli (Wang et al., 2021) is a large-scale multilingual speech corpus consisting of data sourced

from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique do-

main of oratory, political speech, largely sourced from non-native speakers. We use the English

subset with approximately 550 hours and the canonical data splits.

TED-LIUM (Hernandez et al., 2018) is based on English-language TED Talk conference videos.

The transcribed talks cover a range of different cultural, political, and academic topics, resulting in a

2https://librivox.org/

technical vocabulary. We use Release 3 edition of the training set with approximately 450 hours and

the legacy distribution of validation and test data, consistent with earlier releases for comparison.

GigaSpeech (Chen et al., 2021) is a multi-domain English speech recognition corpus curated from

audiobooks, podcasts and YouTube. It covers both narrated and spontaneous speech over a variety

of topics, such as arts, science and sports. It is the only corpus in the benchmark to cover multiple

domains. We use the large subset (2,500 hours) to train and the standard validation and test splits.

SPGISpeech (O’Neill et al., 2021) is an English speech recognition corpus composed of company

earnings calls that have been manually transcribed by S&P Global, Inc. The transcriptions are fully-

formatted according to a professional style guide for oratory and spontaneous speech. We train on

the large subset (5,000 hours) and evaluate on the canonical validation and test splits.

Earnings-22 (Del Rio et al., 2022) is a 119-hour corpus of English-language earnings calls collected

from global companies. The dataset was developed with the goal of aggregating a broad range of

speakers and accents covering a range of real-world ﬁnancial topics. There is large diversity in the

speakers and accents, with speakers taken from seven different language regions. To create train-

validation-test splits, we partition the Earnings-22 corpus 90:5:5.

AMI (Carletta, 2007; Renals et al., 2007) comprises 100 hours of meeting recordings captured using

different recording streams. The corpus contains manually annotated orthographic transcriptions of

the meetings aligned at the word level. Individual samples of the AMI dataset contain very large

audio ﬁles (between 10 and 60 minutes), which we segment to lengths feasible for training most

ASR systems (for details, see Appendix A). We use the individual headset microphones (AMI-IHM)

version of the dataset and the train, validation and test sets provided therein.

SwitchBoard (optional) is a collection of two-sided conversational telephone speech amongst

speakers from the US. Recorded over 10 years ago and at a lower sampling rate than the other

corpora, it presents a noisy and challenging ASR problem. We partition 5% of the SwitchBoard

(Godfrey et al., 1992) corpus to form the validation split. We combine the remainder of the Switch-

Board corpus with Fisher (Cieri et al., 2004a;b) to form a train set consisting of approximately

3,600 hours. The test sets are the Hub5Eval2000 (Linguistic Data Consortium, 2002) data with two

subsets: SwitchBoard and CallHome.

CHiME-4 (optional) (Vincent et al., 2017) consists of narrated samples from the Wall Street Journal

corpus (Garofolo et al., 1993b). Recordings are taken in challenging noisy environments using a 6-

channel tablet based microphone array. We limit the ofﬁcial training data to single-channel and 18

hours by randomly selecting one of the six channels for each of the ofﬁcial training recordings. We

use the ofﬁcial 1-channel development and test sets in their original annotated form.

SwitchBoard is a popular dataset for assessing ASR systems due to its unique telephone conversation

domain. Alongside CHiME-4, these two datasets present challenging and noisy audio conditions.

However, both datasets require payment for use. Thus, we include these corpora as optional extras

in ESB; the score for these datasets is standalone and does not contribute to the overall benchmark

score.

5 EVALUATION

System Requirements ESB requires a single system to be deﬁned and evaluated across the con-

stituent datasets. The system must use the same architecture as well as training and evaluation

algorithms for all datasets. This requirement includes using the same data pre- and post-processing

of the audio inputs, target transcriptions, and system predictions. There is no restriction on the sys-

tem being a single model, provided it is deﬁned uniformly across all datasets. Given the range in

size of the different datasets, hyper-parameter tuning is permitted, provided the algorithm for hyper-

parameters tuning is consistent across datasets. The validation sets from each dataset are used to

optimise system conﬁgurations and for hyper-parameter tuning, while the test sets are used only for

the ﬁnal evaluation.

Systems submitted to ESB may use any public or private data to train and develop their systems,

including unlabelled audio data for pretraining, unlabelled text corpora for training language models

(LMs) and labelled audio data for supervised training. However, systems may only use the ESB-

distributed versions of the datasets included in the benchmark; in some cases, these datasets include

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ESB:ABENCHMARKFORMULTI-DOMAINEND-TO-ENDSPEECHRECOGNITIONSanchitGandhi,PatrickvonPlaten&AlexanderM.RushHuggingFacefsanchit,patrick,sashag@huggingface.coABSTRACTSpeechrecognitionapplicationscoverarangeofdifferentaudioandtextdistri-butions,withdifferentspeakingstyles,backgroundnoise,transcriptionpunctu...

展开>> 收起<<

ESB A B ENCHMARK FORMULTI -DOMAIN END-TO-ENDSPEECH RECOGNITION Sanchit Gandhi Patrick von Platen Alexander M. Rush.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ESB A B ENCHMARK FORMULTI -DOMAIN END-TO-ENDSPEECH RECOGNITION Sanchit Gandhi Patrick von Platen Alexander M. Rush

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: