Recent datasets have focused on domains with more challenging audio inputs, specifically in audio
quality, speakers and speaking style (Panayotov et al., 2015; Ardila et al., 2020; Wang et al., 2021;
Hernandez et al., 2018; Chen et al., 2021; O’Neill et al., 2021; Del Rio et al., 2022; Carletta, 2007;
Renals et al., 2007; Godfrey et al., 1992; Cieri et al., 2004a;b; 2005a;b). These datasets incorporate
distinct audio domains, each with different recording conditions and degrees of background noise.
Each dataset includes speakers from both native or non-native English speaking backgrounds, and
together cover accents and dialects from seven different language regions (Del Rio et al., 2022). The
speaking style for each dataset falls into one of three categories: narrated, oratory or spontaneous,
with each style having different distributions for speaking speed and utterance length. We discuss
the individual datasets in detail in Section 4.
For many ASR systems, a series of dataset specific pre- and post-processing steps are applied when
training and evaluating systems on individual datasets. For the 10 datasets in this work, there are 10
different Kaldi (Povey et al., 2011) recipes in use, each with unique pre- and post-processing steps.
Of these recipes, one is not even publicly accessible. Employing dataset-specific pre-processing
steps results in systems that do not transfer to different domains. For example, a system that extracts
speech features without a noise-suppression algorithm works adequately well for a dataset with low-
background noise, but the same approach produces much worse results on a noisy dataset (Kim &
Stern, 2016).
Recent speech recognition datasets also include full transcriptions with all the necessary ortho-
graphic features required for their respective domains (Carletta, 2007; Renals et al., 2007; O’Neill
et al., 2021; Del Rio et al., 2022). These datasets aim to encourage ASR systems capable of pro-
ducing transcriptions that adhere to the formatting requirements of the target text domain. We note
that this differs from the standard ASR output transcription format known as Standard Normalised
Orthographic Representation (SNOR) (NIST, 1998), which consists of single-case letters without
punctuation marks or numbers. This format is necessary for ASR systems that do not predict punctu-
ated and cased outputs, relying on post-processing to restore transcription formatting (Chen, 1999).
Per contra, many speech recognition applications, such as financial meeting transcriptions or legal
documents, require orthographic text.
In circumstances where orthographic text is required, it is typically achieved through a series of
dataset-specific post-processing steps applied to the ASR output, each of which treats a single or-
thographic feature (Beeferman et al., 1998; Lita et al., 2003; Kim & Woodland, 2003; Gravano et al.,
2009; Yuan & Briscoe, 2016). However, there are significant shortcomings to this pipeline approach.
Firstly, certain orthographic decisions can only be made using acoustic information rather than text
alone. For instance, an inflection in vocal pitch at the end of an sentence can change its mean-
ing from a statement to a question, thus requiring a question mark instead of a period. Secondly,
cascading a series of post-processing steps into the speech recognition pipeline may lead to error
propagation that hampers overall system performance (Knill et al., 2018; Lu et al., 2019). Finally,
the pipeline system is evaluated for each post-processing component individually. This can result in
individual components being optimised in isolation, at the expense of lower overall performance due
to distribution shift (Sculley et al., 2015). As a result, post-processing can lead to systems failing to
accurately predict orthographic transcriptions on datasets where it is required.
These issues and the need for dataset specific pre- or post-processing can be bypassed entirely by
designing end-to-end models - from speech directly to orthographic transcripts (Graves & Jaitly,
2014; Chan et al., 2016). E2E models have been shown to outperform traditional cascaded ASR
systems, particularly when large amounts of labelled speech data is available (Hannun et al., 2014;
Synnaeve et al., 2020; Radford et al., 2022). What is more, E2E ASR systems require a single stage
of evaluation; the ASR system is assessed on the cased and punctuated transcription outputs that are
generated for the downstream application, giving a single, unified measure of overall performance.
However, for the further development and refinement of these systems, it is important to have a
benchmark targeting the specific challenges that end-to-end models face.
4 ESB DATASETS
ESB comprises eight English speech recognition datasets, capturing a broad range of domains,
acoustic conditions, speaker styles, and transcription requirements. We retain all punctuation, cas-
ing and formatting in the transcription outputs. Only annotation mistakes, such as double empty
3