A COMPARISON OF TRANSFORMER, CONVOLUTIONAL, AND RECURRENT NEURAL
NETWORKS ON PHONEME RECOGNITION
Kyuhong Shim and Wonyong Sung
Department of Electrical and Computer Engineering, Seoul National University, Korea
{skhu20, wysung}@snu.ac.kr
ABSTRACT
Phoneme recognition is a very important part of speech recog-
nition that requires the ability to extract phonetic features
from multiple frames. In this paper, we compare and ana-
lyze CNN, RNN, Transformer, and Conformer models using
phoneme recognition. For CNN, the ContextNet model is
used for the experiments. First, we compare the accuracy of
various architectures under different constraints, such as the
receptive field length, parameter size, and layer depth. Sec-
ond, we interpret the performance difference of these models,
especially when the observable sequence length varies. Our
analyses show that Transformer and Conformer models bene-
fit from the long-range accessibility of self-attention through
input frames.
Index Terms—Transformer, Conformer, CNN, RNN,
Phoneme recognition
1. INTRODUCTION
The ability to extract phonologically meaningful features
is essential for various speech processing tasks such as au-
tomatic speech recognition (ASR) [1, 2, 3], speaker veri-
fication [4], and speech synthesis [5, 6]. Such phoneme-
awareness is a fundamental building block for human intel-
ligence; not only the spoken but also the written language
directly corresponds to the combination of phonemes.
In speech processing, DNN architectures can be catego-
rized by how the feature extraction mechanism incorporates
past and future information. First, convolutional neural net-
works (CNNs) exploit the fixed-length convolution kernel to
aggregate multiple frame information. Because each frame
can only access nearby frames within the kernel size in a con-
volutional layer, CNN models often stack multiple layers to
capture long-range relationships. Second, recurrent neural
networks (RNNs) compress the entire past/future sequence
into a single feature vector. This compression enables RNN
to utilize the entire sequence efficiently; however, RNN suf-
fers from the loss of long-range information because the rep-
resentation space is restricted to a single vector. In contrast,
Transformer-based models process the entire sequence simul-
taneously using the self-attention, where each frame directly
accesses every other frame and adaptively determines their
importance [7]. In other words, Transformer-based models
are more advantageous for long-range dependency modeling
compared to CNN and RNN models. For this reason, Trans-
former has become the universal choice for state-of-the-art
speech processing in recent years. However, phoneme recog-
nition is considered a task that depends on a very short time
interval of speech when compared to linguistic processing.
Many phonemes can be classified even with only one or a few
frames of speech. Thus, the phoneme classification efficacy
of DNNs, especially Transformer-based ones that can process
long-rage relationships, needs to be studied in detail.
In this paper, we compare four different DNN architec-
tures for phoneme recognition. Specifically, we compare
CNN, RNN, Transformer, and Conformer [3] models un-
der the same conditions. Then, we analyze how different
components and limitations of each architecture affect the
performance. We emphasize that phoneme recognition is the
most suitable task for evaluating the phonetic feature extrac-
tion capability. This is because other speech-related tasks
usually require a model to encapsulate more information than
phonetic knowledge in features. For example, for end-to-end
speech recognition, the model should utilize phonetic and
linguistic information together to generate correct transcrip-
tion [8]. For speaker verification, speaker diarization, and
speech synthesis, the model must consider the non-phonetic
aspects of the speech, such as pitch, accent, speed, or loud-
ness. On the other hand, phoneme recognition performance
can be easily measured by accuracy, and the result solely
depends on the feature quality.
We summarize our findings below:
• Although each phoneme is uttered within a short pe-
riod, the phoneme recognition accuracy of DNN is im-
proved until the receptive field length is fairly long.
• When the receptive field length becomes longer, Trans-
former and Conformer show consistent performance
improvement, in contrast to CNN.
• When the parameter size is very small, such as 1M, the
ContextNet performs best. Also, ContextNet is advan-
tageous when considering the inference time in GPU.
arXiv:2210.00367v1 [eess.AS] 1 Oct 2022