A COMPARISON OF TRANSFORMER CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS ON PHONEME RECOGNITION Kyuhong Shim and Wonyong Sung

2025-04-27 0 0 322.92KB 8 页 10玖币
侵权投诉
A COMPARISON OF TRANSFORMER, CONVOLUTIONAL, AND RECURRENT NEURAL
NETWORKS ON PHONEME RECOGNITION
Kyuhong Shim and Wonyong Sung
Department of Electrical and Computer Engineering, Seoul National University, Korea
{skhu20, wysung}@snu.ac.kr
ABSTRACT
Phoneme recognition is a very important part of speech recog-
nition that requires the ability to extract phonetic features
from multiple frames. In this paper, we compare and ana-
lyze CNN, RNN, Transformer, and Conformer models using
phoneme recognition. For CNN, the ContextNet model is
used for the experiments. First, we compare the accuracy of
various architectures under different constraints, such as the
receptive field length, parameter size, and layer depth. Sec-
ond, we interpret the performance difference of these models,
especially when the observable sequence length varies. Our
analyses show that Transformer and Conformer models bene-
fit from the long-range accessibility of self-attention through
input frames.
Index TermsTransformer, Conformer, CNN, RNN,
Phoneme recognition
1. INTRODUCTION
The ability to extract phonologically meaningful features
is essential for various speech processing tasks such as au-
tomatic speech recognition (ASR) [1, 2, 3], speaker veri-
fication [4], and speech synthesis [5, 6]. Such phoneme-
awareness is a fundamental building block for human intel-
ligence; not only the spoken but also the written language
directly corresponds to the combination of phonemes.
In speech processing, DNN architectures can be catego-
rized by how the feature extraction mechanism incorporates
past and future information. First, convolutional neural net-
works (CNNs) exploit the fixed-length convolution kernel to
aggregate multiple frame information. Because each frame
can only access nearby frames within the kernel size in a con-
volutional layer, CNN models often stack multiple layers to
capture long-range relationships. Second, recurrent neural
networks (RNNs) compress the entire past/future sequence
into a single feature vector. This compression enables RNN
to utilize the entire sequence efficiently; however, RNN suf-
fers from the loss of long-range information because the rep-
resentation space is restricted to a single vector. In contrast,
Transformer-based models process the entire sequence simul-
taneously using the self-attention, where each frame directly
accesses every other frame and adaptively determines their
importance [7]. In other words, Transformer-based models
are more advantageous for long-range dependency modeling
compared to CNN and RNN models. For this reason, Trans-
former has become the universal choice for state-of-the-art
speech processing in recent years. However, phoneme recog-
nition is considered a task that depends on a very short time
interval of speech when compared to linguistic processing.
Many phonemes can be classified even with only one or a few
frames of speech. Thus, the phoneme classification efficacy
of DNNs, especially Transformer-based ones that can process
long-rage relationships, needs to be studied in detail.
In this paper, we compare four different DNN architec-
tures for phoneme recognition. Specifically, we compare
CNN, RNN, Transformer, and Conformer [3] models un-
der the same conditions. Then, we analyze how different
components and limitations of each architecture affect the
performance. We emphasize that phoneme recognition is the
most suitable task for evaluating the phonetic feature extrac-
tion capability. This is because other speech-related tasks
usually require a model to encapsulate more information than
phonetic knowledge in features. For example, for end-to-end
speech recognition, the model should utilize phonetic and
linguistic information together to generate correct transcrip-
tion [8]. For speaker verification, speaker diarization, and
speech synthesis, the model must consider the non-phonetic
aspects of the speech, such as pitch, accent, speed, or loud-
ness. On the other hand, phoneme recognition performance
can be easily measured by accuracy, and the result solely
depends on the feature quality.
We summarize our findings below:
Although each phoneme is uttered within a short pe-
riod, the phoneme recognition accuracy of DNN is im-
proved until the receptive field length is fairly long.
When the receptive field length becomes longer, Trans-
former and Conformer show consistent performance
improvement, in contrast to CNN.
When the parameter size is very small, such as 1M, the
ContextNet performs best. Also, ContextNet is advan-
tageous when considering the inference time in GPU.
arXiv:2210.00367v1 [eess.AS] 1 Oct 2022
2. RELATED WORK
2.1. Phoneme recognition
Earlier studies have first introduced neural networks for
phoneme recognition [9], such as time-delay networks [10]
and bidirectional LSTM [11]. In these works, the benefit of
considering more than about 10 frames was marginal.
Recently, phoneme recognition is widely used as a tool to
evaluate the amount of phonetic information of DNN features
learned from other tasks, including ASR and self-supervised
learning. For example, Mockingjay [12], wav2vec 2.0 [13]
and wav2vec-U [14] exploit phoneme recognition on self-
supervised pre-trained Transformer models to demonstrate
that their models learn general speech representations. Our
work is different from these works in that we directly train
models on the phoneme recognition task. By doing so, the
model can fully utilize its capability in extracting phonetic
characteristics without being distracted by other objectives.
2.2. Transformer-based speech processing
Several studies have investigated the behavior of Trans-
former models in order to understand their superior per-
formance. Probing experiments on the self-supervised Trans-
former models discovered that Transformers detect diverse
aspects of audio, including voice pitch, fluency, duration, and
phonemes [15, 16, 17]. On the other hand, analyses on the
attention map revealed that Transformer considers the en-
tire sequence in phonetic feature extraction, named phonetic
localization [8]. For example, a self-attention head that per-
forms phonetic localization would pay high attention weight
for similarly pronounced frames.
Furthermore, different self-attention heads are specified
for different phonetic relationships [8, 18]. Specifically, pho-
netic self-attention behavior can be separated into similarity-
based and content-based ones, where the former focuses on
the pairwise similarity of frames while the latter considers the
content of each frame [19]. We note that such unique behav-
iors have not been reported in CNN- and RNN-based models.
2.3. Comparison between Transformer and Others
Extensive studies have been conducted to compare CNN to
Transformers in the vision domain [20]. Especially, compar-
isons between vision Transformer (ViT) and CNN show that
they learn very different aspects of an image [21]; for exam-
ple, ViT and CNN behave as low-pass and high-pass filters,
respectively [22]. However, in-depth analyses were not con-
ducted much in the speech domain. Several works have inves-
tigated RNN-based and Transformer-based models for ASR
tasks [23, 24, 25], but only the final word error rate and train-
ing dynamics are compared. In our experiments, we carefully
design model configurations for a fair comparison and com-
pare four architectures with the same constraints.
Fig. 1: Illustration of a single ContextNet block.
3. DNN ARCHITECTURE
3.1. CNN
We choose ContextNet [2] as a baseline because the Con-
textNet block has been employed in many state-of-the-art
CNN-based ASR models [2, 26]. ContextNet architecture
differs from other CNNs in two components: depthwise sep-
arable (DS) convolution [27, 2, 26, 28] and squeeze-excite
(SE) module [29]. Figure 1 shows one ContextNet block that
includes four DS convolution layers, residual connection, and
SE module. Note that we take a block as the basic unit of
ContextNet for experiments.
First, DS convolution includes a depthwise convolution
of large kernel size kfollowed by a pointwise convolution
of kernel size 1. The former aggregates neighboring frames
without mixing channels, and the latter combines every chan-
nel for each frame. This two-step process makes DS con-
volution parameter-efficient because the model can increase
the kernel size without increasing the number of parameters
much. Second, SE module adaptively re-weights channels
based on the per-channel feature averaged through the entire
sequence. SE module is an efficient approach for incorporat-
ing the global information in feature processing, however, it
does not consider the difference of frames because the same
channel weights are multiplied by every frame feature.
3.2. RNN
We use LSTM [30] as our default RNN layer. Specifically,
we stack multiple bidirectional LSTM layers to build an RNN
model [1]. Unlike other architectures, RNN-based models re-
quire sequential processing of frames, which causes a slow
inference especially for bidirectional ones.
3.3. Transformer
We employ the pre-norm Transformer layer [31] which in-
cludes two submodules: multi-head self-attention and feed-
forward. Please refer to the original work [7] for the inter-
nal structure of submodules. While the post-norm design was
employed in the original Transformer model, the pre-norm
design is adopted in many speech and language processing
models [3, 32].
摘要:

ACOMPARISONOFTRANSFORMER,CONVOLUTIONAL,ANDRECURRENTNEURALNETWORKSONPHONEMERECOGNITIONKyuhongShimandWonyongSungDepartmentofElectricalandComputerEngineering,SeoulNationalUniversity,Koreafskhu20,wysungg@snu.ac.krABSTRACTPhonemerecognitionisaveryimportantpartofspeechrecog-nitionthatrequirestheabilitytoe...

展开>> 收起<<
A COMPARISON OF TRANSFORMER CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS ON PHONEME RECOGNITION Kyuhong Shim and Wonyong Sung.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:322.92KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注