A COMPARISON OF TRANSFORMER CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS ON PHONEME RECOGNITION Kyuhong Shim and Wonyong Sung

2025-04-27 0 0 322.92KB 8 页 10玖币

侵权投诉

A COMPARISON OF TRANSFORMER, CONVOLUTIONAL, AND RECURRENT NEURAL

NETWORKS ON PHONEME RECOGNITION

Kyuhong Shim and Wonyong Sung

Department of Electrical and Computer Engineering, Seoul National University, Korea

{skhu20, wysung}@snu.ac.kr

ABSTRACT

Phoneme recognition is a very important part of speech recog-

nition that requires the ability to extract phonetic features

from multiple frames. In this paper, we compare and ana-

lyze CNN, RNN, Transformer, and Conformer models using

phoneme recognition. For CNN, the ContextNet model is

used for the experiments. First, we compare the accuracy of

various architectures under different constraints, such as the

receptive ﬁeld length, parameter size, and layer depth. Sec-

ond, we interpret the performance difference of these models,

especially when the observable sequence length varies. Our

analyses show that Transformer and Conformer models bene-

ﬁt from the long-range accessibility of self-attention through

input frames.

Index Terms—Transformer, Conformer, CNN, RNN,

Phoneme recognition

1. INTRODUCTION

The ability to extract phonologically meaningful features

is essential for various speech processing tasks such as au-

tomatic speech recognition (ASR) [1, 2, 3], speaker veri-

ﬁcation [4], and speech synthesis [5, 6]. Such phoneme-

awareness is a fundamental building block for human intel-

ligence; not only the spoken but also the written language

directly corresponds to the combination of phonemes.

In speech processing, DNN architectures can be catego-

rized by how the feature extraction mechanism incorporates

past and future information. First, convolutional neural net-

works (CNNs) exploit the ﬁxed-length convolution kernel to

aggregate multiple frame information. Because each frame

can only access nearby frames within the kernel size in a con-

volutional layer, CNN models often stack multiple layers to

capture long-range relationships. Second, recurrent neural

networks (RNNs) compress the entire past/future sequence

into a single feature vector. This compression enables RNN

to utilize the entire sequence efﬁciently; however, RNN suf-

fers from the loss of long-range information because the rep-

resentation space is restricted to a single vector. In contrast,

Transformer-based models process the entire sequence simul-

taneously using the self-attention, where each frame directly

accesses every other frame and adaptively determines their

importance [7]. In other words, Transformer-based models

are more advantageous for long-range dependency modeling

compared to CNN and RNN models. For this reason, Trans-

former has become the universal choice for state-of-the-art

speech processing in recent years. However, phoneme recog-

nition is considered a task that depends on a very short time

interval of speech when compared to linguistic processing.

Many phonemes can be classiﬁed even with only one or a few

frames of speech. Thus, the phoneme classiﬁcation efﬁcacy

of DNNs, especially Transformer-based ones that can process

long-rage relationships, needs to be studied in detail.

In this paper, we compare four different DNN architec-

tures for phoneme recognition. Speciﬁcally, we compare

CNN, RNN, Transformer, and Conformer [3] models un-

der the same conditions. Then, we analyze how different

components and limitations of each architecture affect the

performance. We emphasize that phoneme recognition is the

most suitable task for evaluating the phonetic feature extrac-

tion capability. This is because other speech-related tasks

usually require a model to encapsulate more information than

phonetic knowledge in features. For example, for end-to-end

speech recognition, the model should utilize phonetic and

linguistic information together to generate correct transcrip-

tion [8]. For speaker veriﬁcation, speaker diarization, and

speech synthesis, the model must consider the non-phonetic

aspects of the speech, such as pitch, accent, speed, or loud-

ness. On the other hand, phoneme recognition performance

can be easily measured by accuracy, and the result solely

depends on the feature quality.

We summarize our ﬁndings below:

• Although each phoneme is uttered within a short pe-

riod, the phoneme recognition accuracy of DNN is im-

proved until the receptive ﬁeld length is fairly long.

• When the receptive ﬁeld length becomes longer, Trans-

former and Conformer show consistent performance

improvement, in contrast to CNN.

• When the parameter size is very small, such as 1M, the

ContextNet performs best. Also, ContextNet is advan-

tageous when considering the inference time in GPU.

arXiv:2210.00367v1 [eess.AS] 1 Oct 2022

2. RELATED WORK

2.1. Phoneme recognition

Earlier studies have ﬁrst introduced neural networks for

phoneme recognition [9], such as time-delay networks [10]

and bidirectional LSTM [11]. In these works, the beneﬁt of

considering more than about 10 frames was marginal.

Recently, phoneme recognition is widely used as a tool to

evaluate the amount of phonetic information of DNN features

learned from other tasks, including ASR and self-supervised

learning. For example, Mockingjay [12], wav2vec 2.0 [13]

and wav2vec-U [14] exploit phoneme recognition on self-

supervised pre-trained Transformer models to demonstrate

that their models learn general speech representations. Our

work is different from these works in that we directly train

models on the phoneme recognition task. By doing so, the

model can fully utilize its capability in extracting phonetic

characteristics without being distracted by other objectives.

2.2. Transformer-based speech processing

Several studies have investigated the behavior of Trans-

former models in order to understand their superior per-

formance. Probing experiments on the self-supervised Trans-

former models discovered that Transformers detect diverse

aspects of audio, including voice pitch, ﬂuency, duration, and

phonemes [15, 16, 17]. On the other hand, analyses on the

attention map revealed that Transformer considers the en-

tire sequence in phonetic feature extraction, named phonetic

localization [8]. For example, a self-attention head that per-

forms phonetic localization would pay high attention weight

for similarly pronounced frames.

Furthermore, different self-attention heads are speciﬁed

for different phonetic relationships [8, 18]. Speciﬁcally, pho-

netic self-attention behavior can be separated into similarity-

based and content-based ones, where the former focuses on

the pairwise similarity of frames while the latter considers the

content of each frame [19]. We note that such unique behav-

iors have not been reported in CNN- and RNN-based models.

2.3. Comparison between Transformer and Others

Extensive studies have been conducted to compare CNN to

Transformers in the vision domain [20]. Especially, compar-

isons between vision Transformer (ViT) and CNN show that

they learn very different aspects of an image [21]; for exam-

ple, ViT and CNN behave as low-pass and high-pass ﬁlters,

respectively [22]. However, in-depth analyses were not con-

ducted much in the speech domain. Several works have inves-

tigated RNN-based and Transformer-based models for ASR

tasks [23, 24, 25], but only the ﬁnal word error rate and train-

ing dynamics are compared. In our experiments, we carefully

design model conﬁgurations for a fair comparison and com-

pare four architectures with the same constraints.

Fig. 1: Illustration of a single ContextNet block.

3. DNN ARCHITECTURE

3.1. CNN

We choose ContextNet [2] as a baseline because the Con-

textNet block has been employed in many state-of-the-art

CNN-based ASR models [2, 26]. ContextNet architecture

differs from other CNNs in two components: depthwise sep-

arable (DS) convolution [27, 2, 26, 28] and squeeze-excite

(SE) module [29]. Figure 1 shows one ContextNet block that

includes four DS convolution layers, residual connection, and

SE module. Note that we take a block as the basic unit of

ContextNet for experiments.

First, DS convolution includes a depthwise convolution

of large kernel size kfollowed by a pointwise convolution

of kernel size 1. The former aggregates neighboring frames

without mixing channels, and the latter combines every chan-

nel for each frame. This two-step process makes DS con-

volution parameter-efﬁcient because the model can increase

the kernel size without increasing the number of parameters

much. Second, SE module adaptively re-weights channels

based on the per-channel feature averaged through the entire

sequence. SE module is an efﬁcient approach for incorporat-

ing the global information in feature processing, however, it

does not consider the difference of frames because the same

channel weights are multiplied by every frame feature.

3.2. RNN

We use LSTM [30] as our default RNN layer. Speciﬁcally,

we stack multiple bidirectional LSTM layers to build an RNN

model [1]. Unlike other architectures, RNN-based models re-

quire sequential processing of frames, which causes a slow

inference especially for bidirectional ones.

3.3. Transformer

We employ the pre-norm Transformer layer [31] which in-

cludes two submodules: multi-head self-attention and feed-

forward. Please refer to the original work [7] for the inter-

nal structure of submodules. While the post-norm design was

employed in the original Transformer model, the pre-norm

design is adopted in many speech and language processing

models [3, 32].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACOMPARISONOFTRANSFORMER,CONVOLUTIONAL,ANDRECURRENTNEURALNETWORKSONPHONEMERECOGNITIONKyuhongShimandWonyongSungDepartmentofElectricalandComputerEngineering,SeoulNationalUniversity,Koreafskhu20,wysungg@snu.ac.krABSTRACTPhonemerecognitionisaveryimportantpartofspeechrecog-nitionthatrequirestheabilitytoe...

展开>> 收起<<

A COMPARISON OF TRANSFORMER CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS ON PHONEME RECOGNITION Kyuhong Shim and Wonyong Sung.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A COMPARISON OF TRANSFORMER CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS ON PHONEME RECOGNITION Kyuhong Shim and Wonyong Sung

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: