AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGES BASED ON CHUKCHI Cydnie Davenport

2025-04-27 0 0 570.04KB 12 页 10玖币

侵权投诉

AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE

LANGUAGES BASED ON CHUKCHI

Cydnie Davenport

Linguistic Theory and Language Description Masters program

Higher School of Ecomonics

Moscow, Russia

davenport.cyd@gmail.com

Emil Nadimanov

Computational Linguistics Masters program

Higher School of Ecomonics

Moscow, Russia

nadimaemi@gmail.com

Anastasia Safonova

Computational Linguistics Masters program

Higher School of Ecomonics

Moscow, Russia

an.saphonova@gmail.com

Tatiana Yudina

Computational Linguistics Masters program

Higher School of Ecomonics

Moscow, Russia

yudina.tatiana22@gmail.com

October 13, 2022

1 Introduction

The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition

(ASR) and Text to Speech (TTS) system based in the Chukchi language. The aim of this is to develop a system that

makes the language more accessible to speakers of Chukchi - such as annotating subtitles on videos and movies,

providing more accessible data for research and analysis, or the creation of chat-bots for online users. This system

should consist of; an acoustic model for receiving an audio signal fragment and which gives the probability of various

phonemes based on the fragment analyzed; a language model for determining which suggestions are more or less likely;

and a decoder which will determine the most likely prediction. Predictive automatic speech recognition models already

exist and are a popular focus in the realm of Natural Language Processing, however, the most challenging adversaries

are low-resource languages due to extreme data deﬁcits.

This project is centered around a multi-step research process. Initially, we began by analyzing the Chukchi language

from a linguistic perspective, but for the sake of clarity regarding motivations for making this system, it also must be

looked at from a cultural and sociolinguistic perspective. What is known about the language, are there any cultural

inﬂuences within the language, why is such a system necessary, and so on. Once there was extensive understanding of

the subject at hand, the next step would be ﬁnding data that is usable. For this project, this included broadcasts from

a Russian-based Chukchi radio station, videos and lessons from YouTube, written translations of the Bible, and the

Higher School of Economics’ set of Chukchi-based corpora known as Chuklang. Once enough data is collected, there

then comes the task of cleaning it. This included labeling and segmenting audio data for training, cleaning and ﬁltering

out unnecessary symbols (mainly Russian) from text, and determining which data would be used for pre-training

and which would be used for testing the resultant model. Once enough data has been collected and cleaned for our

model, we sample and train various models to understand how they process data. Additionally, we must try various

encoders to understand how well they clean out noise and extra acoustic audio. Extra research must be conducted in

order to compare models designed for both high- and low-resource languages. Various designs and tools for training

ASR models include VQ-VAE, XLSR, the toolkit Kaldi, wav2vec, and more. The intended result of this project is an

automatic speech-recognition system that can seamlessly work with Chukchi and provides us with the potential to be

used for other low-resource languages.

arXiv:2210.05726v1 [cs.CL] 11 Oct 2022

APREPRINT - OCTOBER 13, 2022

2 BACKGROUND

2.1 The Chukchi Language

The Chukotko-Kamchatkan family of languages is said to contain two branches by default. The northern branch is

referred to as the Chukotian branch (or “Luorevetlan”, based on the Chukchi ethnonym) and consists of Chukchi,

Koryak, Alutor and Kerek (now extinct). The second branch is known as Itelmen, and contains the language Western

Itelmen, which itself consists of two dialects: Khajrjusovo and Sedanka [

]. The language of focus for this paper is

Chukchi, a polysynthetic language spoken primarily within the Chukotka Autonomous Okrug, which is located in the

easternmost part of Siberia. Chukchi itself is an endangered indigenous language with less than 10,000 speakers at

present, and most speakers are bilingual with a primary language of Russian. There are only less than 100 speakers

who don’t speak Russian at all. Instances and usages of this language are difﬁcult to come by, and is not a language

taught in schools. The decreasing use of this language in general everyday life, as well prominence of Russian within

the community demonstrates the necessity for an automatic speech recognition system, so that we may provide more

accessibility to such an endangered and very low-resource language and its community.

2.2 What is a Low-Resource Language?

In the ﬁeld of NLP, research tends to have a large focus on languages where data and native speakers are easily

accessible, and the language is relatively well-known. These are referred to as high-resource languages, and as such,

produce a large quantity of data. On the other hand, low-resource languages (occasionally referred to as LRLs) are

usually “..less studied, resource scarce, less computerized, less privileged, less commonly taught, or low density..”

[

] and therefore are not prioritized in the realm of NLP research. However, this is actually one of the more major

motivating factors for our project. Chukchi is an incredibly low-resource language, an example of which is that most of

the up-to-date information regarding the language and its speakers is most easily accessed from a detailed article found

on Wikipedia

. The low-resourcedness of Chukchi is what inspired this project, as it is an endangered language, and

one that is not particularly accessible in terms of media, education, and history. By creating a new automatic speech

recognition system, not only can accessibility be provided for this language, but it also creates new opportunities for the

same achievements in other low-resource languages.

2.2.1 What is an ASR System?

Traditionally, modern automatic speech recognition systems are typically made up of three different parts: a lexicon, an

acoustic model, and a language model

. The lexicon contains the information that an ASR system needs to be able to

understand the input it receives on the base level. This includes things such as phonetic transcription codes that are

used for the target language’s phonemes. For English, ARPABET

and TIMIT

are the most commonly used codes and

transcriptions, developed by the Defense Advanced Research Projects Agency (DARPA).

The second component of an ASR system is the acoustic model, which is responsible for forging the relationships

between the phonemes of a language (such as the ones provided in the lexicon) and an audio signal. This interaction is

supported by the use of transcripts along with their respective audio ﬁles

, and are thus supposed to be able to map

statistical representations for feature vector sequences of a particular phoneme (or sound unit) and classify it [

]. This

allows the system to recognize and distinguish this particular sound unit from the rest of the phonemes that it may

encounter in both training data and experimental data.

Finally, there is the language model, which helps to provide clearer contexts and allows the model to view the language

in a naturally occurring form. Thus, this is where training comes in. By training the language model, contexts become

more comprehensive and coherent when interacting with the system, and are thus understandable. By design, the

system, with the help of all of these aforementioned components, is then supposed to be able to predict speech patterns.

Link to the article in question:

https://en.wikipedia.org/wiki/Chukchi_language#:~:text=Chukchi%20%2F%CB%

88t%CA%83%CA%8Ak,mainly%20in%20Chukotka%20Autonomous%20Okrug.&text=In%20the%20UNESCO%20Red%20Book,

the%20list%20of%20endangered%20languages

Information about the basic components of an automatic speech recognition system is widely available, one of the more easily

understandable sources can be found here: https://voximplant.com/blog/what-is-automatic-speech-recognition

3https://en.wikipedia.org/wiki/ARPABET

4https://en.wikipedia.org/wiki/TIMIT

Microsoft conducts extensive research regarding acoustic models, more information as well as links to other sources and

publications can be found here: https://www.microsoft.com/en-us/research/project/acoustic-modeling/

APREPRINT - OCTOBER 13, 2022

2.3 Previous Research

Regarding previous research focused on low-resource languages and ASR, there have been multiple approaches to

ﬁnding the most efﬁcient and effective model for processing such a limited amount of available data. The basic

framework for processing speech was typically based on a few components. For example, these components could have

included an autoencoder (denoising or otherwise), dual transformation for both text and speech, bidirectional sequence

modeling, typically with a major focus on unsupervised pre-training [

]. In addition to this, many approaches also

included a Transformer-based uniﬁed model structure [

], [

]. The goal was to have a system that could sample the

language evenly and return feedback to the model, learning as it continued to sample more data [

]. These components

are crucial to the creation of our model, and will be utilized in this project.

Both universities and major corporations alike (e.g., Google with Strope et al, 2011) have also researched the most

effective ways to implement the most ideal features for training both acoustic models as well as language models.

In many cases, it is incredibly difﬁcult to create a pre-training environment that is entirely unsupervised, but the key

here is that it is almost unsupervised. The beneﬁt of unsupervised data pre-training is that it makes data much more

usable. Without the need for supervision, the amount of usable data increases signiﬁcantly, which gives us much more

accessibility to languages that lack a signiﬁcant amount of data (i.e., being able to utilize data in a more efﬁcient way).

The key to much of the unsupervised training that already exists is the technique implemented. Discriminative, in which

a dual unigram and trigram language model was used to interpret relative truth, active or passive. Passive learning was a

technique and algorithm that dominated the realm of automatic speech recognition for much of its lifespan.

Passive learning was the initial algorithm used for training language models. This meant that a model was trained based

on a single implementation of a set of data, ﬁxed in time. As a result, there was no room allowed for the model to

improve. Additionally, all the data being used was usually transcribed under human supervision, and training a model to

work with language data ended up being a very time consuming process. By taking the workload off of the researchers

and volunteers who transcribe this data and manually check the model, more effective and efﬁcient means of training an

ASR system can be developed. Active learning mechanisms, as a result, are particularly useful in cases like these. With

a feedback system that allows for the model to learn from itself and ultimately use less data. This can prove invaluable

in developing an ASR system for Chukchi and other low-resource languages.

3 DATA COLLECTION

Given that Chukchi is a very low-resource language with very few speakers, ﬁnding usable data proved difﬁcult, as

was discussed above. Samples of both spoken and written Chukchi were selected from any source that could be found.

This included the Charles Weinstein website of Chukchi with translations and descriptions in both French and Russian,

recent news broadcasts in Chukchi (December 2020 and January 2021) from the Anadyr’-based radio station Radio

“Purga”, videos from Youtube, corpora from Chuklang.ru, as well as translated parts of the Bible from Bible.is.

3.1 Radio ‘Purga’

One of the main sources of high-quality annotated data was the Chukchi radio station Radio “Purga,” which has a

special feature at their station in which they report news in Chukchi on a regular (almost daily) basis. A representative

of this radio station provided our research team with a total of 2.53 hours of audio data from 30 episodes of morning

news. These audio ﬁles then had to be manually split into shorter chunks of both audio recording and text pairs in order

to be used further. Fortunately, each broadcast came with its own script. However, there were a few issues with this

data in that the real recording and the script would sometimes differ. Additionally, the script also contained several

sentences of pure Russian speech, which makes some part of the audio ﬁle unusable. Both of these issues were solved

by excluding such recording-script pairs from the dataset.

3.2 YouTube Videos

Youtube was another primary resource to ﬁnd any instances of Chukchi audio samples. All videos found were then

converted into WAV format. This portion of the corpus contains:

• stories;

• Chukchi online language lessons;

• interviews with native speakers;

• lessons from the project “Vetgav. Chuckhi lessons”;

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AUTOMATICSPEECHRECOGNITIONOFLOW-RESOURCELANGUAGESBASEDONCHUKCHICydnieDavenportLinguisticTheoryandLanguageDescriptionMastersprogramHigherSchoolofEcomonicsMoscow,Russiadavenport.cyd@gmail.comEmilNadimanovComputationalLinguisticsMastersprogramHigherSchoolofEcomonicsMoscow,Russianadimaemi@gmail.comAnast...

展开>> 收起<<

AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGES BASED ON CHUKCHI Cydnie Davenport.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGES BASED ON CHUKCHI Cydnie Davenport

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: