AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGES BASED ON CHUKCHI Cydnie Davenport

2025-04-27 0 0 570.04KB 12 页 10玖币
侵权投诉
AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE
LANGUAGES BASED ON CHUKCHI
Cydnie Davenport
Linguistic Theory and Language Description Masters program
Higher School of Ecomonics
Moscow, Russia
davenport.cyd@gmail.com
Emil Nadimanov
Computational Linguistics Masters program
Higher School of Ecomonics
Moscow, Russia
nadimaemi@gmail.com
Anastasia Safonova
Computational Linguistics Masters program
Higher School of Ecomonics
Moscow, Russia
an.saphonova@gmail.com
Tatiana Yudina
Computational Linguistics Masters program
Higher School of Ecomonics
Moscow, Russia
yudina.tatiana22@gmail.com
October 13, 2022
1 Introduction
The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition
(ASR) and Text to Speech (TTS) system based in the Chukchi language. The aim of this is to develop a system that
makes the language more accessible to speakers of Chukchi - such as annotating subtitles on videos and movies,
providing more accessible data for research and analysis, or the creation of chat-bots for online users. This system
should consist of; an acoustic model for receiving an audio signal fragment and which gives the probability of various
phonemes based on the fragment analyzed; a language model for determining which suggestions are more or less likely;
and a decoder which will determine the most likely prediction. Predictive automatic speech recognition models already
exist and are a popular focus in the realm of Natural Language Processing, however, the most challenging adversaries
are low-resource languages due to extreme data deficits.
This project is centered around a multi-step research process. Initially, we began by analyzing the Chukchi language
from a linguistic perspective, but for the sake of clarity regarding motivations for making this system, it also must be
looked at from a cultural and sociolinguistic perspective. What is known about the language, are there any cultural
influences within the language, why is such a system necessary, and so on. Once there was extensive understanding of
the subject at hand, the next step would be finding data that is usable. For this project, this included broadcasts from
a Russian-based Chukchi radio station, videos and lessons from YouTube, written translations of the Bible, and the
Higher School of Economics’ set of Chukchi-based corpora known as Chuklang. Once enough data is collected, there
then comes the task of cleaning it. This included labeling and segmenting audio data for training, cleaning and filtering
out unnecessary symbols (mainly Russian) from text, and determining which data would be used for pre-training
and which would be used for testing the resultant model. Once enough data has been collected and cleaned for our
model, we sample and train various models to understand how they process data. Additionally, we must try various
encoders to understand how well they clean out noise and extra acoustic audio. Extra research must be conducted in
order to compare models designed for both high- and low-resource languages. Various designs and tools for training
ASR models include VQ-VAE, XLSR, the toolkit Kaldi, wav2vec, and more. The intended result of this project is an
automatic speech-recognition system that can seamlessly work with Chukchi and provides us with the potential to be
used for other low-resource languages.
arXiv:2210.05726v1 [cs.CL] 11 Oct 2022
APREPRINT - OCTOBER 13, 2022
2 BACKGROUND
2.1 The Chukchi Language
The Chukotko-Kamchatkan family of languages is said to contain two branches by default. The northern branch is
referred to as the Chukotian branch (or “Luorevetlan”, based on the Chukchi ethnonym) and consists of Chukchi,
Koryak, Alutor and Kerek (now extinct). The second branch is known as Itelmen, and contains the language Western
Itelmen, which itself consists of two dialects: Khajrjusovo and Sedanka [
5
]. The language of focus for this paper is
Chukchi, a polysynthetic language spoken primarily within the Chukotka Autonomous Okrug, which is located in the
easternmost part of Siberia. Chukchi itself is an endangered indigenous language with less than 10,000 speakers at
present, and most speakers are bilingual with a primary language of Russian. There are only less than 100 speakers
who don’t speak Russian at all. Instances and usages of this language are difficult to come by, and is not a language
taught in schools. The decreasing use of this language in general everyday life, as well prominence of Russian within
the community demonstrates the necessity for an automatic speech recognition system, so that we may provide more
accessibility to such an endangered and very low-resource language and its community.
2.2 What is a Low-Resource Language?
In the field of NLP, research tends to have a large focus on languages where data and native speakers are easily
accessible, and the language is relatively well-known. These are referred to as high-resource languages, and as such,
produce a large quantity of data. On the other hand, low-resource languages (occasionally referred to as LRLs) are
usually “..less studied, resource scarce, less computerized, less privileged, less commonly taught, or low density..
[
8
] and therefore are not prioritized in the realm of NLP research. However, this is actually one of the more major
motivating factors for our project. Chukchi is an incredibly low-resource language, an example of which is that most of
the up-to-date information regarding the language and its speakers is most easily accessed from a detailed article found
on Wikipedia
1
. The low-resourcedness of Chukchi is what inspired this project, as it is an endangered language, and
one that is not particularly accessible in terms of media, education, and history. By creating a new automatic speech
recognition system, not only can accessibility be provided for this language, but it also creates new opportunities for the
same achievements in other low-resource languages.
2.2.1 What is an ASR System?
Traditionally, modern automatic speech recognition systems are typically made up of three different parts: a lexicon, an
acoustic model, and a language model
2
. The lexicon contains the information that an ASR system needs to be able to
understand the input it receives on the base level. This includes things such as phonetic transcription codes that are
used for the target language’s phonemes. For English, ARPABET
3
and TIMIT
4
are the most commonly used codes and
transcriptions, developed by the Defense Advanced Research Projects Agency (DARPA).
The second component of an ASR system is the acoustic model, which is responsible for forging the relationships
between the phonemes of a language (such as the ones provided in the lexicon) and an audio signal. This interaction is
supported by the use of transcripts along with their respective audio files
5
, and are thus supposed to be able to map
statistical representations for feature vector sequences of a particular phoneme (or sound unit) and classify it [
11
]. This
allows the system to recognize and distinguish this particular sound unit from the rest of the phonemes that it may
encounter in both training data and experimental data.
Finally, there is the language model, which helps to provide clearer contexts and allows the model to view the language
in a naturally occurring form. Thus, this is where training comes in. By training the language model, contexts become
more comprehensive and coherent when interacting with the system, and are thus understandable. By design, the
system, with the help of all of these aforementioned components, is then supposed to be able to predict speech patterns.
1
Link to the article in question:
https://en.wikipedia.org/wiki/Chukchi_language#:~:text=Chukchi%20%2F%CB%
88t%CA%83%CA%8Ak,mainly%20in%20Chukotka%20Autonomous%20Okrug.&text=In%20the%20UNESCO%20Red%20Book,
the%20list%20of%20endangered%20languages
2
Information about the basic components of an automatic speech recognition system is widely available, one of the more easily
understandable sources can be found here: https://voximplant.com/blog/what-is-automatic-speech-recognition
3https://en.wikipedia.org/wiki/ARPABET
4https://en.wikipedia.org/wiki/TIMIT
5
Microsoft conducts extensive research regarding acoustic models, more information as well as links to other sources and
publications can be found here: https://www.microsoft.com/en-us/research/project/acoustic-modeling/
2
APREPRINT - OCTOBER 13, 2022
2.3 Previous Research
Regarding previous research focused on low-resource languages and ASR, there have been multiple approaches to
finding the most efficient and effective model for processing such a limited amount of available data. The basic
framework for processing speech was typically based on a few components. For example, these components could have
included an autoencoder (denoising or otherwise), dual transformation for both text and speech, bidirectional sequence
modeling, typically with a major focus on unsupervised pre-training [
10
]. In addition to this, many approaches also
included a Transformer-based unified model structure [
10
], [
7
]. The goal was to have a system that could sample the
language evenly and return feedback to the model, learning as it continued to sample more data [
13
]. These components
are crucial to the creation of our model, and will be utilized in this project.
Both universities and major corporations alike (e.g., Google with Strope et al, 2011) have also researched the most
effective ways to implement the most ideal features for training both acoustic models as well as language models.
In many cases, it is incredibly difficult to create a pre-training environment that is entirely unsupervised, but the key
here is that it is almost unsupervised. The benefit of unsupervised data pre-training is that it makes data much more
usable. Without the need for supervision, the amount of usable data increases significantly, which gives us much more
accessibility to languages that lack a significant amount of data (i.e., being able to utilize data in a more efficient way).
The key to much of the unsupervised training that already exists is the technique implemented. Discriminative, in which
a dual unigram and trigram language model was used to interpret relative truth, active or passive. Passive learning was a
technique and algorithm that dominated the realm of automatic speech recognition for much of its lifespan.
Passive learning was the initial algorithm used for training language models. This meant that a model was trained based
on a single implementation of a set of data, fixed in time. As a result, there was no room allowed for the model to
improve. Additionally, all the data being used was usually transcribed under human supervision, and training a model to
work with language data ended up being a very time consuming process. By taking the workload off of the researchers
and volunteers who transcribe this data and manually check the model, more effective and efficient means of training an
ASR system can be developed. Active learning mechanisms, as a result, are particularly useful in cases like these. With
a feedback system that allows for the model to learn from itself and ultimately use less data. This can prove invaluable
in developing an ASR system for Chukchi and other low-resource languages.
3 DATA COLLECTION
Given that Chukchi is a very low-resource language with very few speakers, finding usable data proved difficult, as
was discussed above. Samples of both spoken and written Chukchi were selected from any source that could be found.
This included the Charles Weinstein website of Chukchi with translations and descriptions in both French and Russian,
recent news broadcasts in Chukchi (December 2020 and January 2021) from the Anadyr’-based radio station Radio
“Purga”, videos from Youtube, corpora from Chuklang.ru, as well as translated parts of the Bible from Bible.is.
3.1 Radio ‘Purga’
One of the main sources of high-quality annotated data was the Chukchi radio station Radio “Purga,” which has a
special feature at their station in which they report news in Chukchi on a regular (almost daily) basis. A representative
of this radio station provided our research team with a total of 2.53 hours of audio data from 30 episodes of morning
news. These audio files then had to be manually split into shorter chunks of both audio recording and text pairs in order
to be used further. Fortunately, each broadcast came with its own script. However, there were a few issues with this
data in that the real recording and the script would sometimes differ. Additionally, the script also contained several
sentences of pure Russian speech, which makes some part of the audio file unusable. Both of these issues were solved
by excluding such recording-script pairs from the dataset.
3.2 YouTube Videos
Youtube was another primary resource to find any instances of Chukchi audio samples. All videos found were then
converted into WAV format. This portion of the corpus contains:
• stories;
Chukchi online language lessons;
interviews with native speakers;
lessons from the project “Vetgav. Chuckhi lessons”;
3
摘要:

AUTOMATICSPEECHRECOGNITIONOFLOW-RESOURCELANGUAGESBASEDONCHUKCHICydnieDavenportLinguisticTheoryandLanguageDescriptionMastersprogramHigherSchoolofEcomonicsMoscow,Russiadavenport.cyd@gmail.comEmilNadimanovComputationalLinguisticsMastersprogramHigherSchoolofEcomonicsMoscow,Russianadimaemi@gmail.comAnast...

展开>> 收起<<
AUTOMATIC SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGES BASED ON CHUKCHI Cydnie Davenport.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:570.04KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注