PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WA V2VEC 2.0 Stefano Bann o12 Marco Matassoni1 1Fondazione Bruno Kessler Trento Italy

2025-05-02 0 0 244.77KB 8 页 10玖币
侵权投诉
PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0
Stefano Bann`
o1,2, Marco Matassoni1
1Fondazione Bruno Kessler, Trento, Italy
2University of Trento, Trento, Italy
ABSTRACT
The increasing demand for learning English as a second
language has led to a growing interest in methods for au-
tomatically assessing spoken language proficiency. Most
approaches use hand-crafted features, but their efficacy re-
lies on their particular underlying assumptions and they risk
discarding potentially salient information about proficiency.
Other approaches rely on transcriptions produced by ASR
systems which may not provide a faithful rendition of a
learner’s utterance in specific scenarios (e.g., non-native chil-
dren’s spontaneous speech). Furthermore, transcriptions do
not yield any information about relevant aspects such as in-
tonation, rhythm or prosody. In this paper, we investigate
the use of wav2vec 2.0 for assessing overall and individual
aspects of proficiency on two small datasets, one of which is
publicly available. We find that this approach significantly
outperforms the BERT-based baseline system trained on ASR
and manual transcriptions used for comparison.
Index Termsautomatic assessment of spoken language
proficiency, computer assisted language learning
1. INTRODUCTION
With the increasing number of learners of English as a second
language (L2) worldwide, there has been a growing demand
for automated spoken language assessment systems both for
formal settings, such as internationally recognised language
tests, and for practice situations in Computer Assisted Lan-
guage Learning (CALL). In fact, compared to human graders,
automatic graders can ensure greater consistency and speed at
a lower cost, since the recruitment and training of new human
experts are expensive and can offer only a small increase in
performance [1].
In automatic assessment of L2 spoken language profi-
ciency, input sequential data from a learner is used to predict
a grade or a level with respect to holistic proficiency or
specific aspects of proficiency [2, 3]. The input may con-
sist of acoustic features, recognised words, phones and/or
time-aligned information, or other information, such as fun-
damental frequency, extracted directly from the audio or from
automatic speech recognition (ASR) transcriptions. Most ap-
proaches in the literature extract sets of hand-crafted features
related to specific aspects of proficiency, such as fluency [4],
pronunciation [5], prosody [6] and text complexity [7], which
are then fed into graders to predict analytic scores targeting
those specific aspects. A similar approach can be used by
concatenating multiple hand-crafted features targeting more
than one aspect in order to produce overall feature sets, which
are then passed through graders to predict holistic grades, as
shown in [8, 9, 10, 11]. The effectiveness of hand-crafted fea-
tures for grading either single aspects or overall proficiency
heavily relies on their particular underlying assumptions, and
they risk discarding potentially salient information about pro-
ficiency. This issue for holistic grading has been addressed
by replacing hand-crafted features with automatically derived
features for holistic grading prediction, either through an
end-to-end system [12] or in multiple stages [13, 14]. Other
studies have employed graders that are trained on holistic
grades but are defined with both their inputs and topology
adapted to focus on specific aspects of proficiency, such as
pronunciation [15], rhythm [16] and text [17, 18]. In these
cases, an evident limitation is the lack of information con-
cerning aspects of proficiency that are not included in the
input data fed to the grader, although it is possible to com-
bine multiple graders targeting different aspects, as shown in
[19]. This is particularly true for systems using ASR tran-
scriptions for two reasons: first, they contain a certain word
error rate (WER) and may not faithfully render the contents
of a learner’s performance; secondly, although transcriptions
might preserve some information about pronunciation (e.g.,
in the ASR confidence scores), they do not yield any informa-
tion about other relevant aspects of a learner’s performance,
e.g., intonation, rhythm or prosody. Instead, they remain
a valuable resource for highly specific tasks in CALL ap-
plications, such as spoken grammatical error correction and
feedback [20].
In this work, to address these issues and limitations, we
propose an approach based on self-supervised speech repre-
sentations using wav2vec 2.0 [21, 22]. Recent studies have
demonstrated that self-supervised learning (SSL) is an effec-
tive approach in various downstream tasks of speech process-
ing applications, such as ASR, emotion recognition, keyword
spotting, speaker diarisation and speaker identification [21,
23]. In these studies, contextual representations were ap-
plied by means of pre-trained models. Specifically, it has
Accepted at SLT 2022. Copyright notice: 978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.13168v1 [cs.CL] 24 Oct 2022
been demonstrated that such models are capable of captur-
ing a wide range of speech-related features and linguistic in-
formation, such as audio, fluency, suprasegmental pronuncia-
tion, and even syntactic and semantic text-based features for
L1, L2, read and spontaneous speech [24]. In the field of
CALL, SSL has been applied to mispronunciation detection
and diagnosis [25, 26, 27] and automatic pronunciation as-
sessment [28], but, to the best of our knowledge, it has not
been investigated for the assessment of overall spoken pro-
ficiency nor of other specific aspects of proficiency, such as
formal correctness, communicative effectiveness, lexical rich-
ness and complexity, and relevance.
In this work, we first test the effectiveness of wav2vec
2.0 in predicting the holistic proficiency level of L2 English
learners’ answers included in a publicly available dataset.
Subsequently, we do the same on a non-publicly available
learner corpus, which also contains annotations related to
single aspects of proficiency (i.e., pronunciation, fluency,
formal correctness, communicative effectiveness, relevance
and lexical richness and complexity) that we try to predict
with specific graders. The baseline system used for compar-
ison is a BERT [29] model that takes transcriptions as input.
We use only manual transcriptions for our experiments on
the publicly available dataset, whereas we use both manual
and ASR transcriptions for our experiments on the second
dataset. In particular, the manual transcriptions also contain
hesitations and fragments of words, which serve as proxies
for pronunciation and fluency.
Another aspect that should be highlighted is that we con-
duct our experiments using a small quantity of training data
and still manage to achieve promising results on both datasets.
Section 2 describes the data used in our experiments. Sec-
tion 3 illustrates the model architectures. Section 4 shows the
results of our experiments. Finally, in Section 5, we analyse
and discuss the results and reflect upon next steps.
2. DATA
2.1. ICNALE
In order to test our approach, we consider the Interna-
tional Corpus Network of Asian Learners of English (IC-
NALE) [30], a publicly available dataset 1comprising written
and spoken responses of English learners ranging from A2
to B2 of the Common European Framework of Reference
(CEFR) for languages [31] and partially of native speakers.
The L1s of the speakers are not indicated, but they may be
inferred from their countries of origin: China, Hong Kong,
Indonesia, Japan, South Korea, Pakistan, Philippines, Singa-
pore, Thailand, and Taiwan. The CEFR levels were assigned
prior to collecting the data, as the ICNALE team required all
the learners to take an L2 vocabulary size test and to present
their scores in English proficiency tests such as TOEFL,
1http://language.sakura.ne.jp/icnale/download.html
TOEIC, IELTS, etc. On the basis of these two scores, the
learners were classified into proficiency levels. The spoken
section of the corpus consists of two parts: one containing
monologues and the other containing dialogues. For our ex-
periments, we only considered the monologues, i.e., 4332
answers lasting between 36 and 69 seconds in which learners
are required to express their opinion about two statements
on the importance of having a part-time job and on smok-
ing in restaurants. The available metadata includes manual
transcriptions of the learners’ answers, personal information
about learners’ education history, and their assigned CEFR
levels. We divide the data into a training set of 3898 answers,
a development set and a test set with 217 answers each. For
the experiments on this dataset, proficiency assessment is
treated as a classification task with five classes: A2, B1 1,
B1 2, B2, and native speakers (see Table 1). To the best of
our knowledge, the ICNALE spoken monologues have only
been used in [32], but in this study, the answers to the two
statements are considered and evaluated independently, so
no comparison is possible. The experiments described in
[33], instead, only include a section of essays and spoken
dialogues.
Train Dev Test Total
A2 299 16 17 332
B1 1 792 44 44 880
B1 2 1681 94 93 1868
B2 586 33 33 652
native 540 30 30 600
Total 3898 217 217 4332
Table 1: Number of answers for each CEFR proficiency level
in ICNALE.
2.2. TLT-school
In Trentino, an autonomous region in northern Italy, the lin-
guistic competence of Italian students has been assessed in re-
cent years through proficiency tests in both English and Ger-
man [34], involving about 3000 students ranging from 9 to 16
years old, belonging to four different school grades (5th,8th,
10th,11th) and three CEFR proficiency levels (A1, A2, B1).
Since our experiments are conducted only on the B1 section
of the English spoken parts of the corpus, we will not describe
the section concerning the German section, as their analysis
goes beyond the scope of this paper.
After eliminating the answers containing only silence or
non-speech background, the spoken section is composed of
494 responses to 7 small talk questions about everyday life
situations. It is worth mentioning that some answers are char-
acterized by a number of issues (e.g., presence of words be-
longing to multiple languages or presence of off-topic an-
swers). We decided not to eliminate these answers from the
摘要:

PROFICIENCYASSESSMENTOFL2SPOKENENGLISHUSINGWAV2VEC2.0StefanoBanno1;2,MarcoMatassoni11FondazioneBrunoKessler,Trento,Italy2UniversityofTrento,Trento,ItalyABSTRACTTheincreasingdemandforlearningEnglishasasecondlanguagehasledtoagrowinginterestinmethodsforau-tomaticallyassessingspokenlanguageprociency.M...

展开>> 收起<<
PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WA V2VEC 2.0 Stefano Bann o12 Marco Matassoni1 1Fondazione Bruno Kessler Trento Italy.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:244.77KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注