PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WA V2VEC 2.0 Stefano Bann o12 Marco Matassoni1 1Fondazione Bruno Kessler Trento Italy

2025-05-02 0 0 244.77KB 8 页 10玖币

侵权投诉

PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0

Stefano Bann`

o1,2, Marco Matassoni1

1Fondazione Bruno Kessler, Trento, Italy

2University of Trento, Trento, Italy

ABSTRACT

The increasing demand for learning English as a second

language has led to a growing interest in methods for au-

tomatically assessing spoken language proﬁciency. Most

approaches use hand-crafted features, but their efﬁcacy re-

lies on their particular underlying assumptions and they risk

discarding potentially salient information about proﬁciency.

Other approaches rely on transcriptions produced by ASR

systems which may not provide a faithful rendition of a

learner’s utterance in speciﬁc scenarios (e.g., non-native chil-

dren’s spontaneous speech). Furthermore, transcriptions do

not yield any information about relevant aspects such as in-

tonation, rhythm or prosody. In this paper, we investigate

the use of wav2vec 2.0 for assessing overall and individual

aspects of proﬁciency on two small datasets, one of which is

publicly available. We ﬁnd that this approach signiﬁcantly

outperforms the BERT-based baseline system trained on ASR

and manual transcriptions used for comparison.

Index Terms—automatic assessment of spoken language

proﬁciency, computer assisted language learning

1. INTRODUCTION

With the increasing number of learners of English as a second

language (L2) worldwide, there has been a growing demand

for automated spoken language assessment systems both for

formal settings, such as internationally recognised language

tests, and for practice situations in Computer Assisted Lan-

guage Learning (CALL). In fact, compared to human graders,

automatic graders can ensure greater consistency and speed at

a lower cost, since the recruitment and training of new human

experts are expensive and can offer only a small increase in

performance [1].

In automatic assessment of L2 spoken language proﬁ-

ciency, input sequential data from a learner is used to predict

a grade or a level with respect to holistic proﬁciency or

speciﬁc aspects of proﬁciency [2, 3]. The input may con-

sist of acoustic features, recognised words, phones and/or

time-aligned information, or other information, such as fun-

damental frequency, extracted directly from the audio or from

automatic speech recognition (ASR) transcriptions. Most ap-

proaches in the literature extract sets of hand-crafted features

related to speciﬁc aspects of proﬁciency, such as ﬂuency [4],

pronunciation [5], prosody [6] and text complexity [7], which

are then fed into graders to predict analytic scores targeting

those speciﬁc aspects. A similar approach can be used by

concatenating multiple hand-crafted features targeting more

than one aspect in order to produce overall feature sets, which

are then passed through graders to predict holistic grades, as

shown in [8, 9, 10, 11]. The effectiveness of hand-crafted fea-

tures for grading either single aspects or overall proﬁciency

heavily relies on their particular underlying assumptions, and

they risk discarding potentially salient information about pro-

ﬁciency. This issue for holistic grading has been addressed

by replacing hand-crafted features with automatically derived

features for holistic grading prediction, either through an

end-to-end system [12] or in multiple stages [13, 14]. Other

studies have employed graders that are trained on holistic

grades but are deﬁned with both their inputs and topology

adapted to focus on speciﬁc aspects of proﬁciency, such as

pronunciation [15], rhythm [16] and text [17, 18]. In these

cases, an evident limitation is the lack of information con-

cerning aspects of proﬁciency that are not included in the

input data fed to the grader, although it is possible to com-

bine multiple graders targeting different aspects, as shown in

[19]. This is particularly true for systems using ASR tran-

scriptions for two reasons: ﬁrst, they contain a certain word

error rate (WER) and may not faithfully render the contents

of a learner’s performance; secondly, although transcriptions

might preserve some information about pronunciation (e.g.,

in the ASR conﬁdence scores), they do not yield any informa-

tion about other relevant aspects of a learner’s performance,

e.g., intonation, rhythm or prosody. Instead, they remain

a valuable resource for highly speciﬁc tasks in CALL ap-

plications, such as spoken grammatical error correction and

feedback [20].

In this work, to address these issues and limitations, we

propose an approach based on self-supervised speech repre-

sentations using wav2vec 2.0 [21, 22]. Recent studies have

demonstrated that self-supervised learning (SSL) is an effec-

tive approach in various downstream tasks of speech process-

ing applications, such as ASR, emotion recognition, keyword

spotting, speaker diarisation and speaker identiﬁcation [21,

23]. In these studies, contextual representations were ap-

plied by means of pre-trained models. Speciﬁcally, it has

arXiv:2210.13168v1 [cs.CL] 24 Oct 2022

been demonstrated that such models are capable of captur-

ing a wide range of speech-related features and linguistic in-

formation, such as audio, ﬂuency, suprasegmental pronuncia-

tion, and even syntactic and semantic text-based features for

L1, L2, read and spontaneous speech [24]. In the ﬁeld of

CALL, SSL has been applied to mispronunciation detection

and diagnosis [25, 26, 27] and automatic pronunciation as-

sessment [28], but, to the best of our knowledge, it has not

been investigated for the assessment of overall spoken pro-

ﬁciency nor of other speciﬁc aspects of proﬁciency, such as

formal correctness, communicative effectiveness, lexical rich-

ness and complexity, and relevance.

In this work, we ﬁrst test the effectiveness of wav2vec

2.0 in predicting the holistic proﬁciency level of L2 English

learners’ answers included in a publicly available dataset.

Subsequently, we do the same on a non-publicly available

learner corpus, which also contains annotations related to

single aspects of proﬁciency (i.e., pronunciation, ﬂuency,

formal correctness, communicative effectiveness, relevance

and lexical richness and complexity) that we try to predict

with speciﬁc graders. The baseline system used for compar-

ison is a BERT [29] model that takes transcriptions as input.

We use only manual transcriptions for our experiments on

the publicly available dataset, whereas we use both manual

and ASR transcriptions for our experiments on the second

dataset. In particular, the manual transcriptions also contain

hesitations and fragments of words, which serve as proxies

for pronunciation and ﬂuency.

Another aspect that should be highlighted is that we con-

duct our experiments using a small quantity of training data

and still manage to achieve promising results on both datasets.

Section 2 describes the data used in our experiments. Sec-

tion 3 illustrates the model architectures. Section 4 shows the

results of our experiments. Finally, in Section 5, we analyse

and discuss the results and reﬂect upon next steps.

2. DATA

2.1. ICNALE

In order to test our approach, we consider the Interna-

tional Corpus Network of Asian Learners of English (IC-

NALE) [30], a publicly available dataset 1comprising written

and spoken responses of English learners ranging from A2

to B2 of the Common European Framework of Reference

(CEFR) for languages [31] and partially of native speakers.

The L1s of the speakers are not indicated, but they may be

inferred from their countries of origin: China, Hong Kong,

Indonesia, Japan, South Korea, Pakistan, Philippines, Singa-

pore, Thailand, and Taiwan. The CEFR levels were assigned

prior to collecting the data, as the ICNALE team required all

the learners to take an L2 vocabulary size test and to present

their scores in English proﬁciency tests such as TOEFL,

1http://language.sakura.ne.jp/icnale/download.html

TOEIC, IELTS, etc. On the basis of these two scores, the

learners were classiﬁed into proﬁciency levels. The spoken

section of the corpus consists of two parts: one containing

monologues and the other containing dialogues. For our ex-

periments, we only considered the monologues, i.e., 4332

answers lasting between 36 and 69 seconds in which learners

are required to express their opinion about two statements

on the importance of having a part-time job and on smok-

ing in restaurants. The available metadata includes manual

transcriptions of the learners’ answers, personal information

about learners’ education history, and their assigned CEFR

levels. We divide the data into a training set of 3898 answers,

a development set and a test set with 217 answers each. For

the experiments on this dataset, proﬁciency assessment is

treated as a classiﬁcation task with ﬁve classes: A2, B1 1,

B1 2, B2, and native speakers (see Table 1). To the best of

our knowledge, the ICNALE spoken monologues have only

been used in [32], but in this study, the answers to the two

statements are considered and evaluated independently, so

no comparison is possible. The experiments described in

[33], instead, only include a section of essays and spoken

dialogues.

Train Dev Test Total

A2 299 16 17 332

B1 1 792 44 44 880

B1 2 1681 94 93 1868

B2 586 33 33 652

native 540 30 30 600

Total 3898 217 217 4332

Table 1: Number of answers for each CEFR proﬁciency level

in ICNALE.

2.2. TLT-school

In Trentino, an autonomous region in northern Italy, the lin-

guistic competence of Italian students has been assessed in re-

cent years through proﬁciency tests in both English and Ger-

man [34], involving about 3000 students ranging from 9 to 16

years old, belonging to four different school grades (5th,8th,

10th,11th) and three CEFR proﬁciency levels (A1, A2, B1).

Since our experiments are conducted only on the B1 section

of the English spoken parts of the corpus, we will not describe

the section concerning the German section, as their analysis

goes beyond the scope of this paper.

After eliminating the answers containing only silence or

non-speech background, the spoken section is composed of

494 responses to 7 small talk questions about everyday life

situations. It is worth mentioning that some answers are char-

acterized by a number of issues (e.g., presence of words be-

longing to multiple languages or presence of off-topic an-

swers). We decided not to eliminate these answers from the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PROFICIENCYASSESSMENTOFL2SPOKENENGLISHUSINGWAV2VEC2.0StefanoBanno1;2,MarcoMatassoni11FondazioneBrunoKessler,Trento,Italy2UniversityofTrento,Trento,ItalyABSTRACTTheincreasingdemandforlearningEnglishasasecondlanguagehasledtoagrowinginterestinmethodsforau-tomaticallyassessingspokenlanguageprociency.M...

展开>> 收起<<

PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WA V2VEC 2.0 Stefano Bann o12 Marco Matassoni1 1Fondazione Bruno Kessler Trento Italy.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WA V2VEC 2.0 Stefano Bann o12 Marco Matassoni1 1Fondazione Bruno Kessler Trento Italy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: