
been demonstrated that such models are capable of captur-
ing a wide range of speech-related features and linguistic in-
formation, such as audio, fluency, suprasegmental pronuncia-
tion, and even syntactic and semantic text-based features for
L1, L2, read and spontaneous speech [24]. In the field of
CALL, SSL has been applied to mispronunciation detection
and diagnosis [25, 26, 27] and automatic pronunciation as-
sessment [28], but, to the best of our knowledge, it has not
been investigated for the assessment of overall spoken pro-
ficiency nor of other specific aspects of proficiency, such as
formal correctness, communicative effectiveness, lexical rich-
ness and complexity, and relevance.
In this work, we first test the effectiveness of wav2vec
2.0 in predicting the holistic proficiency level of L2 English
learners’ answers included in a publicly available dataset.
Subsequently, we do the same on a non-publicly available
learner corpus, which also contains annotations related to
single aspects of proficiency (i.e., pronunciation, fluency,
formal correctness, communicative effectiveness, relevance
and lexical richness and complexity) that we try to predict
with specific graders. The baseline system used for compar-
ison is a BERT [29] model that takes transcriptions as input.
We use only manual transcriptions for our experiments on
the publicly available dataset, whereas we use both manual
and ASR transcriptions for our experiments on the second
dataset. In particular, the manual transcriptions also contain
hesitations and fragments of words, which serve as proxies
for pronunciation and fluency.
Another aspect that should be highlighted is that we con-
duct our experiments using a small quantity of training data
and still manage to achieve promising results on both datasets.
Section 2 describes the data used in our experiments. Sec-
tion 3 illustrates the model architectures. Section 4 shows the
results of our experiments. Finally, in Section 5, we analyse
and discuss the results and reflect upon next steps.
2. DATA
2.1. ICNALE
In order to test our approach, we consider the Interna-
tional Corpus Network of Asian Learners of English (IC-
NALE) [30], a publicly available dataset 1comprising written
and spoken responses of English learners ranging from A2
to B2 of the Common European Framework of Reference
(CEFR) for languages [31] and partially of native speakers.
The L1s of the speakers are not indicated, but they may be
inferred from their countries of origin: China, Hong Kong,
Indonesia, Japan, South Korea, Pakistan, Philippines, Singa-
pore, Thailand, and Taiwan. The CEFR levels were assigned
prior to collecting the data, as the ICNALE team required all
the learners to take an L2 vocabulary size test and to present
their scores in English proficiency tests such as TOEFL,
1http://language.sakura.ne.jp/icnale/download.html
TOEIC, IELTS, etc. On the basis of these two scores, the
learners were classified into proficiency levels. The spoken
section of the corpus consists of two parts: one containing
monologues and the other containing dialogues. For our ex-
periments, we only considered the monologues, i.e., 4332
answers lasting between 36 and 69 seconds in which learners
are required to express their opinion about two statements
on the importance of having a part-time job and on smok-
ing in restaurants. The available metadata includes manual
transcriptions of the learners’ answers, personal information
about learners’ education history, and their assigned CEFR
levels. We divide the data into a training set of 3898 answers,
a development set and a test set with 217 answers each. For
the experiments on this dataset, proficiency assessment is
treated as a classification task with five classes: A2, B1 1,
B1 2, B2, and native speakers (see Table 1). To the best of
our knowledge, the ICNALE spoken monologues have only
been used in [32], but in this study, the answers to the two
statements are considered and evaluated independently, so
no comparison is possible. The experiments described in
[33], instead, only include a section of essays and spoken
dialogues.
Train Dev Test Total
A2 299 16 17 332
B1 1 792 44 44 880
B1 2 1681 94 93 1868
B2 586 33 33 652
native 540 30 30 600
Total 3898 217 217 4332
Table 1: Number of answers for each CEFR proficiency level
in ICNALE.
2.2. TLT-school
In Trentino, an autonomous region in northern Italy, the lin-
guistic competence of Italian students has been assessed in re-
cent years through proficiency tests in both English and Ger-
man [34], involving about 3000 students ranging from 9 to 16
years old, belonging to four different school grades (5th,8th,
10th,11th) and three CEFR proficiency levels (A1, A2, B1).
Since our experiments are conducted only on the B1 section
of the English spoken parts of the corpus, we will not describe
the section concerning the German section, as their analysis
goes beyond the scope of this paper.
After eliminating the answers containing only silence or
non-speech background, the spoken section is composed of
494 responses to 7 small talk questions about everyday life
situations. It is worth mentioning that some answers are char-
acterized by a number of issues (e.g., presence of words be-
longing to multiple languages or presence of off-topic an-
swers). We decided not to eliminate these answers from the