
label distribution and outperforms the strong base-
lines employed in readability assessments.
This study makes two main contributions. First,
we present the largest corpus to date of sentences
annotated according to established language abil-
ity indicators. Second, we propose a sentence-
level assessment model to handle unbalanced label
distribution. CEFR-SP and sentence-level assess-
ment codes are available
2
for future research at
https://github.com/yukiar/CEFR-SP.
2 Related Work
Related studies have assessed text levels on differ-
ent granularity (document and sentence) and level
definitions (readability/complexity and CEFR).
2.1 Document-based Readability
Previous studies have assessed readability and cre-
ated corpora with document readability annota-
tions. WeeBit (Vajjala and Meurers,2012), the
OneStopEnglish corpus (Vajjala and Luˇ
ci´
c,2018),
and Newsela provide manually written documents
for various readability levels. Working with these
annotated corpora, previous studies have used vari-
ous linguistic and psycholinguistic features to de-
velop models for assessing document-based read-
ability (Heilman et al.,2007;Kate et al.,2010;
Vajjala and Meurers,2012;Xia et al.,2016;Vaj-
jala and Luˇ
ci´
c,2018). Neural network-based ap-
proaches have proven to be better than feature-
based models (Azpiazu and Pera,2019;Meng et al.,
2020;Imperial,2021;Martinc et al.,2021). In
particular, Deutsch et al. (2020) showed that pre-
trained language models outperform feature-based
approaches, and the combination of linguistic fea-
tures plays no role in performance gains.
2.2 Sentence-based Readability
Previous studies annotated sentences’ complexities
based on crowd workers’ subjective perceptions.
Stajner et al. (2017) used a
5
-point scale to rate the
complexity of sentences written by humans or gen-
erated by text simplification models. Brunato et al.
(2018) used a
7
-point scale for sentences extracted
from the news sections of treebanks (McDonald
et al.,2013). However, as Section 3.4 confirms, re-
lating complexity to language ability descriptions is
challenging. Naderi et al. (2019) annotated German
sentence complexity based on language learners’
2
The licenses of the data sources are detailed in Ethics
Statement section.
subjective judgements. In contrast, the CEFR-level
of a sentence should be judged objectively based
on the understanding of language learners’ skills.
Hence, we presume that a sentence CEFR-level
can be judged only by language education profes-
sionals based on their teaching experience. For
sentence-based readability assessments, previous
studies regarded all sentences in a document to
have the same readability (Collins-Thompson and
Callan,2004;Dell’Orletta et al.,2011;Vajjala and
Meurers,2014;Ambati et al.,2016;Howcroft and
Demberg,2017). As we show in Section 3.4, this
assumption hardly holds.
The simplicity of a sentence is one of the primary
aspects in a text simplification evaluation, which
is commonly judged by human. There are a few
corpora annotated by the sentence simplicity for
automatic quality estimation of text simplification
(Štajner et al.,2016;Alva-Manchego et al.,2021).
Nakamachi et al. (2020) applied a pretrained lan-
guage model for estimating the sentence simplic-
ity and used it to reward a reinforcement learn-
ing–based text simplification model. The sentence
simplicity is distinctive from CEFR levels based
on the established language ability descriptions.
2.3 CEFR-based Text Levels
Attempts have been made to establish criteria for
CEFR-level assessments. For example, the English
Profile (Salamoura and Saville,2010) and CEFR-J
(Ishii and Tono,2018) projects relate English vo-
cabulary and grammar to CEFR levels based on
learner-written’ and textbook corpora. Tools such
as Text Inspector
3
and CVLA (Uchida and Negishi,
2018) endeavour to measure the level of English
reading passages automatically. Xia et al. (2016)
collected reading passages from Cambridge En-
glish Exams and predicted their CEFR levels using
features proposed to assess readability. Rama and
Vajjala (2021) demonstrated that Bidirectional En-
coder Representations from Transformers (BERT)
(Devlin et al.,2019) consistently achieved high ac-
curacy for multilingual CEFR-level classification.
Although these micro- (i.e., vocabulary and
grammar) and macro-level (i.e., passage-level) ap-
proaches have proven useful, few attempts have
been made to assign CEFR levels at the sentence
level, despite its importance in learning and teach-
ing. Pilán et al. (2014) conducted a sentence-level
assessment for Swedish based on CEFR; however,
3https://textinspector.com/