
TAPE: Assessing Few-shot Russian Language Understanding
Ekaterina Taktasheva1,2∗
, Tatiana Shavrina1,3∗, Alena Fenogenova1∗, Denis Shevelev1,
Nadezhda Katricheva1,Maria Tikhonova1,2,Albina Akhmetgareeva1,
Oleg Zinkevich2,Anastasiia Bashmakova2,Svetlana Iordanskaia2,Alena Spiridonova2,
Valentina Kurenshchikova2,Ekaterina Artemova4,5,Vladislav Mikhailov1
1SberDevices, 2HSE University, 3Artificial Intelligence Research Institute,
4Huawei Noah’s Ark lab, 5CIS LMU Munich, Germany
Correspondence: rybolos@gmail.com
Abstract
Recent advances in zero-shot and few-shot
learning have shown promise for a scope
of research and practical purposes. How-
ever, this fast-growing area lacks standardized
evaluation suites for non-English languages,
hindering progress outside the Anglo-centric
paradigm. To address this line of research,
we propose TAPE (Text Attack and Pertur-
bation Evaluation), a novel benchmark that
includes six more complex NLU tasks for
Russian, covering multi-hop reasoning, ethi-
cal concepts, logic and commonsense knowl-
edge. The TAPE’s design focuses on system-
atic zero-shot and few-shot NLU evaluation:
(i) linguistic-oriented adversarial attacks and
perturbations for analyzing robustness, and
(ii) subpopulations for nuanced interpretation.
The detailed analysis of testing the autoregres-
sive baselines indicates that simple spelling-
based perturbations affect the performance the
most, while paraphrasing the input has a more
negligible effect. At the same time, the results
demonstrate a significant gap between the neu-
ral and human baselines for most tasks. We
publicly release TAPE1to foster research on
robust LMs that can generalize to new tasks
when little to no supervision is available.
1 Introduction
The ability to acquire new concepts from a few
examples is central to human intelligence (Tenen-
baum et al.,2011). Recent advances in the NLP
field have fostered the development of language
models (LMs; Radford et al.,2019;Brown et al.,
2020) that exhibit such generalization capacity un-
der a wide range of few-shot learning and prompt-
ing methods (Liu et al.,2021;Beltagy et al.,2022).
The community has addressed various aspects of
few-shot learning, such as efficient model applica-
tion (Schick and Schütze,2021), adaptation to un-
seen tasks and domains (Bansal et al.,2020a,b), and
∗Equal contribution.
1tape-benchmark.com
cross-lingual generalization (Winata et al.,2021;
Lin et al.,2021).
The latest research has raised an essential ques-
tion of standardized evaluation protocols to assess
few-shot generalization from multiple perspectives.
The novel tool-kits and benchmarks mainly focus
on systematic evaluation design (Bragg et al.,2021;
Zheng et al.,2022), cross-task generalization (Ye
et al.,2021;Wang et al.,2022), and real-world sce-
narios (Alex et al.,2021). However, this rapidly
developing area fails to provide similar evalua-
tion suites for non-English languages, hindering
progress outside the Anglo-centric paradigm.
Motivation and Contributions.
In this paper, we
introduce TAPE
2
, a novel benchmark for few-shot
Russian language understanding evaluation. Our
objective is to provide a reliable tool and method-
ology for nuanced assessment of zero-shot and
few-shot methods for Russian. The objective is
achieved through two main contributions.
Contribution 1.
Our first contribution is to cre-
ate six more complex question answering (QA),
Winograd schema, and ethics tasks for Russian.
The tasks require understanding many aspects of
language, multi-hop reasoning, logic, and common-
sense knowledge.
The motivation behind this is that there are systems
that match or outperform human baselines on most
of the existing QA tasks for Russian, e.g., the ones
from Russian SuperGLUE (Shavrina et al.,2020):
DaNetQA (Glushkova et al.,2020), MuSeRC and
RuCoS (Fenogenova et al.,2020). To the best of
our knowledge, datasets on ethical concepts have
not yet been created in Russian. To bridge this
gap, we propose one of the first Russian datasets
on estimating the ability of LMs to predict human
ethical judgments about various text situations.
Contribution 2.
Our second contribution is to de-
velop a framework for multifaceted zero-shot and
2Text Attack and Perturbation Evaluation.
arXiv:2210.12813v1 [cs.CL] 23 Oct 2022