TAPE Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva12 Tatiana Shavrina13 Alena Fenogenova1 Denis Shevelev1 Nadezhda Katricheva1Maria Tikhonova12Albina Akhmetgareeva1

2025-05-02 0 0 2.8MB 26 页 10玖币
侵权投诉
TAPE: Assessing Few-shot Russian Language Understanding
Ekaterina Taktasheva1,2
, Tatiana Shavrina1,3, Alena Fenogenova1, Denis Shevelev1,
Nadezhda Katricheva1,Maria Tikhonova1,2,Albina Akhmetgareeva1,
Oleg Zinkevich2,Anastasiia Bashmakova2,Svetlana Iordanskaia2,Alena Spiridonova2,
Valentina Kurenshchikova2,Ekaterina Artemova4,5,Vladislav Mikhailov1
1SberDevices, 2HSE University, 3Artificial Intelligence Research Institute,
4Huawei Noah’s Ark lab, 5CIS LMU Munich, Germany
Correspondence: rybolos@gmail.com
Abstract
Recent advances in zero-shot and few-shot
learning have shown promise for a scope
of research and practical purposes. How-
ever, this fast-growing area lacks standardized
evaluation suites for non-English languages,
hindering progress outside the Anglo-centric
paradigm. To address this line of research,
we propose TAPE (Text Attack and Pertur-
bation Evaluation), a novel benchmark that
includes six more complex NLU tasks for
Russian, covering multi-hop reasoning, ethi-
cal concepts, logic and commonsense knowl-
edge. The TAPE’s design focuses on system-
atic zero-shot and few-shot NLU evaluation:
(i) linguistic-oriented adversarial attacks and
perturbations for analyzing robustness, and
(ii) subpopulations for nuanced interpretation.
The detailed analysis of testing the autoregres-
sive baselines indicates that simple spelling-
based perturbations affect the performance the
most, while paraphrasing the input has a more
negligible effect. At the same time, the results
demonstrate a significant gap between the neu-
ral and human baselines for most tasks. We
publicly release TAPE1to foster research on
robust LMs that can generalize to new tasks
when little to no supervision is available.
1 Introduction
The ability to acquire new concepts from a few
examples is central to human intelligence (Tenen-
baum et al.,2011). Recent advances in the NLP
field have fostered the development of language
models (LMs; Radford et al.,2019;Brown et al.,
2020) that exhibit such generalization capacity un-
der a wide range of few-shot learning and prompt-
ing methods (Liu et al.,2021;Beltagy et al.,2022).
The community has addressed various aspects of
few-shot learning, such as efficient model applica-
tion (Schick and Schütze,2021), adaptation to un-
seen tasks and domains (Bansal et al.,2020a,b), and
Equal contribution.
1tape-benchmark.com
cross-lingual generalization (Winata et al.,2021;
Lin et al.,2021).
The latest research has raised an essential ques-
tion of standardized evaluation protocols to assess
few-shot generalization from multiple perspectives.
The novel tool-kits and benchmarks mainly focus
on systematic evaluation design (Bragg et al.,2021;
Zheng et al.,2022), cross-task generalization (Ye
et al.,2021;Wang et al.,2022), and real-world sce-
narios (Alex et al.,2021). However, this rapidly
developing area fails to provide similar evalua-
tion suites for non-English languages, hindering
progress outside the Anglo-centric paradigm.
Motivation and Contributions.
In this paper, we
introduce TAPE
2
, a novel benchmark for few-shot
Russian language understanding evaluation. Our
objective is to provide a reliable tool and method-
ology for nuanced assessment of zero-shot and
few-shot methods for Russian. The objective is
achieved through two main contributions.
Contribution 1.
Our first contribution is to cre-
ate six more complex question answering (QA),
Winograd schema, and ethics tasks for Russian.
The tasks require understanding many aspects of
language, multi-hop reasoning, logic, and common-
sense knowledge.
The motivation behind this is that there are systems
that match or outperform human baselines on most
of the existing QA tasks for Russian, e.g., the ones
from Russian SuperGLUE (Shavrina et al.,2020):
DaNetQA (Glushkova et al.,2020), MuSeRC and
RuCoS (Fenogenova et al.,2020). To the best of
our knowledge, datasets on ethical concepts have
not yet been created in Russian. To bridge this
gap, we propose one of the first Russian datasets
on estimating the ability of LMs to predict human
ethical judgments about various text situations.
Contribution 2.
Our second contribution is to de-
velop a framework for multifaceted zero-shot and
2Text Attack and Perturbation Evaluation.
arXiv:2210.12813v1 [cs.CL] 23 Oct 2022
few-shot NLU evaluation. The design includes (i)
linguistic-oriented adversarial attacks and perturba-
tions for testing robustness, and (ii) subpopulations
for nuanced performance analysis.
Here, we follow the methodological principles and
recommendations by Bowman and Dahl (2021)
and Bragg et al. (2021), which motivate the need
for systematic benchmark design and adversarially-
constructed test sets.
Findings.
Our findings are summarized as five-
fold: (i) zero-shot evaluation may outperform few-
shot evaluation, meaning that the autoregressive
baselines fail to utilize demonstrations, (ii) few-
shot results may be unstable and sensitive to prompt
changes, (iii)
negative result
: zero-shot and few-
shot generation for open-domain and span selec-
tion QA tasks leads to near-zero performance, (iv)
the baselines are most vulnerable to spelling-based
and emoji-based adversarial perturbations, and (v)
human annotators significantly outperform the neu-
ral baselines, indicating that there is still room for
developing robust and generalizable systems.
2 Related Work
Benchmark Critique.
Benchmarks such as
GLUE (Wang et al.,2018) and SuperGLUE (Wang
et al.,2019) have become de facto standard tools to
measure progress in NLP. However, recent stud-
ies have criticized the canonical benchmarking
approaches. Bender et al. (2021) warn perfor-
mance gains are achieved at the cost of carbon foot-
print. Elangovan et al. (2021) claim that the current
benchmarks evaluate the LM’s ability to memorize
rather than generalize because of the significant
overlap between the train and test datasets. Church
and Kordoni (2022) argue that benchmarks focus
on relatively easy tasks instead of creating long-
term challenges. Raji et al. (2021) raise concerns
about the resource-intensive task design. In par-
ticular, benchmarks present with large-scale train
datasets, which are expensive to create. This may
lead to benchmark stagnation, as new tasks can not
be added easily (Barbosa-Silva et al.,2022). In
turn, few-shot benchmarking offers a prospective
avenue for LMs evaluation regarding generaliza-
tion capacity, computational and resource costs.
Few-shot Benchmarking.
Research in few-shot
benchmarking has evolved in several directions.
Schick and Schütze (2021) create FewGLUE by
sampling small fixed-sized training datasets from
SuperGLUE. Variance w.r.t to training dataset
size and sampling strategy is not reported. Later
works overcome these issues by exploring evalu-
ation strategies by
K
-fold cross-validation (Perez
et al.,2021), bagging, and multi-splits, introduced
in FewNLU (Zheng et al.,2022). Additionally,
FewNLU explores correlations between perfor-
mance on development and test sets and stability
w.r.t. the number of runs. CrossFit (Ye et al.,2021)
studies cross-task generalization by unifying task
formats and splitting tasks into training, develop-
ment, and test sets. FLEX (Bragg et al.,2021)
covers the best practices and provides a unified in-
terface for different types of transfer and varying
shot sizes. Finally, to the best of our knowledge,
the only non-English dataset for few-shot bench-
marking is Few-CLUE in Chinese (Xu et al.,2021).
TAPE is the first few-shot benchmark for Russian,
which introduces variations at the data level by
creating adversarial test sets.
3 Task Formulations
TAPE includes six novel datasets for Russian,
each requiring modeling “intellectual abilities” of
at least two skills: logical reasoning (§3.1; ex-
tended Winograd schema challenge), reasoning
with world knowledge (§3.2; CheGeKa, RuOpen-
BookQA and RuWorldTree), multi-hop reason-
ing (§3.2; MultiQ), and ethical judgments (§3.3;
Ethics
1/2
). This section describes the task formu-
lations, general data collection stages, and dataset
examples. Appendix A provides the general dataset
statistics, while Appendix E.1 includes details on
dataset collection and extra validation stage via
a crowd-sourcing platform Toloka
3
(Pavlichenko
et al.,2021).
3.1 Logical Reasoning
Winograd.
The Winograd schema challenge com-
poses tasks with syntactic ambiguity, which can be
resolved with logical reasoning (Levesque et al.,
2012). The texts for the dataset are collected
with a semi-automatic pipeline. First, lists of
11 typical grammatical structures with syntactic
homonymy (mainly case) are compiled by a few au-
thors with linguistic background (see Appendix B).
Queries corresponding to these constructions are
submitted to the search of the Russian National
Corpus
4
, which includes a sub-corpus with re-
3toloka.ai
4ruscorpora.ru/en
solved homonymy. In the resulting 2k+ sentences,
homonymy is resolved automatically with UDPipe
5
and then validated manually by a few authors af-
terward. Each sentence is split into multiple exam-
ples in the binary classification format, indicating
whether the reference pronoun is dependant on the
chosen candidate noun.
Context:
“Brosh’ iz Pompei, kotoraya perezhila
veka.(A trinket from Pompeii that has survived
the centuries.)
Reference: “kotoraya” (that)
Candidate Answer: “Brosh’ ” (A trinket)
Label: 3(correct resolution)
3.2 Reasoning with World Knowledge
RuOpenBookQA.
RuOpenBookQA is a QA
dataset with multiple-choice elementary-level sci-
ence questions, which probe understanding of 1k+
core science facts. The dataset is built with au-
tomatic translation of the original English dataset
by Mihaylov et al. (2018) and manual validation
by a few authors.
Question:
“Yesli chelovek idet v napravlenii,
protivopolozhnom napravleniyu strelki kompasa,
on idet... (If a person walks in the direction
opposite to the compass needle, they are going...)
Answers:
(A) “na zapad” (west); (B) “na
sever” (north); (C) “na vostok” (east);
(D) “na yug” (south).
RuWorldTree.
The collection approach of
Ru-
WorldTree
is similar to that of
RuOpenBookQA
,
the main difference being the additional lists of
facts and the logical order that is attached to the
output of each answer to a question (Jansen et al.,
2018).
Question:
“Kakoye svoystvo vody izmenit-
sya, kogda voda dostignet tochki zamerzaniya?”
(What property of water will change when the
water reaches the freezing point?)
Answers:
(A) “tsvet” (color); (B) “massa”
(mass); (C)
“sostoyaniye” (state of matter)
;
(D) “ves” (weight).
MultiQ.
Multi-hop reasoning has been one of the
least explored QA directions for Russian. The task
is addressed by the MuSeRC dataset (Fenogenova
et al.,2020) and only a few dozen questions in
5UDPipe package
SberQUAD (Efimov et al.,2020) and RuBQ (Ry-
bin et al.,2021). In response, we have developed a
semi-automatic pipeline for multi-hop dataset gen-
eration based on Wikidata and Wikipedia. First,
we extract the triplets from Wikidata and search
for their intersections. Two triplets (subject, re-
lation, object) are needed to compose an answer-
able multi-hop question. For instance, the ques-
tion “Na kakom kontinente nakhoditsya strana,
grazhdaninom kotoroy byl Yokhannes Blok?” (In
what continent lies the country of which Johannes
Block was a citizen?) is formed by a sequence of
ve graph units: “Blok, Yokhannes” (Block, Jo-
hannes), “grazhdanstvo” (country of citizenship),
“Germaniya” (Germany), “chast’ sveta” (continent),
and “Yevropa” (Europe). Second, several hundreds
of the corresponding question templates are curated
by a few authors manually, which are further used
to fine-tune ruT5-large
6
to generate multi-hop ques-
tions given the graph units sequences. Third, the re-
sulting questions undergo paraphrasing (Fenogen-
ova,2021) and manual validation procedure to con-
trol the quality and diversity. Finally, each question
is linked to two Wikipedia paragraphs with the help
of wptools
7
, where all graph units appear in the nat-
ural language. The task is to select the answer span
using information from both paragraphs.
Question:
“Gde nakhoditsya istok reki, pri-
tokom kotoroy yavlyayetsya Getar?” (Where
is the source of the river, the tributary of which
is the Getar?)
Supporting Text:
“Getar — reka v Armenii.
Beryot nachalo na territorii Kotaykskoy oblasti,
protekayet cherez tsentral’nuyu chast’ Yerevana
i vpadayet v Razdan. (The Getar is a river in
Armenia. [It] originates in the Kotayk region,
flows through the central part of Yerevan and
flows into the Hrazdan.)
Main Text:
“Razdan — reka v Armenii.
Vytekayet iz ozera
Sevan
v yego severo-
zapadnoy chasti, nedaleko ot goroda Sevan.
(The Hrazdan is a river in Armenia. [It] orig-
inates at the northwest extremity of Lake Sevan,
near the city of Sevan.)
Answer: Sevan
CheGeKa.
The CheGeKa game
8
setup is simi-
lar to Jeopardy, where the player should answer
6hf.co/sberbank-ai/ruT5-large
7github.com/siznax/wptools
8en.wikipedia.org/wiki/what_where_when
Dtest E0
k = 0
EN
k = 8
Evaluation Report
BF
AS
Perturbations
ButterFingers
AddSent
Emojify
BackTranslation
EDA
Adversarial Dtest
Dtrain
Eitrain
k = 1
Ei+1train
k = 1
ENtrain
k = 8
Ei
k = 1
Ei
Ei+1
Ei+2
Ei+3
Ei+4
EN
E0
Acc F1 Size
55.7 ±0.13 ±0.0726.3
52.4 ±0.21 ±0.08
24.6
51.8 ±0.19 ±0.0623.6
perturb
evalset
51.8 ±0.21 ±0.0623.6
51.8 ±0.21 ±0.0623.6
subpopulation
1000
1000
1000
501
499
S1
S2
(a)
(b) (c) (d) (e)
Figure 1: Overview of the TAPE’s design. (a) Dtest is passed to the adversarial framework (§ 4.2) to create
the adversarial test DA
test that includes the original and adversarial examples. (b) We randomly sample 5sets of
demonstration examples from Dtrain for each k∈ {1,4,8}. In the zero-shot scenario, we skip this stage. (c)
After that, we merge the demonstrations, when applicable, with the examples from DA
test to construct evaluation
episodes EN
k.(d) Each EN
kis used to obtain predictions from the model. (e) The performance is summarized in a
diagnostic evaluation report. BF – BUTTERFINGERS, AS – ADDSENT, S – subpopulation.
questions based on wit and common sense knowl-
edge. We directly contacted the authors of Russian
Jeopardy! (Mikhalkova,2021) and asked about in-
cluding their training and private test sets in our
benchmark. The task is to provide a free response
given a question and the question category.
Question:
“Imenno on napisal muzyku k opere
Turandot.(It was he who composed the music
for the opera "Turandot".)
Category:
“Komediya del’ arte” (Commedia
dell’arte)
Answer: “Puchchini” (Puccini)
3.3 Ethical Judgments
There is a multitude of approaches to evaluating
ethics in machine learning. The
Ethics
dataset for
Russian is created from scratch for the first time,
relying on the design compatible with Hendrycks
et al. (2021). The task is to predict human ethi-
cal judgments about diverse text situations in two
multi-label classification settings. The first one is
to identify the presence of concepts in normative
ethics, such as virtue, law, moral, justice, and utili-
tarianism (
Ethics1
). The second one is to evaluate
the positive or negative implementation of these
concepts with binary categories (Ethics2).
The composition of the dataset is conducted in
a semi-automatic mode. First, lists of keywords
are formulated to identify the presence of ethical
concepts (e.g., “kill”, “give”, “create”, etc.). The
collection of keywords includes the automatic col-
lection of synonyms using the semantic similarity
tools of the RusVectores project (Kutuzov and Kuz-
menko,2017). After that, the news and fiction sub-
corpora of the Taiga corpus (Shavrina and Shapo-
valova,2017) are filtered to extract short texts con-
taining these keywords. Each text is annotated via
Toloka as documented in Appendix E.1.
Text:
“Pechen’kami sobstvennogo prigo-
tovleniya nagradila 100-letnyaya Greta Plokh
malysha, kotoryy pomog yey pereyti cherez
ozhivlennoye shosse po peshekhodnomu
perekhodu.(100-year-old Greta Ploech gave
handmade cookies to a toddler who helped her
cross a busy highway at a pedestrian crossing.)
Labels1:3
(Virtue)
7
(Law)
7
(Moral)
3
(Justice)
3(Utilitarianism)
Labels2:3
(Virtue)
3
(Law)
3
(Moral)
3
(Jus-
tice) 3(Utilitarianism)
4 Design
4.1 Evaluation Principles
This section outlines our evaluation principles that
are based on methodological recommendations and
open research questions discussed by Bragg et al.
(2021); Bowman and Dahl (2021); Beltagy et al.
(2022), including sample size design, varying num-
ber of shots, reporting variability, diagnostic perfor-
mance analysis, and adversarial test sets. Figure 1
describes the TAPE’s design.
Data Sampling.
Each task in our benchmark con-
sists of a training set
Dtrain
with labeled examples
and a test set
Dtest
. The splits are randomly sam-
pled, except for
RuOpenBookQA
,
RuWorldTree
,
and
CheGeKa
, where we use the original splits.
We keep the dataset size up to
1
k and purposefully
include imbalanced data for the text classification
tasks.
No extra data.
We do not provide validation sets
nor any additional unlabeled data to test the zero-
shot and few-shot generalization capabilities of
LMs (Bao et al.,2019;Tam et al.,2021).
Number of shots.
We consider
k∈ {1,4,8}
for
few-shot evaluation to account for sensitivity to the
number of shots. We also include zero-shot evalua-
tion, which can be a strong baseline and simulate
scenarios where no supervision is available.
Episode sampling.
We provide
5
episodes in each
k
-shot setting
k∈ {1,4,8}
and report standard de-
viation over the episodes to estimate the variability
due to the selection of demonstrations (Schick and
Schütze,2021). Each episode
Ei= (Ei
traink,DA
test)
consists of
k
demonstration examples
Ei
traink
ran-
domly sampled from
Dtrain
with replacement, and
a single test
DA
test
acquired via the combination of
original and adversarial test data.
Subpopulations.
Subpopulations (Goel et al.,
2021) are utilized for fine-grained performance
analysis w.r.t. such properties of
Dtest
as length,
domain, and others.
Robustness.
LMs are susceptible to adversarial
examples, purposefully designed to force them
output a wrong prediction given a modified input
(Ebrahimi et al.,2018;Liang et al.,2018;Jia and
Liang,2017). We analyze the LMs’ robustness to
different types of adversarial data transformations.
Here, each
Ei
traink
corresponds to
T+ 1
test varia-
tions, including the original
Dtest
and
T
adversarial
test sets
DA
test
, acquired through the modification
of
Dtest
.
T
depends on the dataset and can be
adjusted based on the user’s needs.
4.2 Adversarial Framework
4.2.1 Attacks and Perturbations
Table 1 summarizes the TAPE’s adversarial attacks
and perturbations based on the generally accepted
typology (Zhang et al.,2020;Wang et al.,2021b).
Word-level Perturbations.
Word-level perturba-
tions utilize several strategies to perturb tokens,
ranging from imitation of typos (Jin et al.,2020)
to synonym replacement (Wei and Zou,2019). We
consider the following:
Spelling
. BUTTERFINGERS is the typo-based per-
turbation that adds noise to data by mimicking
spelling mistakes made by humans through charac-
ter swaps based on their keyboard distance.
Modality
. EMOJIFY replaces the input words with
the corresponding emojis, preserving their original
meaning. A few authors have manually validated
translations of the English emoji dictionary.
Sentence-level Perturbations.
In contrast to
word-level perturbations, sentence-level perturba-
tion techniques affect the syntactic structure:
Random
. Easy Data Augmentation (EDA; Wei and
Zou,2019) have proved to be efficient in fooling
LMs on text classification tasks. We use two EDA
configurations: swapping words (EDA
SWAP
) and
deleting tokens (EDADELETE).
Paraphrasis
. BACKTRANSLATION (Yaseen and
Langer,2021) allows to generate linguistic varia-
tions of the input without changing named entities.
We use the OpusMT model
9
to translate the input
text into English and back to Russian.
Distraction
. ADDSENT is an adversarial attack that
generates extra words or sentences with the help
of a generative text model. We pass the input to
the mGPT
10
LM and generate continuations with
the sampling strategy. In the multiple-choice QA
tasks, we replace one or more incorrect answers
with their generated alternatives.
4.2.2 Data Curation
Adversarial perturbations and attacks are efficiently
utilized to exploit weaknesses in LMs (Goel et al.,
2021). At the same time, popular techniques may
distort semantic meanings or generate invalid ad-
versarial examples (Wang et al.,2021a). We aim
to address this problem by using: (i) adversarial
probability thresholds, (ii) task-specific constraints,
and (iii) semantic filtering.
Probability thresholds.
The degree of the input
modification can be controlled with an adversarial
probability threshold, which serves as the hyperpa-
rameter. The higher the probability, the more the
input gets modified. The thresholds used in our
experiments are specified in Table 1.
9hf.co/Helsinki-NLP/opus-mt
10hf.co/THUMT/mGPT
摘要:

TAPE:AssessingFew-shotRussianLanguageUnderstandingEkaterinaTaktasheva1,2,TatianaShavrina1,3,AlenaFenogenova1,DenisShevelev1,NadezhdaKatricheva1,MariaTikhonova1,2,AlbinaAkhmetgareeva1,OlegZinkevich2,AnastasiiaBashmakova2,SvetlanaIordanskaia2,AlenaSpiridonova2,ValentinaKurenshchikova2,EkaterinaArte...

展开>> 收起<<
TAPE Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva12 Tatiana Shavrina13 Alena Fenogenova1 Denis Shevelev1 Nadezhda Katricheva1Maria Tikhonova12Albina Akhmetgareeva1.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:2.8MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注