TAPE Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva12 Tatiana Shavrina13 Alena Fenogenova1 Denis Shevelev1 Nadezhda Katricheva1Maria Tikhonova12Albina Akhmetgareeva1

2025-05-02 0 0 2.8MB 26 页 10玖币

侵权投诉

TAPE: Assessing Few-shot Russian Language Understanding

Ekaterina Taktasheva1,2∗

, Tatiana Shavrina1,3∗, Alena Fenogenova1∗, Denis Shevelev1,

Nadezhda Katricheva1,Maria Tikhonova1,2,Albina Akhmetgareeva1,

Oleg Zinkevich2,Anastasiia Bashmakova2,Svetlana Iordanskaia2,Alena Spiridonova2,

Valentina Kurenshchikova2,Ekaterina Artemova4,5,Vladislav Mikhailov1

1SberDevices, 2HSE University, 3Artiﬁcial Intelligence Research Institute,

4Huawei Noah’s Ark lab, 5CIS LMU Munich, Germany

Correspondence: rybolos@gmail.com

Abstract

Recent advances in zero-shot and few-shot

learning have shown promise for a scope

of research and practical purposes. How-

ever, this fast-growing area lacks standardized

evaluation suites for non-English languages,

hindering progress outside the Anglo-centric

paradigm. To address this line of research,

we propose TAPE (Text Attack and Pertur-

bation Evaluation), a novel benchmark that

includes six more complex NLU tasks for

Russian, covering multi-hop reasoning, ethi-

cal concepts, logic and commonsense knowl-

edge. The TAPE’s design focuses on system-

atic zero-shot and few-shot NLU evaluation:

(i) linguistic-oriented adversarial attacks and

perturbations for analyzing robustness, and

(ii) subpopulations for nuanced interpretation.

The detailed analysis of testing the autoregres-

sive baselines indicates that simple spelling-

based perturbations affect the performance the

most, while paraphrasing the input has a more

negligible effect. At the same time, the results

demonstrate a signiﬁcant gap between the neu-

ral and human baselines for most tasks. We

publicly release TAPE1to foster research on

robust LMs that can generalize to new tasks

when little to no supervision is available.

1 Introduction

The ability to acquire new concepts from a few

examples is central to human intelligence (Tenen-

baum et al.,2011). Recent advances in the NLP

ﬁeld have fostered the development of language

models (LMs; Radford et al.,2019;Brown et al.,

2020) that exhibit such generalization capacity un-

der a wide range of few-shot learning and prompt-

ing methods (Liu et al.,2021;Beltagy et al.,2022).

The community has addressed various aspects of

few-shot learning, such as efﬁcient model applica-

tion (Schick and Schütze,2021), adaptation to un-

seen tasks and domains (Bansal et al.,2020a,b), and

∗Equal contribution.

1tape-benchmark.com

cross-lingual generalization (Winata et al.,2021;

Lin et al.,2021).

The latest research has raised an essential ques-

tion of standardized evaluation protocols to assess

few-shot generalization from multiple perspectives.

The novel tool-kits and benchmarks mainly focus

on systematic evaluation design (Bragg et al.,2021;

Zheng et al.,2022), cross-task generalization (Ye

et al.,2021;Wang et al.,2022), and real-world sce-

narios (Alex et al.,2021). However, this rapidly

developing area fails to provide similar evalua-

tion suites for non-English languages, hindering

progress outside the Anglo-centric paradigm.

Motivation and Contributions.

In this paper, we

introduce TAPE

, a novel benchmark for few-shot

Russian language understanding evaluation. Our

objective is to provide a reliable tool and method-

ology for nuanced assessment of zero-shot and

few-shot methods for Russian. The objective is

achieved through two main contributions.

Contribution 1.

Our ﬁrst contribution is to cre-

ate six more complex question answering (QA),

Winograd schema, and ethics tasks for Russian.

The tasks require understanding many aspects of

language, multi-hop reasoning, logic, and common-

sense knowledge.

The motivation behind this is that there are systems

that match or outperform human baselines on most

of the existing QA tasks for Russian, e.g., the ones

from Russian SuperGLUE (Shavrina et al.,2020):

DaNetQA (Glushkova et al.,2020), MuSeRC and

RuCoS (Fenogenova et al.,2020). To the best of

our knowledge, datasets on ethical concepts have

not yet been created in Russian. To bridge this

gap, we propose one of the ﬁrst Russian datasets

on estimating the ability of LMs to predict human

ethical judgments about various text situations.

Contribution 2.

Our second contribution is to de-

velop a framework for multifaceted zero-shot and

2Text Attack and Perturbation Evaluation.

arXiv:2210.12813v1 [cs.CL] 23 Oct 2022

few-shot NLU evaluation. The design includes (i)

linguistic-oriented adversarial attacks and perturba-

tions for testing robustness, and (ii) subpopulations

for nuanced performance analysis.

Here, we follow the methodological principles and

recommendations by Bowman and Dahl (2021)

and Bragg et al. (2021), which motivate the need

for systematic benchmark design and adversarially-

constructed test sets.

Findings.

Our ﬁndings are summarized as ﬁve-

fold: (i) zero-shot evaluation may outperform few-

shot evaluation, meaning that the autoregressive

baselines fail to utilize demonstrations, (ii) few-

shot results may be unstable and sensitive to prompt

changes, (iii)

negative result

: zero-shot and few-

shot generation for open-domain and span selec-

tion QA tasks leads to near-zero performance, (iv)

the baselines are most vulnerable to spelling-based

and emoji-based adversarial perturbations, and (v)

human annotators signiﬁcantly outperform the neu-

ral baselines, indicating that there is still room for

developing robust and generalizable systems.

2 Related Work

Benchmark Critique.

Benchmarks such as

GLUE (Wang et al.,2018) and SuperGLUE (Wang

et al.,2019) have become de facto standard tools to

measure progress in NLP. However, recent stud-

ies have criticized the canonical benchmarking

approaches. Bender et al. (2021) warn perfor-

mance gains are achieved at the cost of carbon foot-

print. Elangovan et al. (2021) claim that the current

benchmarks evaluate the LM’s ability to memorize

rather than generalize because of the signiﬁcant

overlap between the train and test datasets. Church

and Kordoni (2022) argue that benchmarks focus

on relatively easy tasks instead of creating long-

term challenges. Raji et al. (2021) raise concerns

about the resource-intensive task design. In par-

ticular, benchmarks present with large-scale train

datasets, which are expensive to create. This may

lead to benchmark stagnation, as new tasks can not

be added easily (Barbosa-Silva et al.,2022). In

turn, few-shot benchmarking offers a prospective

avenue for LMs evaluation regarding generaliza-

tion capacity, computational and resource costs.

Few-shot Benchmarking.

Research in few-shot

benchmarking has evolved in several directions.

Schick and Schütze (2021) create FewGLUE by

sampling small ﬁxed-sized training datasets from

SuperGLUE. Variance w.r.t to training dataset

size and sampling strategy is not reported. Later

works overcome these issues by exploring evalu-

ation strategies by

-fold cross-validation (Perez

et al.,2021), bagging, and multi-splits, introduced

in FewNLU (Zheng et al.,2022). Additionally,

FewNLU explores correlations between perfor-

mance on development and test sets and stability

w.r.t. the number of runs. CrossFit (Ye et al.,2021)

studies cross-task generalization by unifying task

formats and splitting tasks into training, develop-

ment, and test sets. FLEX (Bragg et al.,2021)

covers the best practices and provides a uniﬁed in-

terface for different types of transfer and varying

shot sizes. Finally, to the best of our knowledge,

the only non-English dataset for few-shot bench-

marking is Few-CLUE in Chinese (Xu et al.,2021).

TAPE is the ﬁrst few-shot benchmark for Russian,

which introduces variations at the data level by

creating adversarial test sets.

3 Task Formulations

TAPE includes six novel datasets for Russian,

each requiring modeling “intellectual abilities” of

at least two skills: logical reasoning (§3.1; ex-

tended Winograd schema challenge), reasoning

with world knowledge (§3.2; CheGeKa, RuOpen-

BookQA and RuWorldTree), multi-hop reason-

ing (§3.2; MultiQ), and ethical judgments (§3.3;

Ethics

1/2

). This section describes the task formu-

lations, general data collection stages, and dataset

examples. Appendix A provides the general dataset

statistics, while Appendix E.1 includes details on

dataset collection and extra validation stage via

a crowd-sourcing platform Toloka

(Pavlichenko

et al.,2021).

3.1 Logical Reasoning

Winograd.

The Winograd schema challenge com-

poses tasks with syntactic ambiguity, which can be

resolved with logical reasoning (Levesque et al.,

2012). The texts for the dataset are collected

with a semi-automatic pipeline. First, lists of

11 typical grammatical structures with syntactic

homonymy (mainly case) are compiled by a few au-

thors with linguistic background (see Appendix B).

Queries corresponding to these constructions are

submitted to the search of the Russian National

Corpus

, which includes a sub-corpus with re-

3toloka.ai

4ruscorpora.ru/en

solved homonymy. In the resulting 2k+ sentences,

homonymy is resolved automatically with UDPipe

and then validated manually by a few authors af-

terward. Each sentence is split into multiple exam-

ples in the binary classiﬁcation format, indicating

whether the reference pronoun is dependant on the

chosen candidate noun.

•Context:

“Brosh’ iz Pompei, kotoraya perezhila

veka.” (A trinket from Pompeii that has survived

the centuries.)

•Reference: “kotoraya” (that)

•Candidate Answer: “Brosh’ ” (A trinket)

•Label: 3(correct resolution)

3.2 Reasoning with World Knowledge

RuOpenBookQA.

RuOpenBookQA is a QA

dataset with multiple-choice elementary-level sci-

ence questions, which probe understanding of 1k+

core science facts. The dataset is built with au-

tomatic translation of the original English dataset

by Mihaylov et al. (2018) and manual validation

by a few authors.

•Question:

“Yesli chelovek idet v napravlenii,

protivopolozhnom napravleniyu strelki kompasa,

on idet...” (If a person walks in the direction

opposite to the compass needle, they are going...)

•Answers:

(A) “na zapad” (west); (B) “na

sever” (north); (C) “na vostok” (east);

(D) “na yug” (south).

RuWorldTree.

The collection approach of

Ru-

WorldTree

is similar to that of

RuOpenBookQA

the main difference being the additional lists of

facts and the logical order that is attached to the

output of each answer to a question (Jansen et al.,

2018).

•Question:

“Kakoye svoystvo vody izmenit-

sya, kogda voda dostignet tochki zamerzaniya?”

(What property of water will change when the

water reaches the freezing point?)

•Answers:

(A) “tsvet” (color); (B) “massa”

(mass); (C)

“sostoyaniye” (state of matter)

;

(D) “ves” (weight).

MultiQ.

Multi-hop reasoning has been one of the

least explored QA directions for Russian. The task

is addressed by the MuSeRC dataset (Fenogenova

et al.,2020) and only a few dozen questions in

5UDPipe package

SberQUAD (Eﬁmov et al.,2020) and RuBQ (Ry-

bin et al.,2021). In response, we have developed a

semi-automatic pipeline for multi-hop dataset gen-

eration based on Wikidata and Wikipedia. First,

we extract the triplets from Wikidata and search

for their intersections. Two triplets (subject, re-

lation, object) are needed to compose an answer-

able multi-hop question. For instance, the ques-

tion “Na kakom kontinente nakhoditsya strana,

grazhdaninom kotoroy byl Yokhannes Blok?” (In

what continent lies the country of which Johannes

Block was a citizen?) is formed by a sequence of

ﬁve graph units: “Blok, Yokhannes” (Block, Jo-

hannes), “grazhdanstvo” (country of citizenship),

“Germaniya” (Germany), “chast’ sveta” (continent),

and “Yevropa” (Europe). Second, several hundreds

of the corresponding question templates are curated

by a few authors manually, which are further used

to ﬁne-tune ruT5-large

to generate multi-hop ques-

tions given the graph units sequences. Third, the re-

sulting questions undergo paraphrasing (Fenogen-

ova,2021) and manual validation procedure to con-

trol the quality and diversity. Finally, each question

is linked to two Wikipedia paragraphs with the help

of wptools

, where all graph units appear in the nat-

ural language. The task is to select the answer span

using information from both paragraphs.

•Question:

“Gde nakhoditsya istok reki, pri-

tokom kotoroy yavlyayetsya Getar?” (Where

is the source of the river, the tributary of which

is the Getar?)

•Supporting Text:

“Getar — reka v Armenii.

Beryot nachalo na territorii Kotaykskoy oblasti,

protekayet cherez tsentral’nuyu chast’ Yerevana

i vpadayet v Razdan.” (The Getar is a river in

Armenia. [It] originates in the Kotayk region,

ﬂows through the central part of Yerevan and

ﬂows into the Hrazdan.)

•Main Text:

“Razdan — reka v Armenii.

Vytekayet iz ozera

Sevan

v yego severo-

zapadnoy chasti, nedaleko ot goroda Sevan.”

(The Hrazdan is a river in Armenia. [It] orig-

inates at the northwest extremity of Lake Sevan,

near the city of Sevan.)

•Answer: Sevan

CheGeKa.

The CheGeKa game

setup is simi-

lar to Jeopardy, where the player should answer

6hf.co/sberbank-ai/ruT5-large

7github.com/siznax/wptools

8en.wikipedia.org/wiki/what_where_when

Dtest E0

k = 0

k = 8

Evaluation Report

Perturbations

ButterFingers

AddSent

Emojify

BackTranslation

EDA

Adversarial Dtest

Dtrain

Eitrain

k = 1

Ei+1train

k = 1

ENtrain

k = 8

k = 1

Ei+1

Ei+2

Ei+3

Ei+4

Acc F1 Size

55.7 ±0.13 ±0.0726.3

52.4 ±0.21 ±0.08

24.6

51.8 ±0.19 ±0.0623.6

perturb

evalset

51.8 ±0.21 ±0.0623.6

subpopulation

1000

501

499

(a)

(b) (c) (d) (e)

Figure 1: Overview of the TAPE’s design. (a) Dtest is passed to the adversarial framework (§ 4.2) to create

the adversarial test DA

test that includes the original and adversarial examples. (b) We randomly sample 5sets of

demonstration examples from Dtrain for each k∈ {1,4,8}. In the zero-shot scenario, we skip this stage. (c)

After that, we merge the demonstrations, when applicable, with the examples from DA

test to construct evaluation

episodes EN

k.(d) Each EN

kis used to obtain predictions from the model. (e) The performance is summarized in a

diagnostic evaluation report. BF – BUTTERFINGERS, AS – ADDSENT, S – subpopulation.

questions based on wit and common sense knowl-

edge. We directly contacted the authors of Russian

Jeopardy! (Mikhalkova,2021) and asked about in-

cluding their training and private test sets in our

benchmark. The task is to provide a free response

given a question and the question category.

•Question:

“Imenno on napisal muzyku k opere

Turandot.” (It was he who composed the music

for the opera "Turandot".)

•Category:

“Komediya del’ arte” (Commedia

dell’arte)

•Answer: “Puchchini” (Puccini)

3.3 Ethical Judgments

There is a multitude of approaches to evaluating

ethics in machine learning. The

Ethics

dataset for

Russian is created from scratch for the ﬁrst time,

relying on the design compatible with Hendrycks

et al. (2021). The task is to predict human ethi-

cal judgments about diverse text situations in two

multi-label classiﬁcation settings. The ﬁrst one is

to identify the presence of concepts in normative

ethics, such as virtue, law, moral, justice, and utili-

tarianism (

Ethics1

). The second one is to evaluate

the positive or negative implementation of these

concepts with binary categories (Ethics2).

The composition of the dataset is conducted in

a semi-automatic mode. First, lists of keywords

are formulated to identify the presence of ethical

concepts (e.g., “kill”, “give”, “create”, etc.). The

collection of keywords includes the automatic col-

lection of synonyms using the semantic similarity

tools of the RusVectores project (Kutuzov and Kuz-

menko,2017). After that, the news and ﬁction sub-

corpora of the Taiga corpus (Shavrina and Shapo-

valova,2017) are ﬁltered to extract short texts con-

taining these keywords. Each text is annotated via

Toloka as documented in Appendix E.1.

•Text:

“Pechen’kami sobstvennogo prigo-

tovleniya nagradila 100-letnyaya Greta Plokh

malysha, kotoryy pomog yey pereyti cherez

ozhivlennoye shosse po peshekhodnomu

perekhodu.” (100-year-old Greta Ploech gave

handmade cookies to a toddler who helped her

cross a busy highway at a pedestrian crossing.)

•Labels1:3

(Virtue)

(Law)

(Moral)

(Justice)

3(Utilitarianism)

•Labels2:3

(Virtue)

(Law)

(Moral)

(Jus-

tice) 3(Utilitarianism)

4 Design

4.1 Evaluation Principles

This section outlines our evaluation principles that

are based on methodological recommendations and

open research questions discussed by Bragg et al.

(2021); Bowman and Dahl (2021); Beltagy et al.

(2022), including sample size design, varying num-

ber of shots, reporting variability, diagnostic perfor-

mance analysis, and adversarial test sets. Figure 1

describes the TAPE’s design.

Data Sampling.

Each task in our benchmark con-

sists of a training set

Dtrain

with labeled examples

and a test set

Dtest

. The splits are randomly sam-

pled, except for

RuOpenBookQA

RuWorldTree

and

CheGeKa

, where we use the original splits.

We keep the dataset size up to

k and purposefully

include imbalanced data for the text classiﬁcation

tasks.

No extra data.

We do not provide validation sets

nor any additional unlabeled data to test the zero-

shot and few-shot generalization capabilities of

LMs (Bao et al.,2019;Tam et al.,2021).

Number of shots.

We consider

k∈ {1,4,8}

for

few-shot evaluation to account for sensitivity to the

number of shots. We also include zero-shot evalua-

tion, which can be a strong baseline and simulate

scenarios where no supervision is available.

Episode sampling.

We provide

episodes in each

-shot setting

k∈ {1,4,8}

and report standard de-

viation over the episodes to estimate the variability

due to the selection of demonstrations (Schick and

Schütze,2021). Each episode

Ei= (Ei

traink,DA

test)

consists of

demonstration examples

traink

ran-

domly sampled from

Dtrain

with replacement, and

a single test

test

acquired via the combination of

original and adversarial test data.

Subpopulations.

Subpopulations (Goel et al.,

2021) are utilized for ﬁne-grained performance

analysis w.r.t. such properties of

Dtest

as length,

domain, and others.

Robustness.

LMs are susceptible to adversarial

examples, purposefully designed to force them

output a wrong prediction given a modiﬁed input

(Ebrahimi et al.,2018;Liang et al.,2018;Jia and

Liang,2017). We analyze the LMs’ robustness to

different types of adversarial data transformations.

Here, each

traink

corresponds to

T+ 1

test varia-

tions, including the original

Dtest

and

adversarial

test sets

test

, acquired through the modiﬁcation

Dtest

depends on the dataset and can be

adjusted based on the user’s needs.

4.2 Adversarial Framework

4.2.1 Attacks and Perturbations

Table 1 summarizes the TAPE’s adversarial attacks

and perturbations based on the generally accepted

typology (Zhang et al.,2020;Wang et al.,2021b).

Word-level Perturbations.

Word-level perturba-

tions utilize several strategies to perturb tokens,

ranging from imitation of typos (Jin et al.,2020)

to synonym replacement (Wei and Zou,2019). We

consider the following:

Spelling

. BUTTERFINGERS is the typo-based per-

turbation that adds noise to data by mimicking

spelling mistakes made by humans through charac-

ter swaps based on their keyboard distance.

Modality

. EMOJIFY replaces the input words with

the corresponding emojis, preserving their original

meaning. A few authors have manually validated

translations of the English emoji dictionary.

Sentence-level Perturbations.

In contrast to

word-level perturbations, sentence-level perturba-

tion techniques affect the syntactic structure:

Random

. Easy Data Augmentation (EDA; Wei and

Zou,2019) have proved to be efﬁcient in fooling

LMs on text classiﬁcation tasks. We use two EDA

conﬁgurations: swapping words (EDA

SWAP

) and

deleting tokens (EDADELETE).

Paraphrasis

. BACKTRANSLATION (Yaseen and

Langer,2021) allows to generate linguistic varia-

tions of the input without changing named entities.

We use the OpusMT model

to translate the input

text into English and back to Russian.

Distraction

. ADDSENT is an adversarial attack that

generates extra words or sentences with the help

of a generative text model. We pass the input to

the mGPT

LM and generate continuations with

the sampling strategy. In the multiple-choice QA

tasks, we replace one or more incorrect answers

with their generated alternatives.

4.2.2 Data Curation

Adversarial perturbations and attacks are efﬁciently

utilized to exploit weaknesses in LMs (Goel et al.,

2021). At the same time, popular techniques may

distort semantic meanings or generate invalid ad-

versarial examples (Wang et al.,2021a). We aim

to address this problem by using: (i) adversarial

probability thresholds, (ii) task-speciﬁc constraints,

and (iii) semantic ﬁltering.

Probability thresholds.

The degree of the input

modiﬁcation can be controlled with an adversarial

probability threshold, which serves as the hyperpa-

rameter. The higher the probability, the more the

input gets modiﬁed. The thresholds used in our

experiments are speciﬁed in Table 1.

9hf.co/Helsinki-NLP/opus-mt

10hf.co/THUMT/mGPT

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TAPE:AssessingFew-shotRussianLanguageUnderstandingEkaterinaTaktasheva1,2,TatianaShavrina1,3,AlenaFenogenova1,DenisShevelev1,NadezhdaKatricheva1,MariaTikhonova1,2,AlbinaAkhmetgareeva1,OlegZinkevich2,AnastasiiaBashmakova2,SvetlanaIordanskaia2,AlenaSpiridonova2,ValentinaKurenshchikova2,EkaterinaArte...

展开>> 收起<<

TAPE Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva12 Tatiana Shavrina13 Alena Fenogenova1 Denis Shevelev1 Nadezhda Katricheva1Maria Tikhonova12Albina Akhmetgareeva1.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TAPE Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva12 Tatiana Shavrina13 Alena Fenogenova1 Denis Shevelev1 Nadezhda Katricheva1Maria Tikhonova12Albina Akhmetgareeva1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: