Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel12Yoav Goldberg13Yanai Elazar34

2025-05-02 0 0 1.01MB 13 页 10玖币

侵权投诉

Lexical Generalization Improves with Larger Models and Longer

Training

Elron Bandel1,2Yoav Goldberg1,3Yanai Elazar3,4

1Computer Science Department, Bar Ilan University

2IBM Research 3Allen Institute for Artiﬁcial Intelligence

4Paul G. Allen School of Computer Science &

Engineering, University of Washington

elron.bandel@gmail.com

Abstract

While ﬁne-tuned language models perform

well on many tasks, they were also shown

to rely on superﬁcial surface features such as

lexical overlap. Excessive utilization of such

heuristics can lead to failure on challenging

inputs. We analyze the use of lexical over-

lap heuristics in natural language inference,

paraphrase detection, and reading comprehen-

sion (using a novel contrastive dataset), and

ﬁnd that larger models are much less suscep-

tible to adopting lexical overlap heuristics. We

also ﬁnd that longer training leads models to

abandon lexical overlap heuristics. Finally, we

provide evidence that the disparity between

models size has its source in the pre-trained

model.1

1 Introduction

Pretrained Language Models (PLMs) dramatically

improved the performances on a wide range of NLP

tasks, resulting in benchmarks claiming to track

progress of language understanding, like GLUE

(Wang et al.,2018) and SQuAD (Rajpurkar et al.,

2016) to be “solved”. However, many works show

these models to be brittle and generalize poorly to

“out-of-distribution examples” (Naik et al.,2018;

McCoy et al.,2019;Gardner et al.,2020).

One of the reasons for the poor generalization

is models adopting superﬁcial heuristics from the

training data, such as, lexical overlap: simple

match of words between two textual instances. For

instance, a paraphrase-detection model that makes

use of these heuristics may determine that two sen-

tences are paraphrases of each other by simply

comparing the bags-of-words of these sentences.

While this heuristic sometimes works, it is also of-

ten wrong, as demonstrated in Figure 1. Indeed,

while such heuristics are effective for solving in-

domain large datasets, different tests expose that

Code and data are available at:

https://github.com/

elronbandel/lexical-generalization.

Can a bad person become good? Can a good person become bad?

≟

Paraphrase Not Paraphrase

🐥MODEL 🐓MODEL

good person ≠bad person{can, a, bad, person, become, good}

{can, a, good, person, become, bad}

become good ≠become bad

correct

👎

wrong

👎

Figure 1: Illustration of the difference in the behavior

of a too small, or brieﬂy trained paraphrase detection

MODEL, that relies on lexical overlap heuristic to

make a (wrong) prediction. In contrast, MODEL is

larger or is trained for longer, and does not rely on lex-

ical overlap and predicts correctly.

models often rely on these heuristics, and thus they

are right for the wrong reasons (McCoy et al.,2019;

Zhang et al.,2019). Since, different works tackled

these problems, and propose different algorithms

and specialized methods for reducing the use of

such heuristics, and improve models’ generaliza-

tion (He et al.,2019;Utama et al.,2020;Moosavi

et al.,2020;Tu et al.,2020;Liu et al.,2022).

In this work, we link the adoption of lexical

overlap heuristic to size of PLMs and to the number

of iterations in the ﬁne-tuning process. We show

that

a lot of the beneﬁt from the above methods

can be achieved simply by using larger PLMs

and ﬁnetuning them for longer.

We show that larger PLMs, and longer trained

models are much less prone to rely on lexical over-

lap,

despite not being manifested on standard

validation sets

. We validate these ﬁndings on three

widely used PLMs: BERT (Devlin et al.,2019),

RoBERTa (Liu et al.,2019), and ALBERT (Lan

et al.,2019) and three tasks: Natural Language

Inference (NLI), Paraphrase Detection (PD), and

Reading Comprehension (RC). For RC, we col-

lect a new contrastive test-set - ALSQA, based on

SQuAD2.0 (Rajpurkar et al.,2018), while control-

ling on the lexical overlap between the texts and

the questions, containing 365 examples.

arXiv:2210.12673v2 [cs.CL] 25 Oct 2022

2 Related Work

This work relates to a line of research using behav-

ioral methods for understanding model behavior

(Ribeiro et al.,2020;Elazar et al.,2021a;Vig et al.,

2020), and more speciﬁcally regarding the extent in

which speciﬁc heuristics being used by models for

prediction (Poliak et al.,2018;Naik et al.,2018;

McCoy et al.,2019;Jacovi et al.,2021;Elazar

et al.,2021b). The use of heuristics such as lexi-

cal overlap also typically point on the lexical and

under sensitivity of models (Welbl et al.,2020),

studied in different setups (Iyyer et al.,2018;Jia

and Liang,2017;Gan and Ng,2019;Misra et al.,

2020;Belinkov and Bisk,2018;Ribeiro et al.,2020;

Ebrahimi et al.,2018;Bandel et al.,2022).

Our work suggests that the size of PLMs af-

fects their inductive bias (Haussler,1988) towards

the preferred strategy. Studies of inductive biases

in NLP has gained attention recently (Dyer et al.,

2016;Battaglia et al.,2018;Dhingra et al.,2018;

Ravfogel et al.,2019;McCoy et al.,2020a). While

Warstadt et al. (2020) studied the effect of the

amount of pre-training data on linguistic general-

ization, we show that additional training iterations

and larger models affects their inductive biases with

respect to the use of lexical overlap heuristics.

Finally, Tu et al. (2020) show that the adoption

of heuristics can be explained by the ability to gen-

eralize from a few examples that aren’t aligned

with the heuristics. They show that larger model

size and longer ﬁne-tuning can marginally increase

the ability of the model to generalize from minority

groups, an insight also discussed also by Djolonga

et al. (2021). Our focus of lexical heuristics exclu-

sively, together with robust ﬁne-grained experimen-

tal setup concludes, unlike Tu et al. (2020), that

the change in lexical heuristic adoption behavior is

consistent and non marginal.

3 Measuring Reliance on Lexical

Overlap

Deﬁnition: Reliance on Lexical Overlap

say that a model makes use of lexical heuristics if

it relies on superﬁcial lexical cues in the text by

making use of their identity rather than the seman-

tics of the text.

3.1 Experimental Design

We focus on tasks that involve some inference over

text pairs. We estimate the usage of the lexical-

overlap heuristic by training a model on a stan-

dard training set for the task, and then inspecting

models’ predictions on a diagnostic set, containing

high lexical overlap instances,

where half of the

instances are consistent with the heuristic and half

are inconsistent (e.g. the ﬁrst two rows of each

dataset in Figure 2, respectively). Models that rely

on the heuristic will perform well on the consistent

group, and signiﬁcantly worse on the inconsistent

ones.

Metric

We deﬁne the HEURistic score (HEUR)

of a model on a diagnostic set as the difference

in model performance on the consistent examples,

and its performance on the inconsistent examples.

Higher HEUR values indicate high use of the lex-

ical overlap heuristic (bad) while low values indi-

cate lower reliance on the heuristic (good).

3.2 Data

For the

training sets

we use the MNLI dataset

(Williams et al.,2018) for Natural Language In-

ference (NLI; Dagan et al.,2005;Bowman et al.,

2015), the Quora Question Pairs (QQP; Sharma

et al.,2019) for paraphrasing, and SQuAD 2.0 (Ra-

jpurkar et al.,2018) for Reading Comprehension

(RC). The corresponding high-lexical overlap diag-

nostic sets are described below:

HANS

(McCoy et al.,2019) is an NLI dataset,

designed as a challenging test set to test for reliance

on different heuristics. Models that predict entail-

ment when based solely on one of those heuristics

- will fail in half the examples where the heuristic

does not hold. We focus on the lexical overlap

heuristic part, termed HANS-Overlap.

PAWS

(Zhang et al.,2019) is a paraphrase de-

tection dataset containing paraphrase and non-

paraphrase pairs with high lexical overlap. The

pairs were generated by controlled word swapping

and back translation, with a manually ﬁltering val-

idation step. We use the Quora Question Pairs

(PAWS-QQP) part of the dataset.

ALSQA

To test the lexical overlap heuristic uti-

lization in Reading Comprehension models, we

create a new test set: Analyzing Lexically Similar

QA (ALSQA). We augment the SQuAD 2.0 dataset

(Rajpurkar et al.,2018) by asking crowdworkers

We deﬁne the lexical overlap between pair of texts as the

number of unique words that appear in both texts divided by

the number of unique words in the shorter text. The deﬁnition

for ALSQA is similar, but we consider lemmas instead of

words, and ignore function words and proper names.

Dataset Text1 Text2 Label

HANS The banker near the judge saw the actor. The banker saw the actor. E 3

The doctors visited the lawyer. The lawyer visited the doctors. NE 7

PAWS What should I prefer study or job? What should I prefer job or study? P 3

Can a bad person become good ? Can a good person become bad? NP 7

ALSQA ..."downsize" revision of vehicle categories .By 1977 , GM´

full - sized cars reﬂected the crisis. By 1979, virtually all

" full - size " American cars had shrunk , featuring smaller

engines and smaller outside dimensions. Chrysler ended

production of their full - sized luxury sedans at the end of the

1981 model year ...

By which year did full sized American

cars shrink to be smaller ?

What vehicle category did Chrysler

change to in 1977 ?

NA 7

Figure 2: Examples from the datasets. All examples have in-pair high lexical overlap. In ALSQA examples over-

lapping content words colored in blue . Text1 and Text2 correspond to the premise and hypothesis in HANS, the

two sentences in PAWS and the context and question in ALSQA. The labels for HANS pairs are either Entailment

(E) or Non-Entailment (NE). For PAWS the labels are Paraphrase (P) or Non-Paraphrase (NP). For ALSQA the

labels are (A) Answerable (NA) Non-Answerable. If the lexical overlap point on the label it is marked as 3means

it is consistent-with-heuristic.

Larger language models trained for l o n g e r tend to be beer,

but beer in what aspects?

Largely, Right for Beer Reasons:

Lexical Generalization Improves with Model Size

Elron Bandel, Yoav Goldberg, Yanai Elazar

Conclusion I: Pretrain Larger

2. Measuring Heuristic Reliance

1. Approach

Conclusion II: Finetune Longer

Peormance on heuristic

consistent test-set

Peormance on heuristic

inconsistent test-set

✔ =

✘ =

Higher HEUR values indicate high use of

the lexical overlap heuristic

Passage

We asked crowdworkers to rewrite SQuAD2.0 questions into questions with high lexical

overlap with the context passage.

●Answerable questions are consistent with the lexical overlap heuristic ✔

●Unanswerable questions are not consistent with the lexical overlap heuristic ✘

While models of dierent sizes peorm equally on the heuristic

consistent subset (✔), in the inconsistent subset (✘) larger

models generalize beer and are less likely to adopt the

heuristic.

●Is the dierence in

peormance between models

explained by supeicial

heuristics?

●We analyze the adoption and

avoidance of lexical overlap

heuristics in models of

dierent sizes and in dierent

training phases.

When training models longer, they tend to abandon the use of

lexical overlap heuristics.

… while maintaining similar peormance on the standard test

HEUR = ✔ － ✘

Classication test-sets that control the

lexical overlap heuristic

PAWS

(Paraphrase

Identication)

Small Base Large Small Base Large

ALSQA ✔ALSQA ✘

What other heuristics models employ for prediction?

How the size of the model and ne-tuning duration aect the abandonment of lexical heuristics?

3. Test-sets

Accuracy

HANS

3.2 ALSQA: a High Lexical-overlap Reading Comprehension Test

3.1 Text-pair Classiﬁcation

Question Answerable

(NLI)

But they are also:

●Synthetic

●Similar to each other

Remaining questions:

Figure 3: "Larger is better" Electra of different sizes

perform equally on the subset of ALSQA that is consis-

tent with the lexical overlap heuristic (3), in the incon-

sistent subset (7) larger models are less likely to adopt

the heuristic, therefore, generalize better.

to generate questions with high context-overlap

from questions with low overlap (These questions

are paraphrases of the original questions). In the

case of un-answerable questions, annotators were

asked to re-write the question without changing its

meaning and maintain the unanswerability reason.

ALSQA contains 365 questions pairs, 190 with an-

swer and 174 without answer. Examples from the

dataset are presented in Figure 2, and Appendix B.

4 Experiments and Results

Setup

We experiment with 3 strong PLMs:

BERT (Devlin et al.,2019), RoBERTa (Liu et al.,

2019) and ELECTRA (Clark et al.,2020), each

with a few size variants. Since performance after

ﬁnetuning can vary, both for in-domain (Dodge

et al.,2020), and out-of-domain (McCoy et al.,

2020b), we follow Clark et al. (2020) and ﬁnetune

3Full details are provided in Appendix A.

every model six times, with different seeds and

learning rates (speciﬁed in Appendix C) and report

the median of these results. All models achieve

comparable results to those reported in the litera-

ture. For each model we consider different stages

in the training process. The early variant is after

one epoch, while the late variant is after six epochs.

The NLI and PD are text-pair classiﬁcation tasks.

However, RC is a span prediction task, as such, we

explore two versions of this task: (1) the regular

span prediction task, and (2) a text-pair classiﬁca-

tion task, where the goal is to predict if the ques-

tions is answerable or not from the text, which we

call Answerability.

Finding I: Larger is Better (but it is not reﬂected on

the dev set)

We report the results in Table 1. Larger

models perform consistently better on the lexical

challenge across tasks and models: For NLI, the

early BERT models gradually improve their HEUR

score from 93.7 in the base version to 63.5 in the

large version. Similarly for PD, the early RoBERTa

improves the HEUR score from 84.4 to 76.0 from

the base to large versions. The difference in the

HEUR scores are remarkable, as the improvement

on the standard validation set are relatively small

with respect to the improvement on HEUR. For in-

stance, the RoBERTa-early on HANS improve its

dev performance from the base to large version by

2.2%, while HEUR improves relatively by 79.4%.

Apart from the size trend, the absolute numbers be-

tween the tasks also differ signiﬁcantly. RoBERTa-

large practically “solved” this part of HANS, with

a median accuracy of 96%. On the other hand,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LexicalGeneralizationImproveswithLargerModelsandLongerTrainingElronBandel1;2YoavGoldberg1;3YanaiElazar3;41ComputerScienceDepartment,BarIlanUniversity2IBMResearch3AllenInstituteforArticialIntelligence4PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingtonelron.bandel@gmail.comAbstract...

展开>> 收起<<

Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel12Yoav Goldberg13Yanai Elazar34.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel12Yoav Goldberg13Yanai Elazar34

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: