Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel12Yoav Goldberg13Yanai Elazar34

2025-05-02 0 0 1.01MB 13 页 10玖币
侵权投诉
Lexical Generalization Improves with Larger Models and Longer
Training
Elron Bandel1,2Yoav Goldberg1,3Yanai Elazar3,4
1Computer Science Department, Bar Ilan University
2IBM Research 3Allen Institute for Artificial Intelligence
4Paul G. Allen School of Computer Science &
Engineering, University of Washington
elron.bandel@gmail.com
Abstract
While fine-tuned language models perform
well on many tasks, they were also shown
to rely on superficial surface features such as
lexical overlap. Excessive utilization of such
heuristics can lead to failure on challenging
inputs. We analyze the use of lexical over-
lap heuristics in natural language inference,
paraphrase detection, and reading comprehen-
sion (using a novel contrastive dataset), and
find that larger models are much less suscep-
tible to adopting lexical overlap heuristics. We
also find that longer training leads models to
abandon lexical overlap heuristics. Finally, we
provide evidence that the disparity between
models size has its source in the pre-trained
model.1
1 Introduction
Pretrained Language Models (PLMs) dramatically
improved the performances on a wide range of NLP
tasks, resulting in benchmarks claiming to track
progress of language understanding, like GLUE
(Wang et al.,2018) and SQuAD (Rajpurkar et al.,
2016) to be “solved”. However, many works show
these models to be brittle and generalize poorly to
“out-of-distribution examples” (Naik et al.,2018;
McCoy et al.,2019;Gardner et al.,2020).
One of the reasons for the poor generalization
is models adopting superficial heuristics from the
training data, such as, lexical overlap: simple
match of words between two textual instances. For
instance, a paraphrase-detection model that makes
use of these heuristics may determine that two sen-
tences are paraphrases of each other by simply
comparing the bags-of-words of these sentences.
While this heuristic sometimes works, it is also of-
ten wrong, as demonstrated in Figure 1. Indeed,
while such heuristics are effective for solving in-
domain large datasets, different tests expose that
1
Code and data are available at:
https://github.com/
elronbandel/lexical-generalization.
Can a bad person become good? Can a good person become bad?
Paraphrase Not Paraphrase
🐥MODEL 🐓MODEL
good person bad person{can, a, bad, person, become, good}
{can, a, good, person, become, bad}
=
become good become bad
correct
👎
wrong
👎
Figure 1: Illustration of the difference in the behavior
of a too small, or briefly trained paraphrase detection
MODEL, that relies on lexical overlap heuristic to
make a (wrong) prediction. In contrast, MODEL is
larger or is trained for longer, and does not rely on lex-
ical overlap and predicts correctly.
models often rely on these heuristics, and thus they
are right for the wrong reasons (McCoy et al.,2019;
Zhang et al.,2019). Since, different works tackled
these problems, and propose different algorithms
and specialized methods for reducing the use of
such heuristics, and improve models’ generaliza-
tion (He et al.,2019;Utama et al.,2020;Moosavi
et al.,2020;Tu et al.,2020;Liu et al.,2022).
In this work, we link the adoption of lexical
overlap heuristic to size of PLMs and to the number
of iterations in the fine-tuning process. We show
that
a lot of the benefit from the above methods
can be achieved simply by using larger PLMs
and finetuning them for longer.
We show that larger PLMs, and longer trained
models are much less prone to rely on lexical over-
lap,
despite not being manifested on standard
validation sets
. We validate these findings on three
widely used PLMs: BERT (Devlin et al.,2019),
RoBERTa (Liu et al.,2019), and ALBERT (Lan
et al.,2019) and three tasks: Natural Language
Inference (NLI), Paraphrase Detection (PD), and
Reading Comprehension (RC). For RC, we col-
lect a new contrastive test-set - ALSQA, based on
SQuAD2.0 (Rajpurkar et al.,2018), while control-
ling on the lexical overlap between the texts and
the questions, containing 365 examples.
arXiv:2210.12673v2 [cs.CL] 25 Oct 2022
2 Related Work
This work relates to a line of research using behav-
ioral methods for understanding model behavior
(Ribeiro et al.,2020;Elazar et al.,2021a;Vig et al.,
2020), and more specifically regarding the extent in
which specific heuristics being used by models for
prediction (Poliak et al.,2018;Naik et al.,2018;
McCoy et al.,2019;Jacovi et al.,2021;Elazar
et al.,2021b). The use of heuristics such as lexi-
cal overlap also typically point on the lexical and
under sensitivity of models (Welbl et al.,2020),
studied in different setups (Iyyer et al.,2018;Jia
and Liang,2017;Gan and Ng,2019;Misra et al.,
2020;Belinkov and Bisk,2018;Ribeiro et al.,2020;
Ebrahimi et al.,2018;Bandel et al.,2022).
Our work suggests that the size of PLMs af-
fects their inductive bias (Haussler,1988) towards
the preferred strategy. Studies of inductive biases
in NLP has gained attention recently (Dyer et al.,
2016;Battaglia et al.,2018;Dhingra et al.,2018;
Ravfogel et al.,2019;McCoy et al.,2020a). While
Warstadt et al. (2020) studied the effect of the
amount of pre-training data on linguistic general-
ization, we show that additional training iterations
and larger models affects their inductive biases with
respect to the use of lexical overlap heuristics.
Finally, Tu et al. (2020) show that the adoption
of heuristics can be explained by the ability to gen-
eralize from a few examples that aren’t aligned
with the heuristics. They show that larger model
size and longer fine-tuning can marginally increase
the ability of the model to generalize from minority
groups, an insight also discussed also by Djolonga
et al. (2021). Our focus of lexical heuristics exclu-
sively, together with robust fine-grained experimen-
tal setup concludes, unlike Tu et al. (2020), that
the change in lexical heuristic adoption behavior is
consistent and non marginal.
3 Measuring Reliance on Lexical
Overlap
Definition: Reliance on Lexical Overlap
We
say that a model makes use of lexical heuristics if
it relies on superficial lexical cues in the text by
making use of their identity rather than the seman-
tics of the text.
3.1 Experimental Design
We focus on tasks that involve some inference over
text pairs. We estimate the usage of the lexical-
overlap heuristic by training a model on a stan-
dard training set for the task, and then inspecting
models’ predictions on a diagnostic set, containing
high lexical overlap instances,
2
where half of the
instances are consistent with the heuristic and half
are inconsistent (e.g. the first two rows of each
dataset in Figure 2, respectively). Models that rely
on the heuristic will perform well on the consistent
group, and significantly worse on the inconsistent
ones.
Metric
We define the HEURistic score (HEUR)
of a model on a diagnostic set as the difference
in model performance on the consistent examples,
and its performance on the inconsistent examples.
Higher HEUR values indicate high use of the lex-
ical overlap heuristic (bad) while low values indi-
cate lower reliance on the heuristic (good).
3.2 Data
For the
training sets
we use the MNLI dataset
(Williams et al.,2018) for Natural Language In-
ference (NLI; Dagan et al.,2005;Bowman et al.,
2015), the Quora Question Pairs (QQP; Sharma
et al.,2019) for paraphrasing, and SQuAD 2.0 (Ra-
jpurkar et al.,2018) for Reading Comprehension
(RC). The corresponding high-lexical overlap diag-
nostic sets are described below:
HANS
(McCoy et al.,2019) is an NLI dataset,
designed as a challenging test set to test for reliance
on different heuristics. Models that predict entail-
ment when based solely on one of those heuristics
- will fail in half the examples where the heuristic
does not hold. We focus on the lexical overlap
heuristic part, termed HANS-Overlap.
PAWS
(Zhang et al.,2019) is a paraphrase de-
tection dataset containing paraphrase and non-
paraphrase pairs with high lexical overlap. The
pairs were generated by controlled word swapping
and back translation, with a manually filtering val-
idation step. We use the Quora Question Pairs
(PAWS-QQP) part of the dataset.
ALSQA
To test the lexical overlap heuristic uti-
lization in Reading Comprehension models, we
create a new test set: Analyzing Lexically Similar
QA (ALSQA). We augment the SQuAD 2.0 dataset
(Rajpurkar et al.,2018) by asking crowdworkers
2
We define the lexical overlap between pair of texts as the
number of unique words that appear in both texts divided by
the number of unique words in the shorter text. The definition
for ALSQA is similar, but we consider lemmas instead of
words, and ignore function words and proper names.
Dataset Text1 Text2 Label
HANS The banker near the judge saw the actor. The banker saw the actor. E 3
The doctors visited the lawyer. The lawyer visited the doctors. NE 7
PAWS What should I prefer study or job? What should I prefer job or study? P 3
Can a bad person become good ? Can a good person become bad? NP 7
ALSQA ..."downsize" revision of vehicle categories .By 1977 , GM´
s
full - sized cars reflected the crisis. By 1979, virtually all
" full - size " American cars had shrunk , featuring smaller
engines and smaller outside dimensions. Chrysler ended
production of their full - sized luxury sedans at the end of the
1981 model year ...
By which year did full sized American
cars shrink to be smaller ?
What vehicle category did Chrysler
change to in 1977 ?
A3
NA 7
Figure 2: Examples from the datasets. All examples have in-pair high lexical overlap. In ALSQA examples over-
lapping content words colored in blue . Text1 and Text2 correspond to the premise and hypothesis in HANS, the
two sentences in PAWS and the context and question in ALSQA. The labels for HANS pairs are either Entailment
(E) or Non-Entailment (NE). For PAWS the labels are Paraphrase (P) or Non-Paraphrase (NP). For ALSQA the
labels are (A) Answerable (NA) Non-Answerable. If the lexical overlap point on the label it is marked as 3means
it is consistent-with-heuristic.
Larger language models trained for l o n g e r tend to be beer,
but beer in what aspects?
Largely, Right for Beer Reasons:
Lexical Generalization Improves with Model Size
Elron Bandel, Yoav Goldberg, Yanai Elazar
Conclusion I: Pretrain Larger
2. Measuring Heuristic Reliance
1. Approach
Conclusion II: Finetune Longer
Peormance on heuristic
consistent test-set
Peormance on heuristic
inconsistent test-set
=
=
Higher HEUR values indicate high use of
the lexical overlap heuristic
Passage
We asked crowdworkers to rewrite SQuAD2.0 questions into questions with high lexical
overlap with the context passage.
Answerable questions are consistent with the lexical overlap heuristic
Unanswerable questions are not consistent with the lexical overlap heuristic
While models of dierent sizes peorm equally on the heuristic
consistent subset (), in the inconsistent subset () larger
models generalize beer and are less likely to adopt the
heuristic.
Is the dierence in
peormance between models
explained by supeicial
heuristics?
We analyze the adoption and
avoidance of lexical overlap
heuristics in models of
dierent sizes and in dierent
training phases.
When training models longer, they tend to abandon the use of
lexical overlap heuristics.
… while maintaining similar peormance on the standard test
HEUR =
Classication test-sets that control the
lexical overlap heuristic
PAWS
(Paraphrase
Identication)
Small Base Large Small Base Large
ALSQA ALSQA
What other heuristics models employ for prediction?
How the size of the model and ne-tuning duration aect the abandonment of lexical heuristics?
3. Test-sets
Accuracy
HANS
3.2 ALSQA: a High Lexical-overlap Reading Comprehension Test
3.1 Text-pair Classification
Question Answerable
(NLI)
But they are also:
Synthetic
Similar to each other
Remaining questions:
Figure 3: "Larger is better" Electra of different sizes
perform equally on the subset of ALSQA that is consis-
tent with the lexical overlap heuristic (3), in the incon-
sistent subset (7) larger models are less likely to adopt
the heuristic, therefore, generalize better.
to generate questions with high context-overlap
from questions with low overlap (These questions
are paraphrases of the original questions). In the
case of un-answerable questions, annotators were
asked to re-write the question without changing its
meaning and maintain the unanswerability reason.
3
ALSQA contains 365 questions pairs, 190 with an-
swer and 174 without answer. Examples from the
dataset are presented in Figure 2, and Appendix B.
4 Experiments and Results
Setup
We experiment with 3 strong PLMs:
BERT (Devlin et al.,2019), RoBERTa (Liu et al.,
2019) and ELECTRA (Clark et al.,2020), each
with a few size variants. Since performance after
finetuning can vary, both for in-domain (Dodge
et al.,2020), and out-of-domain (McCoy et al.,
2020b), we follow Clark et al. (2020) and finetune
3Full details are provided in Appendix A.
every model six times, with different seeds and
learning rates (specified in Appendix C) and report
the median of these results. All models achieve
comparable results to those reported in the litera-
ture. For each model we consider different stages
in the training process. The early variant is after
one epoch, while the late variant is after six epochs.
The NLI and PD are text-pair classification tasks.
However, RC is a span prediction task, as such, we
explore two versions of this task: (1) the regular
span prediction task, and (2) a text-pair classifica-
tion task, where the goal is to predict if the ques-
tions is answerable or not from the text, which we
call Answerability.
Finding I: Larger is Better (but it is not reflected on
the dev set)
We report the results in Table 1. Larger
models perform consistently better on the lexical
challenge across tasks and models: For NLI, the
early BERT models gradually improve their HEUR
score from 93.7 in the base version to 63.5 in the
large version. Similarly for PD, the early RoBERTa
improves the HEUR score from 84.4 to 76.0 from
the base to large versions. The difference in the
HEUR scores are remarkable, as the improvement
on the standard validation set are relatively small
with respect to the improvement on HEUR. For in-
stance, the RoBERTa-early on HANS improve its
dev performance from the base to large version by
2.2%, while HEUR improves relatively by 79.4%.
Apart from the size trend, the absolute numbers be-
tween the tasks also differ significantly. RoBERTa-
large practically “solved” this part of HANS, with
a median accuracy of 96%. On the other hand,
摘要:

LexicalGeneralizationImproveswithLargerModelsandLongerTrainingElronBandel1;2YoavGoldberg1;3YanaiElazar3;41ComputerScienceDepartment,BarIlanUniversity2IBMResearch3AllenInstituteforArticialIntelligence4PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingtonelron.bandel@gmail.comAbstract...

展开>> 收起<<
Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel12Yoav Goldberg13Yanai Elazar34.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.01MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注