What do Large Language Models Learn beyond Language Avinash Madasu Shashank Srivastava UNC Chapel Hill

2025-05-06 0 0 873.17KB 14 页 10玖币
侵权投诉
What do Large Language Models Learn beyond Language?
Avinash Madasu Shashank Srivastava
UNC Chapel Hill
{avinashm,ssrivastava}@cs.unc.edu
Abstract
Large language models (LMs) have rapidly
become a mainstay in Natural Language Pro-
cessing. These models are known to acquire
rich linguistic knowledge from training on
large amounts of text. In this paper, we in-
vestigate if pre-training on text also confers
these models with helpful ‘inductive biases’
for non-linguistic reasoning. On a set of 19
diverse non-linguistic tasks involving quanti-
tative computations, recognizing regular ex-
pressions and reasoning over strings. We find
that pretrained models significantly outper-
form comparable non-pretrained neural mod-
els. This remains true also in experiments with
training non-pretrained models with fewer pa-
rameters to account for model regularization
effects. We further explore the effect of text do-
main on LMs by pretraining models from text
from different domains and provenances. Our
experiments surprisingly reveal that the posi-
tive effects of pre-training persist even when
pretraining on multi-lingual text or computer
code, and even for text generated from syn-
thetic languages. Our findings suggest a hith-
erto unexplored deep connection between pre-
training and inductive learning abilities of lan-
guage models1.
1 Introduction
Pretrained Language Models (LMs) have shown
singular succcess on a range of natural language un-
derstandings tasks, to the extent that they have be-
come foundational for contemporary NLP systems.
Several works have investigated why pretraining
works so well (Warstadt et al.,2019;Zhao et al.,
2020). In particular, studies have shown that the
pretrained LMs like BERT capture linguistic knowl-
edge about syntax (Lin et al.,2019;Wu et al.,2020),
semantics (Vuli´
c et al.,2020b,a) and morphology
(Hofmann et al.,2020,2021). In fact, Tenney et al.
(2019) demonstrated that learned representations
1https://github.com/avinashsai/NILM
Figure 1: We investigate the effect of pretraining of lan-
guages models on learning non-linguistic tasks using
three task paradigms involving symbolic reasoning.
in pretrained LMs even internally reflect the clas-
sical NLP pipeline. Since most NLP benchmarks
such as SuperGLUE (Wang et al.,2019) naturally
are focused on tasks such as textual entailment
and reading comprehension that require linguistic
knowledge and reasoning, it is unsurprising that
LMs have achieved strong results on these tasks.
On the other hand, little work so far has explored
the abilities of pretrained LMs for learning non-
linguistic tasks.
In this paper, we explore whether pretraining on
text is inherently about learning language, or if pre-
training also imbues LMs with skills for symbolic
manipulation and non-linguistic reasoning (for ex-
ample, performing quantitative computation such
as finding the median of a set of numbers, recog-
nizing regular expressions, or identifying whether
a string is a palindrome, as shown in Figure 1).
In other words, we investigate whether and how
pretraining develops helpful inductive biases for
non-linguistic reasoning. For this analysis, we cre-
ate a set of 19 tasks from three categories of task
paradigms: quantitative computation (§3.1), recog-
nizing regular expressions (§3.2), and string rea-
soning (§3.3). Figure 1shows an example for each
category, and the full list of tasks is described in the
table 1. We experiment with transformer and RNN
based LMs (§4) for learning these tasks, and per-
arXiv:2210.12302v1 [cs.CL] 21 Oct 2022
Task Input Eg. Output Eg. Classes Input range
Odd classification 4210 0 0 - 1 [1, 20000]
Even classification 4210 1 0 - 1 [1, 20000]
Odd even classification 4210 even 1 0 - 1 [1, 20000]
Decimal operation 872 / 436 2 0 - 9 [1, 10000]
Decimal & word operation four / 2 2 0 - 9 [1, 10000]
Mean 15,-8,15,-5,-14,-3 ? 0 0 - 9 [-15, 15]
Median 3,6,5,15,2,3,-6,-2,9,-3,-9,-5,-14 ? 2 0 - 9 [-15, 15]
Mode 5,9,7,0,2,5,3,3,3,0 ? 3 0 - 9 [0, 9]
Recognize {0, 1, 2}*02* 01202102222 1 0 - 1 [0, 2]
Recognize AA*BB*CC*DD*EE*
aaaaaaabbbbcccccddde
1 0 - 1 [a, e]
Palindrome classification aWXXWa 1 0 - 1 [a-z], [A-Z]
Anagram classification rGrPJhk-khGrPJr 1 0 - 1 [a-z],[A-Z]
Isogram classification vFJoSj 1 0 - 1 [a-z], [A-Z]
Tautonym classification stPvg-tPvga 1 0 - 1 [a-z], [A-Z]
Length of a string teeo 4 0 - 9 [a-z]
Count of unique characters deiieediid 3 0 - 9 [a-j]
Parity check 011101001110 0 0 - 1 [0, 1]
Vowels classification iivxcmoouo 0 0 - 9 [a-z]
Maximum frequent character jjjcjj 9 (j) 0 - 9 [a-j]
Table 1: Description of the non-linguistic tasks with input and output examples. Classes are the class labels for
each task. Input range denotes the range of the input operands in each task.
form a comparative analysis with (non-pretrained)
neural model variants from the perspective of learn-
ing metrics such as accuracy and sample efficiency.
Our experiments (§5) reveal that pretrained mod-
els overall perform substantially better and are
more sample efficient on most tasks. However,
there are significant differences and patterns in per-
formance between task types, as well as variance
between different LM architectures. Since non-
pretrained models do not have the benefit of reg-
ularization that comes from pretraining, a plausi-
ble reason for the discrepancy between them and
pretrained LMs might be underfitting of the non-
pretrained models when trained on comparatively
small dataset sizes. To account for this, we also
comprehensively explore the effect of model size
6) of non-pretrained models for both transformer
and RNN architectures. We find that the discrep-
ancy in performance remains even for smaller neu-
ral models, indicating that the differences are not
simply due to a mismatch in model and data sizes.
Finally, we investigate the role that pretraining
data plays in influencing task performance on non-
linguistic tasks (§7). We experiment with pretrain-
ing on different domains of text, pretraining on
perturbed representations of natural language text
(such as shuffled word order), pretraining on text of
computer programs (no linguistic properties of nat-
ural languages), pretraining on multi-lingual and
non-English text, and pretraining with synthetic
text (data sampled from synthetic distributions).
Our analysis reveals that the advantages of pretrain-
ing surprisingly persist with various degrees across
these variations, suggesting hithertho unexplored
connections between pretraining and the learning
abilities of language models. Our contributions are:
We compare a range of pretrained LMs and non-
pretrained models on a carefully designed suite of
19 classifications tasks that require non-linguistic
reasoning.
We comprehensively explore the role of the pre-
training data by experimenting with models pre-
trained from texts with different provenances.
We establish that the positive effects of pretrain-
ing are not simply due to better model regulariza-
tion by experimenting with neural models with
different complexities and architectures.
2 Related Work
A body of work has investigated contextual word
embeddings to determine whether they capture as-
pects of mathematical meaning for numbers (Naik
et al.,2019). Wallace et al. (2019) probed numer-
ical supremacy on token embeddings of contex-
tual language models such as ELMO and BERT.
(Thawani et al.,2021) surveyed numerical under-
standing in NLP models using 7 sub-tasks such as
measurement estimation and word problems. Our
work diverges from these in exploring a richer set of
tasks including harder tasks such as set operations.
Further, previous methods explore mathematical
reasoning tasks posed as language problems, which
conflates the problems of language and mathemati-
cal learning and also makes the datasets susceptible
to biases due to data collection. Our analysis cir-
cumvents both these issues by design.
Some previous works have explored the ability
of RNN and Transformer architectures for learning
regular languages (Weiss et al.,2018;Sennhauser
and Berwick,2018;Suzgun et al.,2019b;Bhat-
tamishra et al.,2020), closing brackets (Skachkova
et al.,2018), and dynamic counting (Suzgun et al.,
2019a). However, they focus on the learnability of
these tasks with specific architectures, and do not
look at pretrained LMs, which are our focus here.
Finally, in our discussion, we conceptually
stretch the notion of inductive bias. The idea of
inductive bias is usually associated with specific
model types (McCoy et al.,2020;Kharitonov and
Chaabouni,2021), architectures (Xu et al.,2021;
Brutzkus and Globerson,2021) and regularization
approaches (Helmbold and Long,2015). We be-
lieve that extending this to refer to learning tasks
with pretrained LMs is both reasonable and useful.
3 NILM
In this section, we describe the tasks used for our
analysis, which we refer to as
NILM
(measuring
Non-linguistic Inductive bias in Language Models).
The tasks correspond to three task paradigms: (1)
quantitative computation, (2) regular expressions,
and (3) string reasoning. Each task in
NILM
is posed
as a classification task. The descriptions for all the
tasks with input and output examples, class labels
and the input range are shown in Table 1. Each task
has a synthetically generated dataset with train/de-
v/test splits
2
. To avoid biases in the datasets, rel-
evant numbers and strings in individual examples
are uniformly sampled from the appropriate ranges.
3.1 Quantitative computation
This task paradigm focuses on tasks involving arith-
metic and set statistics.
Odd classification. Classify if a number is odd.
Even classification. Classify if a number is even.
Odd even classification.
For a given number
N
and a string “even” or “odd”, classify if the number
satisfies the string condition.
Decimal operation.
Subtract or divide two num-
bers. Operands are represented in decimal notation.
2
The training set size for all tasks is 10K, dev set size is 1K
and test set size is 1K, except for tasks on recognizing regular
expressions, where the test set size is 2K following previous
work (Bhattamishra et al.,2020).
Decimal & word operation.
Subtract or divide
two numbers. Operands are represented in decimal
or word notation.
Mean. Given a set of numbers, output the mean.
Median. Given a set, output the median.
Mode. Given a set of numbers, output the mode.
3.2 Recognizing regular expressions
This task paradigm focuses on recognizing regular
expressions. The training data consists of positive
and negative examples of strings matching a regu-
lar expression (Bhattamishra et al.,2020).
Recognize {0,1,2}*02*.
Recognize if a pattern
matches {0,1,2}*02*. The maximum length of
the patterns is 20.
Recognize AA*BB*CC*DD*EE*.
Recognize if
a pattern matches AA*BB*CC*DD*EE*. The
maximum length of the patterns is 30.
3.3 String reasoning
This task paradigm focuses on reasoning tasks over
individual strings or pairs of strings.
Palindrome classification.
A string is a palin-
drome if it reads the same forward and backward.
The task is to classify whether a given string is a
palindrome. The string length ranges from 1 to 15.
Anagram classification.
Two strings are ana-
grams if one is formed by rearranging letters from
the other. The task is to classify if a pair of strings
are anagrams. The string length ranges from 2 to
15.
Isogram classification.
A string is an isogram if it
has no repeating characters. The task is to classify
whether a given string is an isogram. The string
length ranges from 1 to 52.
Tautonym classification.
A tautonym is a word
which can be broken down into two identical parts,
with the same spelling. The task is to classify
whether a given string is a tautonym. The string
length ranges from 1 to 10.
Length of a string.
Output the length of a given
string. The string length ranges from 1 to 10.
Count of unique characters.
Given a string,
count the number of unique characters in it. The
string lengths ranges from 10 to 30.
Parity check.
Given a binary string, output if the
counts of ones and zeros are the same. The maxi-
mum length of the binary string is 20.
Vowels classification.
Given a string, classify if
the string contains only vowel characters. The
string length ranges from 3 to 10.
Maximum frequent character.
Given a string,
摘要:

WhatdoLargeLanguageModelsLearnbeyondLanguage?AvinashMadasuShashankSrivastavaUNCChapelHill{avinashm,ssrivastava}@cs.unc.eduAbstractLargelanguagemodels(LMs)haverapidlybecomeamainstayinNaturalLanguagePro-cessing.Thesemodelsareknowntoacquirerichlinguisticknowledgefromtrainingonlargeamountsoftext.Inthisp...

展开>> 收起<<
What do Large Language Models Learn beyond Language Avinash Madasu Shashank Srivastava UNC Chapel Hill.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:873.17KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注