What do Large Language Models Learn beyond Language Avinash Madasu Shashank Srivastava UNC Chapel Hill

2025-05-06 0 0 873.17KB 14 页 10玖币

侵权投诉

What do Large Language Models Learn beyond Language?

Avinash Madasu Shashank Srivastava

UNC Chapel Hill

{avinashm,ssrivastava}@cs.unc.edu

Abstract

Large language models (LMs) have rapidly

become a mainstay in Natural Language Pro-

cessing. These models are known to acquire

rich linguistic knowledge from training on

large amounts of text. In this paper, we in-

vestigate if pre-training on text also confers

these models with helpful ‘inductive biases’

for non-linguistic reasoning. On a set of 19

diverse non-linguistic tasks involving quanti-

tative computations, recognizing regular ex-

pressions and reasoning over strings. We ﬁnd

that pretrained models signiﬁcantly outper-

form comparable non-pretrained neural mod-

els. This remains true also in experiments with

training non-pretrained models with fewer pa-

rameters to account for model regularization

effects. We further explore the effect of text do-

main on LMs by pretraining models from text

from different domains and provenances. Our

experiments surprisingly reveal that the posi-

tive effects of pre-training persist even when

pretraining on multi-lingual text or computer

code, and even for text generated from syn-

thetic languages. Our ﬁndings suggest a hith-

erto unexplored deep connection between pre-

training and inductive learning abilities of lan-

guage models1.

1 Introduction

Pretrained Language Models (LMs) have shown

singular succcess on a range of natural language un-

derstandings tasks, to the extent that they have be-

come foundational for contemporary NLP systems.

Several works have investigated why pretraining

works so well (Warstadt et al.,2019;Zhao et al.,

2020). In particular, studies have shown that the

pretrained LMs like BERT capture linguistic knowl-

edge about syntax (Lin et al.,2019;Wu et al.,2020),

semantics (Vuli´

c et al.,2020b,a) and morphology

(Hofmann et al.,2020,2021). In fact, Tenney et al.

(2019) demonstrated that learned representations

1https://github.com/avinashsai/NILM

Figure 1: We investigate the effect of pretraining of lan-

guages models on learning non-linguistic tasks using

three task paradigms involving symbolic reasoning.

in pretrained LMs even internally reﬂect the clas-

sical NLP pipeline. Since most NLP benchmarks

such as SuperGLUE (Wang et al.,2019) naturally

are focused on tasks such as textual entailment

and reading comprehension that require linguistic

knowledge and reasoning, it is unsurprising that

LMs have achieved strong results on these tasks.

On the other hand, little work so far has explored

the abilities of pretrained LMs for learning non-

linguistic tasks.

In this paper, we explore whether pretraining on

text is inherently about learning language, or if pre-

training also imbues LMs with skills for symbolic

manipulation and non-linguistic reasoning (for ex-

ample, performing quantitative computation such

as ﬁnding the median of a set of numbers, recog-

nizing regular expressions, or identifying whether

a string is a palindrome, as shown in Figure 1).

In other words, we investigate whether and how

pretraining develops helpful inductive biases for

non-linguistic reasoning. For this analysis, we cre-

ate a set of 19 tasks from three categories of task

paradigms: quantitative computation (§3.1), recog-

nizing regular expressions (§3.2), and string rea-

soning (§3.3). Figure 1shows an example for each

category, and the full list of tasks is described in the

table 1. We experiment with transformer and RNN

based LMs (§4) for learning these tasks, and per-

arXiv:2210.12302v1 [cs.CL] 21 Oct 2022

Task Input Eg. Output Eg. Classes Input range

Odd classiﬁcation 4210 0 0 - 1 [1, 20000]

Even classiﬁcation 4210 1 0 - 1 [1, 20000]

Odd even classiﬁcation 4210 even 1 0 - 1 [1, 20000]

Decimal operation 872 / 436 2 0 - 9 [1, 10000]

Decimal & word operation four / 2 2 0 - 9 [1, 10000]

Mean 15,-8,15,-5,-14,-3 ? 0 0 - 9 [-15, 15]

Median 3,6,5,15,2,3,-6,-2,9,-3,-9,-5,-14 ? 2 0 - 9 [-15, 15]

Mode 5,9,7,0,2,5,3,3,3,0 ? 3 0 - 9 [0, 9]

Recognize {0, 1, 2}*02* 01202102222 1 0 - 1 [0, 2]

Recognize AA*BB*CC*DD*EE*

aaaaaaabbbbcccccddde

1 0 - 1 [a, e]

Palindrome classiﬁcation aWXXWa 1 0 - 1 [a-z], [A-Z]

Anagram classiﬁcation rGrPJhk-khGrPJr 1 0 - 1 [a-z],[A-Z]

Isogram classiﬁcation vFJoSj 1 0 - 1 [a-z], [A-Z]

Tautonym classiﬁcation stPvg-tPvga 1 0 - 1 [a-z], [A-Z]

Length of a string teeo 4 0 - 9 [a-z]

Count of unique characters deiieediid 3 0 - 9 [a-j]

Parity check 011101001110 0 0 - 1 [0, 1]

Vowels classiﬁcation iivxcmoouo 0 0 - 9 [a-z]

Maximum frequent character jjjcjj 9 (j) 0 - 9 [a-j]

Table 1: Description of the non-linguistic tasks with input and output examples. Classes are the class labels for

each task. Input range denotes the range of the input operands in each task.

form a comparative analysis with (non-pretrained)

neural model variants from the perspective of learn-

ing metrics such as accuracy and sample efﬁciency.

Our experiments (§5) reveal that pretrained mod-

els overall perform substantially better and are

more sample efﬁcient on most tasks. However,

there are signiﬁcant differences and patterns in per-

formance between task types, as well as variance

between different LM architectures. Since non-

pretrained models do not have the beneﬁt of reg-

ularization that comes from pretraining, a plausi-

ble reason for the discrepancy between them and

pretrained LMs might be underﬁtting of the non-

pretrained models when trained on comparatively

small dataset sizes. To account for this, we also

comprehensively explore the effect of model size

(§6) of non-pretrained models for both transformer

and RNN architectures. We ﬁnd that the discrep-

ancy in performance remains even for smaller neu-

ral models, indicating that the differences are not

simply due to a mismatch in model and data sizes.

Finally, we investigate the role that pretraining

data plays in inﬂuencing task performance on non-

linguistic tasks (§7). We experiment with pretrain-

ing on different domains of text, pretraining on

perturbed representations of natural language text

(such as shufﬂed word order), pretraining on text of

computer programs (no linguistic properties of nat-

ural languages), pretraining on multi-lingual and

non-English text, and pretraining with synthetic

text (data sampled from synthetic distributions).

Our analysis reveals that the advantages of pretrain-

ing surprisingly persist with various degrees across

these variations, suggesting hithertho unexplored

connections between pretraining and the learning

abilities of language models. Our contributions are:

•

We compare a range of pretrained LMs and non-

pretrained models on a carefully designed suite of

19 classiﬁcations tasks that require non-linguistic

reasoning.

•

We comprehensively explore the role of the pre-

training data by experimenting with models pre-

trained from texts with different provenances.

•

We establish that the positive effects of pretrain-

ing are not simply due to better model regulariza-

tion by experimenting with neural models with

different complexities and architectures.

2 Related Work

A body of work has investigated contextual word

embeddings to determine whether they capture as-

pects of mathematical meaning for numbers (Naik

et al.,2019). Wallace et al. (2019) probed numer-

ical supremacy on token embeddings of contex-

tual language models such as ELMO and BERT.

(Thawani et al.,2021) surveyed numerical under-

standing in NLP models using 7 sub-tasks such as

measurement estimation and word problems. Our

work diverges from these in exploring a richer set of

tasks including harder tasks such as set operations.

Further, previous methods explore mathematical

reasoning tasks posed as language problems, which

conﬂates the problems of language and mathemati-

cal learning and also makes the datasets susceptible

to biases due to data collection. Our analysis cir-

cumvents both these issues by design.

Some previous works have explored the ability

of RNN and Transformer architectures for learning

regular languages (Weiss et al.,2018;Sennhauser

and Berwick,2018;Suzgun et al.,2019b;Bhat-

tamishra et al.,2020), closing brackets (Skachkova

et al.,2018), and dynamic counting (Suzgun et al.,

2019a). However, they focus on the learnability of

these tasks with speciﬁc architectures, and do not

look at pretrained LMs, which are our focus here.

Finally, in our discussion, we conceptually

stretch the notion of inductive bias. The idea of

inductive bias is usually associated with speciﬁc

model types (McCoy et al.,2020;Kharitonov and

Chaabouni,2021), architectures (Xu et al.,2021;

Brutzkus and Globerson,2021) and regularization

approaches (Helmbold and Long,2015). We be-

lieve that extending this to refer to learning tasks

with pretrained LMs is both reasonable and useful.

3 NILM

In this section, we describe the tasks used for our

analysis, which we refer to as

NILM

(measuring

Non-linguistic Inductive bias in Language Models).

The tasks correspond to three task paradigms: (1)

quantitative computation, (2) regular expressions,

and (3) string reasoning. Each task in

NILM

is posed

as a classiﬁcation task. The descriptions for all the

tasks with input and output examples, class labels

and the input range are shown in Table 1. Each task

has a synthetically generated dataset with train/de-

v/test splits

. To avoid biases in the datasets, rel-

evant numbers and strings in individual examples

are uniformly sampled from the appropriate ranges.

3.1 Quantitative computation

This task paradigm focuses on tasks involving arith-

metic and set statistics.

Odd classiﬁcation. Classify if a number is odd.

Even classiﬁcation. Classify if a number is even.

Odd even classiﬁcation.

For a given number

and a string “even” or “odd”, classify if the number

satisﬁes the string condition.

Decimal operation.

Subtract or divide two num-

bers. Operands are represented in decimal notation.

The training set size for all tasks is 10K, dev set size is 1K

and test set size is 1K, except for tasks on recognizing regular

expressions, where the test set size is 2K following previous

work (Bhattamishra et al.,2020).

Decimal & word operation.

Subtract or divide

two numbers. Operands are represented in decimal

or word notation.

Mean. Given a set of numbers, output the mean.

Median. Given a set, output the median.

Mode. Given a set of numbers, output the mode.

3.2 Recognizing regular expressions

This task paradigm focuses on recognizing regular

expressions. The training data consists of positive

and negative examples of strings matching a regu-

lar expression (Bhattamishra et al.,2020).

Recognize {0,1,2}*02*.

Recognize if a pattern

matches {0,1,2}*02*. The maximum length of

the patterns is 20.

Recognize AA*BB*CC*DD*EE*.

Recognize if

a pattern matches AA*BB*CC*DD*EE*. The

maximum length of the patterns is 30.

3.3 String reasoning

This task paradigm focuses on reasoning tasks over

individual strings or pairs of strings.

Palindrome classiﬁcation.

A string is a palin-

drome if it reads the same forward and backward.

The task is to classify whether a given string is a

palindrome. The string length ranges from 1 to 15.

Anagram classiﬁcation.

Two strings are ana-

grams if one is formed by rearranging letters from

the other. The task is to classify if a pair of strings

are anagrams. The string length ranges from 2 to

15.

Isogram classiﬁcation.

A string is an isogram if it

has no repeating characters. The task is to classify

whether a given string is an isogram. The string

length ranges from 1 to 52.

Tautonym classiﬁcation.

A tautonym is a word

which can be broken down into two identical parts,

with the same spelling. The task is to classify

whether a given string is a tautonym. The string

length ranges from 1 to 10.

Length of a string.

Output the length of a given

string. The string length ranges from 1 to 10.

Count of unique characters.

Given a string,

count the number of unique characters in it. The

string lengths ranges from 10 to 30.

Parity check.

Given a binary string, output if the

counts of ones and zeros are the same. The maxi-

mum length of the binary string is 20.

Vowels classiﬁcation.

Given a string, classify if

the string contains only vowel characters. The

string length ranges from 3 to 10.

Maximum frequent character.

Given a string,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhatdoLargeLanguageModelsLearnbeyondLanguage?AvinashMadasuShashankSrivastavaUNCChapelHill{avinashm,ssrivastava}@cs.unc.eduAbstractLargelanguagemodels(LMs)haverapidlybecomeamainstayinNaturalLanguagePro-cessing.Thesemodelsareknowntoacquirerichlinguisticknowledgefromtrainingonlargeamountsoftext.Inthisp...

展开>> 收起<<

What do Large Language Models Learn beyond Language Avinash Madasu Shashank Srivastava UNC Chapel Hill.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What do Large Language Models Learn beyond Language Avinash Madasu Shashank Srivastava UNC Chapel Hill

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: