
conflates the problems of language and mathemati-
cal learning and also makes the datasets susceptible
to biases due to data collection. Our analysis cir-
cumvents both these issues by design.
Some previous works have explored the ability
of RNN and Transformer architectures for learning
regular languages (Weiss et al.,2018;Sennhauser
and Berwick,2018;Suzgun et al.,2019b;Bhat-
tamishra et al.,2020), closing brackets (Skachkova
et al.,2018), and dynamic counting (Suzgun et al.,
2019a). However, they focus on the learnability of
these tasks with specific architectures, and do not
look at pretrained LMs, which are our focus here.
Finally, in our discussion, we conceptually
stretch the notion of inductive bias. The idea of
inductive bias is usually associated with specific
model types (McCoy et al.,2020;Kharitonov and
Chaabouni,2021), architectures (Xu et al.,2021;
Brutzkus and Globerson,2021) and regularization
approaches (Helmbold and Long,2015). We be-
lieve that extending this to refer to learning tasks
with pretrained LMs is both reasonable and useful.
3 NILM
In this section, we describe the tasks used for our
analysis, which we refer to as
NILM
(measuring
Non-linguistic Inductive bias in Language Models).
The tasks correspond to three task paradigms: (1)
quantitative computation, (2) regular expressions,
and (3) string reasoning. Each task in
NILM
is posed
as a classification task. The descriptions for all the
tasks with input and output examples, class labels
and the input range are shown in Table 1. Each task
has a synthetically generated dataset with train/de-
v/test splits
2
. To avoid biases in the datasets, rel-
evant numbers and strings in individual examples
are uniformly sampled from the appropriate ranges.
3.1 Quantitative computation
This task paradigm focuses on tasks involving arith-
metic and set statistics.
Odd classification. Classify if a number is odd.
Even classification. Classify if a number is even.
Odd even classification.
For a given number
N
and a string “even” or “odd”, classify if the number
satisfies the string condition.
Decimal operation.
Subtract or divide two num-
bers. Operands are represented in decimal notation.
2
The training set size for all tasks is 10K, dev set size is 1K
and test set size is 1K, except for tasks on recognizing regular
expressions, where the test set size is 2K following previous
work (Bhattamishra et al.,2020).
Decimal & word operation.
Subtract or divide
two numbers. Operands are represented in decimal
or word notation.
Mean. Given a set of numbers, output the mean.
Median. Given a set, output the median.
Mode. Given a set of numbers, output the mode.
3.2 Recognizing regular expressions
This task paradigm focuses on recognizing regular
expressions. The training data consists of positive
and negative examples of strings matching a regu-
lar expression (Bhattamishra et al.,2020).
Recognize {0,1,2}*02*.
Recognize if a pattern
matches {0,1,2}*02*. The maximum length of
the patterns is 20.
Recognize AA*BB*CC*DD*EE*.
Recognize if
a pattern matches AA*BB*CC*DD*EE*. The
maximum length of the patterns is 30.
3.3 String reasoning
This task paradigm focuses on reasoning tasks over
individual strings or pairs of strings.
Palindrome classification.
A string is a palin-
drome if it reads the same forward and backward.
The task is to classify whether a given string is a
palindrome. The string length ranges from 1 to 15.
Anagram classification.
Two strings are ana-
grams if one is formed by rearranging letters from
the other. The task is to classify if a pair of strings
are anagrams. The string length ranges from 2 to
15.
Isogram classification.
A string is an isogram if it
has no repeating characters. The task is to classify
whether a given string is an isogram. The string
length ranges from 1 to 52.
Tautonym classification.
A tautonym is a word
which can be broken down into two identical parts,
with the same spelling. The task is to classify
whether a given string is a tautonym. The string
length ranges from 1 to 10.
Length of a string.
Output the length of a given
string. The string length ranges from 1 to 10.
Count of unique characters.
Given a string,
count the number of unique characters in it. The
string lengths ranges from 10 to 30.
Parity check.
Given a binary string, output if the
counts of ones and zeros are the same. The maxi-
mum length of the binary string is 20.
Vowels classification.
Given a string, classify if
the string contains only vowel characters. The
string length ranges from 3 to 10.
Maximum frequent character.
Given a string,