
Computational Linguistics Volume 1, Number 1
BERT, and other models based on BERT, such as Patentbert (Lee and Hsiang 2019),
Docbert (Adhikari et al. 2019), SciBERT (Beltagy, Lo, and Cohan 2019), DistilBERT (Sanh
et al. 2019) and K-bert (Liu et al. 2020), have achieved groundbreaking results in diverse
language understanding tasks, including QA (Reddy, Chen, and Manning 2019; Fan
et al. 2019; Lewis et al. 2019), text summarization (Liu and Lapata 2019; Zhang, Wei, and
Zhou 2019), sentence prediction (Shin, Lee, and Jung 2019; Lan et al. 2019), dialogue
response generation (Zhang et al. 2019; Wang et al. 2019), natural language inference
(McCoy, Pavlick, and Linzen 2019; Richardson et al. 2020), and sentiment classification
(Gao et al. 2019; Thongtan and Phienthrakul 2019; Munikar, Shakya, and Shrestha 2019).
The model studied in this paper, RoBERTa (Liu et al. 2019b), is a highly optimized
version of the original BERT architecture that was first published in 2019 and improved
over BERT on various benchmarks by margins of 0.9 [on the Quora Question Pairs
dataset (Iyer, Dandekar, and Csernai 2016)] - 16.2 percent [on the Recognizing Textual
Entailment dataset (Dagan, Glickman, and Magnini 2005; Haim et al. 2006; Giampiccolo
et al. 2007; Bentivogli et al. 2009)].
Specifically, RoBERTa is trained with larger mini-batches and learning rates, re-
moves the next-sentence pre-training objective, and focuses on improving the MLM
objective to deliver improved performance, compared to BERT, on problems such
as Multi-Genre Natural Language Inference (Williams, Nangia, and Bowman 2017),
and Question-Based Natural Language Inference (Rajpurkar et al. 2016). RoBERTa-
based models have approached near-human performance on various (subsequently
described) commonsense natural language understanding (NLU) benchmarks.
BERT’s original success on these NLU tasks has also motivated researchers to adapt
it for multi-modal language representation (Lu et al. 2019; Sun et al. 2019), cross-
lingual language models (Lample and Conneau 2019), and domain-specific language
models, including in the medicine- (Alsentzer et al. 2019; Wang et al. 2020) and biology-
related domains (Lee et al. 2020). Due to this widespread use, and the fact that even
recent, more advanced models based on billions of parameters are based on similar
technology (deep transformers), it has become important to systematically study the
linguistic properties of BERT using a battery of tests inspired by work first conducted
in the behavioral sciences. In prior work, for example, several proposed approaches
aimed to study the knowledge encoded within BERT, including fill-in-the-gap probes
of MLM (Rogers, Kovaleva, and Rumshisky 2020; Wu et al. 2019), analysis of self-
attention weights (Kobayashi et al. 2020; Ettinger 2020), the probing of classifiers with
different BERT representations as inputs (Liu et al. 2019a; Warstadt and Bowman 2020),
and a ‘CheckList’ style approach to systematically evaluate the linguistic capability of a
BERT-based model (Ribeiro et al. 2020). Evidence from this line of research suggests that
BERT encodes a hierarchy of linguistic information, with surface features at the bottom,
syntactic features in the middle and semantic features at the top (Jawahar, Sagot, and
Seddah 2019). It ‘naturally’ learns syntactic information from pre-training text.
However, it has been found that while information can be recovered from its token
representation (Wu et al. 2020), it does not fully ‘understand’ naturalistic concepts like
negation, and is insensitive to malformed input (Rogers, Kovaleva, and Rumshisky
2020). The latter is similar to adversarial experiments (not dissimilar to adversarial
experiments in the computer vision community) that researchers have conducted to
test BERT’s robustness. Some of these experiments have shown that, even though BERT
encodes information about entity types, relations, semantic roles, and proto-roles well,
it struggles with the representations of numbers (Wallace et al. 2019b) and is also brittle
to named entity replacements (Balasubramanian et al. 2020). Besides, (?) also found
2