Finding Dataset Shortcuts with Grammar Induction Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science Princeton University

2025-05-06 0 0 561.34KB 19 页 10玖币
侵权投诉
Finding Dataset Shortcuts with Grammar Induction
Dan Friedman, Alexander Wettig, Danqi Chen
Department of Computer Science, Princeton University
{dfriedman,awettig,danqic}@cs.princeton.edu
Abstract
Many NLP datasets have been found to con-
tain shortcuts: simple decision rules that
achieve surprisingly high accuracy. However,
it is difficult to discover shortcuts automati-
cally. Prior work on automatic shortcut de-
tection has focused on enumerating features
like unigrams or bigrams, which can find
only low-level shortcuts, or relied on post-hoc
model interpretability methods like saliency
maps, which reveal qualitative patterns with-
out a clear statistical interpretation. In this
work, we propose to use probabilistic gram-
mars to characterize and discover shortcuts in
NLP datasets. Specifically, we use a context-
free grammar to model patterns in sentence
classification datasets and use a synchronous
context-free grammar to model datasets in-
volving sentence pairs. The resulting gram-
mars reveal interesting shortcut features in a
number of datasets, including both simple and
high-level features, and automatically identify
groups of test examples on which conventional
classifiers fail. Finally, we show that the fea-
tures we discover can be used to generate di-
agnostic contrast examples and incorporated
into standard robust optimization methods to
improve worst-group accuracy.1
1 Introduction
Many NLP datasets have been found to contain
shortcuts: simple decision rules that achieve sur-
prisingly high accuracy. For example, it is possible
to get high classification accuracy on paraphrase
identification datasets by predicting that sentences
with many common words are paraphrases of each
other (Zhang et al.,2019). Such classifiers are said
to be “right for the wrong reason” (McCoy et al.,
2019). Shortcuts are a problem if they do not gen-
eralize to the intended test distribution (Geirhos
1
Our code for inducing grammars and finding dataset
shortcuts is available at https://github.com/princeton-
nlp/ShortcutGrammar.
et al.,2020). For example, paraphrase identifica-
tion models might misclassify non-paraphrases that
have many overlapping words.
Shortcuts have been reported in many estab-
lished datasets (e.g., McCoy et al.,2019;Guru-
rangan et al.,2018;Niven and Kao,2019;Schuster
et al.,2019), as a consequence of annotation arti-
facts or so-called spurious correlations. Typically,
these discoveries are the result of human intuition
about possible patterns in a particular dataset. Our
goal in this paper is to discover shortcuts automat-
ically. If we can identify shortcuts, we can iden-
tify categories of examples on which conventional
classifiers will fail, and try to mitigate these weak-
nesses by collecting more training data or using
robust optimization algorithms.
The main challenge to automatically identify-
ing shortcuts is to develop a formal framework for
describing patterns in language data that can cap-
ture both simple and high-level features and that
makes it possible to search for these patterns ef-
ficiently. Prior work has addressed only simple
features like unigrams and bigrams (Wang and Cu-
lotta,2020;Wang et al.,2022;Gardner et al.,2021),
and it is difficult to extend this approach to more
sophisticated patterns, like lexical overlap, without
knowing the pattern in advance. Other model-based
approaches use black-box interpretability methods,
by using gradient-based techniques to identify to-
kens or training instances that influence a particular
prediction (Han et al.,2020;Pezeshkpour et al.,
2022;Bastings et al.,2021). These methods of-
fer local, qualitative hints about the decision of a
classifier on a particular test instance, but do not
identify dataset-level features nor provide a way of
measuring the strength of correlation.
Our approach is to use grammar induction to
characterize and discover shortcut features (§2).
Probabilistic grammars provide a principled frame-
work for describing patterns in natural language,
allowing us to formally model both simple fea-
arXiv:2210.11560v1 [cs.CL] 20 Oct 2022
1
14
14
41
A
/
A
32
girl
/
kid
10
16
75
in
/
94
a
/
11
70
green
/
70
shirt
/
31
77
jumps
/
sits
2
0
59
in
/
in
38
the
/
a
83
air
/
chair
14
A girl/A kid
77
jumps/sits
2
in the air/in a chair
Top subtrees for entailment
A man/A person walking/walking in the grass/outside
A man/A human walk/walking down the street/outside
Top subtrees for contradiction
A man/A woman /sitting at night/during the day
A woman/A man stand/sit in the air/down
Top subtrees for neutral
A man/A tall human /competes down the street/home
A man/An old man running/running in the sand/on the beach
Figure 1: Left: The most likely parse tree for an example from the SNLI validation set according to our syn-
chronous grammar. The numbered nodes index non-terminal symbols and denotes an empty symbol. We high-
light the subtrees that provide the strongest evidence in favor of the label contradiction, and show alternative spans
generated by these non-terminals after conditioning on the class labels (Right).
tures and more sophisticated patterns. They admit
tractable search algorithms and provide a natural
way to measure statistics about the correlation be-
tween text patterns and labels. Grammars also offer
a mechanism for identifying contrastive features,
templates that appear in similar contexts across
classes, but take on different values, which can
be used to construct diagnostic test examples (see
examples in Figure 1).
In this work, we focus on both single-sentence
(e.g., sentiment analysis) and sentence-pair classi-
fication datasets (e.g., NLI, paraphrase detection).
While we can use context-free grammars to model
features in sentences, sentence-pair datasets present
a particular challenge, as it is difficult to enumer-
ate interactions between the pair of sentences. We
propose to use synchronous context-free grammars,
which formally model insertion, deletion, and align-
ment. We find that we can extract meaningful,
dataset-specific structures that describe latent pat-
terns characterizing these classification tasks.
We apply our approach to four classification
datasets: IMDb, SUBJ, SNLI and QQP. After il-
lustrating the shortcut features (§3), we explore
whether state-of-the-art classifiers exploit these
shortcuts by identifying minority examples in the
datasets and then generating contrastive examples
4). Then we demonstrate that these features can
be used in robust optimization algorithms to im-
prove generalization (§5). Finally, we compare this
approach with model-based interpretability meth-
ods and n-gram features, which do not explicitly
model syntax (§6). Overall, we find that grammar
induction provides a flexible and expressive repre-
sentation for modeling features in NLP datasets.
2 Method
2.1 Overview
We focus on text classification datasets
D X ×Y
,
where
y∈ Y
is a categorical label and
x∈ X
is either a sentence or a pair of sentences, each
consisting of a sequence of words from a discrete
vocabulary
V
. When
x
is a sentence pair, we will
write
x= (xa, xb)
and refer to
xa
and
xb
as the
source and target sentences, respectively.
We aim to automatically extract a description of
the features that characterize the relationship be-
tween
x
and
y
, and our key idea is to define the
features in terms of a dataset-specific grammar.
Compared to a grammar extracted from a standard
treebank, the grammars we induce serve as inter-
pretable models for the distribution of sentences in
the dataset. Our approach consists of two steps:
1. Grammar induction:
First, we induce a
grammar for (unlabeled) training instances
x1, . . . , xN
and get the maximum likelihood
trees t1, . . . , tN.
2. Finding features:
We define features in
terms of subtrees in the grammar, which de-
scribe patterns in the input sentences, and we
search for features that have high mutual in-
formation with the class labels.
We induce one shared grammar for all classes rather
than incorporating labels during grammar induc-
tion, so that the non-terminal symbols have a con-
sistent meaning across classes. This facilitates find-
ing contrastive features. For example, in Figure 1
(right), the nonterminal symbol
77
always gener-
ates pairs of verbs, but the distribution changes
according to the class label:
77
is more likely to
generate
walk/walking
when the class label is en-
tailment and more likely to generate
stand/sits
when the class label is contradiction.
2.2 Grammar Induction
In this section, we describe the two grammars we
use and the training procedure.
Context-free grammar
A context-free grammar
(CFG) consists of an inventory of terminal symbols
V
(words), non-terminal symbols
N
, and produc-
tion rules of the form
αβ∈ R
, where
α∈ N
and
β(V N )
. A probabilistic CFG (PCFG)
defines a distribution over trees by assigning ev-
ery non-terminal symbol a categorical distribution
over production rules, with the probability of a tree
defined as the product of the probability of the pro-
duction rules used to generate it. A PCFG defines
a distribution over sequences of words
x∈ V
by marginalizing over all trees which generate the
sentence x(denoted by yield(t)):
p(x) = X
t:yield(t)=x
p(t).
Synchronous grammar
Many NLP shortcuts
have been found in datasets involving pairs of
sentences
x= (xa, xb)
. We model patterns in
these datasets using a Synchronous PCFG (SCFG;
Lewis and Stearns,1968;Wu,1997), a grammar
for defining probability distributions over pairs
of sequences. An SCFG assumes that both se-
quences were generated from a single context-
free parse tree, whose terminal symbols have the
form
wa/wb
, where
wa
and
wb
are either words
in
xa
or
xb
respectively, or an empty symbol, de-
noted by
, which represents a null alignment (Fig-
ure 1). SCFG productions can also be thought of as
translation rules: the emission
wa/wb
represents
a substitution—replacing word
wa
with
wb
—and
null alignments represent insertion or deletion. An
SCFG makes strong simplifying assumptions about
the possible relationships between sentences, but,
as we will show, it is still capable of modeling inter-
esting, hierarchical structure in real-world datasets.
Parameterization and training
The parameters
of grammar consist of a vector
θ
with one entry
θαβ
for every rule
αβ∈ R
. Following Kim
et al. (2019), we parameterize the grammars using
a neural network. We use the neural CFG param-
eterization from Kim et al. (2019) and develop a
similar parameterization for our SCFG, with exten-
sions for the terminal production rules. We defer
the full details to the appendix (Section A).
Given a training set
D={(xi, yi)}N
i=1
and a
grammar
G= (V,N,R)
, we find a maximum
likelihood estimate
θ
by maximizing the marginal
likelihood of the (unlabeled) training sentences:
θ= arg max
θ
N
X
i=1
log p(xi| G, θ).
We optimize the parameters using gradient descent.
After training, we calculate the maximum likeli-
hood trees
t1, . . . , tN
for the training data, and use
these trees as the basis of further analysis.
Complexity
Training and parsing require enumer-
ating the trees consistent with the input, which is
calculated with the inside algorithm for CFGs (Lari
and Young,1990) and the bitext inside algorithm
for SCFGs (Wu,1997). The inside algorithm has
space and time complexity of
O(|x|3|G|)
and the
bitext inside algorithm has space and time com-
plexity of
O(|xa|3|xb|3|G|)
, where
|G|
is a gram-
mar constant determined by the number of rules
in the grammar. We use a vectorized GPU im-
plementation of the inside algorithm provided by
Torch-Struct (Rush,2020) and we implement a vec-
torized version of the bitext inside algorithm. The
cost of the bitext parsing algorithm imposes a prac-
tical limitation on the length of the sentences we
consider (Section 3.1). We also discuss possible
efficient approximations in the Limitations section,
which we leave as an avenue for future work.
2.3 Finding Features
The tree-annotated corpus provides a structured rep-
resentation of the dataset that we can now query to
find discriminative patterns at various levels of ab-
straction. We follow a simple procedure for finding
dataset-level patterns using our tree annotations.
First, given a set of trees
t1, . . . , tN
with class
labels
y1, . . . , yN
, we extract the set of complete
subtrees, which are subtrees whose leaves are all
terminal symbols. There is at most one unique
subtree for each non-terminal node in each tree, so
the number of complete subtrees is roughly on the
order of the number of words in the training data.
Next, we calculate the mutual information be-
tween each subtree and the class labels. We treat
each subtree
s
as a boolean-valued feature function
on trees,
φs(t) = [st]
. Let
Zs
be a random
variable denoting the output of
φs
and let
Y
be a
random variable over
Y
. The mutual information
is defined as:
I(Zs;Y) = X
zs∈{0,1}
X
y∈Y
p(y, zs) log p(y, zs)
p(y)p(zs).
We estimate the mutual information between
Zs
and
Y
using
ˆp(y, zs)1 + PN
i=1 [yi=y
φs(ti) = zs]
. Mutual information measures the
expected amount of information we learn about
Y
by learning the value of
φs
. While we use mutual
information in this paper, we could also score the
features using other feature-importance metrics,
such as z-score (Gardner et al.,2021).
To visualize the most discriminative patterns,
we group the highest-ranked subtrees according
to their root label and majority class label. Let
S(α, y)
denote the set of subtrees with root label
α
and majority class label
y
. We define a composite
feature,
Zα,y =WsS(α,y)Zs
as the union of fea-
tures in
S(α, y)
. The result of this procedure is a
concise list of class-conditional non-terminal fea-
tures, which we can inspect to identify the patterns
that are broadly discriminative across the dataset.
3 Finding Shortcuts
3.1 Experimental Setup
Datasets
We apply our approach to two single-
sentence and two sentence-pair classification
datasets.
IMDb
(Maas et al.,2011) is a binary
sentiment analysis dataset consisting of paragraph-
length movie reviews.
SUBJ
(Pang and Lee,2004)
is a subjectivity classification dataset, containing
sentences labeled as either subjective or objec-
tive.
SNLI
(Bowman et al.,2015) is a three-way
classification task; given two sentences,
xa
and
xb
, the objective is to determine whether
xa
en-
tails
xb
,contradicts
xb
, or is neutral.
QQP
(Iyer
et al.,2017) consists of pairs of questions from
quora.com labeled as being either paraphrases or
not paraphrases.
For all experiments, we fix the size of the
grammar to be 96 non-terminals symbols, divided
into 32 internal non-terminal symbols and 64 pre-
terminal symbols, similar to Kim et al. (2019),
and use a lowercase word-level tokenizer with a
maximum vocabulary size of 20,000 words. For
the sentence-pair datasets, we randomly sample
65K/16K class-balanced sets of training/validation
examples that fit within the length limit imposed by
the bitext inside-outside algorithm (
|xa|×|xb|<
225
). This length limit covers approximately 80%
of SNLI and 70% of QQP. For IMDb, we split
the movie reviews into sentences for the purposes
of training the PCFG, but compute feature statis-
tics using the full reviews. More implementation
details are in Appendix B.
3.2 A Look at the Top Features
In this section, we qualitatively explore some of the
dataset-level shortcuts we find. Our procedure for
each dataset is the same: given a set of trees, we
enumerate all complete subtrees and sort them by
mutual information, treating each subtree as a bi-
nary indicator variable. Then we group the subtree
features by root label and majority class label and
inspect the most informative groups of subtrees.
The majority class label for a feature
Z
is defined
as the most common class label among training
examples for which
Z= 1
. For each feature
Z
, we
report the number of training examples for which
Z= 1
(
Count
) and the percentage of these having
the majority class label (
% Majority
). We present
the most interesting results here and include ex-
tended results in Appendix D.1.
IMDb
Not surprisingly, we find that the most in-
formative features in IMDb include adjectives, ad-
verbs, and nouns with high sentiment valence. We
highlight some of the more interesting patterns in
Table 1. For example, we discover that negative
reviews are almost three times as likely as positive
reviews to mention the length of the film (node
8
),
and we find that the grammar has learned a clear
category corresponding to names (node 5 ).
We also confirm that the grammar recovers a
known shortcut in IMDb, numerical ratings (Ross
et al.,2021;Pezeshkpour et al.,2022). We find that
a single non-terminal (node
29
) is responsible for
generating ratings, including both simple, numeri-
cal ratings, which have been documented in earlier
work, as well as letter grades and ratings on a star
system (Table 2).
SUBJ
In the SUBJ dataset (Table 3), the most
informative features reflect how this dataset was
constructed (Pang and Lee,2004): the subjective
class consists of movie reviews from Rotten Toma-
toes and the objective class consists of movie sum-
摘要:

FindingDatasetShortcutswithGrammarInductionDanFriedman,AlexanderWettig,DanqiChenDepartmentofComputerScience,PrincetonUniversity{dfriedman,awettig,danqic}@cs.princeton.eduAbstractManyNLPdatasetshavebeenfoundtocon-tainshortcuts:simpledecisionrulesthatachievesurprisinglyhighaccuracy.However,itisdifcul...

展开>> 收起<<
Finding Dataset Shortcuts with Grammar Induction Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science Princeton University.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:561.34KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注