Finding Dataset Shortcuts with Grammar Induction Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science Princeton University

2025-05-06 0 0 561.34KB 19 页 10玖币

侵权投诉

Finding Dataset Shortcuts with Grammar Induction

Dan Friedman, Alexander Wettig, Danqi Chen

Department of Computer Science, Princeton University

{dfriedman,awettig,danqic}@cs.princeton.edu

Abstract

Many NLP datasets have been found to con-

tain shortcuts: simple decision rules that

achieve surprisingly high accuracy. However,

it is difﬁcult to discover shortcuts automati-

cally. Prior work on automatic shortcut de-

tection has focused on enumerating features

like unigrams or bigrams, which can ﬁnd

only low-level shortcuts, or relied on post-hoc

model interpretability methods like saliency

maps, which reveal qualitative patterns with-

out a clear statistical interpretation. In this

work, we propose to use probabilistic gram-

mars to characterize and discover shortcuts in

NLP datasets. Speciﬁcally, we use a context-

free grammar to model patterns in sentence

classiﬁcation datasets and use a synchronous

context-free grammar to model datasets in-

volving sentence pairs. The resulting gram-

mars reveal interesting shortcut features in a

number of datasets, including both simple and

high-level features, and automatically identify

groups of test examples on which conventional

classiﬁers fail. Finally, we show that the fea-

tures we discover can be used to generate di-

agnostic contrast examples and incorporated

into standard robust optimization methods to

improve worst-group accuracy.1

1 Introduction

Many NLP datasets have been found to contain

shortcuts: simple decision rules that achieve sur-

prisingly high accuracy. For example, it is possible

to get high classiﬁcation accuracy on paraphrase

identiﬁcation datasets by predicting that sentences

with many common words are paraphrases of each

other (Zhang et al.,2019). Such classiﬁers are said

to be “right for the wrong reason” (McCoy et al.,

2019). Shortcuts are a problem if they do not gen-

eralize to the intended test distribution (Geirhos

Our code for inducing grammars and ﬁnding dataset

shortcuts is available at https://github.com/princeton-

nlp/ShortcutGrammar.

et al.,2020). For example, paraphrase identiﬁca-

tion models might misclassify non-paraphrases that

have many overlapping words.

Shortcuts have been reported in many estab-

lished datasets (e.g., McCoy et al.,2019;Guru-

rangan et al.,2018;Niven and Kao,2019;Schuster

et al.,2019), as a consequence of annotation arti-

facts or so-called spurious correlations. Typically,

these discoveries are the result of human intuition

about possible patterns in a particular dataset. Our

goal in this paper is to discover shortcuts automat-

ically. If we can identify shortcuts, we can iden-

tify categories of examples on which conventional

classiﬁers will fail, and try to mitigate these weak-

nesses by collecting more training data or using

robust optimization algorithms.

The main challenge to automatically identify-

ing shortcuts is to develop a formal framework for

describing patterns in language data that can cap-

ture both simple and high-level features and that

makes it possible to search for these patterns ef-

ﬁciently. Prior work has addressed only simple

features like unigrams and bigrams (Wang and Cu-

lotta,2020;Wang et al.,2022;Gardner et al.,2021),

and it is difﬁcult to extend this approach to more

sophisticated patterns, like lexical overlap, without

knowing the pattern in advance. Other model-based

approaches use black-box interpretability methods,

by using gradient-based techniques to identify to-

kens or training instances that inﬂuence a particular

prediction (Han et al.,2020;Pezeshkpour et al.,

2022;Bastings et al.,2021). These methods of-

fer local, qualitative hints about the decision of a

classiﬁer on a particular test instance, but do not

identify dataset-level features nor provide a way of

measuring the strength of correlation.

Our approach is to use grammar induction to

characterize and discover shortcut features (§2).

Probabilistic grammars provide a principled frame-

work for describing patterns in natural language,

allowing us to formally model both simple fea-

arXiv:2210.11560v1 [cs.CL] 20 Oct 2022

girl

kid



green



shirt



jumps

sits

the

air

chair

A girl/A kid

jumps/sits

in the air/in a chair

Top subtrees for entailment

A man/A person walking/walking in the grass/outside

A man/A human walk/walking down the street/outside

Top subtrees for contradiction

A man/A woman /sitting at night/during the day

A woman/A man stand/sit in the air/down

Top subtrees for neutral

A man/A tall human /competes down the street/home

A man/An old man running/running in the sand/on the beach

Figure 1: Left: The most likely parse tree for an example from the SNLI validation set according to our syn-

chronous grammar. The numbered nodes index non-terminal symbols and denotes an empty symbol. We high-

light the subtrees that provide the strongest evidence in favor of the label contradiction, and show alternative spans

generated by these non-terminals after conditioning on the class labels (Right).

tures and more sophisticated patterns. They admit

tractable search algorithms and provide a natural

way to measure statistics about the correlation be-

tween text patterns and labels. Grammars also offer

a mechanism for identifying contrastive features,

templates that appear in similar contexts across

classes, but take on different values, which can

be used to construct diagnostic test examples (see

examples in Figure 1).

In this work, we focus on both single-sentence

(e.g., sentiment analysis) and sentence-pair classi-

ﬁcation datasets (e.g., NLI, paraphrase detection).

While we can use context-free grammars to model

features in sentences, sentence-pair datasets present

a particular challenge, as it is difﬁcult to enumer-

ate interactions between the pair of sentences. We

propose to use synchronous context-free grammars,

which formally model insertion, deletion, and align-

ment. We ﬁnd that we can extract meaningful,

dataset-speciﬁc structures that describe latent pat-

terns characterizing these classiﬁcation tasks.

We apply our approach to four classiﬁcation

datasets: IMDb, SUBJ, SNLI and QQP. After il-

lustrating the shortcut features (§3), we explore

whether state-of-the-art classiﬁers exploit these

shortcuts by identifying minority examples in the

datasets and then generating contrastive examples

(§4). Then we demonstrate that these features can

be used in robust optimization algorithms to im-

prove generalization (§5). Finally, we compare this

approach with model-based interpretability meth-

ods and n-gram features, which do not explicitly

model syntax (§6). Overall, we ﬁnd that grammar

induction provides a ﬂexible and expressive repre-

sentation for modeling features in NLP datasets.

2 Method

2.1 Overview

We focus on text classiﬁcation datasets

D ⊆ X ×Y

where

y∈ Y

is a categorical label and

x∈ X

is either a sentence or a pair of sentences, each

consisting of a sequence of words from a discrete

vocabulary

. When

is a sentence pair, we will

write

x= (xa, xb)

and refer to

and

as the

source and target sentences, respectively.

We aim to automatically extract a description of

the features that characterize the relationship be-

tween

and

, and our key idea is to deﬁne the

features in terms of a dataset-speciﬁc grammar.

Compared to a grammar extracted from a standard

treebank, the grammars we induce serve as inter-

pretable models for the distribution of sentences in

the dataset. Our approach consists of two steps:

1. Grammar induction:

First, we induce a

grammar for (unlabeled) training instances

x1, . . . , xN

and get the maximum likelihood

trees t1, . . . , tN.

2. Finding features:

We deﬁne features in

terms of subtrees in the grammar, which de-

scribe patterns in the input sentences, and we

search for features that have high mutual in-

formation with the class labels.

We induce one shared grammar for all classes rather

than incorporating labels during grammar induc-

tion, so that the non-terminal symbols have a con-

sistent meaning across classes. This facilitates ﬁnd-

ing contrastive features. For example, in Figure 1

(right), the nonterminal symbol

always gener-

ates pairs of verbs, but the distribution changes

according to the class label:

is more likely to

generate

walk/walking

when the class label is en-

tailment and more likely to generate

stand/sits

when the class label is contradiction.

2.2 Grammar Induction

In this section, we describe the two grammars we

use and the training procedure.

Context-free grammar

A context-free grammar

(CFG) consists of an inventory of terminal symbols

(words), non-terminal symbols

, and produc-

tion rules of the form

α→β∈ R

, where

α∈ N

and

β∈(V ∪ N )∗

. A probabilistic CFG (PCFG)

deﬁnes a distribution over trees by assigning ev-

ery non-terminal symbol a categorical distribution

over production rules, with the probability of a tree

deﬁned as the product of the probability of the pro-

duction rules used to generate it. A PCFG deﬁnes

a distribution over sequences of words

x∈ V∗

by marginalizing over all trees which generate the

sentence x(denoted by yield(t)):

p(x) = X

t:yield(t)=x

p(t).

Synchronous grammar

Many NLP shortcuts

have been found in datasets involving pairs of

sentences

x= (xa, xb)

. We model patterns in

these datasets using a Synchronous PCFG (SCFG;

Lewis and Stearns,1968;Wu,1997), a grammar

for deﬁning probability distributions over pairs

of sequences. An SCFG assumes that both se-

quences were generated from a single context-

free parse tree, whose terminal symbols have the

form

wa/wb

, where

and

are either words

respectively, or an empty symbol, de-

noted by



, which represents a null alignment (Fig-

ure 1). SCFG productions can also be thought of as

translation rules: the emission

wa/wb

represents

a substitution—replacing word

with

—and

null alignments represent insertion or deletion. An

SCFG makes strong simplifying assumptions about

the possible relationships between sentences, but,

as we will show, it is still capable of modeling inter-

esting, hierarchical structure in real-world datasets.

Parameterization and training

The parameters

of grammar consist of a vector

with one entry

θα→β

for every rule

α→β∈ R

. Following Kim

et al. (2019), we parameterize the grammars using

a neural network. We use the neural CFG param-

eterization from Kim et al. (2019) and develop a

similar parameterization for our SCFG, with exten-

sions for the terminal production rules. We defer

the full details to the appendix (Section A).

Given a training set

D={(xi, yi)}N

i=1

and a

grammar

G= (V,N,R)

, we ﬁnd a maximum

likelihood estimate

θ∗

by maximizing the marginal

likelihood of the (unlabeled) training sentences:

θ∗= arg max

i=1

log p(xi| G, θ).

We optimize the parameters using gradient descent.

After training, we calculate the maximum likeli-

hood trees

t1, . . . , tN

for the training data, and use

these trees as the basis of further analysis.

Complexity

Training and parsing require enumer-

ating the trees consistent with the input, which is

calculated with the inside algorithm for CFGs (Lari

and Young,1990) and the bitext inside algorithm

for SCFGs (Wu,1997). The inside algorithm has

space and time complexity of

O(|x|3|G|)

and the

bitext inside algorithm has space and time com-

plexity of

O(|xa|3|xb|3|G|)

, where

|G|

is a gram-

mar constant determined by the number of rules

in the grammar. We use a vectorized GPU im-

plementation of the inside algorithm provided by

Torch-Struct (Rush,2020) and we implement a vec-

torized version of the bitext inside algorithm. The

cost of the bitext parsing algorithm imposes a prac-

tical limitation on the length of the sentences we

consider (Section 3.1). We also discuss possible

efﬁcient approximations in the Limitations section,

which we leave as an avenue for future work.

2.3 Finding Features

The tree-annotated corpus provides a structured rep-

resentation of the dataset that we can now query to

ﬁnd discriminative patterns at various levels of ab-

straction. We follow a simple procedure for ﬁnding

dataset-level patterns using our tree annotations.

First, given a set of trees

t1, . . . , tN

with class

labels

y1, . . . , yN

, we extract the set of complete

subtrees, which are subtrees whose leaves are all

terminal symbols. There is at most one unique

subtree for each non-terminal node in each tree, so

the number of complete subtrees is roughly on the

order of the number of words in the training data.

Next, we calculate the mutual information be-

tween each subtree and the class labels. We treat

each subtree

as a boolean-valued feature function

on trees,

φs(t) = [s∈t]

. Let

be a random

variable denoting the output of

φs

and let

be a

random variable over

. The mutual information

is deﬁned as:

I(Zs;Y) = X

zs∈{0,1}

y∈Y

p(y, zs) log p(y, zs)

p(y)p(zs).

We estimate the mutual information between

and

using

ˆp(y, zs)∝1 + PN

i=1 [yi=y∧

φs(ti) = zs]

. Mutual information measures the

expected amount of information we learn about

by learning the value of

φs

. While we use mutual

information in this paper, we could also score the

features using other feature-importance metrics,

such as z-score (Gardner et al.,2021).

To visualize the most discriminative patterns,

we group the highest-ranked subtrees according

to their root label and majority class label. Let

S(α, y)

denote the set of subtrees with root label

and majority class label

. We deﬁne a composite

feature,

Zα,y =Ws∈S(α,y)Zs

as the union of fea-

tures in

S(α, y)

. The result of this procedure is a

concise list of class-conditional non-terminal fea-

tures, which we can inspect to identify the patterns

that are broadly discriminative across the dataset.

3 Finding Shortcuts

3.1 Experimental Setup

Datasets

We apply our approach to two single-

sentence and two sentence-pair classiﬁcation

datasets.

IMDb

(Maas et al.,2011) is a binary

sentiment analysis dataset consisting of paragraph-

length movie reviews.

SUBJ

(Pang and Lee,2004)

is a subjectivity classiﬁcation dataset, containing

sentences labeled as either subjective or objec-

tive.

SNLI

(Bowman et al.,2015) is a three-way

classiﬁcation task; given two sentences,

and

, the objective is to determine whether

en-

tails

,contradicts

, or is neutral.

QQP

(Iyer

et al.,2017) consists of pairs of questions from

quora.com labeled as being either paraphrases or

not paraphrases.

For all experiments, we ﬁx the size of the

grammar to be 96 non-terminals symbols, divided

into 32 internal non-terminal symbols and 64 pre-

terminal symbols, similar to Kim et al. (2019),

and use a lowercase word-level tokenizer with a

maximum vocabulary size of 20,000 words. For

the sentence-pair datasets, we randomly sample

65K/16K class-balanced sets of training/validation

examples that ﬁt within the length limit imposed by

the bitext inside-outside algorithm (

|xa|×|xb|<

225

). This length limit covers approximately 80%

of SNLI and 70% of QQP. For IMDb, we split

the movie reviews into sentences for the purposes

of training the PCFG, but compute feature statis-

tics using the full reviews. More implementation

details are in Appendix B.

3.2 A Look at the Top Features

In this section, we qualitatively explore some of the

dataset-level shortcuts we ﬁnd. Our procedure for

each dataset is the same: given a set of trees, we

enumerate all complete subtrees and sort them by

mutual information, treating each subtree as a bi-

nary indicator variable. Then we group the subtree

features by root label and majority class label and

inspect the most informative groups of subtrees.

The majority class label for a feature

is deﬁned

as the most common class label among training

examples for which

Z= 1

. For each feature

, we

report the number of training examples for which

Z= 1

(

Count

) and the percentage of these having

the majority class label (

% Majority

). We present

the most interesting results here and include ex-

tended results in Appendix D.1.

IMDb

Not surprisingly, we ﬁnd that the most in-

formative features in IMDb include adjectives, ad-

verbs, and nouns with high sentiment valence. We

highlight some of the more interesting patterns in

Table 1. For example, we discover that negative

reviews are almost three times as likely as positive

reviews to mention the length of the ﬁlm (node

and we ﬁnd that the grammar has learned a clear

category corresponding to names (node 5 ).

We also conﬁrm that the grammar recovers a

known shortcut in IMDb, numerical ratings (Ross

et al.,2021;Pezeshkpour et al.,2022). We ﬁnd that

a single non-terminal (node

) is responsible for

generating ratings, including both simple, numeri-

cal ratings, which have been documented in earlier

work, as well as letter grades and ratings on a star

system (Table 2).

SUBJ

In the SUBJ dataset (Table 3), the most

informative features reﬂect how this dataset was

constructed (Pang and Lee,2004): the subjective

class consists of movie reviews from Rotten Toma-

toes and the objective class consists of movie sum-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FindingDatasetShortcutswithGrammarInductionDanFriedman,AlexanderWettig,DanqiChenDepartmentofComputerScience,PrincetonUniversity{dfriedman,awettig,danqic}@cs.princeton.eduAbstractManyNLPdatasetshavebeenfoundtocon-tainshortcuts:simpledecisionrulesthatachievesurprisinglyhighaccuracy.However,itisdifcul...

展开>> 收起<<

Finding Dataset Shortcuts with Grammar Induction Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science Princeton University.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Finding Dataset Shortcuts with Grammar Induction Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science Princeton University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: