CTL Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions and Compatibility of Neural Representations Róbert Csordás1Kazuki Irie1Jürgen Schmidhuber12

2025-04-26 0 0 775.03KB 10 页 10玖币
侵权投诉
CTL++: Evaluating Generalization on Never-Seen Compositional Patterns
of Known Functions, and Compatibility of Neural Representations
Róbert Csordás1Kazuki Irie1Jürgen Schmidhuber1,2
1The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland
2AI Initiative, KAUST, Thuwal, Saudi Arabia
{robert, kazuki, juergen}@idsia.ch
Abstract
Well-designed diagnostic tasks have played a
key role in studying the failure of neural nets
(NNs) to generalize systematically. Famous
examples include SCAN and Compositional
Table Lookup (CTL). Here we introduce
CTL++, a new diagnostic dataset based on
compositions of unary symbolic functions.
While the original CTL is used to test length
generalization or productivity, CTL++ is
designed to test systematicity of NNs, that is,
their capability to generalize to unseen com-
positions of known functions. CTL++ splits
functions into groups and tests performance
on group elements composed in a way not
seen during training. We show that recent
CTL-solving Transformer variants fail on
CTL++. The simplicity of the task design al-
lows for fine-grained control of task difficulty,
as well as many insightful analyses. For exam-
ple, we measure how much overlap between
groups is needed by tested NNs for learning
to compose. We also visualize how learned
symbol representations in outputs of functions
from different groups are compatible in case
of success but not in case of failure. These
results provide insights into failure cases
reported on more complex compositions in the
natural language domain. Our code is public.1
1 Introduction
Neural networks (NNs) should ideally learn from
training data to generalize systematically (Fodor
et al.,1988), by learning generally applicable rules
instead of pure pattern matching. Existing NNs,
however, typically don’t. For example, in the con-
text of sequence processing NNs, superficial differ-
ences between training and test distributions, e.g.,
with respect to input sequence length or unseen
input/word combinations, are enough to prevent
current NNs from generalizing (Lake and Baroni,
2018). Training on a large amounts of data might
1https://github.com/robertcsordas/ctlpp
alleviate the problem, but it is infeasible to cover
all possible lengths and combinations.
Indeed, while large language models trained on
a large amounts of data have obtained impressive
results (Brown et al.,2020), they often fail on
tasks requiring simple algorithmic reasoning, e.g.,
simple arithmetics (Rae et al.,2021). A promis-
ing way to achieve systematic generalization is
to make NNs more compositional (Schmidhuber,
1990), by reflecting and exploiting the hierarchi-
cal structure of many problems either within some
NN’s learned weights, or through tailored NN ar-
chitectures. For example, recent work by Csordás
et al. (2022) proposes architectural modifications
to the standard Transformer (Vaswani et al.,2017)
motivated by the principles of compositionality.
The resulting Neural Data Router (NDR) exhibits
strong length generalization or productivity on rep-
resentative datasets such as Compositional Table
Lookup (CTL; Liska et al. (2018); Hupkes et al.
(2019)).
The focus of the present work is on systematicity:
the capability to generalize to unseen compositions
of known functions/words. That is crucial for learn-
ing to process natural language or to reason on
algorithmic problems without an excessive amount
of training examples. Some of the existing bench-
marks (such as COGS (Kim and Linzen,2020) and
PCFG (Hupkes et al.,2020)) are almost solvable
by plain NNs with careful tuning (Csordás et al.,
2021), while others, such as CFQ (Keysers et al.,
2020), are much harder. A recent analysis of CFQ
by Bogin et al. (2022) suggests that the difficult
examples have a common characteristic: they con-
tain some local structures (describable by parse
trees) which are not present in the training exam-
ples. These findings provide hints for constructing
both challenging and intuitive (simple to define and
analyze) diagnostic tasks for testing systematicity.
We propose CTL++, a new diagnostic dataset build-
ing upon CTL. CTL++ is basically as simple as the
arXiv:2210.06350v1 [cs.LG] 12 Oct 2022
original CTL in terms of task definition, but adds
the core challenge of compositional generalization
absent in CTL. Such simplicity allows for insight-
ful analyses: one low-level reason for the failure to
generalize compositionally appears to be the failure
to learn functions whose outputs are symbol repre-
sentations compatible with inputs of other learned
neural functions. We will visualize this.
Well-designed diagnostic datasets have histori-
cally contributed to studies of systematic general-
ization in NNs. Our CTL++ strives to continue this
tradition.
2 Original CTL
Our new task (Sec. 3) is based on the CTL task
(Liska et al.,2018;Hupkes et al.,2019;Dubois
et al.,2020) whose examples consist of composi-
tions of bijective unary functions defined over a set
of symbols. Each example in the original CTL is
defined by one input symbol and a list of functions
to be applied sequentially, i.e., the first function is
applied to the input symbol and the resulting output
becomes the input to the second function, and so
forth. The functions are bijective and randomly
generated. The original CTL uses eight different
symbols. We represent each symbol by a natural
number, and each function by a letter. For example,
d a b 3
’ corresponds to the expression
d(a(b(3)))
.
The model has to predict the corresponding output
symbol (this can be viewed as a sequence clas-
sification task). When the train/test distributions
are independent and identically distributed (IID),
even the basic Transformer achieves perfect test
accuracy (Csordás et al.,2022). The task becomes
more interesting when test examples are longer
than training examples. In such a productivity split,
which is the common setting of the original CTL
(Dubois et al.,2020;Csordás et al.,2022), standard
Transformers fail, while NDR and bi-directional
LSTM work perfectly.
3 Extensions for Systematicity: CTL++
To introduce a systematicity split to the CTL frame-
work, we divide the set of functions into disjoint
groups and restrict the sampling process such that
some patterns of compositions between group el-
ements are never sampled for training, only for
testing. Based on this simple principle, we derive
three variations of CTL++. They differ from each
other in terms of compositional patterns used for
testing (excluded from training) as described below.
IN
Ga
Gb
OUT
Figure 1: Sampling graph for variant ‘A.
We’ll also visualize the difference using sampling
graphs in which the nodes represent the groups,
and the edges specify possible compositional pat-
terns. The colors of the edges reflect when the
edges are used: black for both training and testing,
blue for training, and red only for testing.
Variation ‘A’ (as in ‘Alternating’).
Here func-
tions are divided in groups
Ga
and
Gb
. During
training, successively composed functions are sam-
pled from different groups in an alternating way—
i.e., successive functions cannot be from the same
group. During testing, however, only functions
from the same group can be composed. The sam-
pling graph is shown in Fig. 1. Importantly, the sin-
gle function applications are part of the training set,
to allow the model to learn common input/output
symbol representations for the interface between
different groups.
Variation ‘R’ (as in ‘Repeating’).
This variant
is the complement of variation ‘A’ above. To get
a training example, either
Ga
or
Gb
is sampled,
and all functions in that example are sampled from
that same group for the whole sequence. In test
examples, functions are sampled in an alternating
way. There is thus no exchange of information
between the groups, except for the shared input
embeddings and the output classification weight
matrix. The sampling graph is like in Fig. 1for ‘A
except that blue edges should become red and vice
versa (see Fig. 5in the appendix).
Variation ‘S’ (as in ‘Staged’).
In this variant,
functions are divided into five disjoint groups:
Ga1
,
Ga2
,
Gb1
,
Gb2
and
Go
. As indicated by the indices,
each group belongs to one of the two paths (‘a’
or ‘b’) and one of the two stages (‘1’ or ‘2’), ex-
cept for
Go
which only belongs to stage ‘2’ shared
between paths ‘a’ and ‘b’ during training. The cor-
responding sampling graph is shown in Fig. 2. To
get a training example, we sample an integer
K
which defines the sequence length as
2K+ 1
, and
iterate the following process for
k[0, .., K]
and
摘要:

CTL++:EvaluatingGeneralizationonNever-SeenCompositionalPatternsofKnownFunctions,andCompatibilityofNeuralRepresentationsRóbertCsordás1KazukiIrie1JürgenSchmidhuber1;21TheSwissAILabIDSIA,USI&SUPSI,Lugano,Switzerland2AIInitiative,KAUST,Thuwal,SaudiArabia{robert,kazuki,juergen}@idsia.chAbstractWell-desig...

展开>> 收起<<
CTL Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions and Compatibility of Neural Representations Róbert Csordás1Kazuki Irie1Jürgen Schmidhuber12.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:775.03KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注