SOLVING REASONING TASKS WITH A SLOT TRANSFORMER Ryan Faulkner Deepmind

2025-05-03 0 0 1.74MB 21 页 10玖币
侵权投诉
SOLVING REASONING TASKS WITH A SLOT TRANSFORMER
Ryan Faulkner
Deepmind
rfaulk@google.com
Daniel Zoran
Deepmind
danielzoran@deepmind.com
ABSTRACT
The ability to carve the world into useful abstractions in order to reason about time and space is a
crucial component of intelligence. In order to successfully perceive and act effectively using senses
we must parse and compress large amounts of information for further downstream reasoning to take
place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation
learning methods to work with real world scenes and temporal dynamics then there must be a way to
learn accurate, concise, and composable abstractions across time. We present the Slot Transformer,
an architecture that leverages slot attention, transformers and iterative variational inference on video
scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-
600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling
and reasoning around complex behaviours as well as scores on these datasets that compare favourably
to existing baselines. Finally we evaluate the effectiveness of key components of the architecture, the
model’s representational capacity and its ability to predict from incomplete input.
1 Introduction
Reasoning over time is an indispensable skill when navigating and interacting with a complex environment. However,
rationalizing about the world becomes an intractable problem if we are incapable of compressing it into to a reduced set
of relevant abstractions. For this reason relational reasoning and abstracting high-level concepts from complex scene
data is a critical area of machine learning research. Past approaches use relational reasoning and composable scene
representation and have met success on static datasets [Santoro et al., 2017, 2018, Greff et al., 2020, Locatello et al.,
2020, Burgess et al., 2019] however, those gains must now be extended across time; this introduces a large set of new
complexities in the form of temporal dynamics and scene physics. Some recent work has utilized transformers [Jaegle
et al., 2021, Vaswani et al., 2017, Ding et al., 2020] which provide a potential path toward understanding the type of
approach that can succeed at solving this problem.
The goal of the approach presented in this paper involves learning to output useful spatio-temporal representations of
the input scene sequence which are hypothesized to be critical to solving downstream tasks for scene understanding
and generalisation in domains containing complex visual scenes and temporal dynamics. The target domains for the
evaluation of this type of model we have chosen involve scenes of synthetic and real world objects and behaviours
requiring complex scene understanding and abstract reasoning via question and answer datasets. This work combines
a number of existing ideas in a novel way, namely slot attention [Locatello et al., 2020], transformers [Vaswani
et al., 2017] and iterative variational inference [Marino et al., 2018]; in order to better understand conceptual scene
representation and how it can be achieved and embedded in more complex systems. Several hypotheses drive the
direction for this work. First, information about scenes may be more efficiently represented as independent components
rather than in a monolithic representation. Second, ingesting spatio-temporal input in such a way that there is no bias
to any step of the sequence is critical (see supplementary material for analysis) when forming representations about
space and time over long sequences. Finally, processing information in an iterative fashion can provide the means to
recursively recombine information in a way that is useful for reasoning tasks.
The main contributions of this work are: 1) present the Slot Transformer architecture for spatio-temporal inference
and reasoning and a framework for learning how to encode useful representations, 2) evaluate this model against
downstream tasks that require video understanding and reasoning capabilities in order to be solved, and finally 3)
demonstrate the role the components of the overall approach play in the problem solving capabilities induced during
arXiv:2210.11394v1 [cs.LG] 20 Oct 2022
training. We have chosen three tasks on which to evaluate the Slot Transformer: CLEVRER [Yi et al., 2020], a video
dataset where questions are posed about objects in the scene, Kinetics-600 comprising of YouTube video data for action
classification [Carreira et al., 2018] and CATER [Girdhar and Ramanan, 2019], object relational data which requires
video understanding of scene events to successfully solve. In each of these cases we evaluate the Slot Transformer
against current state-of-the-art approaches.
2 Related Work
We have drawn inspiration from models that induce scene understanding via object discovery. In particular, the
approaches taken by Locatello et al. [2020], Greff et al. [2020], Burgess et al. [2019] all involve learning latent
representations that allow the scene to be parsed into distinct objects. Slot attention [Locatello et al., 2020] provides a
means to extract relevant scene information into a set of latent vectors where each one queries scene pixels (or ResNet
encoded super pixels) and where the attention softmax is done across the query axis rather than the channel axis,
inducing slots to compete to explain each pixel in the input. Another slotted approach appears in IODINE [Greff et al.,
2020], which leverages iterative amortized inference [Marino et al., 2018, Andrychowicz et al., 2016] to produce better
posterior latent estimates for the slots which can then be used to reconstruct image by using a spatial Gaussian mixture
using masks and means decoded from each slot as mixing weights and component means respectively. We aim to
extend this type of approach by applying the same ideas to sequential input, in particular, video input.
Self-attention and transformers [Vaswani et al., 2017, Parisotto et al., 2019] are also central to our work and have
played a critical role in the field in recent years. Some of the latest applications have demonstrated the efficacy of
this mechanism applied to problem solving in domains that require a capacity to reason as a success criteria [Santoro
et al., 2018, Clark et al., 2020, Russin et al., 2021]. In particular, self-attention provides a means by which to form a
pairwise relationship among any two elements in a sequence, however, this comes at a cost that scales quadratically in
the sequence length, so care must be taken when choosing how to apply this technique. One recent approach involving
video sequence data, the TimeSformer from Bertasius et al. [2021], utilizes attention over image patches across time
applied successfully to sequence action classification [Goyal et al., 2017, Carreira et al., 2018].
There has been a great deal of work on self-supervised video representation learning methods [Wang et al., 2021, Qian
et al., 2020, Feichtenhofer et al., 2021, Gopalakrishnan et al., 2022] among the family of video representation learning
models. In particular in Qian et al. [2020] the authors have explored using a contrastive loss and applied to Kinetics-600.
We include some of these results below in table 3 below. In Kipf et al. [2021] the authors present an architecure similar
to ours, albeit without generative losses, where they show that their model achieves state-of-the-art performance on
image object segmentation in video sequences and optical flow field prediction. In Ding et al. [2020] the authors use
an object representation model (Burgess et al. [2019]) as input to a transformer and achieve very strong results on
CLEVRER and CATER.
Neuro-symbolic logic based models [d’Avila Garcez and Lamb, 2020] have been used to solve temporal reasoning
problems and, in particular, to CLEVRER and CATER. In section 4 we examine the performance of the Dynamic
Context Learner (DCL, Chen et al. [2021]) and Neuro-Symbolic Dynamic Reasoning (NS-DR, Yi et al. [2020]) where
NS-DR consists of neural networks for video parsing and dynamics prediction and a symbolic logic program executor
and DCL is composed of Program Parser and Symbolic Executor while making use of extra labelled data when applied
to CLEVRER. Some components of these methods rely on explicit modeling of reasoning mechanisms crafted for the
problem domain or on additional labelled annotations. In contrast, the intent of our approach is to form an inductive
bias around reasoning about sequences in general.
Recently Jaegle et al. [2021] have published their work on the Perceiver, a model that makes use of attention asymmetry
to ingest temporal multi-modal input. In this work an attention bottleneck is introduced that reduces the dimensionality
of the positional axis through which model inputs pass. This provides tractability for inputs of large temporal and
spatial dimensions where the quadratic scaling of transformers become otherwise prohibitive. Notably the authors
show that this approach succeeds on multimodal data, achieving state-of-art scores on the AudioSet dataset [Gemmeke
et al., 2017] containing audio and video inputs. Ours is a similar approach in spirit to the Perceiver in compressing
representations via attention mechanisms however, we focus on the ability to reason using encoded sequences and
primarily reduce over spatial axes.
3 Model
We now present the Slot Transformer, a generative transformer model that leverages iterative inference to produce
improved latent estimates given an input sequence. Our model can broadly be described in terms of three phases (see
Figure 1): an encode phase where a spatio-temporal input is compressed to form a representation, then a decode phase
2
Figure 1: The general model architecture with the encoding, decoding and iteration phases of the model. The encode
phase computes an updated context
ck+1
, the decode phase produces a reconstruction
x0
k
. Finally, the iterate process
repeats the encode-decode steps compressing more information into the context at each iteration. KL and likelihood
losses are computed across all iterations.
where the representation is used to reconstruct known data or predict unseen data; finally the iterate phase, which
encompasses the other two phases, whereby conditioning new representations on those from the previous iteration
in the sequence input enable more useful representations to be inferred. This process flow is similar in manner and
inspired by the encode-process-decode paradigm [Hamrick et al., 2018]. We leverage the ideas of iterative attention
used in slot attention and IODINE [Locatello et al., 2020, Greff et al., 2020] based upon iterative variational inference
to model good, compact representations for downstream tasks.
3.1 Encoder
The input sequence
XRT×H×W×3
is encoded once from raw pixels with a residual network [He et al., 2015] applied
to each frame across the sequence:
e=fResNet(X)
where
eRT×H0×W0×E
where
H0< H
and
W0< W
. To the
resulting image encodings
e
we concatenate a spatial encoding basis comprised of Fourier basis functions dependent
upon the spatial coordinates of the pixels.
Next we define the context: a set of
T×K
vectors (or slots), each of
Ccontext
dimensions, that store contextual
information about the scene. We denote the context as
cRT×K×Ccontext
where
T
is the number of time steps in the
input sequence. We will perform iterative updates on the context over the entire spatio-temporal volume with each
forward pass through the model. At each iteration, the encoder’s role is to infer an updated spatio-temporal context via
the context-transformer where the input is the combination of the context from the previous iteration plus the output of
the slot attention on the input (more detail in 3.3).
We initialize the context by sampling a standard Gaussian of size
K×Ccontext
then tiling this tensor over the same
number of time steps as the input
1
. This yields the initial context
c0
. Sampling the context this way we break symmetry
across the slots at each frame, and by tiling across time we encourage slots to have the same role across adjacent
time-steps. The slotted context provides an inductive bias to learn structured state information of the scene dynamics
over time and space which the attention and context transform operations facilitate.
1
It should be noted that while in this work the encoded representations retain the full time dimension this is not strictly necessary
and this architecture may be modified to project to lower cardinality time dimensions in a straightforward manner.
3
3.1.1 Slot Attention
Given the initial context,
c0
, we next apply a shared MLP across the slots to construct the queries,
qk
, one for each slot
at each time step. We use these queries to attend over the keys and values decoded from the super pixels in
e
and their
corresponding spatial position encodings. The softmax step in this operation is applied across the slot dimension such
that slots "compete" on which pixels in the scene they "explain" and losses that reward more efficient representations
will induce the model to learn representations that best explain the scene [Locatello et al., 2020]. One critical detail
of this attention readout is that it occurs batched across time steps, learning temporal relationships is handled by the
context transformer described below. Much of this detail is captured in figure 1.
The readout from the slot-attention step akfor each slot at each time step is combined with the context via GRU style
gating [Hochreiter and Schmidhuber, 1997, Cho et al., 2014] with the gating function c0
k=fgate(ck,ak).
3.1.2 Context Transformer
Now that we have obtained the updated context
c0
using the input, it is passed through a transformer [Vaswani et al.,
2017],
ck=Tcontext(c0
k)
, where positional encodings across the time sequence are applied but no masking. This enables
the model to form temporal connections across the sequence via the context slots. At this step the same transformer
weights are applied separately to each slot across time meaning that elements of each slot only communicates with
other elements in that slot over the time axis.
The reason for this approach is two-fold, 1) this provides an inductive bias for the slots to adopt specialized roles when
explaining the input, 2) this ensures that the complexity of the transform operation does not scale with the number of
slots. At this step we could choose to introduce a temporal asymmetry as has shown promise in other work [Jaegle
et al., 2021]. This could yield better scalability and representational power if done correctly and we leave this as an
avenue of future work.
One final note, the context transformer is only applied on the first iteration of the encoder as this allows the model to
scale more easily to greater numbers of iterations, integrating information from the input through the slot attention
operation alone on subsequent iterations.
3.2 Decoder
Once the encoder has generated an updated context we may hang new losses off of this representation by way of the
decoder. From here there are many possibilities for the way forward, for the scope of this this work we chose to explore
image reconstruction over all frames in the sequence. We hypothesize that these will help induce the model to learn
useful and composable representations via the slots and to learn the temporal dynamics of the input via the context
transformer. We also decided to apply variational inference [Doersch, 2021] as we posit that this should help the model
to better generalize by compressing the latent representations. Therefore, we use the context to parameterize a Gaussian
distribution (approximate posterior) by linearly projecting the context into parameters
λkR2Clatent
and decode after
sampling from it:
µk,log σk=λk=Wλck
and
zkN(µk, σk)
. While our choice of decoder is certainly dependent
upon this choice we discuss this in more detail in the next section.
3.2.1 Spatial Broadcast Decoder & Masks
Since the slots contain information that is likely spatially entangled we want to ensure that the encoder is left free to
learn the spatio-temporal content based representations among the slots that can then be used by the decoder to rebuild
the input. For this reason we use a spatial broadcast decoder [Watters et al., 2019] to distill information from the slots
with the key property that the context channels are broadcast across spatial dimensions and decoded without upsampling
- in effect each slot is allowed to explain independent parts of the image content. The spatial decoder is batch applied
across the batch, time and slot dimensions of the slotted latent to produce a one channel mask,
mk,t
, and an RGB mean
image,
rk,t
:
mk,t,rk,t =fdecoder(zk,t)
. These two elements are combined via weighted sum in a manner similar as is
done in Slot Attention: x0
t=PK
k=1 mk,trk,t.
3.3 Iterative Model & Losses
As mentioned in Section 3.2 variational inference is applied to the model by projecting the context output from the
encoder as Gaussian posterior parameter estimates over a latent distribution from which we sample zk.
We define a conditional prior,
P(z|x1..p)
, by feeding the first
p < T
steps of the context to the context transformer
thereby limiting the prior to information within a subsequence of the input to form the prior context. To predict the
4
remaining
Tp
steps of the latent representation an auto-regressive model is used where the initial state is formed via
a self attention operation on the prior context concatenated with a learnable token. After the self attention operation is
executed the output token is read and used as the initial state of the auto-regressive model. The output of this model is
combined with the prior context to form the full definition of the latent prior parameters for the full sequence.
We make use of iterative inference [Marino et al., 2018] as applied by Greff et al. [2020] where updates to the posterior
parameters estimate,
λk
, are computed dependent on the sampled latents
zk
, the input data
x
, the encoder architecture
fφ, and auxiliary inputs ak:
λk+1 =λk+fφ(zk, x, ak)(1)
On each iteration we also compute the reconstruction loss in addition to the Kullback-Liebler (KL) measure [Kullback
and Leibler, 1951] of the posterior parameters with respect to the prior (i.e the ELBO):
Lgen =
T
X
t=1
K
X
k=1 DKL(N(µpost
k,t , σpost2
k,t )||N(µprior
k,t , σprior2
k,t )) log L(xt|x0
t)(2)
We use a mixture distribution for the output likelihood
log L(xt|x0
t)
, where the masks define a categorical distribution
of components over Gaussian data distributions.
3.4 Auxiliary Losses
In addition to the variational and supervised losses we define auxiliary losses to help to induce better representations for
scene understanding. For this we use the information from the slot specialization and language describing the scene
when available.
3.4.1 Object Mask Prediction
This loss is formed via model predictions on the latent slot values for selected target frames. This is similar to the
self-supervision strategies used in Ding et al. [2020]. For this we take the samples from the final iteration context,
zK
,
then select
s < S
slots at random masking out the last
k
steps, for
k
a fixed parameter; this is then fed once through the
context transformer to compose z0
K. Next, from the masked steps we randomly select a subset of step indices, Ttargets,
and compute the
L2
norm over the difference between the target predictions from
z0
K
and the true latents from
zK
:
Lobject =PTtargets
i||zKz0
K||
. All of our top results were obtained when including this loss where predictions are done
across the last half of the input sequence for each of 1,2, and 3slots summed together.
3.4.2 Question Prediction
Given a Question-Answer dataset we hypothesize that we can learn better representations if they are good enough to
predict the original question when conditioned on the answer. During training, given the answer and the model latent
zK
we compose a new belief vector from these using a transformer and a CLS token. This serves as the initial state of
an auto-regressive model where we use teacher forcing from the question tokens and compute the sequence question
word embeddings. The cross-entropy between the predicted question embeddings and the true ones to form the question
prediction loss: Lquestion.
3.5 Heads for Downstream Tasks
The context output from the final iteration,
cK
, now forms an encoded sequence that may be used as input for specialized
tasks. The task head in figure 2 depicts the general head architecture used for downstream tasks defined in Section 4.
The full input to the head consists of the question(s) provided with the task (when available), the final context from the
encoder
cK
, the mask information from the last iteration of decoder output
mK
, and a classification token, CLS. As
alluded to in Section 3.6, we hypothesize at this point that
cK
and
mK
should contain information about the objects in
the scene and the relations they share with eachother across the sequence. Note that the time axis of
ck
is now expanded
to include all slots, so it is a sequence of length T×K. The token CLS is prepended to the input sequence.
These elements are concatenated along the time axis, including absolute position encodings and now form the input to a
multi-layer gated transformer [Parisotto et al., 2019]. From the output the contents of the initial position corresponding
to the CLS token is then fed into an MLP (and softmaxed, depending on the task) to produce logits which can then be
compared to the task labels via cross-entropy to produce a supervised loss LQA.
5
摘要:

SOLVINGREASONINGTASKSWITHASLOTTRANSFORMERRyanFaulknerDeepmindrfaulk@google.comDanielZoranDeepminddanielzoran@deepmind.comABSTRACTTheabilitytocarvetheworldintousefulabstractionsinordertoreasonabouttimeandspaceisacrucialcomponentofintelligence.Inordertosuccessfullyperceiveandacteffectivelyusingsensesw...

收起<<
SOLVING REASONING TASKS WITH A SLOT TRANSFORMER Ryan Faulkner Deepmind.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.74MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注