SOLVING REASONING TASKS WITH A SLOT TRANSFORMER Ryan Faulkner Deepmind

2025-05-03 1 0 1.74MB 21 页 10玖币

侵权投诉

SOLVING REASONING TASKS WITH A SLOT TRANSFORMER

Ryan Faulkner

Deepmind

rfaulk@google.com

Daniel Zoran

Deepmind

danielzoran@deepmind.com

ABSTRACT

The ability to carve the world into useful abstractions in order to reason about time and space is a

crucial component of intelligence. In order to successfully perceive and act effectively using senses

we must parse and compress large amounts of information for further downstream reasoning to take

place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation

learning methods to work with real world scenes and temporal dynamics then there must be a way to

learn accurate, concise, and composable abstractions across time. We present the Slot Transformer,

an architecture that leverages slot attention, transformers and iterative variational inference on video

scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-

600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling

and reasoning around complex behaviours as well as scores on these datasets that compare favourably

to existing baselines. Finally we evaluate the effectiveness of key components of the architecture, the

model’s representational capacity and its ability to predict from incomplete input.

1 Introduction

Reasoning over time is an indispensable skill when navigating and interacting with a complex environment. However,

rationalizing about the world becomes an intractable problem if we are incapable of compressing it into to a reduced set

of relevant abstractions. For this reason relational reasoning and abstracting high-level concepts from complex scene

data is a critical area of machine learning research. Past approaches use relational reasoning and composable scene

representation and have met success on static datasets [Santoro et al., 2017, 2018, Greff et al., 2020, Locatello et al.,

2020, Burgess et al., 2019] however, those gains must now be extended across time; this introduces a large set of new

complexities in the form of temporal dynamics and scene physics. Some recent work has utilized transformers [Jaegle

et al., 2021, Vaswani et al., 2017, Ding et al., 2020] which provide a potential path toward understanding the type of

approach that can succeed at solving this problem.

The goal of the approach presented in this paper involves learning to output useful spatio-temporal representations of

the input scene sequence which are hypothesized to be critical to solving downstream tasks for scene understanding

and generalisation in domains containing complex visual scenes and temporal dynamics. The target domains for the

evaluation of this type of model we have chosen involve scenes of synthetic and real world objects and behaviours

requiring complex scene understanding and abstract reasoning via question and answer datasets. This work combines

a number of existing ideas in a novel way, namely slot attention [Locatello et al., 2020], transformers [Vaswani

et al., 2017] and iterative variational inference [Marino et al., 2018]; in order to better understand conceptual scene

representation and how it can be achieved and embedded in more complex systems. Several hypotheses drive the

direction for this work. First, information about scenes may be more efﬁciently represented as independent components

rather than in a monolithic representation. Second, ingesting spatio-temporal input in such a way that there is no bias

to any step of the sequence is critical (see supplementary material for analysis) when forming representations about

space and time over long sequences. Finally, processing information in an iterative fashion can provide the means to

recursively recombine information in a way that is useful for reasoning tasks.

The main contributions of this work are: 1) present the Slot Transformer architecture for spatio-temporal inference

and reasoning and a framework for learning how to encode useful representations, 2) evaluate this model against

downstream tasks that require video understanding and reasoning capabilities in order to be solved, and ﬁnally 3)

demonstrate the role the components of the overall approach play in the problem solving capabilities induced during

arXiv:2210.11394v1 [cs.LG] 20 Oct 2022

training. We have chosen three tasks on which to evaluate the Slot Transformer: CLEVRER [Yi et al., 2020], a video

dataset where questions are posed about objects in the scene, Kinetics-600 comprising of YouTube video data for action

classiﬁcation [Carreira et al., 2018] and CATER [Girdhar and Ramanan, 2019], object relational data which requires

video understanding of scene events to successfully solve. In each of these cases we evaluate the Slot Transformer

against current state-of-the-art approaches.

2 Related Work

We have drawn inspiration from models that induce scene understanding via object discovery. In particular, the

approaches taken by Locatello et al. [2020], Greff et al. [2020], Burgess et al. [2019] all involve learning latent

representations that allow the scene to be parsed into distinct objects. Slot attention [Locatello et al., 2020] provides a

means to extract relevant scene information into a set of latent vectors where each one queries scene pixels (or ResNet

encoded super pixels) and where the attention softmax is done across the query axis rather than the channel axis,

inducing slots to compete to explain each pixel in the input. Another slotted approach appears in IODINE [Greff et al.,

2020], which leverages iterative amortized inference [Marino et al., 2018, Andrychowicz et al., 2016] to produce better

posterior latent estimates for the slots which can then be used to reconstruct image by using a spatial Gaussian mixture

using masks and means decoded from each slot as mixing weights and component means respectively. We aim to

extend this type of approach by applying the same ideas to sequential input, in particular, video input.

Self-attention and transformers [Vaswani et al., 2017, Parisotto et al., 2019] are also central to our work and have

played a critical role in the ﬁeld in recent years. Some of the latest applications have demonstrated the efﬁcacy of

this mechanism applied to problem solving in domains that require a capacity to reason as a success criteria [Santoro

et al., 2018, Clark et al., 2020, Russin et al., 2021]. In particular, self-attention provides a means by which to form a

pairwise relationship among any two elements in a sequence, however, this comes at a cost that scales quadratically in

the sequence length, so care must be taken when choosing how to apply this technique. One recent approach involving

video sequence data, the TimeSformer from Bertasius et al. [2021], utilizes attention over image patches across time

applied successfully to sequence action classiﬁcation [Goyal et al., 2017, Carreira et al., 2018].

There has been a great deal of work on self-supervised video representation learning methods [Wang et al., 2021, Qian

et al., 2020, Feichtenhofer et al., 2021, Gopalakrishnan et al., 2022] among the family of video representation learning

models. In particular in Qian et al. [2020] the authors have explored using a contrastive loss and applied to Kinetics-600.

We include some of these results below in table 3 below. In Kipf et al. [2021] the authors present an architecure similar

to ours, albeit without generative losses, where they show that their model achieves state-of-the-art performance on

image object segmentation in video sequences and optical ﬂow ﬁeld prediction. In Ding et al. [2020] the authors use

an object representation model (Burgess et al. [2019]) as input to a transformer and achieve very strong results on

CLEVRER and CATER.

Neuro-symbolic logic based models [d’Avila Garcez and Lamb, 2020] have been used to solve temporal reasoning

problems and, in particular, to CLEVRER and CATER. In section 4 we examine the performance of the Dynamic

Context Learner (DCL, Chen et al. [2021]) and Neuro-Symbolic Dynamic Reasoning (NS-DR, Yi et al. [2020]) where

NS-DR consists of neural networks for video parsing and dynamics prediction and a symbolic logic program executor

and DCL is composed of Program Parser and Symbolic Executor while making use of extra labelled data when applied

to CLEVRER. Some components of these methods rely on explicit modeling of reasoning mechanisms crafted for the

problem domain or on additional labelled annotations. In contrast, the intent of our approach is to form an inductive

bias around reasoning about sequences in general.

Recently Jaegle et al. [2021] have published their work on the Perceiver, a model that makes use of attention asymmetry

to ingest temporal multi-modal input. In this work an attention bottleneck is introduced that reduces the dimensionality

of the positional axis through which model inputs pass. This provides tractability for inputs of large temporal and

spatial dimensions where the quadratic scaling of transformers become otherwise prohibitive. Notably the authors

show that this approach succeeds on multimodal data, achieving state-of-art scores on the AudioSet dataset [Gemmeke

et al., 2017] containing audio and video inputs. Ours is a similar approach in spirit to the Perceiver in compressing

representations via attention mechanisms however, we focus on the ability to reason using encoded sequences and

primarily reduce over spatial axes.

3 Model

We now present the Slot Transformer, a generative transformer model that leverages iterative inference to produce

improved latent estimates given an input sequence. Our model can broadly be described in terms of three phases (see

Figure 1): an encode phase where a spatio-temporal input is compressed to form a representation, then a decode phase

Figure 1: The general model architecture with the encoding, decoding and iteration phases of the model. The encode

phase computes an updated context

ck+1

, the decode phase produces a reconstruction

. Finally, the iterate process

repeats the encode-decode steps compressing more information into the context at each iteration. KL and likelihood

losses are computed across all iterations.

where the representation is used to reconstruct known data or predict unseen data; ﬁnally the iterate phase, which

encompasses the other two phases, whereby conditioning new representations on those from the previous iteration

in the sequence input enable more useful representations to be inferred. This process ﬂow is similar in manner and

inspired by the encode-process-decode paradigm [Hamrick et al., 2018]. We leverage the ideas of iterative attention

used in slot attention and IODINE [Locatello et al., 2020, Greff et al., 2020] based upon iterative variational inference

to model good, compact representations for downstream tasks.

3.1 Encoder

The input sequence

X∈RT×H×W×3

is encoded once from raw pixels with a residual network [He et al., 2015] applied

to each frame across the sequence:

e=fResNet(X)

where

e∈RT×H0×W0×E

where

H0< H

and

W0< W

. To the

resulting image encodings

we concatenate a spatial encoding basis comprised of Fourier basis functions dependent

upon the spatial coordinates of the pixels.

Next we deﬁne the context: a set of

T×K

vectors (or slots), each of

Ccontext

dimensions, that store contextual

information about the scene. We denote the context as

c∈RT×K×Ccontext

where

is the number of time steps in the

input sequence. We will perform iterative updates on the context over the entire spatio-temporal volume with each

forward pass through the model. At each iteration, the encoder’s role is to infer an updated spatio-temporal context via

the context-transformer where the input is the combination of the context from the previous iteration plus the output of

the slot attention on the input (more detail in 3.3).

We initialize the context by sampling a standard Gaussian of size

K×Ccontext

then tiling this tensor over the same

number of time steps as the input

. This yields the initial context

. Sampling the context this way we break symmetry

across the slots at each frame, and by tiling across time we encourage slots to have the same role across adjacent

time-steps. The slotted context provides an inductive bias to learn structured state information of the scene dynamics

over time and space which the attention and context transform operations facilitate.

It should be noted that while in this work the encoded representations retain the full time dimension this is not strictly necessary

and this architecture may be modiﬁed to project to lower cardinality time dimensions in a straightforward manner.

3.1.1 Slot Attention

Given the initial context,

, we next apply a shared MLP across the slots to construct the queries,

, one for each slot

at each time step. We use these queries to attend over the keys and values decoded from the super pixels in

and their

corresponding spatial position encodings. The softmax step in this operation is applied across the slot dimension such

that slots "compete" on which pixels in the scene they "explain" and losses that reward more efﬁcient representations

will induce the model to learn representations that best explain the scene [Locatello et al., 2020]. One critical detail

of this attention readout is that it occurs batched across time steps, learning temporal relationships is handled by the

context transformer described below. Much of this detail is captured in ﬁgure 1.

The readout from the slot-attention step akfor each slot at each time step is combined with the context via GRU style

gating [Hochreiter and Schmidhuber, 1997, Cho et al., 2014] with the gating function c0

k=fgate(ck,ak).

3.1.2 Context Transformer

Now that we have obtained the updated context

using the input, it is passed through a transformer [Vaswani et al.,

2017],

ck=Tcontext(c0

, where positional encodings across the time sequence are applied but no masking. This enables

the model to form temporal connections across the sequence via the context slots. At this step the same transformer

weights are applied separately to each slot across time meaning that elements of each slot only communicates with

other elements in that slot over the time axis.

The reason for this approach is two-fold, 1) this provides an inductive bias for the slots to adopt specialized roles when

explaining the input, 2) this ensures that the complexity of the transform operation does not scale with the number of

slots. At this step we could choose to introduce a temporal asymmetry as has shown promise in other work [Jaegle

et al., 2021]. This could yield better scalability and representational power if done correctly and we leave this as an

avenue of future work.

One ﬁnal note, the context transformer is only applied on the ﬁrst iteration of the encoder as this allows the model to

scale more easily to greater numbers of iterations, integrating information from the input through the slot attention

operation alone on subsequent iterations.

3.2 Decoder

Once the encoder has generated an updated context we may hang new losses off of this representation by way of the

decoder. From here there are many possibilities for the way forward, for the scope of this this work we chose to explore

image reconstruction over all frames in the sequence. We hypothesize that these will help induce the model to learn

useful and composable representations via the slots and to learn the temporal dynamics of the input via the context

transformer. We also decided to apply variational inference [Doersch, 2021] as we posit that this should help the model

to better generalize by compressing the latent representations. Therefore, we use the context to parameterize a Gaussian

distribution (approximate posterior) by linearly projecting the context into parameters

λk∈R2Clatent

and decode after

sampling from it:

µk,log σk=λk=Wλck

and

zk∼N(µk, σk)

. While our choice of decoder is certainly dependent

upon this choice we discuss this in more detail in the next section.

3.2.1 Spatial Broadcast Decoder & Masks

Since the slots contain information that is likely spatially entangled we want to ensure that the encoder is left free to

learn the spatio-temporal content based representations among the slots that can then be used by the decoder to rebuild

the input. For this reason we use a spatial broadcast decoder [Watters et al., 2019] to distill information from the slots

with the key property that the context channels are broadcast across spatial dimensions and decoded without upsampling

- in effect each slot is allowed to explain independent parts of the image content. The spatial decoder is batch applied

across the batch, time and slot dimensions of the slotted latent to produce a one channel mask,

mk,t

, and an RGB mean

image,

rk,t

mk,t,rk,t =fdecoder(zk,t)

. These two elements are combined via weighted sum in a manner similar as is

done in Slot Attention: x0

t=PK

k=1 mk,trk,t.

3.3 Iterative Model & Losses

As mentioned in Section 3.2 variational inference is applied to the model by projecting the context output from the

encoder as Gaussian posterior parameter estimates over a latent distribution from which we sample zk.

We deﬁne a conditional prior,

P(z|x1..p)

, by feeding the ﬁrst

p < T

steps of the context to the context transformer

thereby limiting the prior to information within a subsequence of the input to form the prior context. To predict the

remaining

T−p

steps of the latent representation an auto-regressive model is used where the initial state is formed via

a self attention operation on the prior context concatenated with a learnable token. After the self attention operation is

executed the output token is read and used as the initial state of the auto-regressive model. The output of this model is

combined with the prior context to form the full deﬁnition of the latent prior parameters for the full sequence.

We make use of iterative inference [Marino et al., 2018] as applied by Greff et al. [2020] where updates to the posterior

parameters estimate,

λk

, are computed dependent on the sampled latents

, the input data

, the encoder architecture

fφ, and auxiliary inputs ak:

λk+1 =λk+fφ(zk, x, ak)(1)

On each iteration we also compute the reconstruction loss in addition to the Kullback-Liebler (KL) measure [Kullback

and Leibler, 1951] of the posterior parameters with respect to the prior (i.e the ELBO):

Lgen =

t=1

k=1 DKL(N(µpost

k,t , σpost2

k,t )||N(µprior

k,t , σprior2

k,t )) −log L(xt|x0

t)(2)

We use a mixture distribution for the output likelihood

log L(xt|x0

, where the masks deﬁne a categorical distribution

of components over Gaussian data distributions.

3.4 Auxiliary Losses

In addition to the variational and supervised losses we deﬁne auxiliary losses to help to induce better representations for

scene understanding. For this we use the information from the slot specialization and language describing the scene

when available.

3.4.1 Object Mask Prediction

This loss is formed via model predictions on the latent slot values for selected target frames. This is similar to the

self-supervision strategies used in Ding et al. [2020]. For this we take the samples from the ﬁnal iteration context,

then select

s < S

slots at random masking out the last

steps, for

a ﬁxed parameter; this is then fed once through the

context transformer to compose z0

K. Next, from the masked steps we randomly select a subset of step indices, Ttargets,

and compute the

norm over the difference between the target predictions from

and the true latents from

Lobject =PTtargets

i||zK−z0

K||

. All of our top results were obtained when including this loss where predictions are done

across the last half of the input sequence for each of 1,2, and 3slots summed together.

3.4.2 Question Prediction

Given a Question-Answer dataset we hypothesize that we can learn better representations if they are good enough to

predict the original question when conditioned on the answer. During training, given the answer and the model latent

we compose a new belief vector from these using a transformer and a CLS token. This serves as the initial state of

an auto-regressive model where we use teacher forcing from the question tokens and compute the sequence question

word embeddings. The cross-entropy between the predicted question embeddings and the true ones to form the question

prediction loss: Lquestion.

3.5 Heads for Downstream Tasks

The context output from the ﬁnal iteration,

, now forms an encoded sequence that may be used as input for specialized

tasks. The task head in ﬁgure 2 depicts the general head architecture used for downstream tasks deﬁned in Section 4.

The full input to the head consists of the question(s) provided with the task (when available), the ﬁnal context from the

encoder

, the mask information from the last iteration of decoder output

, and a classiﬁcation token, CLS. As

alluded to in Section 3.6, we hypothesize at this point that

and

should contain information about the objects in

the scene and the relations they share with eachother across the sequence. Note that the time axis of

is now expanded

to include all slots, so it is a sequence of length T×K. The token CLS is prepended to the input sequence.

These elements are concatenated along the time axis, including absolute position encodings and now form the input to a

multi-layer gated transformer [Parisotto et al., 2019]. From the output the contents of the initial position corresponding

to the CLS token is then fed into an MLP (and softmaxed, depending on the task) to produce logits which can then be

compared to the task labels via cross-entropy to produce a supervised loss LQA.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SOLVINGREASONINGTASKSWITHASLOTTRANSFORMERRyanFaulknerDeepmindrfaulk@google.comDanielZoranDeepminddanielzoran@deepmind.comABSTRACTTheabilitytocarvetheworldintousefulabstractionsinordertoreasonabouttimeandspaceisacrucialcomponentofintelligence.Inordertosuccessfullyperceiveandacteffectivelyusingsensesw...

展开>> 收起<<

SOLVING REASONING TASKS WITH A SLOT TRANSFORMER Ryan Faulkner Deepmind.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SOLVING REASONING TASKS WITH A SLOT TRANSFORMER Ryan Faulkner Deepmind

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: