training. We have chosen three tasks on which to evaluate the Slot Transformer: CLEVRER [Yi et al., 2020], a video
dataset where questions are posed about objects in the scene, Kinetics-600 comprising of YouTube video data for action
classification [Carreira et al., 2018] and CATER [Girdhar and Ramanan, 2019], object relational data which requires
video understanding of scene events to successfully solve. In each of these cases we evaluate the Slot Transformer
against current state-of-the-art approaches.
2 Related Work
We have drawn inspiration from models that induce scene understanding via object discovery. In particular, the
approaches taken by Locatello et al. [2020], Greff et al. [2020], Burgess et al. [2019] all involve learning latent
representations that allow the scene to be parsed into distinct objects. Slot attention [Locatello et al., 2020] provides a
means to extract relevant scene information into a set of latent vectors where each one queries scene pixels (or ResNet
encoded super pixels) and where the attention softmax is done across the query axis rather than the channel axis,
inducing slots to compete to explain each pixel in the input. Another slotted approach appears in IODINE [Greff et al.,
2020], which leverages iterative amortized inference [Marino et al., 2018, Andrychowicz et al., 2016] to produce better
posterior latent estimates for the slots which can then be used to reconstruct image by using a spatial Gaussian mixture
using masks and means decoded from each slot as mixing weights and component means respectively. We aim to
extend this type of approach by applying the same ideas to sequential input, in particular, video input.
Self-attention and transformers [Vaswani et al., 2017, Parisotto et al., 2019] are also central to our work and have
played a critical role in the field in recent years. Some of the latest applications have demonstrated the efficacy of
this mechanism applied to problem solving in domains that require a capacity to reason as a success criteria [Santoro
et al., 2018, Clark et al., 2020, Russin et al., 2021]. In particular, self-attention provides a means by which to form a
pairwise relationship among any two elements in a sequence, however, this comes at a cost that scales quadratically in
the sequence length, so care must be taken when choosing how to apply this technique. One recent approach involving
video sequence data, the TimeSformer from Bertasius et al. [2021], utilizes attention over image patches across time
applied successfully to sequence action classification [Goyal et al., 2017, Carreira et al., 2018].
There has been a great deal of work on self-supervised video representation learning methods [Wang et al., 2021, Qian
et al., 2020, Feichtenhofer et al., 2021, Gopalakrishnan et al., 2022] among the family of video representation learning
models. In particular in Qian et al. [2020] the authors have explored using a contrastive loss and applied to Kinetics-600.
We include some of these results below in table 3 below. In Kipf et al. [2021] the authors present an architecure similar
to ours, albeit without generative losses, where they show that their model achieves state-of-the-art performance on
image object segmentation in video sequences and optical flow field prediction. In Ding et al. [2020] the authors use
an object representation model (Burgess et al. [2019]) as input to a transformer and achieve very strong results on
CLEVRER and CATER.
Neuro-symbolic logic based models [d’Avila Garcez and Lamb, 2020] have been used to solve temporal reasoning
problems and, in particular, to CLEVRER and CATER. In section 4 we examine the performance of the Dynamic
Context Learner (DCL, Chen et al. [2021]) and Neuro-Symbolic Dynamic Reasoning (NS-DR, Yi et al. [2020]) where
NS-DR consists of neural networks for video parsing and dynamics prediction and a symbolic logic program executor
and DCL is composed of Program Parser and Symbolic Executor while making use of extra labelled data when applied
to CLEVRER. Some components of these methods rely on explicit modeling of reasoning mechanisms crafted for the
problem domain or on additional labelled annotations. In contrast, the intent of our approach is to form an inductive
bias around reasoning about sequences in general.
Recently Jaegle et al. [2021] have published their work on the Perceiver, a model that makes use of attention asymmetry
to ingest temporal multi-modal input. In this work an attention bottleneck is introduced that reduces the dimensionality
of the positional axis through which model inputs pass. This provides tractability for inputs of large temporal and
spatial dimensions where the quadratic scaling of transformers become otherwise prohibitive. Notably the authors
show that this approach succeeds on multimodal data, achieving state-of-art scores on the AudioSet dataset [Gemmeke
et al., 2017] containing audio and video inputs. Ours is a similar approach in spirit to the Perceiver in compressing
representations via attention mechanisms however, we focus on the ability to reason using encoded sequences and
primarily reduce over spatial axes.
3 Model
We now present the Slot Transformer, a generative transformer model that leverages iterative inference to produce
improved latent estimates given an input sequence. Our model can broadly be described in terms of three phases (see
Figure 1): an encode phase where a spatio-temporal input is compressed to form a representation, then a decode phase
2