
factors and their corresponding latent dimensions. However,
without inductive bias in the data or model, learning disen-
tangled latent space for a VAE is theoretically impossible [7].
Therefore, we should train the VAE with a degree of super-
vision on data or apply inductive bias to the model to learn
disentangled latent space.
In complex datasets such as Carla dataset [8], generative
factors are defined at a more abstract level. In addition, not
all generative factors are known, and the domain of observed
generative factors can be continuous. Therefore, providing
labels based on generative factors for each image can be
expensive. As a result, using match pairing [9] for a complex
dataset is more practical. However, structuring the latent space
of VAE without labels and just based on partitions for more
than one generative factor can be challenging. As shown
in [10], the disentanglement performance can decrease when
changes in other factors affect the learned distribution for a
given factor.
Although theoretically achieving full disentanglement with
weak supervision is possible, in practice, based on the size of
latent space, incomplete knowledge about generative factors,
level of abstraction in defining generative factors, etc., total
disentanglement may not be achieved. For example, if a fixed
number of latent dimensions are selected for a rain generative
factor, selected dimensions may not represent all the informa-
tion regarding that factor. However, these dimensions majorly
encode the rain information. Therefore, a mechanism is needed
to learn partial disentanglement for complex datasets.
Logic tensor networks (LTN) [11] distill knowledge in the
network weights based on a set of rules during training. For
this purpose, the loss function is defined as a set of rules. For
LTN, training is the process of optimizing network parameters
to minimize the satisfaction of the loss rule. Since LTN uses
first-order fuzzy logic semantics (real-logic), the rules can be
satisfied partially. Therefore, LTNs’ characteristics make them
suitable for defining partial disentanglement.
Currently, OOD detection and reasoning approaches [4]
try to achieve partial disentanglement through model-based
inductive bias. However, they cannot guarantee the mapping
between generative factors and specific latent dimensions.
Our contribution: To solve the aforementioned issues, we
propose an OOD reasoning framework that consists of three
phases: data partitioning, training OOD reasoners, and run-
time OOD reasoning. Data partitions are formed based on ob-
served values for generative factors, and OOD reasoners (latent
dimensions of a weakly-disentangled VAE) are designed with
match pairing supervision and LTN. Using the LTN version
of a VAE allows us to define disentanglement formally based
on given data partition samples. Inspired by [9] we define the
adaptation and isolation rules for achieving disentanglement.
The adaptation rule ensures that the change in generative
factor values is reflected in the distribution learned for the
corresponding latent dimensions. The isolation rule guarantees
that the change in a given factor is only reflected in its corre-
sponding latent dimensions. Since LTN uses first-order fuzzy
logic semantics (real-logic), adaptation and isolation rules can
be partially satisfied. As a result, the VAE can achieve a
proper level of disentanglement even when latent space size
is small, and some generative factors are not observed during
training. Finally, we use the corresponding dimension for a
given factor to identify OOD samples based on a given factor.
We show the effect of defined constraints on disentanglement
by visualization, and also mutual information [12]. We also
show that our approach achieves an AUROC of 0.98 and 0.89
on the Carla dataset for rain and city reasoners, respectively.
II. BACKGROUND
Our framework is built based on three concepts. In this
section, we introduce these concepts.
Variational autoencoder: A VAE is a machine learning
model formed by two attached neural networks: the encoder
and the decoder. Given an input x, encoder qφ(z|x)with
parameters φmaps the input to latent representation z. The
decoder pθ(x|z)with parameters θregenerates data from z
representation. Equation 1 shows the ELBO loss of a VAE.
loss =Eqφ(z|x)[log pθ(x|z)] −KL(qφ(z|x)||p(z)) (1)
The first and second terms are reconstruction loss and
regularization losses, respectively. The reconstruction loss
ensures that the distribution learned for data reflects the main
factors required for data reconstruction. The regularization loss
evaluates the similarity between the learned distribution and
the prior distribution by using KL-divergence between learned
and prior (usually standard Gaussian distribution N(0,1))
distributions [13].
Logic tensor network: A logic tensor network (LTN)
uses logical constraints to distill knowledge about data and
a model in model weights during training. The knowledge
is formally described by first-order fuzzy logic named real
logic. Real logic uses a set of functions, predicates, variables,
and constants to form logical terms. These elements, alongside
operators such as negation, conjunction, etc., form the syntax
of real logic. The fuzzy semantic is defined for real logic
so that the rules can be satisfied partially. The operators
are semantically defined based on product real logic. Table
I, summarizes the operator definition. In this table a,b,
and a1, ..., anare predicates with values that fall in [0,1].
Learning for a logic tensor network is the process of finding
a set of parameters that maximize the satisfaction of rules
or minimizing the satisfaction of a loss rule defined by real
logic [11].
Weakly supervised disentanglement: Disentanglement is
defined by two concepts: consistency and restrictiveness.
Given a generative factor sencoded in dimensions i∈Iof
latent space, consistency means changes in the distributions
of dimensions outside the specified set (I) do not affect given
factor s. Restrictiveness means other factors s0∈S\ {s}are
immune to changes in the distributions of specified dimensions
(i∈I) that encode generative factor s[9]. We can use different
levels of weak supervision to attain disentanglement, such as