independent conditioning on
w0
, the decomposition of the third term holds in Eq. 2. The relationship
between
y
and
w0
will be further explained in Section 4.1.2. Given
qφ(w, z|x, y) = qφ(w, z|x) =
qφ(w|x)·qφ(z|x)
, we rewrite the joint probability in Eq.
(2)
as the form of the Bayesian variational
inference as the first term of the learning objective:
L1=−Eqφ(w,z|x)[log pθ(x|w, z)] −Eqφ(w|x)[log pγ(y|w)] + DKL(qφ(w, z|x)||p(w, z)).(3)
Meanwhile, since the objective function in Eq.
(3)
does not contribute to our assumption that
z
is
independent from
w
and
y
, and values in
w
are independent with each other, we decompose the
KL-divergence in Eq. (3) and penalize the term:
L2=ρ1·DKL(q(z, w)||q(z)q(w)) + ρ2·DKL(q(w)|| Y
i
q(wi)),(4)
where
ρ1
and
ρ2
are co-efficient hyper-parameters to penalize the two terms. Details of the proof and
derivation regarding the overall objective can be refereed in Appendix A.
4.1.2 Relating the properties and latent variables
To model the dependence between the correlated properties and the associated latent variables
p(y|w)
in Eq
(3)
as well as to capture the correlation among properties, we propose to directly learn the
specific relationship between disentangled latent variables in
w
and properties
y
. The correlations
among
y
are also captured. Specifically, we design a mask pooling layer achieved by a mask matrix
M∈ {0,1}l×m
, where
l
is the dimension of the latent vector
w
.
M
captures the way how
w
relates
to
y
, where
Mi,j = 1
denotes that
wi
relates to the
j
-th property
yj
, otherwise there is no relation. In
this way, two properties that relate to the same variable in
w
can be regarded as correlated. The binary
elements in
M
are trained with the Gumbel Softmax function. In implementation, the
L1
norm of the
mask matrix is also added to the objective to encourage the sparsity of M.
Next, given the learned mask matrix
M
, we model the mapping from
w
to
y
. For properties
y
, we
can calculate the corresponding
w0
that aggregates the values in
w
that contribute to each property as
w·JTM
, each column of which corresponds to the related latent variables in
w
to be aggregated
to predict the corresponding
y
. For each property
yj
in
y
, we aggregate all the information from its
related latent variable set in winto the next-level latent variable w0
j(i.e., the j-th variable of w0) via
an aggregation function h:
w0=h(w·JTM;β),(5)
where
J
is a vector with all values as one,
represents the element-wise multiplication and
β
is the
parameter of h. Then the property ycan be predicted using w0as:
y=f(w0;γ),(6)
where
f
is the set of prediction functions with
w0=h(w·JTM;β)
as the input and
γ
are the
parameter which will be further explained in the next section. Thus, we have built a one-to-one
mapping between
w0
and
y
. In addition, the correlation of
yi
and
yj
can be recovered if
MT
·i·M·j6= 0
.
4.1.3 Invertible constraint for multiple-property control
As stated in the problem formulation, our proposed model aims to generate a data point
x
that retains
the original property value requirement for the given properties. The most straightforward way to
do this is to model both the mutual dependence between each
yi
and its relevant latent variable set
w0
i
. However, this can incur double errors in this two-way mapping, since there exists a complex
correlation among properties in
y
and there are many cases that
MT
·i·M·j6= 0
. To address it, we
propose an invertible function that mathematically ensures the exact recovery of bridging variables
w0given a group of desired properties ybased on the following deduction.
As in Eq.
(6)
, the set of correlated properties
y={y1, y2, ..., ym}
are correlated with the set of latent
variables
w0={w0
1, w0
2, ..., w0
m}
in a one-to-one mapping fashion. Thus we assume that
y
can be
sampled from a multivariate Gaussian given w0as follows:
p(y|w0) = N(y|f(w0;γ),Σ); y= (y1, y2, ..., ym), w0={w0
1, w0
2, ..., w0
m},Σ∈Rm×m
s.t., f(w0;γ)[j] = ¯
f(w0;γ)[j] + w0
j, Lip(¯
f(w0;γ)[j]) <1if ||Wk||2<1, j = 1, ..., m, (7)
where
Lip
denotes to the
Lipschitz −constant
. Namely, to precisely control the properties
y
,
we learn a set of invertible functions
f(w0;γ)
indicated in Eq 6 to model
pγ(y|w0)
.
γ
is the set of
parameters in Eq 6. The constraint enforces
f(w0;γ)[j]
to be an invertible function to achieve mutual
dependence between yjand w0
j[2]. As a result, we have the third term of the objective function:
L3=−Ew0∼p(w0)[N(y|f(w0;γ),Σ)] + ||Lip(¯
f(w0;γ)[j]) −1||2(8)
5