
example, parameters as scalars are quantized into finite bins before being embedded as vectors (see
supplementary for the quantization details), and since there are at most five parameters for each
primitive, we pack all parameter embeddings into a single code. On the other hand, each constraint
reference as a primitive index is directly embedded as a code. Therefore, each token of a
L0
typed
instance is encoded as
et0.x =enctype(t0) + encpos(t0.x) + encparam(t0.x)|encref (t0.x),(1)
where
t0.x
iterates over the split tokens (i.e., type, parameters, references), the type embedding is
shared for all tokens of the instance, the position embedding counts the token index in the whole
split-tokenized sequence of
S
, and parameter or reference embeddings are applied where applicable.
Concept detection
We build the detection network as an encoder-decoder transformer following
[
2
]. The transformer encoder operates on the sketch encoded sequence
[et0
i∈S]
and produces the
contextualized sequence
[e0
t0
i∈S]
through layers of self-attention and feed-forward. The transformer
decoder takes a learnable set of concept queries
[qi]
of size
kqry
plus a special query
qR
for
composition generation, and applies interleaved self-attention, cross-attention to
[e0
t0
i]
and feed-
forward layers to obtain the implicit concept codes
[qi]
and
qR
. The concept codes are further
quantized into
[q0
i]
by selecting concept prototypes from a library
L1
implicitly encoding
L1
, before
being expanded into explicit forms.
4.2 Explicit concept structure generation
Concept structure expansion
Given a library code
q0∈L1
representing a type
T1∈L1
, through
an MLP we expand its explicit structure as a collection of codes
[t0
i]
representing the
L0
type instances
[t0
i]and a matrix representing the composition RT1of [t0
i]and arguments (cf. List 1).
concept A concept B
RS
primitive constraint
inward arg outward arg
We fix the maximum number of
L0
type instances to
kL0
(12 by default), and
split the arguments into two groups, inward arguments and outward arguments,
each of maximum number
karg
(2 by default). Each type code
t0
i
is decoded
into discrete probabilities over
L0
with an additional probability for null type
φ
to indicate the emptiness of this element (cf. Sec. 5.1), by
dectype(·)
as
the inverse of
enctype(·)
in Sec. 4.1. An inward argument only points to a
primitive inside the concept structure and originates from a constraint outside,
and conversely an outward argument only points to primitives outside and
originates from a constraint inside the concept (see inset for illustration); the split into two groups
eases composition computation, as discussed below.
The composition operator
RT1
is implemented as an assignment matrix
RT1
of shape
(2kL0+karg )×
(kL0+karg )
, where each row corresponds to a constraint reference or inward argument, and each
column to a primitive or outward argument. The two-fold coefficient of constraint references comes
from that any constraint we considered in the dataset [
17
] has at most two arguments. Each row is a
discrete probability distribution such that
PjRT1[i, j] = 1
, with the maximum entry signifying that
the
i
-th constraint/outward argument refers to the
j
-th primitive/inward argument. We compute
RT1
by first mapping the concept code
q0
to a matrix of logits in the shape of
RT1
, and then applying
softmax transform for each row. Notably, we avoid the meaningless loops of an element referring
back to itself, and inward arguments referring to outward arguments, by masking the diagonal blocks
RT1[2i:2i+2, i], i∈[kL0]and the argument block RT1[2kL0:, kL0:] by setting their logits to −∞.
Cross-concept composition
Aside from references inside a concept, references across concepts are
generated to complete the whole sketch graph. We achieve cross-concept references by argument
passing (see inset above for illustration). In particular, we implement the cross-concept composition
operator
RS
as an assignment matrix
RS
of shape
(kqry·karg)×(kqry ·karg)
directly mapped from
qR
through an MLP. Similar to the in-concept composition matrix, each row of the cross-concept matrix
is a discrete distribution such that
PjRS[i, j]=1
, with the maximum entry signifying that the
(imod karg )
-th outward argument of the
bi/karg c
-th concept instance refers to the
(jmod karg )
-th
inward argument of the bj/kargc-th concept instance.
The complete cross-concept reference is therefore the product of three transport matrices:
Rcref [t1
i, t1
j] = Rt1
i[:2kL0, kL0:]×RS[i·karg :(i+1)·karg, j·karg:(j+1)·karg]×Rt1
j[2kL0:,:kL0],
5