
The second knockoff property (2.2) is satisfied if, given the original explanatory variables X1:d,
the knockoffs
r
X1:dhave no effect on the response variable Y. Finding knockoff generation
methods such that the first knockoff property (2.1) is satisfied can be challenging. For some
specific distributions, like the multivariate normal distribution, it is possible to obtain exact
knockoff copies (see for example Cand`es et al. 2018; Gimenez et al. 2019; Sesia et al. 2018). If
the distribution of X1:dis more complex, various methods have been proposed in the literature.
These methods can be used to construct knockoffs that to a certain extend approximately satisfy
the knockoff property (2.1). In contrast, the second knockoff property (2.2) is easily satisfied, if
the outcome variable Yis not used to construct the knockoffs.
To repeat some key terms for controlled variable selection and model-X knockoffs, we consider
the following problem. Assume that we have obtained a sample from a response variable of
interest Ytogether with covariates X1:dwhich might explain Y. The goal is to identify a
subset of X1:dcontaining important variables which have an effect on Y. To formalize this, lets
assume that the response only depends on a (small) subset of variables S ⊂ {1, . . . , d}such
that, conditionally on {Xi}i∈S , the outcome variable Yis independent of all other covariates.
We further denote by
p
Sthe set of important variables which has been identified with a variable
selection procedure. Usually, such variable selection procedures are designed in a way that the
false discovery rate is controlled, i.e.,
E"#{i:i∈
p
S \ S}
#{i:i∈
p
S} #≤q,
for some nominal level q∈(0,1) and with the convention 0
0= 0.
It has been shown in Cand`es et al. (2018) that model-X knockoffs are a variable selection
method where the false discovery rate is controlled. In the following, we will briefly repeat the
most important steps of the model-X knockoffs framework. A first key element is a method
to construct model-X knockoffs satisfying the knockoff properties (2.1) and (2.2). Additionally,
some measures of feature importance Ziand
r
Ziare required for each variable Xi, 1 ≤i≤d,
and their knockoff copies
r
Xi, 1 ≤i≤d, respectively. These measures of feature importance can
be obtained from standard ML-methods like for example a Lasso or elastic net regression of Y
on the augmented vector (X1:d,
r
X1:d). The feature importance scores of each variable and its
knockoff are then combined to a knockoff statistic, e.g., Wi=Zi−
r
Zi. This knockoff statistic is
antisymmetric, so that a large positive value of the knockoff statistic Wiis an indication for an
important variable Xi. At the same time for an unimportant variable Xi, positive and negative
values should be equally likely for the knockoff statistic Wi. The estimated set of important
variables with the model-X knockoffs framework, while controlling the false discovery rate, is
then obtain as
p
S:= {i:Wi≥τq}. Here, the threshold τqis given by (Barber and Cand`es 2015;
Cand`es et al. 2018)3
τq= min t > 0 : 1+#{i:Wi≤ −t}
#{i:Wi≥t}≤q.
The validity and quality of model-X knockoffs fundamentally depends on the procedure used
for generating knockoffs which satisfy the properties (2.1) and (2.2). In the following, we will
propose a new such knockoff generation method. The new method utilizes vine copulas which
are a powerful tool for high-dimensional dependence modeling.
3Note that recently proposed extensions can be employed to derandomize knockoffs and / or find better
thresholds for the knockoff filter (see for example Emery and Keich 2019; Gimenez and Zou 2019; Luo et al. 2022;
Ren et al. 2021). Many of these methods try to improve the stability of knockoff filters by generating multiple
or simultaneous knockoffs and combining them in an appropriate way. These advanced methods or extensions
could also be combined with the vine copula knockoff generation method. For the sake of simplicity, this is left
for future research.
3