Vine copula based knocko generation for high-dimensional controlled variable selection Malte S. Kurz

2025-05-06 0 0 739.35KB 23 页 10玖币
侵权投诉
Vine copula based knockoff generation for high-dimensional
controlled variable selection
Malte S. Kurz
TUM School of Management
Technical University of Munich
Arcisstr. 21, 80333 Munich, Germany
October 21, 2022
Abstract
Vine copulas are a flexible tool for high-dimensional dependence modeling. In this article,
we discuss the generation of approximate model-X knockoffs with vine copulas. It is shown
how Gaussian knockoffs can be generalized to Gaussian copula knockoffs. A convenient way
to parametrize Gaussian copulas are partial correlation vines. We discuss how completion
problems for partial correlation vines are related to Gaussian knockoffs. A natural general-
ization of partial correlation vines are vine copulas which are well suited for the generation
of approximate model-X knockoffs. We discuss a specific D-vine structure which is advanta-
geous to obtain vine copula knockoff models. In a simulation study, we demonstrate that vine
copula knockoff models are effective and powerful for high-dimensional controlled variable
selection.
1 Introduction
In various fields, like economics, finance, biology or medicine, researchers and practitioners try
to identify important variables explaining a response variable of interest. The set of potential
explanatory variables is often high-dimensional. Therefore, appropriate statistical tools are
necessary to control the false discovery rate. Model-X knockoffs (Cand`es et al. 2018) can be
used in such situations for high-dimensional controlled variable selection.
Knockoffs have been introduced by Barber and Cand`es (2015) and been generalized to model-
X knockoffs in Cand`es et al. (2018). The general idea of knockoffs is the following: Denote the
response variable by Yand the d-dimensional vector of explanatory variables by X1:d. For the
explanatory variables X1:d, knockoff copies
r
X1:dare constructed. To obtain valid knockoffs, two
properties need to be satisfied. First, it is required that conditionally on the true explanatory
variables X1:d, the knockoffs
r
X1:dare not associated with the response variable Y. At the same
time, the knockoff copies
r
X1:dneed to be constructed in a way that their distributional structure
is very similar to the original variables X1:d.1In variable selection procedures, these knockoffs
are then added as control variables. Only if an original variable Xjis particularly more relevant
than its knockoff counterpart
r
Xj, it will be considered as important explanatory variable for the
response Y. The knockoffs framework by Barber and Cand`es (2015) and Cand`es et al. (2018) is
then constructed in a way such that the false discovery rate is, at least approximately, controlled.
malte.kurz@tum.de
1A formal definition of knockoffs will be given in Section 2.
1
arXiv:2210.11196v1 [stat.ME] 20 Oct 2022
A key part of the knockoffs framework for high-dimensional controlled variable selection
is the knockoff generation procedure or model. Different methods have been proposed and
analyzed in the literature. The simplest approach relies on a multivariate Gaussian distribution
which is obtained by matching the first and second moments (Cand`es et al. 2018). More flexible
alternatives are based on hidden Markov models (Sesia et al. 2018) or variational auto-encoders
(Liu and Zheng 2018). Deep learning methods for knockoff generation have been analyzed in
Romano et al. (2020) and Sudarshan et al. (2020) and generative adversarial networks in Jordon
et al. (2019).
In the following, we will propose a knockoff generation method based on vine copulas. Vine
copulas (Aas et al. 2009; Bedford and Cooke 2001; Joe 1997) are a flexible model class for high-
dimensional dependence modeling and have already been applied as high-dimensional generative
models (see Tagasovska et al. 2019). We will discuss how Gaussian knockoffs can be generalized
to Gaussian copula knockoffs. Gaussian copula knockoffs in particular allow for more flexi-
ble marginal distributions, i.e., non-normal marginal distributions. Gaussian copulas can be
parametrized with so-called partial correlation vines (Bedford and Cooke 2002; Kurowicka and
Cooke 2003). We will explain how completion problems for partial correlation vines are related
to the construction of knockoffs. Partial correlation vines can be generalized to vine copulas
in order to allow for more flexible dependence structures. We will introduced a specific D-vine
copula model which is particularly well suited for constructing approximate model-X knockoffs.
In a simulation study it is demonstrated that vine copula based knockoffs are effective and pow-
erful for high-dimensional controlled variable selection. An implementation of all three knockoff
methods (Gaussian knockoffs, Gaussian copula knockoffs and vine copula knockoffs) is available
in an accompanying Python package vineknockoffs (Kurz 2022).
The paper is structured as follows. In Section 2, we will repeat the most important concepts
of the model-X knockoff framework. Gaussian knockoffs are discussed in Section 3. The gener-
alization to Gaussian copula knockoffs, partial correlation vines and vine copulas are discussed
in Section 4. In Section 5, we introduce vine copula based knockoffs and discuss implemen-
tation details. A simulation study, where the finite sample performance for high-dimensional
controlled variable selection is analyzed, is presented in Section 6. Concluding remarks are given
in Section 7.
2 Model-X knockoffs
A random vector
r
X1:dRdis called a model-X knockoff copy of X1:dRd, if the following
properties are satisfied
(X1:d,
r
X1:d)d
= (X1:d,
r
X1:d)swap(S),for each S 1:d:= {1, . . . , d},(2.1)
Y
r
X1:d|X1:d,(2.2)
where d
= denotes equality in distribution.2
The first knockoff property (2.1) implies that the joint distribution of the vector of covariates
X1:dtogether with the vector of its knockoffs
r
X1:dis invariant against any kind of swap. A swap
with subset S 1:dis obtained by swapping the entries Xjand
r
Xjfor each j∈ S in the
augmented vector (X1:d,
r
X1:d). For example for d= 3, we can consider the subset S={1,3}
and the knockoff property (2.1) becomes
X1, X2, X3,
r
X1,
r
X2,
r
X3d
=r
X1, X2,
r
X3, X1,
r
X2, X3.
2Originally, knockoffs have been introduced by Barber and Cand`es (2015) under the assumption that the
covariates are fixed. The term model-X knockoffs was introduced by Cand`es et al. (2018) who treat the covariates
X1:das random variables to be able to apply knockoffs for high-dimensional settings.
2
The second knockoff property (2.2) is satisfied if, given the original explanatory variables X1:d,
the knockoffs
r
X1:dhave no effect on the response variable Y. Finding knockoff generation
methods such that the first knockoff property (2.1) is satisfied can be challenging. For some
specific distributions, like the multivariate normal distribution, it is possible to obtain exact
knockoff copies (see for example Cand`es et al. 2018; Gimenez et al. 2019; Sesia et al. 2018). If
the distribution of X1:dis more complex, various methods have been proposed in the literature.
These methods can be used to construct knockoffs that to a certain extend approximately satisfy
the knockoff property (2.1). In contrast, the second knockoff property (2.2) is easily satisfied, if
the outcome variable Yis not used to construct the knockoffs.
To repeat some key terms for controlled variable selection and model-X knockoffs, we consider
the following problem. Assume that we have obtained a sample from a response variable of
interest Ytogether with covariates X1:dwhich might explain Y. The goal is to identify a
subset of X1:dcontaining important variables which have an effect on Y. To formalize this, lets
assume that the response only depends on a (small) subset of variables S ⊂ {1, . . . , d}such
that, conditionally on {Xi}i∈S , the outcome variable Yis independent of all other covariates.
We further denote by
p
Sthe set of important variables which has been identified with a variable
selection procedure. Usually, such variable selection procedures are designed in a way that the
false discovery rate is controlled, i.e.,
E"#{i:i
p
S \ S}
#{i:i
p
S} #q,
for some nominal level q(0,1) and with the convention 0
0= 0.
It has been shown in Cand`es et al. (2018) that model-X knockoffs are a variable selection
method where the false discovery rate is controlled. In the following, we will briefly repeat the
most important steps of the model-X knockoffs framework. A first key element is a method
to construct model-X knockoffs satisfying the knockoff properties (2.1) and (2.2). Additionally,
some measures of feature importance Ziand
r
Ziare required for each variable Xi, 1 id,
and their knockoff copies
r
Xi, 1 id, respectively. These measures of feature importance can
be obtained from standard ML-methods like for example a Lasso or elastic net regression of Y
on the augmented vector (X1:d,
r
X1:d). The feature importance scores of each variable and its
knockoff are then combined to a knockoff statistic, e.g., Wi=Zi
r
Zi. This knockoff statistic is
antisymmetric, so that a large positive value of the knockoff statistic Wiis an indication for an
important variable Xi. At the same time for an unimportant variable Xi, positive and negative
values should be equally likely for the knockoff statistic Wi. The estimated set of important
variables with the model-X knockoffs framework, while controlling the false discovery rate, is
then obtain as
p
S:= {i:Wiτq}. Here, the threshold τqis given by (Barber and Cand`es 2015;
Cand`es et al. 2018)3
τq= min t > 0 : 1+#{i:Wi≤ −t}
#{i:Wit}q.
The validity and quality of model-X knockoffs fundamentally depends on the procedure used
for generating knockoffs which satisfy the properties (2.1) and (2.2). In the following, we will
propose a new such knockoff generation method. The new method utilizes vine copulas which
are a powerful tool for high-dimensional dependence modeling.
3Note that recently proposed extensions can be employed to derandomize knockoffs and / or find better
thresholds for the knockoff filter (see for example Emery and Keich 2019; Gimenez and Zou 2019; Luo et al. 2022;
Ren et al. 2021). Many of these methods try to improve the stability of knockoff filters by generating multiple
or simultaneous knockoffs and combining them in an appropriate way. These advanced methods or extensions
could also be combined with the vine copula knockoff generation method. For the sake of simplicity, this is left
for future research.
3
3 Gaussian knockoffs
Assume that the d-dimensional covariates X1:dRdare multivariate normally distributed, i.e.,
X1:d∼ Nd(0,Σ), where Nd(µ, Σ) denotes a d-dimensional normal distribution with expectation
µand covariance matrix Σ. Model-XKnockoffs
r
X1:dRdcan then be obtained from the
following joint normal distribution (Cand`es et al. 2018)
(X1:d,
r
X1:d)∼ N2d(0, G),where G=Σ Σ diag(s)
Σdiag(s) Σ (3.1)
and diag(s) is a diagonal matrix such that Gis positive semidefinite. Typically, the vector sis
obtained by solving a semidefinite program, see Cand`es et al. (2018).
Having specified, or estimated, Σ, sand therefore also G, knockoffs
r
X1:dcan be simply
obtained by sampling from the conditional distribution (Cand`es et al. 2018)
r
X1:d|X1:d=x1:d∼ Nd(µ, V ),
with
µ=x1:dx1:dΣ1diag(s),
V= 2diag(s)diag(s1diag(s).
This procedure to generate knockoffs will in the following be called Gaussian knockoffs. It is
restricted to the case of multivariate normally distributed covariates X1:d. However, it can also
be applied for covariates that are not normally distributed. In such cases their distribution is
approximated by a multivariate normal distribution and the knockoff generation procedure is
sometimes also called second-order knockoffs (Cand`es et al. 2018).4The name reflects the fact
that the procedure is designed in a way that first and second moments of the data are matched
but not necessarily the entire distribution.
4 Gaussian copula knockoffs, partial correlation vines and vine
copulas
The Gaussian knockoff generation procedure is restricted to the multivariate normal case, i.e.,
it requires the distributional assumption X1:d∼ Nd(0,Σ). If this assumption fails to hold,
Gaussian knockoffs might still be used. However, depending on how well the true distribution
of X1:dcan be approximated with a multivariate normal distribution, Gaussian knockoffs might
not produce valid results.
4.1 Gaussian copula knockoffs
A straightforward generalization of Gaussian knockoffs is given by assuming a Gaussian copula
for X1:d, i.e.,
U1:d:= F1(X1), . . . , Fd(Xd)CGau
d(R),
where CGau
d(R) denotes a d-dimensional Gaussian copula with correlation matrix Rand F1, . . . , Fd
are arbitrary absolutely continuous marginal distributions such that X1F1, . . . , XdFd.
Note that by definition
U1:dCGau
d(R)Y1:d:= Φ1(U1),...,Φ1(Ud)∼ Nd(0, R),
4We will in the following always use the term Gaussian knockoffs irrespectively whether X1:dis multivariate
normally distributed or not.
4
where Φ(·) denotes the cdf of the univariate standard normal distribution.
To obtain valid knockoffs, we use the following model
(U1:d,
r
U1:d)CGau
2d(H),where H=R R diag(r)
Rdiag(r)R.(4.1)
The vector r, depending on R, can be obtained as solution to the same kind of optimiza-
tion problem being solved to obtain sdepending on Σ (see (3.1)). We further set
r
X1:d:=
(F1
1(
r
U1), . . . , F 1
d(
r
Ud)) such that
r
X1F1,...,
r
XdFd.
Note that Gaussian knockoffs are a special case of Gaussian copula knockoffs which are
obtained by setting all marginal distributions F1, . . . , Fdto cdfs of the normal distributions
N(0, σ2
1), . . .,N(0, σ2
d) with the respective variances σ2
1, . . . , σ2
d.
4.2 Partial correlation vines and completion problems
A straightforward approach to generate Gaussian copula knockoffs relies on partial correlation
vines and vine copulas. Before we discuss partial correlation vines themselves, we want to
introduce vines and the subclasses of so-called regular vines (R-vines) and drawable vines (D-
vines). Vines as graph-theoretic concept have been introduced by Bedford and Cooke (2002)
and form the basis for partial correlation vines (Bedford and Cooke 2002; Kurowicka and Cooke
2003) and vine copulas (Aas et al. 2009; Bedford and Cooke 2001; Joe 1997).
For dvariables, a vine V:= (T1, . . . , Td1) := ((N1, E1),...,(Nd1, Ed1)) consists of d1
trees. Each tree Tjconsists of nodes Njand edges Ejwhich form a connected graph with no
cycle. In the vine, nodes of tree Tjare edges in tree Tj1, i.e., Nj=Ej1for j= 2, . . . , d 1.
The nodes of the first tree are N1={1, . . . , d}, i.e., the variable indices.
If a vine satisfies the so-called proximity condition, it is called a regular vine (R-vine). The
proximity condition for j= 2, . . . , d1 is the requirement that two nodes can only be connected
by an edge in tree Tj, if the nodes (being edges in Tj1) share a common node in tree Tj1.
An R-vine is called a drawable vine (D-vine), if all nodes in the first tree T1are connected
to a maximum of two other nodes. The graph-theoretic structure of a D-vine can be nicely
visualized. In Figure 1 we show the five-dimensional case. The order of the variables in the first
tree T1(X1. . . Xd) can be chosen arbitrarily and is sometimes also called the structure of
the D-vine.
T1
T2
T3
1234
12 23 34
13; 2 24; 3
ρ12 ρ23 ρ34
ρ13; 2 ρ24; 3
ρ14; 23
Figure 1: Four-dimensional partial correlation D-vine.
To understand the labeling of the nodes and edges in Figure 1, we need to introduce some
more graph-theoretic concepts and notation. We consider an R-vine for dvariables. The com-
5
摘要:

Vinecopulabasedknocko generationforhigh-dimensionalcontrolledvariableselectionMalteS.Kurz*TUMSchoolofManagementTechnicalUniversityofMunichArcisstr.21,80333Munich,GermanyOctober21,2022AbstractVinecopulasareaexibletoolforhigh-dimensionaldependencemodeling.Inthisarticle,wediscussthegenerationofapproxim...

展开>> 收起<<
Vine copula based knocko generation for high-dimensional controlled variable selection Malte S. Kurz.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:739.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注