Vine copula based knocko generation for high-dimensional controlled variable selection Malte S. Kurz

2025-05-06 0 0 739.35KB 23 页 10玖币

侵权投诉

Vine copula based knockoﬀ generation for high-dimensional

controlled variable selection

Malte S. Kurz∗

TUM School of Management

Technical University of Munich

Arcisstr. 21, 80333 Munich, Germany

October 21, 2022

Abstract

Vine copulas are a ﬂexible tool for high-dimensional dependence modeling. In this article,

we discuss the generation of approximate model-X knockoﬀs with vine copulas. It is shown

how Gaussian knockoﬀs can be generalized to Gaussian copula knockoﬀs. A convenient way

to parametrize Gaussian copulas are partial correlation vines. We discuss how completion

problems for partial correlation vines are related to Gaussian knockoﬀs. A natural general-

ization of partial correlation vines are vine copulas which are well suited for the generation

of approximate model-X knockoﬀs. We discuss a speciﬁc D-vine structure which is advanta-

geous to obtain vine copula knockoﬀ models. In a simulation study, we demonstrate that vine

copula knockoﬀ models are eﬀective and powerful for high-dimensional controlled variable

selection.

1 Introduction

In various ﬁelds, like economics, ﬁnance, biology or medicine, researchers and practitioners try

to identify important variables explaining a response variable of interest. The set of potential

explanatory variables is often high-dimensional. Therefore, appropriate statistical tools are

necessary to control the false discovery rate. Model-X knockoﬀs (Cand`es et al. 2018) can be

used in such situations for high-dimensional controlled variable selection.

Knockoﬀs have been introduced by Barber and Cand`es (2015) and been generalized to model-

X knockoﬀs in Cand`es et al. (2018). The general idea of knockoﬀs is the following: Denote the

response variable by Yand the d-dimensional vector of explanatory variables by X1:d. For the

explanatory variables X1:d, knockoﬀ copies

X1:dare constructed. To obtain valid knockoﬀs, two

properties need to be satisﬁed. First, it is required that conditionally on the true explanatory

variables X1:d, the knockoﬀs

X1:dare not associated with the response variable Y. At the same

time, the knockoﬀ copies

X1:dneed to be constructed in a way that their distributional structure

is very similar to the original variables X1:d.1In variable selection procedures, these knockoﬀs

are then added as control variables. Only if an original variable Xjis particularly more relevant

than its knockoﬀ counterpart

Xj, it will be considered as important explanatory variable for the

response Y. The knockoﬀs framework by Barber and Cand`es (2015) and Cand`es et al. (2018) is

then constructed in a way such that the false discovery rate is, at least approximately, controlled.

∗malte.kurz@tum.de

1A formal deﬁnition of knockoﬀs will be given in Section 2.

arXiv:2210.11196v1 [stat.ME] 20 Oct 2022

A key part of the knockoﬀs framework for high-dimensional controlled variable selection

is the knockoﬀ generation procedure or model. Diﬀerent methods have been proposed and

analyzed in the literature. The simplest approach relies on a multivariate Gaussian distribution

which is obtained by matching the ﬁrst and second moments (Cand`es et al. 2018). More ﬂexible

alternatives are based on hidden Markov models (Sesia et al. 2018) or variational auto-encoders

(Liu and Zheng 2018). Deep learning methods for knockoﬀ generation have been analyzed in

Romano et al. (2020) and Sudarshan et al. (2020) and generative adversarial networks in Jordon

et al. (2019).

In the following, we will propose a knockoﬀ generation method based on vine copulas. Vine

copulas (Aas et al. 2009; Bedford and Cooke 2001; Joe 1997) are a ﬂexible model class for high-

dimensional dependence modeling and have already been applied as high-dimensional generative

models (see Tagasovska et al. 2019). We will discuss how Gaussian knockoﬀs can be generalized

to Gaussian copula knockoﬀs. Gaussian copula knockoﬀs in particular allow for more ﬂexi-

ble marginal distributions, i.e., non-normal marginal distributions. Gaussian copulas can be

parametrized with so-called partial correlation vines (Bedford and Cooke 2002; Kurowicka and

Cooke 2003). We will explain how completion problems for partial correlation vines are related

to the construction of knockoﬀs. Partial correlation vines can be generalized to vine copulas

in order to allow for more ﬂexible dependence structures. We will introduced a speciﬁc D-vine

copula model which is particularly well suited for constructing approximate model-X knockoﬀs.

In a simulation study it is demonstrated that vine copula based knockoﬀs are eﬀective and pow-

erful for high-dimensional controlled variable selection. An implementation of all three knockoﬀ

methods (Gaussian knockoﬀs, Gaussian copula knockoﬀs and vine copula knockoﬀs) is available

in an accompanying Python package vineknockoffs (Kurz 2022).

The paper is structured as follows. In Section 2, we will repeat the most important concepts

of the model-X knockoﬀ framework. Gaussian knockoﬀs are discussed in Section 3. The gener-

alization to Gaussian copula knockoﬀs, partial correlation vines and vine copulas are discussed

in Section 4. In Section 5, we introduce vine copula based knockoﬀs and discuss implemen-

tation details. A simulation study, where the ﬁnite sample performance for high-dimensional

controlled variable selection is analyzed, is presented in Section 6. Concluding remarks are given

in Section 7.

2 Model-X knockoﬀs

A random vector

X1:d∈Rdis called a model-X knockoﬀ copy of X1:d∈Rd, if the following

properties are satisﬁed

(X1:d,

X1:d)d

= (X1:d,

X1:d)swap(S),for each S ⊆ 1:d:= {1, . . . , d},(2.1)

Y⊥

X1:d|X1:d,(2.2)

where d

= denotes equality in distribution.2

The ﬁrst knockoﬀ property (2.1) implies that the joint distribution of the vector of covariates

X1:dtogether with the vector of its knockoﬀs

X1:dis invariant against any kind of swap. A swap

with subset S ⊆ 1:dis obtained by swapping the entries Xjand

Xjfor each j∈ S in the

augmented vector (X1:d,

X1:d). For example for d= 3, we can consider the subset S={1,3}

and the knockoﬀ property (2.1) becomes

X1, X2, X3,

X1,

X2,

X3d

=r

X1, X2,

X3, X1,

X2, X3.

2Originally, knockoﬀs have been introduced by Barber and Cand`es (2015) under the assumption that the

covariates are ﬁxed. The term model-X knockoﬀs was introduced by Cand`es et al. (2018) who treat the covariates

X1:das random variables to be able to apply knockoﬀs for high-dimensional settings.

The second knockoﬀ property (2.2) is satisﬁed if, given the original explanatory variables X1:d,

the knockoﬀs

X1:dhave no eﬀect on the response variable Y. Finding knockoﬀ generation

methods such that the ﬁrst knockoﬀ property (2.1) is satisﬁed can be challenging. For some

speciﬁc distributions, like the multivariate normal distribution, it is possible to obtain exact

knockoﬀ copies (see for example Cand`es et al. 2018; Gimenez et al. 2019; Sesia et al. 2018). If

the distribution of X1:dis more complex, various methods have been proposed in the literature.

These methods can be used to construct knockoﬀs that to a certain extend approximately satisfy

the knockoﬀ property (2.1). In contrast, the second knockoﬀ property (2.2) is easily satisﬁed, if

the outcome variable Yis not used to construct the knockoﬀs.

To repeat some key terms for controlled variable selection and model-X knockoﬀs, we consider

the following problem. Assume that we have obtained a sample from a response variable of

interest Ytogether with covariates X1:dwhich might explain Y. The goal is to identify a

subset of X1:dcontaining important variables which have an eﬀect on Y. To formalize this, lets

assume that the response only depends on a (small) subset of variables S ⊂ {1, . . . , d}such

that, conditionally on {Xi}i∈S , the outcome variable Yis independent of all other covariates.

We further denote by

Sthe set of important variables which has been identiﬁed with a variable

selection procedure. Usually, such variable selection procedures are designed in a way that the

false discovery rate is controlled, i.e.,

E"#{i:i∈

S \ S}

#{i:i∈

S} #≤q,

for some nominal level q∈(0,1) and with the convention 0

0= 0.

It has been shown in Cand`es et al. (2018) that model-X knockoﬀs are a variable selection

method where the false discovery rate is controlled. In the following, we will brieﬂy repeat the

most important steps of the model-X knockoﬀs framework. A ﬁrst key element is a method

to construct model-X knockoﬀs satisfying the knockoﬀ properties (2.1) and (2.2). Additionally,

some measures of feature importance Ziand

Ziare required for each variable Xi, 1 ≤i≤d,

and their knockoﬀ copies

Xi, 1 ≤i≤d, respectively. These measures of feature importance can

be obtained from standard ML-methods like for example a Lasso or elastic net regression of Y

on the augmented vector (X1:d,

X1:d). The feature importance scores of each variable and its

knockoﬀ are then combined to a knockoﬀ statistic, e.g., Wi=Zi−

Zi. This knockoﬀ statistic is

antisymmetric, so that a large positive value of the knockoﬀ statistic Wiis an indication for an

important variable Xi. At the same time for an unimportant variable Xi, positive and negative

values should be equally likely for the knockoﬀ statistic Wi. The estimated set of important

variables with the model-X knockoﬀs framework, while controlling the false discovery rate, is

then obtain as

S:= {i:Wi≥τq}. Here, the threshold τqis given by (Barber and Cand`es 2015;

Cand`es et al. 2018)3

τq= min t > 0 : 1+#{i:Wi≤ −t}

#{i:Wi≥t}≤q.

The validity and quality of model-X knockoﬀs fundamentally depends on the procedure used

for generating knockoﬀs which satisfy the properties (2.1) and (2.2). In the following, we will

propose a new such knockoﬀ generation method. The new method utilizes vine copulas which

are a powerful tool for high-dimensional dependence modeling.

3Note that recently proposed extensions can be employed to derandomize knockoﬀs and / or ﬁnd better

thresholds for the knockoﬀ ﬁlter (see for example Emery and Keich 2019; Gimenez and Zou 2019; Luo et al. 2022;

Ren et al. 2021). Many of these methods try to improve the stability of knockoﬀ ﬁlters by generating multiple

or simultaneous knockoﬀs and combining them in an appropriate way. These advanced methods or extensions

could also be combined with the vine copula knockoﬀ generation method. For the sake of simplicity, this is left

for future research.

3 Gaussian knockoﬀs

Assume that the d-dimensional covariates X1:d∈Rdare multivariate normally distributed, i.e.,

X1:d∼ Nd(0,Σ), where Nd(µ, Σ) denotes a d-dimensional normal distribution with expectation

µand covariance matrix Σ. Model-XKnockoﬀs

X1:d∈Rdcan then be obtained from the

following joint normal distribution (Cand`es et al. 2018)

(X1:d,

X1:d)∼ N2d(0, G),where G=Σ Σ −diag(s)

Σ−diag(s) Σ (3.1)

and diag(s) is a diagonal matrix such that Gis positive semideﬁnite. Typically, the vector sis

obtained by solving a semideﬁnite program, see Cand`es et al. (2018).

Having speciﬁed, or estimated, Σ, sand therefore also G, knockoﬀs

X1:dcan be simply

obtained by sampling from the conditional distribution (Cand`es et al. 2018)

X1:d|X1:d=x1:d∼ Nd(µ, V ),

with

µ=x1:d−x1:dΣ−1diag(s),

V= 2diag(s)−diag(s)Σ−1diag(s).

This procedure to generate knockoﬀs will in the following be called Gaussian knockoﬀs. It is

restricted to the case of multivariate normally distributed covariates X1:d. However, it can also

be applied for covariates that are not normally distributed. In such cases their distribution is

approximated by a multivariate normal distribution and the knockoﬀ generation procedure is

sometimes also called second-order knockoﬀs (Cand`es et al. 2018).4The name reﬂects the fact

that the procedure is designed in a way that ﬁrst and second moments of the data are matched

but not necessarily the entire distribution.

4 Gaussian copula knockoﬀs, partial correlation vines and vine

copulas

The Gaussian knockoﬀ generation procedure is restricted to the multivariate normal case, i.e.,

it requires the distributional assumption X1:d∼ Nd(0,Σ). If this assumption fails to hold,

Gaussian knockoﬀs might still be used. However, depending on how well the true distribution

of X1:dcan be approximated with a multivariate normal distribution, Gaussian knockoﬀs might

not produce valid results.

4.1 Gaussian copula knockoﬀs

A straightforward generalization of Gaussian knockoﬀs is given by assuming a Gaussian copula

for X1:d, i.e.,

U1:d:= F1(X1), . . . , Fd(Xd)∼CGau

d(R),

where CGau

d(R) denotes a d-dimensional Gaussian copula with correlation matrix Rand F1, . . . , Fd

are arbitrary absolutely continuous marginal distributions such that X1∼F1, . . . , Xd∼Fd.

Note that by deﬁnition

U1:d∼CGau

d(R)⇔Y1:d:= Φ−1(U1),...,Φ−1(Ud)∼ Nd(0, R),

4We will in the following always use the term Gaussian knockoﬀs irrespectively whether X1:dis multivariate

normally distributed or not.

where Φ(·) denotes the cdf of the univariate standard normal distribution.

To obtain valid knockoﬀs, we use the following model

(U1:d,

U1:d)∼CGau

2d(H),where H=R R −diag(r)

R−diag(r)R.(4.1)

The vector r, depending on R, can be obtained as solution to the same kind of optimiza-

tion problem being solved to obtain sdepending on Σ (see (3.1)). We further set

X1:d:=

(F−1

U1), . . . , F −1

Ud)) such that

X1∼F1,...,

Xd∼Fd.

Note that Gaussian knockoﬀs are a special case of Gaussian copula knockoﬀs which are

obtained by setting all marginal distributions F1, . . . , Fdto cdfs of the normal distributions

N(0, σ2

1), . . .,N(0, σ2

d) with the respective variances σ2

1, . . . , σ2

4.2 Partial correlation vines and completion problems

A straightforward approach to generate Gaussian copula knockoﬀs relies on partial correlation

vines and vine copulas. Before we discuss partial correlation vines themselves, we want to

introduce vines and the subclasses of so-called regular vines (R-vines) and drawable vines (D-

vines). Vines as graph-theoretic concept have been introduced by Bedford and Cooke (2002)

and form the basis for partial correlation vines (Bedford and Cooke 2002; Kurowicka and Cooke

2003) and vine copulas (Aas et al. 2009; Bedford and Cooke 2001; Joe 1997).

For dvariables, a vine V:= (T1, . . . , Td−1) := ((N1, E1),...,(Nd−1, Ed−1)) consists of d−1

trees. Each tree Tjconsists of nodes Njand edges Ejwhich form a connected graph with no

cycle. In the vine, nodes of tree Tjare edges in tree Tj−1, i.e., Nj=Ej−1for j= 2, . . . , d −1.

The nodes of the ﬁrst tree are N1={1, . . . , d}, i.e., the variable indices.

If a vine satisﬁes the so-called proximity condition, it is called a regular vine (R-vine). The

proximity condition for j= 2, . . . , d−1 is the requirement that two nodes can only be connected

by an edge in tree Tj, if the nodes (being edges in Tj−1) share a common node in tree Tj−1.

An R-vine is called a drawable vine (D-vine), if all nodes in the ﬁrst tree T1are connected

to a maximum of two other nodes. The graph-theoretic structure of a D-vine can be nicely

visualized. In Figure 1 we show the ﬁve-dimensional case. The order of the variables in the ﬁrst

tree T1(X1→. . . →Xd) can be chosen arbitrarily and is sometimes also called the structure of

the D-vine.

1234

12 23 34

13; 2 24; 3

ρ12 ρ23 ρ34

ρ13; 2 ρ24; 3

ρ14; 23

Figure 1: Four-dimensional partial correlation D-vine.

To understand the labeling of the nodes and edges in Figure 1, we need to introduce some

more graph-theoretic concepts and notation. We consider an R-vine for dvariables. The com-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Vinecopulabasedknockogenerationforhigh-dimensionalcontrolledvariableselectionMalteS.Kurz*TUMSchoolofManagementTechnicalUniversityofMunichArcisstr.21,80333Munich,GermanyOctober21,2022AbstractVinecopulasareaexibletoolforhigh-dimensionaldependencemodeling.Inthisarticle,wediscussthegenerationofapproxim...

展开>> 收起<<

Vine copula based knocko generation for high-dimensional controlled variable selection Malte S. Kurz.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Vine copula based knocko generation for high-dimensional controlled variable selection Malte S. Kurz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: