High-dimensional Measurement Error Models for Lipschitz Loss Xin Maand Suprateek Kundu

2025-05-06 0 0 491.87KB 35 页 10玖币

侵权投诉

High-dimensional Measurement Error Models

for Lipschitz Loss

Xin Ma∗and Suprateek Kundu†‡

Abstract

Recently emerging large-scale biomedical data pose exciting opportunities for sci-

entiﬁc discoveries. However, the ultrahigh dimensionality and non-negligible measure-

ment errors in the data may create diﬃculties in estimation. There are limited methods

for high-dimensional covariates with measurement error, that usually require knowl-

edge of the noise distribution and focus on linear or generalized linear models. In this

work, we develop high-dimensional measurement error models for a class of Lipschitz

loss functions that encompasses logistic regression, hinge loss and quantile regression,

among others. Our estimator is designed to minimize the L1norm among all estima-

tors belonging to suitable feasible sets, without requiring any knowledge of the noise

distribution. Subsequently, we generalize these estimators to a Lasso analog version

that is computationally scalable to higher dimensions. We derive theoretical guarantees

in terms of ﬁnite sample statistical error bounds and sign consistency, even when the

dimensionality increases exponentially with the sample size. Extensive simulation stud-

ies demonstrate superior performance compared to existing methods in classiﬁcation

and quantile regression problems. An application to a gender classiﬁcation task based

on brain functional connectivity in the Human Connectome Project data illustrates

improved accuracy under our approach, and the ability to reliably identify signiﬁcant

brain connections that drive gender diﬀerences.

Keywords: Classiﬁcation; Lipschitz loss; measurement error models; neuroimaging analysis.

∗Department of Statistics, Florida State University

†Department of Biostatistics, The University of Texas at MD Anderson Cancer Center

‡Corresponding author: Email: SKundu2@mdanderson.org; Address: 1400 Pressler Street, Unit 1411,

Houston, TX 77030

arXiv:2210.15008v1 [stat.ME] 26 Oct 2022

1 Introduction

High-dimensional data has emerged in various research ﬁelds such as human genetics, neu-

roimaging, and microbiome studies. When the number of features in the data become larger

than the sample size or even increases exponentially with the sample size, the traditional

regression models would fail to provide an estimation for the regression coeﬃcients, and the

theoretical large sample results would not apply. In order to accommodate the ultra high

number of features in the regression framework, a series of penalized methods have been

proposed. These methods assume that there are only a small set of features contributing

to the outcome variable, thus the regression coeﬃcients only include very few nonzero ele-

ments. Well-known examples of the sparse learning methods include Lasso with convex L1

penalty (Tibshirani, 1996) and the closely related Dantzig selector (Bickel et al., 2009), and

the non-convex type of methods such as the smoothly clipped absolute deviation (SCAD)

(Fan and Li, 2001) and the minimax concave penalty (MCP) (Zhang, 2010), among others.

For these penalized methods, the sparse assumption on the true regression coeﬃcients

have enabled establishment of desirable theoretical results, both for linear regression mod-

els as well as generalized linear regression settings (James and Radchenko, 2009). More

recently, Negahban et al. (2012) proposed a uniﬁed framework for M-estimators based on

decomposable regularizers for a wide class of convex diﬀerentiable loss functions. In the

context of generalized linear models, Fan and Lv (2011) investigated the performance of spe-

ciﬁc non-convex penalties including SCAD and MCP in ultrahigh dimensions and showed

them to possess the oracle property under mild assumptions. For the speciﬁc case of binary

classiﬁcation in the context of penalized support vector machines (SVM), Peng et al. (2016)

derived the ﬁnite sample statistical error bounds under L1penalties and further showed

the oracle property of non-convex penalized SVM under certain conditions. A wider class

of Lipschitz loss functions involving support vector machines and quantile regression was

studied by Dedieu (2019) under varying penalties with a focus on deriving the ﬁnite sample

statistical error bounds under high-dimensional settings.

The above approaches, and the traditional literature on penalized approaches for high-

dimensional regression, has essentially ignored the presence of measurement error in high-

dimensional covariates. However, this can be an unrealistic assumption, especially in medical

imaging studies, where measurement errors in the images can result due to various factors

such as technical limitations and experimental design, measurement errors and so on (Raser

and O’shea, 2005; Liu, 2016). Ignoring measurement error in estimation has been shown to

result in biased estimates and attenuation to the null (Carroll and Stefanski, 1994). A more

recent line of work has extended penalized sparse learning to the case of high dimensional co-

variates with measurement errors in linear regression settings. See for example, recent work

by Loh et al. (2012); Datta and Zou (2017) and more recent work involving grouped penalties

in the presence of noisy functional features (Ma and Kundu, 2022). These approaches typ-

ically used corrected versions of the objective functions in order to tackle the measurement

error in covariates. Such approaches often require additional validation samples in order to

compute moments of the measurement error distribution, which may not always be feasible

in practice. Further, storing the high-dimensional noise covariance may impose excessive

memory requirements that can result in computational bottlenecks. Some recent approaches

bypass the challenges with computing and storing the high-dimensional noise covariance,

by not requiring knowledge of the noise distributions. Examples include the seminal work

by Rosenbaum and Tsybakov (2010, 2013) involving the matrix uncertainty selector (MUS)

that is motivated by the Dantzig selector, and relies on feasible sets that is guaranteed to

contain the true parameters with high conﬁdence. Other examples include sparse total least

squares (Zhu et al., 2011), and orthogonal matching pursuit (Chen and Caramanis, 2013).

There has been a parallel development of generalized linear models involving measure-

ment errors in covariates, which has unfortunately been restricted to ﬁxed or low dimensional

covariate features. Readers can refer to Stefanski (2000) for a review on measurement error

models. In a recent work, Sørensen et al. (2018) proposed a heuristic approach known as

the generalized MUS (GMUS) estimator for generalized linear models (GLM); however, no

theoretical aspects in terms of error bounds were investigated. Beyond the GLM and linear

regression, there are very limited approaches for measurement error models, to our knowl-

edge. Some examples include quantile regression methods involving covariates with noise

such as Wei and Carroll (2009) who proposed a bias-correction approach based on joint esti-

mating equations, and Wang et al. (2012a) who proposed a method based on corrected score

functions. The above approaches for GLM and quantile regression typically have desirable

asymptotic properties, but theoretical guarantees in terms of ﬁnite sample error bounds for

these cases are lacking and they do not cater to high-dimensional settings of interest where

the number of covariates may increase exponentially with the sample size. In this context,

we note that deriving provably ﬂexible estimators for high-dimensional covariates (n << p)

in classiﬁcation as well as quantile-regression scenarios is known to be a challenging problem,

even in the case without measurement errors (Peng et al., 2016; Dedieu, 2019).

In this article, we address the task of developing provably ﬂexible estimators for high-

dimensional covariates in the presence of measurement errors, for a general class of Lipschitz

continuous loss functions that go beyond the routinely studied simple linear regression and

GLM settings. In addition to logistic regression that lies within the GLM family, the class of

loss functions include support vector machine (SVM) classiﬁcation and quantile regression

loss problems as special cases, among others. We note that a similar class of loss functions

was investigated in Dedieu (2019), who proposed a L1constrained slope estimation but with-

out considering measurement errors in covariates. Our work is distinct compared to their

approach, both in terms of methodology as well as theoretical derivations and implementa-

tion. Our estimators are designed to minimize the L1norm among a wide class of estimators

belonging to suitable feasible sets that contain the true parameter with high conﬁdence. The

proposed approach is motivated by developments in Rosenbaum and Tsybakov (2010) who

proposed matrix uncertainty selectors (MUS) for linear regression problems in the presence

of measurement error. However, our contributions constitute non-trivial generalizations of

the MUS to a much wider class of loss functions that cover commonly encountered clas-

siﬁcation and quantile regression problems as special cases. Another distinction compared

to Rosenbaum and Tsybakov (2010) is the fact that we propose a lasso analog of the pro-

posed method that is computationally scalable to much higher dimensions and establish

ﬁnite sample error bounds for this analog estimator. In our treatment, we start with the

case without measurement error, and subsequently generalize our framework to the case with

additive measurement errors. We will denote the proposed estimators for the general class of

smooth Lipschitz loss functions as matrix uncertainty estimators for Lipschitz loss (MULL)

throughout the article. To our knowledge, the proposed development addresses a critical

gap in high-dimensional measurement error modeling literature.

We derive the ﬁnite sample statistical error bounds and sign consistency results for the

proposed MULL estimators and the corresponding Lasso analog version, even when the

number of covariates increase exponentially with the sample size, which guarantee sound

operating characteristics of the method in high-dimensional cases. We implement the pro-

posed approaches via eﬃcient algorithms that scale to high-dimensional settings. Similar to

Rosenbaum and Tsybakov (2010), the proposed approach does not require additional vali-

dation samples to compute the moments of the measurement error distribution that helps

alleviate practical diﬃculties. We evaluate the operating characteristics via extensive sim-

ulation experiments and illustrate the superior performance of our proposed estimator over

other competing methods in a variety of settings. The proposed approach is applied to a gen-

der classiﬁcation task based on functional connectome in the Human Connectome Project

(HCP), where the number of edges in the brain network increases quadratically with the

number of brain regions. In these types of studies, the brain network is estimated from the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

High-dimensionalMeasurementErrorModelsforLipschitzLossXinMa*andSuprateekKunduAbstractRecentlyemerginglarge-scalebiomedicaldataposeexcitingopportunitiesforsci-enticdiscoveries.However,theultrahighdimensionalityandnon-negligiblemeasure-menterrorsinthedatamaycreatedicultiesinestimation.Therearelimi...

展开>> 收起<<

High-dimensional Measurement Error Models for Lipschitz Loss Xin Maand Suprateek Kundu.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

High-dimensional Measurement Error Models for Lipschitz Loss Xin Maand Suprateek Kundu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: