High-dimensional Measurement Error Models for Lipschitz Loss Xin Maand Suprateek Kundu

2025-05-06 0 0 491.87KB 35 页 10玖币
侵权投诉
High-dimensional Measurement Error Models
for Lipschitz Loss
Xin Maand Suprateek Kundu
Abstract
Recently emerging large-scale biomedical data pose exciting opportunities for sci-
entific discoveries. However, the ultrahigh dimensionality and non-negligible measure-
ment errors in the data may create difficulties in estimation. There are limited methods
for high-dimensional covariates with measurement error, that usually require knowl-
edge of the noise distribution and focus on linear or generalized linear models. In this
work, we develop high-dimensional measurement error models for a class of Lipschitz
loss functions that encompasses logistic regression, hinge loss and quantile regression,
among others. Our estimator is designed to minimize the L1norm among all estima-
tors belonging to suitable feasible sets, without requiring any knowledge of the noise
distribution. Subsequently, we generalize these estimators to a Lasso analog version
that is computationally scalable to higher dimensions. We derive theoretical guarantees
in terms of finite sample statistical error bounds and sign consistency, even when the
dimensionality increases exponentially with the sample size. Extensive simulation stud-
ies demonstrate superior performance compared to existing methods in classification
and quantile regression problems. An application to a gender classification task based
on brain functional connectivity in the Human Connectome Project data illustrates
improved accuracy under our approach, and the ability to reliably identify significant
brain connections that drive gender differences.
Keywords: Classification; Lipschitz loss; measurement error models; neuroimaging analysis.
Department of Statistics, Florida State University
Department of Biostatistics, The University of Texas at MD Anderson Cancer Center
Corresponding author: Email: SKundu2@mdanderson.org; Address: 1400 Pressler Street, Unit 1411,
Houston, TX 77030
1
arXiv:2210.15008v1 [stat.ME] 26 Oct 2022
1 Introduction
High-dimensional data has emerged in various research fields such as human genetics, neu-
roimaging, and microbiome studies. When the number of features in the data become larger
than the sample size or even increases exponentially with the sample size, the traditional
regression models would fail to provide an estimation for the regression coefficients, and the
theoretical large sample results would not apply. In order to accommodate the ultra high
number of features in the regression framework, a series of penalized methods have been
proposed. These methods assume that there are only a small set of features contributing
to the outcome variable, thus the regression coefficients only include very few nonzero ele-
ments. Well-known examples of the sparse learning methods include Lasso with convex L1
penalty (Tibshirani, 1996) and the closely related Dantzig selector (Bickel et al., 2009), and
the non-convex type of methods such as the smoothly clipped absolute deviation (SCAD)
(Fan and Li, 2001) and the minimax concave penalty (MCP) (Zhang, 2010), among others.
For these penalized methods, the sparse assumption on the true regression coefficients
have enabled establishment of desirable theoretical results, both for linear regression mod-
els as well as generalized linear regression settings (James and Radchenko, 2009). More
recently, Negahban et al. (2012) proposed a unified framework for M-estimators based on
decomposable regularizers for a wide class of convex differentiable loss functions. In the
context of generalized linear models, Fan and Lv (2011) investigated the performance of spe-
cific non-convex penalties including SCAD and MCP in ultrahigh dimensions and showed
them to possess the oracle property under mild assumptions. For the specific case of binary
classification in the context of penalized support vector machines (SVM), Peng et al. (2016)
derived the finite sample statistical error bounds under L1penalties and further showed
the oracle property of non-convex penalized SVM under certain conditions. A wider class
of Lipschitz loss functions involving support vector machines and quantile regression was
2
studied by Dedieu (2019) under varying penalties with a focus on deriving the finite sample
statistical error bounds under high-dimensional settings.
The above approaches, and the traditional literature on penalized approaches for high-
dimensional regression, has essentially ignored the presence of measurement error in high-
dimensional covariates. However, this can be an unrealistic assumption, especially in medical
imaging studies, where measurement errors in the images can result due to various factors
such as technical limitations and experimental design, measurement errors and so on (Raser
and O’shea, 2005; Liu, 2016). Ignoring measurement error in estimation has been shown to
result in biased estimates and attenuation to the null (Carroll and Stefanski, 1994). A more
recent line of work has extended penalized sparse learning to the case of high dimensional co-
variates with measurement errors in linear regression settings. See for example, recent work
by Loh et al. (2012); Datta and Zou (2017) and more recent work involving grouped penalties
in the presence of noisy functional features (Ma and Kundu, 2022). These approaches typ-
ically used corrected versions of the objective functions in order to tackle the measurement
error in covariates. Such approaches often require additional validation samples in order to
compute moments of the measurement error distribution, which may not always be feasible
in practice. Further, storing the high-dimensional noise covariance may impose excessive
memory requirements that can result in computational bottlenecks. Some recent approaches
bypass the challenges with computing and storing the high-dimensional noise covariance,
by not requiring knowledge of the noise distributions. Examples include the seminal work
by Rosenbaum and Tsybakov (2010, 2013) involving the matrix uncertainty selector (MUS)
that is motivated by the Dantzig selector, and relies on feasible sets that is guaranteed to
contain the true parameters with high confidence. Other examples include sparse total least
squares (Zhu et al., 2011), and orthogonal matching pursuit (Chen and Caramanis, 2013).
There has been a parallel development of generalized linear models involving measure-
ment errors in covariates, which has unfortunately been restricted to fixed or low dimensional
3
covariate features. Readers can refer to Stefanski (2000) for a review on measurement error
models. In a recent work, Sørensen et al. (2018) proposed a heuristic approach known as
the generalized MUS (GMUS) estimator for generalized linear models (GLM); however, no
theoretical aspects in terms of error bounds were investigated. Beyond the GLM and linear
regression, there are very limited approaches for measurement error models, to our knowl-
edge. Some examples include quantile regression methods involving covariates with noise
such as Wei and Carroll (2009) who proposed a bias-correction approach based on joint esti-
mating equations, and Wang et al. (2012a) who proposed a method based on corrected score
functions. The above approaches for GLM and quantile regression typically have desirable
asymptotic properties, but theoretical guarantees in terms of finite sample error bounds for
these cases are lacking and they do not cater to high-dimensional settings of interest where
the number of covariates may increase exponentially with the sample size. In this context,
we note that deriving provably flexible estimators for high-dimensional covariates (n << p)
in classification as well as quantile-regression scenarios is known to be a challenging problem,
even in the case without measurement errors (Peng et al., 2016; Dedieu, 2019).
In this article, we address the task of developing provably flexible estimators for high-
dimensional covariates in the presence of measurement errors, for a general class of Lipschitz
continuous loss functions that go beyond the routinely studied simple linear regression and
GLM settings. In addition to logistic regression that lies within the GLM family, the class of
loss functions include support vector machine (SVM) classification and quantile regression
loss problems as special cases, among others. We note that a similar class of loss functions
was investigated in Dedieu (2019), who proposed a L1constrained slope estimation but with-
out considering measurement errors in covariates. Our work is distinct compared to their
approach, both in terms of methodology as well as theoretical derivations and implementa-
tion. Our estimators are designed to minimize the L1norm among a wide class of estimators
belonging to suitable feasible sets that contain the true parameter with high confidence. The
4
proposed approach is motivated by developments in Rosenbaum and Tsybakov (2010) who
proposed matrix uncertainty selectors (MUS) for linear regression problems in the presence
of measurement error. However, our contributions constitute non-trivial generalizations of
the MUS to a much wider class of loss functions that cover commonly encountered clas-
sification and quantile regression problems as special cases. Another distinction compared
to Rosenbaum and Tsybakov (2010) is the fact that we propose a lasso analog of the pro-
posed method that is computationally scalable to much higher dimensions and establish
finite sample error bounds for this analog estimator. In our treatment, we start with the
case without measurement error, and subsequently generalize our framework to the case with
additive measurement errors. We will denote the proposed estimators for the general class of
smooth Lipschitz loss functions as matrix uncertainty estimators for Lipschitz loss (MULL)
throughout the article. To our knowledge, the proposed development addresses a critical
gap in high-dimensional measurement error modeling literature.
We derive the finite sample statistical error bounds and sign consistency results for the
proposed MULL estimators and the corresponding Lasso analog version, even when the
number of covariates increase exponentially with the sample size, which guarantee sound
operating characteristics of the method in high-dimensional cases. We implement the pro-
posed approaches via efficient algorithms that scale to high-dimensional settings. Similar to
Rosenbaum and Tsybakov (2010), the proposed approach does not require additional vali-
dation samples to compute the moments of the measurement error distribution that helps
alleviate practical difficulties. We evaluate the operating characteristics via extensive sim-
ulation experiments and illustrate the superior performance of our proposed estimator over
other competing methods in a variety of settings. The proposed approach is applied to a gen-
der classification task based on functional connectome in the Human Connectome Project
(HCP), where the number of edges in the brain network increases quadratically with the
number of brain regions. In these types of studies, the brain network is estimated from the
5
摘要:

High-dimensionalMeasurementErrorModelsforLipschitzLossXinMa*andSuprateekKundu„…AbstractRecentlyemerginglarge-scalebiomedicaldataposeexcitingopportunitiesforsci-enti cdiscoveries.However,theultrahighdimensionalityandnon-negligiblemeasure-menterrorsinthedatamaycreatedicultiesinestimation.Therearelimi...

展开>> 收起<<
High-dimensional Measurement Error Models for Lipschitz Loss Xin Maand Suprateek Kundu.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:491.87KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注