Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift Samantha Morrison1 Constantine Gatsonis1 Issa J. Dahabreh2-4 Bing Li1 and Jon A.

2025-05-03 0 0 481.71KB 37 页 10玖币
侵权投诉
Robust Estimation of Loss-Based Measures of
Model Performance under Covariate Shift
Samantha Morrison1, Constantine Gatsonis1, Issa J. Dahabreh2-4, Bing Li1, and Jon A.
Steingrimsson1
1Department of Biostatistics, School of Public Health, Brown University, Providence, RI
2CAUSALab, Harvard T.H. Chan School of Public Health, Boston, MA
3Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA
4Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
Thursday 6th October, 2022
Address for correspondence: email: jon steingrimsson@brown.edu.
arXiv:2210.01980v1 [stat.ME] 5 Oct 2022
Abstract
We present methods for estimating loss-based measures of the performance of a
prediction model in a target population that differs from the source population in which
the model was developed, in settings where outcome and covariate data are available
from the source population but only covariate data are available on a simple random
sample from the target population. Prior work adjusting for differences between the
two populations has used various weighting estimators with inverse odds or density
ratio weights. Here, we develop more robust estimators for the target population risk
(expected loss) that can be used with data-adaptive (e.g., machine learning-based)
estimation of nuisance paramaters. We examine the large-sample properties of the
estimators and evaluate finite sample performance in simulations. Last, we apply the
methods to data from lung cancer screening using nationally representative data from
the National Health and Nutrition Examination Survey (NHANES) and extend our
methods to account for the complex survey design of the NHANES.
Keywords: transportability, covariate shift, domain adaptation, MSE, double robustness,
weighting
1 Introduction
Ideally, a prediction model should be evaluated using data from the target population where
it will be applied, but typically the source data used for model building are not obtained
from a random sample of that target population [1] and model performance in the source
data may not reflect performance in the target population. A major reason why model
performance estimates may not transport to the target population is “covariate shift,” that
is, the presence of differences in the covariate distribution between the population underlying
the source data (i.e., the source population) and the target population [24]. Under covariate
shift, the conditional distribution of the outcome given the covariates is the same in both
the source and target population, but the covariate distributions of the two populations are
different (i.e., the populations have a different “case-mix”). Such differences affect model
performance as evaluated by measures that are averages of a loss function over the target
population distribution (e.g., the mean squared error, MSE; mean absolute error; the Brier
score [5]).
To understand how covariate shift affects model performance, it is useful to consider “pre-
diction error modifiers” – covariates that are associated with prediction error, as assessed
by a specific loss function, for a given model [6]. When prediction error modifiers have
a different distribution between the source and target population, estimators of loss-based
measures of model performance that only use source data are biased for target population
model performance [7]. Previous efforts to correct this bias used importance-weighting ap-
proaches [8] to re-weight observations in the source data by the ratio of the covariate density
of the target and source population or, equivalently, by the inverse of the conditional odds
of being from the source population [6], in order to construct asymptotically unbiased esti-
mators of the target population risk (i.e., the expected loss in the target population). The
conditional odds (or the covariate density ratio) are almost always unknown and need to be
estimated using statistical models. For the importance-weighting estimators to consistently
estimate the target population risk, the models for the conditional odds needs to be correctly
specified.
Even when the model for the conditional odds of being from the source population is
correctly specified, importance-weighting estimators are inefficient [9]. Furthermore, asymp-
totically valid inference for importance-weighting estimators requires estimators of the con-
ditional odds to converge at nrate [10]. That precludes using data-adaptive estimators
(e.g., machine learning estimators) of the density ratio or the conditional odds because such
estimators converge at slower than nrate. Data-adaptive estimators, however, are very
appealing in applied work because subject matter knowledge is typically inadequate to deter-
mine the correct specification of the conditional odds models, particularly when the covariates
that differ in distribution between the source and target population are high-dimensional or
have multiple continuous components.
In this paper, we develop a doubly robust estimator for the target population risk that
involves estimating both the expected loss conditional on covariates and the probability of
participation in the source. Our estimator is consistent for the target population risk if at
least one of these models is correctly specified, but not necessarily both, and can be used for
asymptotically valid inference even if the models are estimated using methods that converge
at a rate slower than n(i.e., allowing the use of data-adaptive estimation). In the process
of developing the doubly robust estimator, we also develop a novel conditional loss modeling-
based estimator that relies on estimating the expected loss conditional on covariates. We
provide identifiability conditions, identification results, and large-sample properties for the
doubly robust estimator. We compare the finite-sample performance of the doubly robust
and conditional loss modeling-based estimators against importance-weighting estimators in
simulations. Last, we apply the methods to estimate model performance of a prediction
model for lung cancer diagnosis built using data from the National Lung Screening Trial
(NLST) in a target population of people eligible for lung cancer screening in the US. The
4
sample from the target population comes from the National Health and Nutrition Examina-
tion Survey (NHANES) a complex survey that involves multi-stage clustering and variable
probability sampling. We show how to modify our estimators to account for the complex
sampling design and to incorporate weights that account for the oversampling of certain
subgroups, survey non-response, and post-stratification adjustments.
2 Identificaton of the target population risk
2.1 Setup, study design, and targets of inference
Let Ybe an outcome and Xa covariate vector. Let Xbe a vector that contains a subset
of the covariates in X, and g(X) a prediction model for the conditional expectation of Y
given Xthat we are interested in evaluating the performance of in a target population of
substantive interest. We consider a setting where we have access to outcome and covariate
data on a sample from the source population {(Xi, Yi) : i= 1, . . . , n1}and covariate data
on a separately obtained random sample from the target population {Xi:i= 1, . . . , n0}.
Let Dbe an indicator of being from the source population (i.e., D= 1 if an observation
comes from the source population and D= 0 if an observation comes from the target
population). In this setup, the data used to estimate model performance in the target
population is the combined source population sample and the target population sample
{(Xi, Di, Di×Yi) : i= 1, . . . , n =n1+n0}. We also assume that the model is fit using
data that is independent of the data used to assess model performance (e.g., the model is an
external model or it is fit using training data from the source population and an independent
set of test data from the source population is used for model assessment).
We focus on etimation of loss-based measures of model performance because many popu-
lar evaluation measures are loss-based (e.g., mean squared error, Brier loss, and the absolute
loss). A loss function L(Y, g(X)) measures the discrepancy between the observed outcome
5
摘要:

RobustEstimationofLoss-BasedMeasuresofModelPerformanceunderCovariateShiftSamanthaMorrison1,ConstantineGatsonis1,IssaJ.Dahabreh2-4,BingLi1,andJonA.Steingrimsson*11DepartmentofBiostatistics,SchoolofPublicHealth,BrownUniversity,Providence,RI2CAUSALab,HarvardT.H.ChanSchoolofPublicHealth,Boston,MA3Depart...

展开>> 收起<<
Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift Samantha Morrison1 Constantine Gatsonis1 Issa J. Dahabreh2-4 Bing Li1 and Jon A..pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:481.71KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注