Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift Samantha Morrison1 Constantine Gatsonis1 Issa J. Dahabreh2-4 Bing Li1 and Jon A.

2025-05-03 0 0 481.71KB 37 页 10玖币

侵权投诉

Robust Estimation of Loss-Based Measures of

Model Performance under Covariate Shift

Samantha Morrison1, Constantine Gatsonis1, Issa J. Dahabreh2-4, Bing Li1, and Jon A.

Steingrimsson∗1

1Department of Biostatistics, School of Public Health, Brown University, Providence, RI

2CAUSALab, Harvard T.H. Chan School of Public Health, Boston, MA

3Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA

4Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA

Thursday 6th October, 2022

∗Address for correspondence: email: jon steingrimsson@brown.edu.

arXiv:2210.01980v1 [stat.ME] 5 Oct 2022

Abstract

We present methods for estimating loss-based measures of the performance of a

prediction model in a target population that diﬀers from the source population in which

the model was developed, in settings where outcome and covariate data are available

from the source population but only covariate data are available on a simple random

sample from the target population. Prior work adjusting for diﬀerences between the

two populations has used various weighting estimators with inverse odds or density

ratio weights. Here, we develop more robust estimators for the target population risk

(expected loss) that can be used with data-adaptive (e.g., machine learning-based)

estimation of nuisance paramaters. We examine the large-sample properties of the

estimators and evaluate ﬁnite sample performance in simulations. Last, we apply the

methods to data from lung cancer screening using nationally representative data from

the National Health and Nutrition Examination Survey (NHANES) and extend our

methods to account for the complex survey design of the NHANES.

Keywords: transportability, covariate shift, domain adaptation, MSE, double robustness,

weighting

1 Introduction

Ideally, a prediction model should be evaluated using data from the target population where

it will be applied, but typically the source data used for model building are not obtained

from a random sample of that target population [1] and model performance in the source

data may not reﬂect performance in the target population. A major reason why model

performance estimates may not transport to the target population is “covariate shift,” that

is, the presence of diﬀerences in the covariate distribution between the population underlying

the source data (i.e., the source population) and the target population [2–4]. Under covariate

shift, the conditional distribution of the outcome given the covariates is the same in both

the source and target population, but the covariate distributions of the two populations are

diﬀerent (i.e., the populations have a diﬀerent “case-mix”). Such diﬀerences aﬀect model

performance as evaluated by measures that are averages of a loss function over the target

population distribution (e.g., the mean squared error, MSE; mean absolute error; the Brier

score [5]).

To understand how covariate shift aﬀects model performance, it is useful to consider “pre-

diction error modiﬁers” – covariates that are associated with prediction error, as assessed

by a speciﬁc loss function, for a given model [6]. When prediction error modiﬁers have

a diﬀerent distribution between the source and target population, estimators of loss-based

measures of model performance that only use source data are biased for target population

model performance [7]. Previous eﬀorts to correct this bias used importance-weighting ap-

proaches [8] to re-weight observations in the source data by the ratio of the covariate density

of the target and source population or, equivalently, by the inverse of the conditional odds

of being from the source population [6], in order to construct asymptotically unbiased esti-

mators of the target population risk (i.e., the expected loss in the target population). The

conditional odds (or the covariate density ratio) are almost always unknown and need to be

estimated using statistical models. For the importance-weighting estimators to consistently

estimate the target population risk, the models for the conditional odds needs to be correctly

speciﬁed.

Even when the model for the conditional odds of being from the source population is

correctly speciﬁed, importance-weighting estimators are ineﬃcient [9]. Furthermore, asymp-

totically valid inference for importance-weighting estimators requires estimators of the con-

ditional odds to converge at √nrate [10]. That precludes using data-adaptive estimators

(e.g., machine learning estimators) of the density ratio or the conditional odds because such

estimators converge at slower than √nrate. Data-adaptive estimators, however, are very

appealing in applied work because subject matter knowledge is typically inadequate to deter-

mine the correct speciﬁcation of the conditional odds models, particularly when the covariates

that diﬀer in distribution between the source and target population are high-dimensional or

have multiple continuous components.

In this paper, we develop a doubly robust estimator for the target population risk that

involves estimating both the expected loss conditional on covariates and the probability of

participation in the source. Our estimator is consistent for the target population risk if at

least one of these models is correctly speciﬁed, but not necessarily both, and can be used for

asymptotically valid inference even if the models are estimated using methods that converge

at a rate slower than √n(i.e., allowing the use of data-adaptive estimation). In the process

of developing the doubly robust estimator, we also develop a novel conditional loss modeling-

based estimator that relies on estimating the expected loss conditional on covariates. We

provide identiﬁability conditions, identiﬁcation results, and large-sample properties for the

doubly robust estimator. We compare the ﬁnite-sample performance of the doubly robust

and conditional loss modeling-based estimators against importance-weighting estimators in

simulations. Last, we apply the methods to estimate model performance of a prediction

model for lung cancer diagnosis built using data from the National Lung Screening Trial

(NLST) in a target population of people eligible for lung cancer screening in the US. The

sample from the target population comes from the National Health and Nutrition Examina-

tion Survey (NHANES) a complex survey that involves multi-stage clustering and variable

probability sampling. We show how to modify our estimators to account for the complex

sampling design and to incorporate weights that account for the oversampling of certain

subgroups, survey non-response, and post-stratiﬁcation adjustments.

2 Identiﬁcaton of the target population risk

2.1 Setup, study design, and targets of inference

Let Ybe an outcome and Xa covariate vector. Let X∗be a vector that contains a subset

of the covariates in X, and g(X∗) a prediction model for the conditional expectation of Y

given X∗that we are interested in evaluating the performance of in a target population of

substantive interest. We consider a setting where we have access to outcome and covariate

data on a sample from the source population {(Xi, Yi) : i= 1, . . . , n1}and covariate data

on a separately obtained random sample from the target population {Xi:i= 1, . . . , n0}.

Let Dbe an indicator of being from the source population (i.e., D= 1 if an observation

comes from the source population and D= 0 if an observation comes from the target

population). In this setup, the data used to estimate model performance in the target

population is the combined source population sample and the target population sample

{(Xi, Di, Di×Yi) : i= 1, . . . , n =n1+n0}. We also assume that the model is ﬁt using

data that is independent of the data used to assess model performance (e.g., the model is an

external model or it is ﬁt using training data from the source population and an independent

set of test data from the source population is used for model assessment).

We focus on etimation of loss-based measures of model performance because many popu-

lar evaluation measures are loss-based (e.g., mean squared error, Brier loss, and the absolute

loss). A loss function L(Y, g(X∗)) measures the discrepancy between the observed outcome

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustEstimationofLoss-BasedMeasuresofModelPerformanceunderCovariateShiftSamanthaMorrison1,ConstantineGatsonis1,IssaJ.Dahabreh2-4,BingLi1,andJonA.Steingrimsson*11DepartmentofBiostatistics,SchoolofPublicHealth,BrownUniversity,Providence,RI2CAUSALab,HarvardT.H.ChanSchoolofPublicHealth,Boston,MA3Depart...

展开>> 收起<<

Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift Samantha Morrison1 Constantine Gatsonis1 Issa J. Dahabreh2-4 Bing Li1 and Jon A..pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift Samantha Morrison1 Constantine Gatsonis1 Issa J. Dahabreh2-4 Bing Li1 and Jon A.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: