Robust angle-based transfer learning in high dimensions Tian Gu1 Yi Han2 and Rui Duan3 1Department of Biostatistics Columbia Mailman School of Public Health New

2025-05-03 0 0 2.27MB 28 页 10玖币

侵权投诉

Robust angle-based transfer learning in high dimensions

Tian Gu1, Yi Han2, and Rui Duan3,†

1Department of Biostatistics, Columbia Mailman School of Public Health, New

York, NY 10032, USA

2Department of Statistics, Columbia University, New York, NY 10027, USA

3Department of Biostatistics, Harvard T.H. Chan School of Public Health,

Boston, MA 02115, USA

†Corresponding author: rduan@hsph.harvard.edu

Abstract

Transfer learning enhances the performance of a target model by leveraging data from

related source populations, a technique particularly beneﬁcial when target data is scarce.

This study addresses the challenge of training high-dimensional regression models with

limited target data, in the context of heterogeneous source populations. We consider a

practical setting where only the parameter estimates of the pre-trained source models

are accessible, instead of the individual-level source data. Under the setting with only

one source model, we propose a novel ﬂexible angle-based transfer learning (angleTL)

method, which leverages the concordance between the source and the target model pa-

rameters. We showed that the proposed angleTL is adaptive to the signal strength of

the target model, uniﬁes several benchmark methods by construction, and can prevent

negative transfer when between-population heterogeneity is large. We also provide al-

gorithms to eﬀectively incorporate multiple source models accounting for the fact that

some source models may be more helpful than others. Our high-dimensional asymptotic

analysis provides interpretations and insights on when a source model can be useful for

the target, and demonstrates the superiority of angleTL over other benchmark methods.

We perform extensive simulation studies to validate our theoretical conclusions and show

the feasibility of applying angleTL to transferring existing genetic risk prediction models

across multiple biobanks.

arXiv:2210.12759v4 [stat.ME] 10 Nov 2023

1 Introduction

Insuﬃcient training data presents a critical challenge across various domains. In ﬁnance, the scarcity

of comprehensive user credit histories hampers eﬀorts in evaluating individual ﬁnancial risk and de-

tecting fraud (Teja, 2017). Similarly, in precision medicine, the limited availability of medical

records, especially for studying rare diseases or minority and disadvantaged sub-populations, com-

promises both the fairness and clinical eﬃcacy of treatments (Jia and Shi, 2017; Kim and Milliken,

2019). These constraints highlight an urgent need for innovative approaches that can enhance model

performance in the face of data limitations (Viele et al., 2014; Li and Song, 2020; Yang and Kim,

2020).

When borrowing information from external sources, a major challenge is to account for the

potential data heterogeneity. The external data may likely be collected from a diﬀerent population

with diverse characteristics (Chen et al., 2020) or be historical data where the variable deﬁnitions

and measures may change over time (Mansukhani et al., 2019; Mitchell et al., 2021). Many data

integration methods are proposed to leverage potentially shared information across populations

while addressing data heterogeneity. For example, some methods are based on re-weighting or

re-sampling the source data such that they are more similar to the target (Huang et al., 2006;

Pan et al., 2010; Long et al., 2013); some methods assume there is a unique lower dimensional

representation of features across populations, which can be transferred from the source to the

target (Tzeng et al., 2014; Ganin and Lempitsky, 2015; Sun and Saenko, 2016); some methods

propose to use the target data to calibrate the source models (Girshick et al., 2014; Li et al., 2020;

Gu et al., 2023a). In situations where no data from the target population is available, methods

have been proposed to combine source models aiming for distributional robustness by optimizing

for the worst-case performance, assuming the target population can be represented as either a

single source or a mixture of source populations (Meinshausen and B¨uhlmann, 2015; Wang et al.,

2023). The performance of many of the aforementioned methods largely depends on whether the

underlying assumptions regarding the similarity between the source and the target populations

hold, which is usually unknown in practice. Therefore, it is desired to develop methods that can be

adaptive to the underlying data heterogeneity or at least prevent the case where incorporating the

source information leads to worse model performance than not including it, known as the “negative

transfer” phenomenon (Weiss et al., 2016).

In addition, there might be data-sharing constraints such that external sources cannot share

individual-level data with the target study. Federated or distributed algorithms are proposed to

overcome such data-sharing barriers by sharing only summary-level statistics across studies, many

of which rely on sharing the gradients or higher-order derivatives of objective functions (Duan

et al., 2018, 2020, 2022; Cai et al., 2021; Li et al., 2021) and may require iteratively sharing updated

summary statistics across datasets (Li et al., 2021). However, distributed and federated learning

are less feasible without a collaborative environment or certain infrastructures that enable eﬃcient

computing and timely information sharing. In contrast, pre-trained models from existing studies are

often more accessible. With increasing attention to reproducibility and open science, more journals

require studies to publish their results as supplementary materials or to make them shareable upon

request (Thompson et al., 2006; Roobol et al., 2012). Many platforms allow direct implementation

(Belbasis and Panagiotou, 2022) or validation of ﬁtted models on secure collaborative platforms

such as PheKB (Kirby et al., 2016) and FATE (Liu et al., 2021). There is an increasing need for

data integration methods that can directly leverage ﬁtted models to improve the model performance

of a target study.

Regression models are broadly applied in many ﬁelds, due to advantages such as simplicity,

computational eﬃciency, and interpretability (Hafemeister and Satija, 2019; Wynants et al., 2020;

Wu et al., 2020), and they are also the building blocks of many data analysis pipelines and tools

(Van Buuren et al., 2006; Tan, 2006). In this paper, we consider the problem of incorporating

pre-trained regression models from external sources to help train a target model with limited target

data. For the i-th subject in the target data, let Yi∈Rdenote the outcome variable of interest and

Xi∈Rpdenote a set of p-dimensional covariates. We consider

Yi=X⊤

iβ+ϵi,for i∈ {1, . . . , n}

where ϵi∈Ris the random noise with mean zero and variance σ2,β∈Rpis the regression coeﬃcient

of interest, and nis the target sample size. In addition to the target data, we observe a source model

ﬁtted on an external source dataset, where we observe the source estimate ˆ

w∈Rpof the underlying

regression coeﬃcient win the source population. In the case where the underlying coeﬃcients βand

wshow certain similarities, we hope to leverage ˆ

wto assist the estimation of the target parameter

β.

Similar problems have been considered in the literature. In a series of transfer learning work,

the source model estimate ˆ

wwas incorporated through a regularization with the form ∥β−ˆ

w∥q,

for some positive constant q. For example, Li et al. (2020) proposed the transLASSO algorithm

that leverages the source data by adding a penalty term ∥β−ˆ

w∥1when learning βusing the

target data. Later, transLASSO has been extended to generalized linear models (Tian and Yang,

2022), functional linear regression (Lin and Reimherr, 2022), and Q-learning (Chen et al., 2022).

Considering data sharing constraints, Li et al. (2021) proposed a federated learning approach by

sharing gradients and Hessian matrices, and Gu et al. (2023b) proposed a method to incorporate

information from ﬁtted models through a synthetic data approach. Similar ideas are also used in

multi-tasking learning literature, where L2-norm-based penalties, i.e., ∥β−w∥2, are introduced to

leverage the similarities between model parameters (Tian et al., 2022; Duan and Wang, 2022). The

underlying similarity assumption of these methods is that the Lqdistance between βand wis small,

and we refer to this class of methods as the distance-based transfer learning methods.

In real-world applications, the distance-based transfer learning methods may be less eﬀective

when βand ware highly concordant, but their distance may not be small. For example, the

source outcome may be deﬁned diﬀerently from the target outcome (categorical versus continuous

characterization of the same outcome), or there might be standardization procedures applied to the

source data that are unknown to us. The source model might be ﬁtted using a diﬀerent but related

outcome variable correlated with the target outcome (Miglioretti, 2003; Stearns, 2010). To leverage

the concordance between the model parameters, several alternative similarity characterizations are

proposed. For example, Li et al. (2014) proposed a bivariate ridge regression assuming the j-th

entries (βj, wj) follows a bivariate normal distribution with shared correlation ρfor all j∈[p].

Similarly, Maier et al. (2014) proposed a multi-trait prediction method, assuming multivariate

normally distributed random eﬀects to leverage shared genetic architectures across traits. Qiao

et al. (2019) proposed a method for multi-objective optimization problems. Liang et al. (2020)

proposed a calibration version of transLASSO, which allows the scale of the model parameters to

diﬀer. However, these methods require individual-level data from the source and their robustness is

unclear when the level of data heterogeneity is high.

In this paper, we propose an angle-based Transfer Learning approach, named angleTL, which

leverages the similarity of two regression models through a novel penalization obtained by decou-

pling the angle distance between βand w. As a consequence, angleTL adapts to signal strength

and inherently guards against negative transfer, oﬀering a simpler form both conceptually and com-

putationally. Our approach uniﬁes distance-based transfer learning, target-only and source-only

estimators as speciﬁc instances, and thus is guaranteed to have superior performance. The high-

dimensional asymptotic analysis provides a precise characterization of the prediction risk, illustrat-

ing the bias-variance trade-oﬀ, the inﬂuence of signal strengths, the similarity between source and

target, the noise level, and the estimation error of the source to the accuracy of the ﬁnal estimator.

Given multiple source models, we propose methods to eﬀectively incorporate them toward better

predictive performance in the target population. The proposed methods only require parameter

estimates of ﬁtted source models, which are more accessible in practice. We perform extensive sim-

ulation studies to validate our theoretical conclusions and evaluate the performance of angleTL by

training genetic risk models for low-density lipoprotein cholesterol (LDL) using data from multiple

large-scale biobanks.

2 Angle-based transfer learning

Let Y∈Rndenote the outcome variable of interest and X∈Rn×pdenote a set of p-dimensional

covariates in the target data of size n. Without borrowing information from the source, we can

obtain a target-only estimator of βthrough a ridge regression,

βλ= arg min

n∥Y−Xβ∥2

2+λ∥β∥2

2,(1)

where λis a tuning parameter. The penalty on ∥β∥2

2helps reduce the overall mean squared error

of the estimator in the high-dimensional setting (Hastie et al., 2009).

Suppose that we also observe ˆ

w∈Rp, the parameter estimates of a source model ﬁtted on an

external source dataset with a sample size potentially much larger than n. Due to data heterogeneity,

the two underlying regression coeﬃcients, βand w, may not be the same but may share certain

similarities, and hence ˆ

wcan be used to guide the estimation of β. Following a series of recent

transfer learning methods (Li et al., 2022, 2021; Gu et al., 2023b), we deﬁne the L2-distance-based

transfer learning estimator (distTL) to be

βλd= arg min

n∥Y−Xβ∥2

2+λd∥β−ˆ

w∥2

2,(2)

where λdis a tuning parameter. Imposing a distance-based penalty, ∥β−ˆ

w∥2, is equivalent to

adding a constraint ∥β−ˆ

w∥2≤h, which reduces the parameter space to an L2ball centered at ˆ

as shown in panel A of Fig. 1. Anchoring on the source estimator ˆ

w, distTL encourages estimators

closer to ˆ

wwhile allowing for the calibration of potential diﬀerences.

As introduced earlier, in real-world applications, it is possible that βand ware concordant to

some degree, but ∥β−w∥2is not small. In such cases, distTL may be less eﬀective and we may

Figure 1: Geometric illustration of the distance-based similarity characterization (A); the angle-

based characterization (B); and the situation where the distance-based transfer learning have the

same predictive risk as the proposed method (C).

consider a more general characterization of the similarity using the angle distance characterized

by sin Θ(w,β) = q1−(w⊤β)2

∥w∥2

2∥β∥2

. As shown in panel B of Fig.1, if the target model parameter

βis restricted by a constraint that sin Θ(β,w)≤d, the source model parameter wcan provide

directional information of βand reduce the parameter space of βto a cone. When wand βare

small in L2distance, it also implies that the angle between the two vectors is small, while it may

not be true the other way around.

Instead of directly penalizing on sin Θ( ˆ

w,β) = q1−(ˆ

w⊤β)2

∥ˆ

w∥2

2∥β∥2

which may lead to computational

complexity, we consider an alternative angle-based transfer learning estimator deﬁned as

βλ,η = arg min

n∥Y−Xβ∥2

2+λ∥β∥2

2−2ηˆ

w⊤β,(3)

where λand ηare tuning parameters. To see the connection between the proposed penalty terms

and the sin Θ distance, without loss of generality, we can consider ∥ˆ

w∥2

2= 1. Due to the duality

of constrained optimization and penalization, the penalty on sin Θ( ˆ

w,β) is equivalent to having a

constraint q1−(ˆ

w⊤β)2

∥β∥2

< a for some a, which can be written as ˆ

w⊤β

∥β∥2>√1−a2. While this con-

straint controls the term ˆ

w⊤β

∥β∥2, what we propose is to decouple its numerator and denominator, and

introducing two separate constraints ˆ

w⊤β> b (equivalent to −ˆ

w⊤β<−b) and ∥β∥2< c (equiva-

lent to ∥β∥2

2< c2). While this decoupling provides computational simplicity and eﬃciency, the set

β:ˆ

w⊤β> b, ∥β∥2< c may not necessarily include, or be included in, the set β:q1−(ˆ

w⊤β)2

∥β∥2

< a,

when band care two freely adjustable tuning parameters. So the proposed penalization is not

equivalent to or is a relaxation of the sin Θ penalty.

The proposed penalization has several advantages. First, given λand η, Equation (3) has a

closed-form solution ˆ

βλ,η = (X⊤X+nλIp)−1(X⊤Y+nη ˆ

w), which ensures computational eﬃciency.

Secondly, it is adaptive to the signal strength of the target parameter. Speciﬁcally, when ∥β∥2is

large, the constraint ˆ

w⊤β> b can be satisﬁed for β’s with large sin Θ distance with ˆ

w. In this

scenario, the angle constraint is relatively minimal, and less information is borrowed from the

source. Conversely, when ∥β∥2is small, there is a tighter constraint on the angle distance, which

results in borrowing more information from the source. Intuitively, with a suﬃciently strong signal,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Robustangle-basedtransferlearninginhighdimensionsTianGu1,YiHan2,andRuiDuan3,†1DepartmentofBiostatistics,ColumbiaMailmanSchoolofPublicHealth,NewYork,NY10032,USA2DepartmentofStatistics,ColumbiaUniversity,NewYork,NY10027,USA3DepartmentofBiostatistics,HarvardT.H.ChanSchoolofPublicHealth,Boston,MA02115,U...

展开>> 收起<<

Robust angle-based transfer learning in high dimensions Tian Gu1 Yi Han2 and Rui Duan3 1Department of Biostatistics Columbia Mailman School of Public Health New.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust angle-based transfer learning in high dimensions Tian Gu1 Yi Han2 and Rui Duan3 1Department of Biostatistics Columbia Mailman School of Public Health New

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: