Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources Soumojit Das1and Partha Lahiri2

2025-05-06 0 0 380.39KB 30 页 10玖币

侵权投诉

Hierarchical Bayes estimation of small area means

using statistical linkage of disparate data sources

Soumojit Das1and Partha Lahiri2

1University of Maryland, College Park, MD, USA, soumojit@umd.edu

2University of Maryland, College Park, MD, USA, plahiri@umd.edu

Abstract

We propose a Bayesian approach to estimate ﬁnite population means for small

areas. The proposed methodology improves on the traditional sample survey meth-

ods because, unlike the traditional methods, our proposed method borrows strength

from multiple data sources. Our approach is fundamentally diﬀerent from the exist-

ing small area Bayesian approach to the ﬁnite population sampling, which typically

assumes a hierarchical model for all units of the ﬁnite population. We assume

such model only for the units of the ﬁnite population in which the outcome vari-

able is observed; because for these units, the assumed model can be checked using

existing statistical tools. Modeling unobserved units of the ﬁnite population is chal-

lenging because the assumed model cannot be checked in the absence of data on

the outcome variable. To make reasonable modeling assumptions, we propose to

form several cells for each small area using factors that potentially inﬂuence the

outcome variable of interest. This strategy is expected to bring some degree of ho-

mogeneity within a given cell and also among cells from diﬀerent small areas that

are constructed with the same factor level combination. Instead of modeling true

probabilities for unobserved individual units, we assume that population means of

cells with the same combination of factor levels are identical across small areas and

the population mean of true probabilities for a cell is identical to the mean of true

arXiv:2210.04980v2 [stat.ME] 30 Apr 2023

values for the observed units in that cell. We apply our proposed methodology to a

real-life COVID-19 survey, linking information from multiple disparate data sources

to estimate vaccine-hesitancy rates (proportions) for 50 US states and Washington,

D.C. (small areas). We also provide practical ways of model selection that can be

applied to a wider class of models under similar setting but for a diverse range of

scientiﬁc problems.

Keywords: Administrative Data, Finite Population sampling, Informative sampling,

Hierarchical Bayes, Multiple surveys, Multi-level Modeling, Nonprobability surveys, Syn-

thetic estimation, Robust estimation.

1 Introduction

Ericson (1969) laid a foundation of subjective Bayesian approach to ﬁnite population

sampling. In this approach, the entire matrix of characteristics for all units of the ﬁnite

population can be viewed as the ﬁnite population parameter matrix. In practice, a func-

tion (e.g., ﬁnite population mean or proportion of a characteristic of interest) or a vector

of functions (e.g., ﬁnite population means or proportions of several characteristics) of this

ﬁnite population parameter matrix is considered for inference. Using a subjective prior on

the ﬁnite population parameter matrix, inferences on the ﬁnite population parameter(s)

of interest can be drawn using the posterior predictive distribution of the unobserved

units of the ﬁnite population given the observed sample.

Using this basic idea of Ericson (1969), several papers were written on the estimation of

ﬁnite population parameters. The methodology developed can be used to solve problems

in small area estimation, repeated surveys, and other important applications. The papers

can be broadly classiﬁed as Empirical Bayesian or EB (e.g., Ghosh and Meeden (1986),

Ghosh and Lahiri (1987), Ghosh et al. (1989), Nandram and Sedransk (1993), Arora et al.

(1997), among others) and Hierarchical Bayesian or HB (e.g., Ghosh and Lahiri (1992),

Datta and Ghosh (1995), Ghosh and Meeden (1997), Malec et al. (1997), Little (2004),

Chen et al. (2012), Ghosh (2009), Liu and Lahiri (2017), Nandram et al. (2018), Ha

and Sedransk (2019), and others). In an empirical Bayesian approach, hyperparameters

are estimated using a classical method (e.g., maximum likelihood). In contrast, in the

hierarchical Bayesian approach, priors – usually noninformative or weakly informative –

are put on the hyperparameters.

The greater accessibility of administrative and Big Data and advances in technology

are now providing new opportunities for researchers to solve a wide range of problems

that would not be possible using a single data source. However, these databases are often

unstructured and are available in disparate forms, making inferences using data linkages

quite challenging. There is, therefore, a growing need to develop innovative statistical

data linkage tools to link such complex multiple datasets. Using only one primary sur-

vey to answer scientiﬁc questions about the whole population or large geographical areas

or subpopulations may be eﬀective and reliable, but using them for smaller domains

or small areas can often lead to unreliable estimation and unrealistic measures of un-

certainty. Using an appropriate statistical model to combine information from multiple

data sources, one can often obtain reliable estimates for small areas. A good review of

diﬀerent approaches to small area estimation can be found in Jiang (2007), Datta (2009),

Pfeﬀermann (2013), Rao and Molina (2015), Ghosh (2020), among others.

Scott (1977), Pfeﬀermann and Sverchkov (1999), Bonn´ery et al. (2012), and others

discussed the concept of informative sampling under which the distribution of the sample

could be markedly diﬀerent from the one assumed for the ﬁnite population even after

conditioning on related auxiliary variables. Most of the papers cited in the preceding

paragraphs assume non-informative sampling, so the distribution of the sample is as-

sumed to be identical to the distribution assumed on the ﬁnite population. This could

be a strong assumption in many applications. The informative sampling approach sug-

gested by Pfeﬀermann and Sverchkov (1999) and Pfeﬀermann and Sverchkov (2007) is

one possible solution, but this approach requires additional modeling of survey weights.

Rubin (1983) suggested modeling the ﬁnite population given the inclusion probabilities

(or, equivalently, basic weights) for making inferences about the ﬁnite population param-

eters of interest. However, in practice, inclusion probabilities for all ﬁnite population

units or detailed design information may not be available. Also, survey data may contain

information only on the ﬁnal weights for the respondents that incorporate nonresponse

and calibration. Verret et al. (2015) proposed an alternative approach in which inclusion

probabilities are used to augment the sample model. Unlike Rubin (1983), their approach

needs weights only for the sample. Both Pfeﬀermann and Sverchkov (2007) and Verret

et al. (2015) considered empirical best linear unbiased prediction (EBLUP) of small area

means.

In this paper, we propose a hierarchical Bayesian approach to estimate ﬁnite popu-

lation means for small areas. Our proposed method combines information from multiple

disparate data sources such as probability surveys, non-probability surveys, administra-

tive data, census records, social media big data and/or any sources of available relevant

information. We propose to reduce the degree of informativeness in sampling by incor-

porating important auxiliary variables that potentially aﬀect the outcome variable of

interest. Following Verret et al. (2015), we also add the survey weights in our model.

Our approach diﬀers inherently from the existing small area Bayesian approach to the

ﬁnite population sampling, which typically assumes a hierarchical model for all units of

the ﬁnite population. We assume such an elaborate model only for the units of the ﬁnite

population in which the outcome variable was observed. The reason is intuitive; for these

units, the model can be checked using existing statistical tools.

To make reasonable modeling assumptions, we propose to form several cells for each

of the small areas using factors that potentially inﬂuence our outcome variable of interest.

The number of cells is carefully chosen, so the cell population sizes can be obtained from

reliable sources (e.g., population projection data). This strategy is expected to bring some

degree of homogeneity within a given cell and also among cells from diﬀerent small areas

that are constructed with the same factor level combination. However, unlike Multilevel

Regression and Poststratiﬁcation (MRP) Gelman and Little (1997), our cell construction

will not achieve full homogeneity either in terms of the outcome variable of interest or

survey weights. But we stress that achieving full homogeneity in the outcome variable

and weights is likely to require numerous cells for which reliable population size data

may not be available from population projection data and one may have to rely on some

unstable survey data for small cells.

In contrast to the usual modeling approaches under similar scenarios (e.g., MRP), we

do not assume an exchangeable model within cell or an elaborate model for unobserved

individual units because the assumed model cannot be checked in the absence of data

on the outcome variable. Instead, drawing inspiration from synthetic methods for small

areas, we assume that population means of cells with the same combination of factor

levels are identical across small areas and the population mean of true probabilities for

a cell is identical to the mean of true values for the observed units in that cell. Since

the sample size in a given area is small, many such cells are unrepresented in the sample.

If a sample for that cell can be found from other areas, the assumed model is used to

produce a hierarchical Bayes synthetic estimate of the ﬁnite population mean of that

cell for the area. This assumes that the observations within a given cell and area are

similar to those for the cell from other areas. If the cell is unrepresented in all areas, the

cell mean can be predicted from the population model, but to simplify the methodology

and to induce a greater degree of robustness, we can simply ignore that cell when the

contribution from that cell is negligible as in our illustrative example. The basic idea

is to make the proposed methodology robust against possible violation of the assumed

model.

As an application of the proposed methodology, we use a real life example, using

a COVID-19 probability survey representing the entire US adult population, a non-

probability survey representing only active US adult Facebook users, and Census Bu-

reau estimates of adult population counts at granular levels along with data from an

independent COVID-19 data reporting website, to estimate the vaccine hesitancy rates

(proportions) for the US states and the District of Columbia (small areas). Through this

example and application, we will demonstrate the problems with regular design based

estimates when used for small areas and how our methodology may be employed to get

more robust estimates along with stable measures of uncertainty.

We now give an outline for the rest of the paper. In section 2, we describe our

methodology in detail. In section 3, we describe an application of our method to a real

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalBayesestimationofsmallareameansusingstatisticallinkageofdisparatedatasourcesSoumojitDas1andParthaLahiri21UniversityofMaryland,CollegePark,MD,USA,soumojit@umd.edu2UniversityofMaryland,CollegePark,MD,USA,plahiri@umd.eduAbstractWeproposeaBayesianapproachtoestimatenitepopulationmeansforsmal...

展开>> 收起<<

Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources Soumojit Das1and Partha Lahiri2.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources Soumojit Das1and Partha Lahiri2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: