Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources Soumojit Das1and Partha Lahiri2

2025-05-06 0 0 380.39KB 30 页 10玖币
侵权投诉
Hierarchical Bayes estimation of small area means
using statistical linkage of disparate data sources
Soumojit Das1and Partha Lahiri2
1University of Maryland, College Park, MD, USA, soumojit@umd.edu
2University of Maryland, College Park, MD, USA, plahiri@umd.edu
Abstract
We propose a Bayesian approach to estimate finite population means for small
areas. The proposed methodology improves on the traditional sample survey meth-
ods because, unlike the traditional methods, our proposed method borrows strength
from multiple data sources. Our approach is fundamentally different from the exist-
ing small area Bayesian approach to the finite population sampling, which typically
assumes a hierarchical model for all units of the finite population. We assume
such model only for the units of the finite population in which the outcome vari-
able is observed; because for these units, the assumed model can be checked using
existing statistical tools. Modeling unobserved units of the finite population is chal-
lenging because the assumed model cannot be checked in the absence of data on
the outcome variable. To make reasonable modeling assumptions, we propose to
form several cells for each small area using factors that potentially influence the
outcome variable of interest. This strategy is expected to bring some degree of ho-
mogeneity within a given cell and also among cells from different small areas that
are constructed with the same factor level combination. Instead of modeling true
probabilities for unobserved individual units, we assume that population means of
cells with the same combination of factor levels are identical across small areas and
the population mean of true probabilities for a cell is identical to the mean of true
1
arXiv:2210.04980v2 [stat.ME] 30 Apr 2023
values for the observed units in that cell. We apply our proposed methodology to a
real-life COVID-19 survey, linking information from multiple disparate data sources
to estimate vaccine-hesitancy rates (proportions) for 50 US states and Washington,
D.C. (small areas). We also provide practical ways of model selection that can be
applied to a wider class of models under similar setting but for a diverse range of
scientific problems.
Keywords: Administrative Data, Finite Population sampling, Informative sampling,
Hierarchical Bayes, Multiple surveys, Multi-level Modeling, Nonprobability surveys, Syn-
thetic estimation, Robust estimation.
1 Introduction
Ericson (1969) laid a foundation of subjective Bayesian approach to finite population
sampling. In this approach, the entire matrix of characteristics for all units of the finite
population can be viewed as the finite population parameter matrix. In practice, a func-
tion (e.g., finite population mean or proportion of a characteristic of interest) or a vector
of functions (e.g., finite population means or proportions of several characteristics) of this
finite population parameter matrix is considered for inference. Using a subjective prior on
the finite population parameter matrix, inferences on the finite population parameter(s)
of interest can be drawn using the posterior predictive distribution of the unobserved
units of the finite population given the observed sample.
Using this basic idea of Ericson (1969), several papers were written on the estimation of
finite population parameters. The methodology developed can be used to solve problems
in small area estimation, repeated surveys, and other important applications. The papers
can be broadly classified as Empirical Bayesian or EB (e.g., Ghosh and Meeden (1986),
Ghosh and Lahiri (1987), Ghosh et al. (1989), Nandram and Sedransk (1993), Arora et al.
(1997), among others) and Hierarchical Bayesian or HB (e.g., Ghosh and Lahiri (1992),
Datta and Ghosh (1995), Ghosh and Meeden (1997), Malec et al. (1997), Little (2004),
Chen et al. (2012), Ghosh (2009), Liu and Lahiri (2017), Nandram et al. (2018), Ha
and Sedransk (2019), and others). In an empirical Bayesian approach, hyperparameters
2
are estimated using a classical method (e.g., maximum likelihood). In contrast, in the
hierarchical Bayesian approach, priors – usually noninformative or weakly informative –
are put on the hyperparameters.
The greater accessibility of administrative and Big Data and advances in technology
are now providing new opportunities for researchers to solve a wide range of problems
that would not be possible using a single data source. However, these databases are often
unstructured and are available in disparate forms, making inferences using data linkages
quite challenging. There is, therefore, a growing need to develop innovative statistical
data linkage tools to link such complex multiple datasets. Using only one primary sur-
vey to answer scientific questions about the whole population or large geographical areas
or subpopulations may be effective and reliable, but using them for smaller domains
or small areas can often lead to unreliable estimation and unrealistic measures of un-
certainty. Using an appropriate statistical model to combine information from multiple
data sources, one can often obtain reliable estimates for small areas. A good review of
different approaches to small area estimation can be found in Jiang (2007), Datta (2009),
Pfeffermann (2013), Rao and Molina (2015), Ghosh (2020), among others.
Scott (1977), Pfeffermann and Sverchkov (1999), Bonn´ery et al. (2012), and others
discussed the concept of informative sampling under which the distribution of the sample
could be markedly different from the one assumed for the finite population even after
conditioning on related auxiliary variables. Most of the papers cited in the preceding
paragraphs assume non-informative sampling, so the distribution of the sample is as-
sumed to be identical to the distribution assumed on the finite population. This could
be a strong assumption in many applications. The informative sampling approach sug-
gested by Pfeffermann and Sverchkov (1999) and Pfeffermann and Sverchkov (2007) is
one possible solution, but this approach requires additional modeling of survey weights.
Rubin (1983) suggested modeling the finite population given the inclusion probabilities
(or, equivalently, basic weights) for making inferences about the finite population param-
eters of interest. However, in practice, inclusion probabilities for all finite population
units or detailed design information may not be available. Also, survey data may contain
3
information only on the final weights for the respondents that incorporate nonresponse
and calibration. Verret et al. (2015) proposed an alternative approach in which inclusion
probabilities are used to augment the sample model. Unlike Rubin (1983), their approach
needs weights only for the sample. Both Pfeffermann and Sverchkov (2007) and Verret
et al. (2015) considered empirical best linear unbiased prediction (EBLUP) of small area
means.
In this paper, we propose a hierarchical Bayesian approach to estimate finite popu-
lation means for small areas. Our proposed method combines information from multiple
disparate data sources such as probability surveys, non-probability surveys, administra-
tive data, census records, social media big data and/or any sources of available relevant
information. We propose to reduce the degree of informativeness in sampling by incor-
porating important auxiliary variables that potentially affect the outcome variable of
interest. Following Verret et al. (2015), we also add the survey weights in our model.
Our approach differs inherently from the existing small area Bayesian approach to the
finite population sampling, which typically assumes a hierarchical model for all units of
the finite population. We assume such an elaborate model only for the units of the finite
population in which the outcome variable was observed. The reason is intuitive; for these
units, the model can be checked using existing statistical tools.
To make reasonable modeling assumptions, we propose to form several cells for each
of the small areas using factors that potentially influence our outcome variable of interest.
The number of cells is carefully chosen, so the cell population sizes can be obtained from
reliable sources (e.g., population projection data). This strategy is expected to bring some
degree of homogeneity within a given cell and also among cells from different small areas
that are constructed with the same factor level combination. However, unlike Multilevel
Regression and Poststratification (MRP) Gelman and Little (1997), our cell construction
will not achieve full homogeneity either in terms of the outcome variable of interest or
survey weights. But we stress that achieving full homogeneity in the outcome variable
and weights is likely to require numerous cells for which reliable population size data
may not be available from population projection data and one may have to rely on some
4
unstable survey data for small cells.
In contrast to the usual modeling approaches under similar scenarios (e.g., MRP), we
do not assume an exchangeable model within cell or an elaborate model for unobserved
individual units because the assumed model cannot be checked in the absence of data
on the outcome variable. Instead, drawing inspiration from synthetic methods for small
areas, we assume that population means of cells with the same combination of factor
levels are identical across small areas and the population mean of true probabilities for
a cell is identical to the mean of true values for the observed units in that cell. Since
the sample size in a given area is small, many such cells are unrepresented in the sample.
If a sample for that cell can be found from other areas, the assumed model is used to
produce a hierarchical Bayes synthetic estimate of the finite population mean of that
cell for the area. This assumes that the observations within a given cell and area are
similar to those for the cell from other areas. If the cell is unrepresented in all areas, the
cell mean can be predicted from the population model, but to simplify the methodology
and to induce a greater degree of robustness, we can simply ignore that cell when the
contribution from that cell is negligible as in our illustrative example. The basic idea
is to make the proposed methodology robust against possible violation of the assumed
model.
As an application of the proposed methodology, we use a real life example, using
a COVID-19 probability survey representing the entire US adult population, a non-
probability survey representing only active US adult Facebook users, and Census Bu-
reau estimates of adult population counts at granular levels along with data from an
independent COVID-19 data reporting website, to estimate the vaccine hesitancy rates
(proportions) for the US states and the District of Columbia (small areas). Through this
example and application, we will demonstrate the problems with regular design based
estimates when used for small areas and how our methodology may be employed to get
more robust estimates along with stable measures of uncertainty.
We now give an outline for the rest of the paper. In section 2, we describe our
methodology in detail. In section 3, we describe an application of our method to a real
5
摘要:

HierarchicalBayesestimationofsmallareameansusingstatisticallinkageofdisparatedatasourcesSoumojitDas1andParthaLahiri21UniversityofMaryland,CollegePark,MD,USA,soumojit@umd.edu2UniversityofMaryland,CollegePark,MD,USA,plahiri@umd.eduAbstractWeproposeaBayesianapproachtoestimate nitepopulationmeansforsmal...

展开>> 收起<<
Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources Soumojit Das1and Partha Lahiri2.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:380.39KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注