Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count DataT1

2025-05-05 0 0 606.88KB 29 页 10玖币
侵权投诉
JOINT POINT AND VARIANCE ESTIMATION UNDER A HIERARCHICAL
BAYESIAN MODEL FOR SURVEY COUNT DATA*
BYTERRANCE D. SAVITSKY1,a,
JULIE GERSHUNSKAYA2,bAND MARK CRANKSHAW2,c
1Office of Survey Methods Research, U.S. Bureau of Labor Statistics, aSavitsky.Terrance@bls.gov
2OEUS Statistical Methods Division, U.S. Bureau of Labor Statistics, bGershunskaya.Julie@bls.gov;
cCrankshaw.Mark@bls.gov
We propose a novel Bayesian framework for the joint modeling of sur-
vey point and variance estimates for count data. The approach incorporates
an induced prior distribution on the modeled true variance that sets it equal
to the generating variance of the point estimate, a key property more readily
achieved for continuous data response type models. Our count data model
formulation allows the input of domains at multiple resolutions (e.g., states,
regions, nation) and simultaneously benchmarks modeled estimates at higher
resolutions (e.g., states) to those at lower resolutions (e.g., regions) in a fash-
ion that borrows more strength to sharpen our domain estimates at higher res-
olutions. We conduct a simulation study that generates a population of units
within domains to produce ground truth statistics to compare to direct and
modeled estimates performed on samples taken from the population where
we show improved reductions in error across domains. The model is applied
to the job openings variable and other data items published in the Job Open-
ings and Labor Turnover Survey administered by the U.S. Bureau of Labor
Statistics.
1. Introduction. Count data response variables are commonly measured by government
surveys; for example, the American Community Survey administered by the U.S. Census Bu-
reau counts the population below a poverty threshold for household domains indexed by ge-
ography (e.g., census tracts). The U.S. Census Bureau administer the Consumer Expenditures
surveys of consumer units (independent households) for the U.S. Bureau of Labor Statistics
(BLS) that include count variables related to local and regional locations of the consumer
units. BLS administers surveys and a census instrument of business establishments related to
total employment and its components (e.g., job openings, hires, separations).
As with surveys conducted for continuous data response types, surveys that include count
data responses aggregate respondent-level counts, such as total employment for a business es-
tablishment respondent, to a collection of domains (such as state-by-industry classification)
and produce both a point estimate and an estimated variance statistic for each domain. Small
domain estimation models for the continuous response type that jointly model the point esti-
mates and the estimated variances for the domains exist within both frequentist and Bayesian
frameworks; see, for example, Maiti et al. (2014) and Sugasawa et al. (2017). These mod-
els borrow strength from the underlying correlations among the domain estimates to provide
de-noised model-based estimators that are characterized by lower mean squared errors. The
inferential goal for these models is to extract model-smoothed point and variance estimates
for publication to their data users. It is important to note that the domain-indexed variances
arXiv: math.PR/0000000
*U.S. Bureau of Labor Statistics, 2 Massachusetts Ave. N.E, Washington, D.C. 20212 USA
Keywords and phrases: Bayesian hierarchical models, Small Area Estimation, Count data, Stan.
1
arXiv:2210.14366v1 [stat.ME] 25 Oct 2022
2
are not known, but estimated, such that small domain models provide an opportunity to en-
hance the quality of both point and variance estimates through their joint estimation since
they are typically correlated.
Bayesian models for continuous data point and variance estimates are easily designed such
that the mean of the marginal likelihood for the estimated variances represents a denoised
“true” variance. The true variance, in turn, is set to be the “generating” variance for the noisy
point estimate in its likelihood centered around the estimated true mean value (Sugasawa
et al. 2017); for example, suppose vdrepresents the estimated sampling variance for domain
d(1,...,N)associated with continuous response, yd. In the case of unknown, latent true
domain variance, σ2
d, one may impose a likelihood, vd|σ2
d
ind
f(σ2
d)with mean σ2
d. One
typically chooses the conditional likelihood, yd|θd, σ2
d
ind
∼ N(θd, σ2
d), where N(·)denotes
the normal distribution. We see that the variance in the conditional likelihood for continuous
ydis readily and naturally set to equal the latent true variance, σ2
d. This connection between
the point and variance estimates where the true modeled variance is set as the generating
variance of the noisy point estimate ties together the likelihoods for the point and variance
estimates in a single model framework.
We are not aware of a (small area) model for count data in the small estimation literature
where the estimated sampling variance is modeled jointly with the direct point estimate such
that the generating variance of the direct point estimate is set equal to the mean of estimated
sampling variance (where we interpret the mean as the latent “true" domain variance). In their
recent comprehensive review article of small estimation methods, Sugasawa & Kubokawa
(2020) note the possibility to model the point estimate with a non-normal distribution and
more broadly discuss the use of generalized linear models. They do not, however, explicate
a count data model that incorporates estimated variances. Similarly, Rao & Molina (2015)
discuss over-dispersed Poisson models for count data, but none that incorporate estimated
domain variances. Tzavidis, Nikos and Ranalli, M Giovanna and Salvati, Nicola and Dreassi,
Emanuela and Chambers, Ray (2015) develop a Poisson small area model for count data, but
assume the domain variance is equal to the mean of the Poisson likelihood for the domain
point estimate. So, they do not input domain variances, at all.
The literature does, however, provide a recent example where Bradley et al. (2016) con-
struct a joint model for geographically-indexed point and variance estimates under a count
data response. They define a Poisson likelihood such that the model conditional variance
(of the point estimate) is defined as Var (yd|xd) = exp (λd)for count data response, yd,
associated to domain d(1,...,N); where xdis a set of covariates and λdis the log-
mean parameter. By contrast, under a normal likelihood with mean λdand variance ϕd
for logarithm, log(vd), of true variance vdof yd, the associated mean is E(vd|xd) = σ2
d=
exp λd+σ2
d/26=Var (yd|xd). Although Bradley et al. (2016) utilize a random effect by
specifying a likelihood for the log-variance, this construction does not produce a true variance
that is equal to the generating variance of the count data response. All to say, the literature
is more limited for small area models for count data that incorporate estimated variances and
there are no implementations to our knowledge that ensure the σ2
d=Var (yd|xd). Perhaps
the reason for the limited literature focused on count data models for small area estimation is
that for many datasets domain level counts are sufficiently large to approximate with a con-
tinuous data distribution, though we have mentioned some data examples above with count
data variables that express low counts for some domains.
This paper, by contrast, constructs a joint model for a count data point estimate ydand
its estimated variance vdwhere the modeled true variance σ2
d(the mean of the conditional
likelihood for vd) is set equal to the variance Var(yd|xd)of the point estimate likelihood.
We extend a multiplicative random effect in our model specification for the point estimate
as suggested by Zhou et al. (2012) for non-survey data where there is no associated variance
BAYESIAN MODEL FOR SURVEY COUNT DATA 3
estimate. They discuss that a Poisson-Lognormal prior set-up better fits the data with more
appealing large sample theoretical properties as compared to the Negative Binomial model
because the more flexible formulation of the former allows the data to learn a higher degree
of over-dispersion. We extend their Poisson-Lognormal set-up in our Bayesian hierarchical
model framework to indirectly induce a prior on the true variance of the likelihood for the
estimated variance such that the true variance equals the generating variance for the point
estimate.
Our formulation further leverages a Bayesian hierarchical probability model construction
by including the variable of interest at multiple resolutions (e.g., nested geographic levels,
such as states, regions, nation). The model simultaneously benchmarks modeled point es-
timates at higher resolutions (e.g., states) to those at lower resolutions (e.g., regions) in a
fashion that sharpens the estimation quality at higher resolutions. Traditional benchmarking
discussed in Rao & Molina (2015), by contrast, is often performed as a second step after
modeling is completed such that it tends to add back some of the variance removed by mod-
eling. We avoid this loss of efficiency by including the benchmarking as part of performing
estimation at multiple resolutions in a single step as is done in Savitsky (2016).
1.1. Job Opening and Labor Turnover Survey (JOLTS) Motivating Dataset. This paper
was motivated by the JOLTS survey conducted by the U.S. Bureau of Labor Statistics. JOLTS
measures dynamic trends in the labor market by tracking job openings, hires and separations,
among other variables, on a monthly basis. The survey is conducted nationally with the intent
to provide a national-level estimator for a collection of industries (defined based on the North
American industry classification codes (NAICS)). Users, however, desire to have state-level
estimates of these labor market dynamic variables for each industry. A major challenge to
produce state-level estimates of JOLTS variables is the small number of surveyed business
establishments in many states; in fact, in some industries there are states that may not have
any sample responses in a given month. It is cost prohibitive for BLS to increase their sample
size and to use blocking by state in order to support state level estimation. Our goal in this pa-
per is to model the collection of state-by-industry point estimates and variances constructed
from the national survey to extract more efficient, higher quality estimators and to impute es-
timated values for state-level domains with no underlying sample. The Bayesian hierarchical
model that we construct in the sequel for each industry will simultaneously impute missing
point estimates and variances for those states excluded from the sample in any given month.
The remainder of this paper is organized, as follows: Section 2provides the mathematical
formulation of our joint model for state-level point and variance estimates for each month in
each industry.We then extend this cross-sectional by-month model to a time-series construc-
tion that jointly models a collection of state-indexed time-series for each of the point and
variance estimates in each industry. We design a simulation study in Section 3that generates
respondent level population of count data and constructs true values for point estimates over
a collection of domains. Our simulation design then takes a sample of respondents and pro-
duces domain-based point and both true and estimated variances of direct point estimates for
the sampled respondents in each domain. We compare the MSE performances of the sample-
based direct estimator and our modeled estimator based on the population ground truth. In
Section 4, we proceed to apply our model to provide JOLTS state-based point and variance
estimates and we illustrate the smoothing property of our models. We conclude with a brief
discussion in Section 5.
2. Model for Count Data Point and Variance Estimates.
4
2.1. Cross-sectional Model Under Assumed Known Variances. We proceed to describe
the formulation of a cross-sectional model for the joint estimation of count data point in
a given month. We utilize the structure of JOLTS data to describe our model, for ease-of-
understanding. Domains in JOLTS are defined by intersections of states and industries. We
consider separate models by industry, each estimated over the collection of states. For each
domain d(1, ..., N)we observe sample based estimates, yd,and respective estimates of
their variances, vd,where Ndenotes the total number of domains in a given industry.
We construct a model for a count data response, rather than treating point estimate, yd,
as continuous because the JOLTS variables of interest, such as job openings, are often char-
acterized by very small counts for a given industry and state such that the conditional like-
lihood is expected to be very skewed (unlike a symmetric normal distribution). Our model
will specify a Poisson-lognormal model that allows for an over-dispersed marginal likelihood
for point estimate ydfor domain d(1,...,N)where the data may estimate the variance,
Var(yd|xd)E(yd|xd). We assume that the constructed domain variances, (vd)N
d=1, are
known such that we treat them as fixed. So, we specify a likelihood for yd|xdand set its
variance equal to vd, which we see below is not trivial.
2.1.1. Poisson-lognormal Mixture Likelihood Formulation. We allow for overdispersion
through a Poisson-lognormal joint likelihood with:
(1) yd|θd, εd
ind
Poisson (θdεd),
where the use of latent random effects εdwith a mean fixed to 1allows for the variance to
be greater than or equal to the mean by specifying the following distribution for the latent
likelihood,
(2) εd|ϕd
ind
LN 0.5ϕ2
d, ϕ2
d.
Together, Equations 1and 2for observed direct sample-based estimates yd(of job openings,
hires, separations, or other JOLTS items) follow a lognormal scale mixture of Poisson dis-
tributions parameterized so that mean E(yd|θd) = θd,where θdrepresents the parameter of
interest. We accomplish setting the mean of the mixture likelihood to θdthrough our use of
0.5ϕ2
din Equation 2that restricts the prior mean of εdto be 1. If we had instead chosen
a Gamma distribution for εdthe resulting likelihood after marginalizing over εdwould have
produced a closed-form negative binomial distribution which is not the case for our Poisson-
lognormal mixture. We, nevertheless, select the lognormal instead of the Gamma because
Zhou et al. (2012) highlight that the lognormal has proven more flexible for the modeling of
heavy tails that we express in the JOLTS data to the presence of domains with very small
counts.
2.1.2. Linking Model for Conditional Mean, θd.The conditional mean parameter, θd, are
used to borrow strength across domains and is constructed with the formulation
(3) θd=Xdexp (λd),
where Xdis an “offset”. The use of an offset allows specification of the regression model for
a normalized rate, exp (λd)which allows the data to estimate correlations among domains
for smoothing among domains of different sizes. The use of a magnitude offset is typical for
count data models; for example, in estimating disease prevalance (Gelman et al. 2014). The
offset Xdis assumed known and does not contain error. In application to JOLTS, we use the
employment level, derived from the Current Employment Statistics (CES) survey conducted
by the Bureau of Labor Statistics, as the offset. The total employment values are typically
much larger in magnitude that the JOLTS variables (e.g., job openings, quits, hires) and the
BAYESIAN MODEL FOR SURVEY COUNT DATA 5
rate of JOLTS variables to total employment composes a natural ratio in a similar fashion to
the number with a disease over the total population in disease mapping Gelman et al. (2014).
Although the CES-based Xdis an estimate, it is based on a much larger sample than the
JOLTS estimate, and so we ignore the variance associated with estimation ofXd.
We model λdin Equation 3on the log-scale such that λdRand we may specify a normal
distribution prior distribution. We specify a linking model for λdin Equation 4that allows
“borrowing strength” across domains from a set of covariates with,
(4) λd N βxd, τ 2.
The prior specifies that rate λdfollows a normal distribution, centered at βxdwith variance
τ2;xdis a set of covariates and βis a vector of regression coefficients. Hyperparameters β
and τ2are “global” in the sense that their values are shared for all domains.
2.1.3. Setting Conditional Variance of ydto Equal Known Variance, vd.The Poisson-
lognormal mixture likelihood of Equations 1and 2produces the marginal variance,
(5) Var (yd|θd, ϕd) = θd+θ2
dexp(ϕ2
d)1,
where the marginal variance is a function of parameters (θd, ϕd). By construction, vd=
Var(yd|Xd), so we need to set the marginal variance (after integrating out εd) of the Poisson-
lognormal mixture equal to vd(treated as known and fixed) with,
(6) vd=Var (yd|θd, ϕd) = θd+θ2
dexp(ϕ2
d)1.
Our inferential interest is in θd, the conditional mean for ydand we specify the linking
model for θdto borrow strength among domains to produce a smoothed estimator. So, we
accomplish setting vdto be the marginal variance for ydby solving for ϕ2
din Equation 7to
achieve,
(7) ϕ2
d= log vdθd
θ2
d
+ 1,
where we have induced a prior on ϕdthrough our linking model distribution imposed on λd
(where we recall Equation 3). In other words, unlike the typical set-up in Bayesian models,
we do not directly set a prior distribution for ϕdbut induce it through θdand the functional
relationship of Equation 7to guarantee that Equation 6is achieved. As discussed in the
introduction, Sugasawa & Kubokawa (2020) mention the possibility for use of non-normal
likelihoods and generalized linear models, but do not include the joint modeling of point
estimates and variances for count data. Similarly is the case for Rao & Molina (2015). All to
say, ours is the first treatment of a count data model for domain level data that enforces the
variance condition of Equation 6.
2.2. Model Extension for Joint Modeling of Point Estimates and Variances, (yd, vd).
2.2.1. Likelihood for Observed Variances, vd.We next address the case where the true,
underlying domain variances are unknown such that we construct an additional likelihood
statement for the observed variances, vd, centered on the true, latent variances.
We denote the true latent variance of sample-based point estimate ydby σ2
dsuch that
σ2
d=Var (yd|θd, ϕd). We achieve this equality by altering Equation 6to,
(8) σ2
d=Var (yd|θd, ϕd) = θd+θ2
dexp(ϕ2
d)1.
Note, however, that true sampling variances σ2
dare not observed. Instead, we observe
estimated variances vd. We choose to work with observed squared coefficient of variation,
摘要:

JOINTPOINTANDVARIANCEESTIMATIONUNDERAHIERARCHICALBAYESIANMODELFORSURVEYCOUNTDATA*BYTERRANCED.SAVITSKY1,a,JULIEGERSHUNSKAYA2,bANDMARKCRANKSHAW2,c1OfceofSurveyMethodsResearch,U.S.BureauofLaborStatistics,aSavitsky.Terrance@bls.gov2OEUSStatisticalMethodsDivision,U.S.BureauofLaborStatistics,bGershunskay...

展开>> 收起<<
Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count DataT1.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:29 页 大小:606.88KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注