Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count DataT1

2025-05-05 0 0 606.88KB 29 页 10玖币

侵权投诉

JOINT POINT AND VARIANCE ESTIMATION UNDER A HIERARCHICAL

BAYESIAN MODEL FOR SURVEY COUNT DATA*

BYTERRANCE D. SAVITSKY1,a,

JULIE GERSHUNSKAYA2,bAND MARK CRANKSHAW2,c

1Ofﬁce of Survey Methods Research, U.S. Bureau of Labor Statistics, aSavitsky.Terrance@bls.gov

2OEUS Statistical Methods Division, U.S. Bureau of Labor Statistics, bGershunskaya.Julie@bls.gov;

cCrankshaw.Mark@bls.gov

We propose a novel Bayesian framework for the joint modeling of sur-

vey point and variance estimates for count data. The approach incorporates

an induced prior distribution on the modeled true variance that sets it equal

to the generating variance of the point estimate, a key property more readily

achieved for continuous data response type models. Our count data model

formulation allows the input of domains at multiple resolutions (e.g., states,

regions, nation) and simultaneously benchmarks modeled estimates at higher

resolutions (e.g., states) to those at lower resolutions (e.g., regions) in a fash-

ion that borrows more strength to sharpen our domain estimates at higher res-

olutions. We conduct a simulation study that generates a population of units

within domains to produce ground truth statistics to compare to direct and

modeled estimates performed on samples taken from the population where

we show improved reductions in error across domains. The model is applied

to the job openings variable and other data items published in the Job Open-

ings and Labor Turnover Survey administered by the U.S. Bureau of Labor

Statistics.

1. Introduction. Count data response variables are commonly measured by government

surveys; for example, the American Community Survey administered by the U.S. Census Bu-

reau counts the population below a poverty threshold for household domains indexed by ge-

ography (e.g., census tracts). The U.S. Census Bureau administer the Consumer Expenditures

surveys of consumer units (independent households) for the U.S. Bureau of Labor Statistics

(BLS) that include count variables related to local and regional locations of the consumer

units. BLS administers surveys and a census instrument of business establishments related to

total employment and its components (e.g., job openings, hires, separations).

As with surveys conducted for continuous data response types, surveys that include count

data responses aggregate respondent-level counts, such as total employment for a business es-

tablishment respondent, to a collection of domains (such as state-by-industry classiﬁcation)

and produce both a point estimate and an estimated variance statistic for each domain. Small

domain estimation models for the continuous response type that jointly model the point esti-

mates and the estimated variances for the domains exist within both frequentist and Bayesian

frameworks; see, for example, Maiti et al. (2014) and Sugasawa et al. (2017). These mod-

els borrow strength from the underlying correlations among the domain estimates to provide

de-noised model-based estimators that are characterized by lower mean squared errors. The

inferential goal for these models is to extract model-smoothed point and variance estimates

for publication to their data users. It is important to note that the domain-indexed variances

arXiv: math.PR/0000000

*U.S. Bureau of Labor Statistics, 2 Massachusetts Ave. N.E, Washington, D.C. 20212 USA

Keywords and phrases: Bayesian hierarchical models, Small Area Estimation, Count data, Stan.

arXiv:2210.14366v1 [stat.ME] 25 Oct 2022

are not known, but estimated, such that small domain models provide an opportunity to en-

hance the quality of both point and variance estimates through their joint estimation since

they are typically correlated.

Bayesian models for continuous data point and variance estimates are easily designed such

that the mean of the marginal likelihood for the estimated variances represents a denoised

“true” variance. The true variance, in turn, is set to be the “generating” variance for the noisy

point estimate in its likelihood centered around the estimated true mean value (Sugasawa

et al. 2017); for example, suppose vdrepresents the estimated sampling variance for domain

d∈(1,...,N)associated with continuous response, yd. In the case of unknown, latent true

domain variance, σ2

d, one may impose a likelihood, vd|σ2

ind

∼f(σ2

d)with mean σ2

d. One

typically chooses the conditional likelihood, yd|θd, σ2

ind

∼ N(θd, σ2

d), where N(·)denotes

the normal distribution. We see that the variance in the conditional likelihood for continuous

ydis readily and naturally set to equal the latent true variance, σ2

d. This connection between

the point and variance estimates where the true modeled variance is set as the generating

variance of the noisy point estimate ties together the likelihoods for the point and variance

estimates in a single model framework.

We are not aware of a (small area) model for count data in the small estimation literature

where the estimated sampling variance is modeled jointly with the direct point estimate such

that the generating variance of the direct point estimate is set equal to the mean of estimated

sampling variance (where we interpret the mean as the latent “true" domain variance). In their

recent comprehensive review article of small estimation methods, Sugasawa & Kubokawa

(2020) note the possibility to model the point estimate with a non-normal distribution and

more broadly discuss the use of generalized linear models. They do not, however, explicate

a count data model that incorporates estimated variances. Similarly, Rao & Molina (2015)

discuss over-dispersed Poisson models for count data, but none that incorporate estimated

domain variances. Tzavidis, Nikos and Ranalli, M Giovanna and Salvati, Nicola and Dreassi,

Emanuela and Chambers, Ray (2015) develop a Poisson small area model for count data, but

assume the domain variance is equal to the mean of the Poisson likelihood for the domain

point estimate. So, they do not input domain variances, at all.

The literature does, however, provide a recent example where Bradley et al. (2016) con-

struct a joint model for geographically-indexed point and variance estimates under a count

data response. They deﬁne a Poisson likelihood such that the model conditional variance

(of the point estimate) is deﬁned as Var (yd|xd) = exp (λd)for count data response, yd,

associated to domain d∈(1,...,N); where xdis a set of covariates and λdis the log-

mean parameter. By contrast, under a normal likelihood with mean λdand variance ϕd

for logarithm, log(vd), of true variance vdof yd, the associated mean is E(vd|xd) = σ2

exp λd+σ2

d/26=Var (yd|xd). Although Bradley et al. (2016) utilize a random effect by

specifying a likelihood for the log-variance, this construction does not produce a true variance

that is equal to the generating variance of the count data response. All to say, the literature

is more limited for small area models for count data that incorporate estimated variances and

there are no implementations to our knowledge that ensure the σ2

d=Var (yd|xd). Perhaps

the reason for the limited literature focused on count data models for small area estimation is

that for many datasets domain level counts are sufﬁciently large to approximate with a con-

tinuous data distribution, though we have mentioned some data examples above with count

data variables that express low counts for some domains.

This paper, by contrast, constructs a joint model for a count data point estimate ydand

its estimated variance vdwhere the modeled true variance σ2

d(the mean of the conditional

likelihood for vd) is set equal to the variance Var(yd|xd)of the point estimate likelihood.

We extend a multiplicative random effect in our model speciﬁcation for the point estimate

as suggested by Zhou et al. (2012) for non-survey data where there is no associated variance

BAYESIAN MODEL FOR SURVEY COUNT DATA 3

estimate. They discuss that a Poisson-Lognormal prior set-up better ﬁts the data with more

appealing large sample theoretical properties as compared to the Negative Binomial model

because the more ﬂexible formulation of the former allows the data to learn a higher degree

of over-dispersion. We extend their Poisson-Lognormal set-up in our Bayesian hierarchical

model framework to indirectly induce a prior on the true variance of the likelihood for the

estimated variance such that the true variance equals the generating variance for the point

estimate.

Our formulation further leverages a Bayesian hierarchical probability model construction

by including the variable of interest at multiple resolutions (e.g., nested geographic levels,

such as states, regions, nation). The model simultaneously benchmarks modeled point es-

timates at higher resolutions (e.g., states) to those at lower resolutions (e.g., regions) in a

fashion that sharpens the estimation quality at higher resolutions. Traditional benchmarking

discussed in Rao & Molina (2015), by contrast, is often performed as a second step after

modeling is completed such that it tends to add back some of the variance removed by mod-

eling. We avoid this loss of efﬁciency by including the benchmarking as part of performing

estimation at multiple resolutions in a single step as is done in Savitsky (2016).

1.1. Job Opening and Labor Turnover Survey (JOLTS) Motivating Dataset. This paper

was motivated by the JOLTS survey conducted by the U.S. Bureau of Labor Statistics. JOLTS

measures dynamic trends in the labor market by tracking job openings, hires and separations,

among other variables, on a monthly basis. The survey is conducted nationally with the intent

to provide a national-level estimator for a collection of industries (deﬁned based on the North

American industry classiﬁcation codes (NAICS)). Users, however, desire to have state-level

estimates of these labor market dynamic variables for each industry. A major challenge to

produce state-level estimates of JOLTS variables is the small number of surveyed business

establishments in many states; in fact, in some industries there are states that may not have

any sample responses in a given month. It is cost prohibitive for BLS to increase their sample

size and to use blocking by state in order to support state level estimation. Our goal in this pa-

per is to model the collection of state-by-industry point estimates and variances constructed

from the national survey to extract more efﬁcient, higher quality estimators and to impute es-

timated values for state-level domains with no underlying sample. The Bayesian hierarchical

model that we construct in the sequel for each industry will simultaneously impute missing

point estimates and variances for those states excluded from the sample in any given month.

The remainder of this paper is organized, as follows: Section 2provides the mathematical

formulation of our joint model for state-level point and variance estimates for each month in

each industry.We then extend this cross-sectional by-month model to a time-series construc-

tion that jointly models a collection of state-indexed time-series for each of the point and

variance estimates in each industry. We design a simulation study in Section 3that generates

respondent level population of count data and constructs true values for point estimates over

a collection of domains. Our simulation design then takes a sample of respondents and pro-

duces domain-based point and both true and estimated variances of direct point estimates for

the sampled respondents in each domain. We compare the MSE performances of the sample-

based direct estimator and our modeled estimator based on the population ground truth. In

Section 4, we proceed to apply our model to provide JOLTS state-based point and variance

estimates and we illustrate the smoothing property of our models. We conclude with a brief

discussion in Section 5.

2. Model for Count Data Point and Variance Estimates.

2.1. Cross-sectional Model Under Assumed Known Variances. We proceed to describe

the formulation of a cross-sectional model for the joint estimation of count data point in

a given month. We utilize the structure of JOLTS data to describe our model, for ease-of-

understanding. Domains in JOLTS are deﬁned by intersections of states and industries. We

consider separate models by industry, each estimated over the collection of states. For each

domain d∈(1, ..., N)we observe sample based estimates, yd,and respective estimates of

their variances, vd,where Ndenotes the total number of domains in a given industry.

We construct a model for a count data response, rather than treating point estimate, yd,

as continuous because the JOLTS variables of interest, such as job openings, are often char-

acterized by very small counts for a given industry and state such that the conditional like-

lihood is expected to be very skewed (unlike a symmetric normal distribution). Our model

will specify a Poisson-lognormal model that allows for an over-dispersed marginal likelihood

for point estimate ydfor domain d∈(1,...,N)where the data may estimate the variance,

Var(yd|xd)≥E(yd|xd). We assume that the constructed domain variances, (vd)N

d=1, are

known such that we treat them as ﬁxed. So, we specify a likelihood for yd|xdand set its

variance equal to vd, which we see below is not trivial.

2.1.1. Poisson-lognormal Mixture Likelihood Formulation. We allow for overdispersion

through a Poisson-lognormal joint likelihood with:

(1) yd|θd, εd

ind

∼Poisson (θdεd),

where the use of latent random effects εdwith a mean ﬁxed to 1allows for the variance to

be greater than or equal to the mean by specifying the following distribution for the latent

likelihood,

(2) εd|ϕd

ind

∼LN −0.5ϕ2

d, ϕ2

d.

Together, Equations 1and 2for observed direct sample-based estimates yd(of job openings,

hires, separations, or other JOLTS items) follow a lognormal scale mixture of Poisson dis-

tributions parameterized so that mean E(yd|θd) = θd,where θdrepresents the parameter of

interest. We accomplish setting the mean of the mixture likelihood to θdthrough our use of

−0.5ϕ2

din Equation 2that restricts the prior mean of εdto be 1. If we had instead chosen

a Gamma distribution for εdthe resulting likelihood after marginalizing over εdwould have

produced a closed-form negative binomial distribution which is not the case for our Poisson-

lognormal mixture. We, nevertheless, select the lognormal instead of the Gamma because

Zhou et al. (2012) highlight that the lognormal has proven more ﬂexible for the modeling of

heavy tails that we express in the JOLTS data to the presence of domains with very small

counts.

2.1.2. Linking Model for Conditional Mean, θd.The conditional mean parameter, θd, are

used to borrow strength across domains and is constructed with the formulation

(3) θd=Xdexp (λd),

where Xdis an “offset”. The use of an offset allows speciﬁcation of the regression model for

a normalized rate, exp (λd)which allows the data to estimate correlations among domains

for smoothing among domains of different sizes. The use of a magnitude offset is typical for

count data models; for example, in estimating disease prevalance (Gelman et al. 2014). The

offset Xdis assumed known and does not contain error. In application to JOLTS, we use the

employment level, derived from the Current Employment Statistics (CES) survey conducted

by the Bureau of Labor Statistics, as the offset. The total employment values are typically

much larger in magnitude that the JOLTS variables (e.g., job openings, quits, hires) and the

BAYESIAN MODEL FOR SURVEY COUNT DATA 5

rate of JOLTS variables to total employment composes a natural ratio in a similar fashion to

the number with a disease over the total population in disease mapping Gelman et al. (2014).

Although the CES-based Xdis an estimate, it is based on a much larger sample than the

JOLTS estimate, and so we ignore the variance associated with estimation ofXd.

We model λdin Equation 3on the log-scale such that λd∈Rand we may specify a normal

distribution prior distribution. We specify a linking model for λdin Equation 4that allows

“borrowing strength” across domains from a set of covariates with,

(4) λd∼ N βxd, τ 2.

The prior speciﬁes that rate λdfollows a normal distribution, centered at βxdwith variance

τ2;xdis a set of covariates and βis a vector of regression coefﬁcients. Hyperparameters β

and τ2are “global” in the sense that their values are shared for all domains.

2.1.3. Setting Conditional Variance of ydto Equal Known Variance, vd.The Poisson-

lognormal mixture likelihood of Equations 1and 2produces the marginal variance,

(5) Var (yd|θd, ϕd) = θd+θ2

dexp(ϕ2

d)−1,

where the marginal variance is a function of parameters (θd, ϕd). By construction, vd=

Var(yd|Xd), so we need to set the marginal variance (after integrating out εd) of the Poisson-

lognormal mixture equal to vd(treated as known and ﬁxed) with,

(6) vd=Var (yd|θd, ϕd) = θd+θ2

dexp(ϕ2

d)−1.

Our inferential interest is in θd, the conditional mean for ydand we specify the linking

model for θdto borrow strength among domains to produce a smoothed estimator. So, we

accomplish setting vdto be the marginal variance for ydby solving for ϕ2

din Equation 7to

achieve,

(7) ϕ2

d= log vd−θd

θ2

+ 1,

where we have induced a prior on ϕdthrough our linking model distribution imposed on λd

(where we recall Equation 3). In other words, unlike the typical set-up in Bayesian models,

we do not directly set a prior distribution for ϕdbut induce it through θdand the functional

relationship of Equation 7to guarantee that Equation 6is achieved. As discussed in the

introduction, Sugasawa & Kubokawa (2020) mention the possibility for use of non-normal

likelihoods and generalized linear models, but do not include the joint modeling of point

estimates and variances for count data. Similarly is the case for Rao & Molina (2015). All to

say, ours is the ﬁrst treatment of a count data model for domain level data that enforces the

variance condition of Equation 6.

2.2. Model Extension for Joint Modeling of Point Estimates and Variances, (yd, vd).

2.2.1. Likelihood for Observed Variances, vd.We next address the case where the true,

underlying domain variances are unknown such that we construct an additional likelihood

statement for the observed variances, vd, centered on the true, latent variances.

We denote the true latent variance of sample-based point estimate ydby σ2

dsuch that

σ2

d=Var (yd|θd, ϕd). We achieve this equality by altering Equation 6to,

(8) σ2

d=Var (yd|θd, ϕd) = θd+θ2

dexp(ϕ2

d)−1.

Note, however, that true sampling variances σ2

dare not observed. Instead, we observe

estimated variances vd. We choose to work with observed squared coefﬁcient of variation,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JOINTPOINTANDVARIANCEESTIMATIONUNDERAHIERARCHICALBAYESIANMODELFORSURVEYCOUNTDATA*BYTERRANCED.SAVITSKY1,a,JULIEGERSHUNSKAYA2,bANDMARKCRANKSHAW2,c1OfceofSurveyMethodsResearch,U.S.BureauofLaborStatistics,aSavitsky.Terrance@bls.gov2OEUSStatisticalMethodsDivision,U.S.BureauofLaborStatistics,bGershunskay...

展开>> 收起<<

Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count DataT1.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count DataT1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: