Autoencoded sparse Bayesian in-IRT factorization calibration and amortized inference for the Work Disability Functional Assessment Battery Joshua C. Chang Carson C. Chow Julia Porcino

2025-05-02 0 0 1.61MB 16 页 10玖币
侵权投诉
Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized
inference for the Work Disability Functional Assessment Battery
Joshua C. Chang Carson C. Chow Julia Porcino
NIH Clinical Center NIH NIDDK NIH Clinical Center
Abstract
The Work Disability Functional Assessment Bat-
tery (WD-FAB) is a multidimensional item re-
sponse theory (IRT) instrument designed for as-
sessing work-related mental and physical func-
tion based on responses to an item bank. In
prior iterations it was developed using traditional
means – linear factorization and null hypothesis
statistical testing for item partitioning/selection,
and finally, posthoc calibration of disjoint uni-
dimensional IRT models. As a result, the WD-
FAB, like many other IRT instruments, is a
posthoc model. Its item partitioning, based on
exploratory factor analysis, is blind to the fi-
nal nonlinear IRT model and is not performed
in a manner consistent with goodness of fit to
the final model. In this manuscript, we de-
velop a Bayesian hierarchical model for self-
consistently performing the following simulta-
neous tasks: scale factorization, item selection,
parameter identification, and response scoring.
This method uses sparsity-based shrinkage to ob-
viate the linear factorization and null hypoth-
esis statistical tests that are usually required
for developing multidimensional IRT models, so
that item partitioning is consistent with the ulti-
mate nonlinear factor model. We also analogize
our multidimensional IRT model to probabilis-
tic autoencoders, specifying an encoder function
that amortizes the inference of ability parame-
ters from item responses. The encoder function
is equivalent to the “VBE” step in a stochas-
tic variational Bayesian expectation maximiza-
tion (VBEM) procedure that we use for approxi-
mate Bayesian inference on the entire model. We
use the method on a sample of WD-FAB item re-
sponses and compare the resulting item discrim-
Proceedings of the 26th International Conference on Artificial In-
telligence and Statistics (AISTATS) 2023, TBD, USA. PMLR:
Volume XXX. Copyright 2023 by the author(s).
inations to those obtained using the traditional
posthoc method.
1 Introduction
The United States Social Security Administration (SSA),
the administrator of the largest federal disability benefits
program in the US, is tasked with determining the eligi-
bility of approximately two million applicants annually for
benefits. Determining a person’s ability to engage in work
is difficult. Additionally, capacity for work in individuals
may change over time and tools are needed for assessing
these changes, for instance in support of return-to-work
programs.
The statutory definition of disability requires determining
whether a person’s ability to work is limited by the pres-
ence of medical conditions (SSA). Modern models of dis-
ability such as the World Health Organization (WHO)’s
International Classification of Functioning, Disability and
Health (ICF) view disability as a biopsychosocial con-
struct (Brandt and Smalligan, 2019), contextualizing dis-
ability as an interaction between the functional capability
of individuals and the needs and opportunities of their envi-
ronment. Assessing disability through this lens is resource-
intensive, motivating the development of tools to aid in the
adjudication process by objectively characterizing the func-
tional ability of an applicant. The Work Disability Func-
tional Assessment Battery (WD-FAB) is such a tool for un-
derstanding work-related physical and mental function of
individuals relative to the working adult population based
on responses to a battery of items.
1.1 Work Disability Functional Assessment Battery
The WD-FAB was developed by researchers at the Boston
University Health and Disability Research Institute (BU) in
collaboration with the National Institutes of Health (NIH)
and with the support of the Social Security Administration
(SSA). The intended use of this instrument is to provide
more standardized and consistent information about an in-
dividual’s functional abilities to help inform SSAs dis-
ability adjudication process. The WD-FAB provides eight
arXiv:2210.10952v4 [stat.ME] 9 May 2023
Autoencoded Bayesian sparse multidimensional IRT
scores across two domains of physical and mental function
that are relevant to a person’s ability to work. The ICF is
one of the key frameworks for the content of these domains.
The ICF includes categories for classifying function at the
cellular, organ, and whole person level, referred to as activ-
ities and participation. The WD-FAB focuses on measuring
activity.
The development of the WD-FAB is detailed in several pa-
pers (Marfeo et al., 2018; Meterko et al., 2015; Jette et al.,
2019; Porcino et al., 2018). Subject matter experts used
the ICF, discipline-specific frameworks, and existing func-
tional assessment instruments, to develop a bank of approx-
imately 300 physical and 300 mental items that pertain to
work-related function. They further divided the physical
items into four subcategories (PD - physical demands, PDR
- physical demands replenishment, PF - physical function,
DA - daily activities) and mental items into three categories
(CC - community cognition, II - interpersonal interactions,
BH - behavioral health) based on how they relate to ICF
content, however, they did not use this categorization in
their analyses.
The item banks consist of questions that ask about a range
of everyday type activities, such as vacuuming, emptying
a dishwasher, painting a room, walking a block, turning a
door knob, speaking to someone on the phone, and manag-
ing under stress. Valid responses were graded on either four
or five option Likert scales with ordinal responses such as
agreement (Strongly agree, Agree, Disagree, Strongly dis-
agree), or frequency (Never, Rarely, Sometimes, Often, Al-
ways). Overall, these studies collected item responses from
a total of 11,901 subjects sampled from claimants for dis-
ability benefits as well as working-age adults who represent
the general population of the United States.
The developers of the WD-FAB then followed the PROMIS
guidelines (Fries et al., 2014; Cella et al., 2007; DeWalt
et al., 2007) for measure development. They first per-
formed exploratory factor analysis on the response matrix,
the output of which is a collection of linear factors with
dense loadings. Then, they extracted the first four factors.
For each factor they used stepwise rejection of items based
on null hypothesis statistical testing, thresholding to select
a subset of items for each dimension. They then assessed
validity of unidimensionality of each of the item subsets
using confirmatory factor analysis. Finally, they calibrated
independent predictive models for how a person may re-
spond to each subset of items. Besides the arbitrariness of
the thresholds used for item selection, a major weakness
of this procedure is in how the scale factorization is not
performed in a way that is mindful of the final nonlinear
model. Alternate item factorizations that do not arise from
the linear factor analyses are prematurely excluded, uncer-
tainty in the factorization is not propagated, and the IRT
model is effectively a posthoc analysis. For this reason, we
will refer to the prior WD-FAB instrument as the posthoc
WD-FAB.
1.2 Item Response Theory
Item response theory (IRT), a generative latent-variable
modeling framework, is the dominant statistical paradigm
for quantifying assessments. Some applications of IRT
include standardized testing including Graduate Record
Exam (GRE) (Kingston and Dorans, 1982), the Scholas-
tic Aptitude Test (SAT) (Carlson and von Davier, 2013)
and the Graduate Management Admission Test (Kingston
et al., 1985). Other applications of IRT include medi-
cal/psychological assessments such as activities of daily
living (Fieo et al., 2010), quality of life (Bilbao et al.,
2014), and personality tests (Goldberg, 1992; Bore et al.,
2020; Saunders and Ngo, 2017; DeYoung et al., 2016;
Funke, 2005; Spence et al., 2012). IRT also serves as the
theoretical basis for the WD-FAB (Meterko et al., 2015;
Marfeo et al., 2016, 2019; Chang et al., 2022b).
In item response theory (IRT), a person’s test responses
are modeled as an interaction between personal traits (also
called abilities) and item-specific parameters. The item pa-
rameters relate to the difficulty of the item and the discrim-
ination of the item, or the degree to which the question’s
responses are determined by personal traits. The two types
of attributes work together to predict an individual’s re-
sponses via item response functions. Conversely, a set of
responses may be statistically inverted in order to estimate
an individual’s ability. The central idea behind IRT is to
use person-specific abilities in order to make comparisons
between people in a population.
Multidimensional instruments: For complex phenom-
ena, such as disability, a single scalar factor cannot ade-
quately describe how a person would respond to a diverse
set of items (Yuker, 1994). In these cases, one can develop
a multidimensional IRT model (MIRT). Like in the WD-
FAB, MIRT models are typically composed of ensembles
of unidimensional models, developed using the stepwise
procedure of linear factor analyses followed by calibration
of disjoint nonlinear unidimensional IRT models. Each
step of in these procedures require statistical decisions –
in practice these decisions are performed using arbitrary P-
value cutoffs. Ultimately, the resulting MIRT model is a
post-hoc model, and the initial item partitioning steps are
not performed with consideration to how well the final IRT
model fits the data. This issue is problematic because abili-
ties are derived from response patterns with the assumption
that the model accurately represents the response patterns
of the population.
1.3 Novelty and relation to prior work
In this manuscript, we re-examine the methodology behind
the WD-FAB and highlight how modern statistical tech-
niques can improve it. Specifically, we show that proba-
Joshua C. Chang, Carson C. Chow, Julia Porcino
bilistic autoencoders can serve as a complete pipeline for
translating survey responses into a set of interpretable in-
dicators about functional ability, with greater predictive
power than existing techniques. Prior work has noted that
IRT models are inherently similar to probabilistic autoen-
coders (Chang et al., 2019; Converse et al., 2019, 2021),
where an encoder performs amortized inference on person-
specific abilities. Viewing IRT models as a specific cate-
gory of autoencoders motivates extensions to standard IRT
methods. Prior work has not constrained the encoder func-
tion so that it does not modify the statistics of the decoder.
Our main methodological contributions are: 1. the adap-
tation of Bayesian sparsity methods to perform factoriza-
tion directly in an IRT model 2. the specification of an
encoder function, fully specified by the decoder, that de-
fines the ‘VBE”-step of a variational Bayesian expectation
maximization algorithm – and in doing so does not modify
the statistics of the decoder.
2 Methods
2.1 Notation
The response data takes the form of a P×Imatrix, where
Pcorresponds to the number of people and Icorresponds
to the number of items. We denote this matrix X. Unless
otherwise stated, we will index rows in this matrix using
the symbol pand columns of this matrix using the symbol
i. Each entry of this matrix is a valid response from the set
{1,2,...K}, where K= 5 for the WD-FAB.
Parameters in the model may vary according to person p,
item i, and latent dimension d. We generally use bold let-
ters for denoting the collection of all values of a parame-
ter (e.g., θdenotes the collection of all ability parameters).
For specific slices of a parameter we use bold lowercase
symbols – for example, θp= (θ(1)
p, θ(2)
p, . . . , θ(D)
p)corre-
sponds to a vector of all ability parameters for person p.
In this manuscript we will denote the collection of all
model parameters as Γ, the collection of all ability parame-
ters as θ,and the collection of all model parameters except
the ability parameters as Γ\θ.
2.2 Multidimensional IRT as a probabilistic
autoencoder
The unidimensional ability scale graded response model
(GRM) (Samejima, 1969) is an item response theory (IRT)
model for ordinal responses. The GRM states that the prob-
ability that person presponds to item iwith a choice jis
Pr(Xpi =j|θp,τi, λi) = Pr(Xpi j|θp, τij , λi)
Pr(Xpi j+ 1|θp, τi,j+1, λi),(1)
where we define the GRM in its probit variation, utilizing
the cumulative distribution function for the unit normal dis-
X
λWθ
τ
horseshoe
P
I
D
Figure 1: Plate diagram corresponding to the multidimen-
sional IRT model in Eq. 3. Applying the horseshoe prior to
λperforms factorization for the model through sparsity.
tribution Φ, so that
Pr(xj|θ, τj, λ) =
Φ(λ(θτj)) j[2, K]
1j1
0j > K
.(2)
Within the model, τi= (τi,1, τi,2, . . .)where τi,j+1 τi,j
are item difficulty parameters. The ability parameters θp
map a person’s ability ranking within their population to
a real-valued scale. The remaining parameters λiare item
discrimination parameters – they represent how informa-
tive a particular item is to the scale, and visa versa. When
the discrimination goes to zero, then an item is effectively
decoupled from the scale.
Extending the GRM to multiple ability scale dimensions,
we define a discrimination-weighted mixture GRM:
Pr(Xpi =j|{θ(d)
p}d,{{τ(d)
ij }j}d,{λ(d)
i}d)
=
D
X
d=1
wid Pr(Xpi =j|θ(d)
p, τ(d)
i,j , τ(d)
i,j+1, λ(d)
i)
wid =λ(d)
i/
D
X
d=1
λ(d)
i,(3)
noting that λ(d)
i= 0 wid = 0; this form of weight-
ing allows us to extend the GRM to a mixture model with-
out needing to introduce any new free parameters. The de-
pendencies between the variables within this model are de-
picted in Fig. 1.
This multidimensional IRT model assumes that each per-
son’s ability consists of Dscales. The parameter θ(d)
pis the
Autoencoded Bayesian sparse multidimensional IRT
ability for person pon scale dand λ(d)
iis the discrimination
of item iwith respect to scale d. It strongly resembles prob-
abilistic matrix factorization and other probabilistic autoen-
coders. When trained on a sample of individuals and their
responses, the model in Eq. 3 defines a total likelihood
π(X|θ,λ,τ) =
Y
pY
i
Pr(Xpi =j|{θ(d)
p}d,{{τ(d)
ij }j}d,{λ(d)
i}d)δXpij
(4)
that takes as input a high-dimensional response matrix X=
(Xpi)and derives a lower dimensional representation ma-
trix θ= (θ(d)
p)pd, where the pth row in the representation
matrix corresponds to the multidimensional ability for per-
son p. The weight matrix W= (wid)id decodes the ability
components for an individual into probability masses for
their item responses. This matrix serves the same purpose
as a factor loading matrix in principle components analy-
sis. Our objective is to obtain this matrix in-unison with
other model parameters that directly relate to how individ-
uals might respond to a given item battery.
Sparse factorization: By determining the matrix W,we
factor the items into multiple scales. For improving the in-
terpretability of these factorizations, we seek sparse factors,
as in sparse probabilistic matrix factorization (Gopalan
et al., 2014; Mnih and Salakhutdinov, 2008; Chang et al.,
2019, 2020). We accomplish this goal by using the horse-
shoe priors (Carvalho et al., 2010; Bhadra et al., 2015,
2019) on the discrimination parameters on a scale-by-scale
basis. Our overall hierarchical probabilistic model for si-
multaneous factorization and calibration of the multidi-
mensional GRM is specified:
log π(λ|ξi,κ,η) = X
i,d "λ(d)
i
2(ξ(d)
iκ(d))2+ log(ξ(d)
iκ(d))#
+X
i,d
log π(wi|ηi) + const (5a)
π(wi|ηi)exp η1
iX
d
w(d)
ilog w(d)
i!(5b)
σ(d)
i=ξ(d)
iκ(d)ξ(d)
icauchy+(0,1) (5c)
ηinormal+(0, η0)κ(d)cauchy+(0, κ(d)
0)(5d)
τ(d)
i,2normal(µ(d)
i,1) τ(d)
i,j |τ(d)
i,j1normal+(τ(d)
i,j1,1)
(5e)
µ(d)
inormal(0,1) θ(d)
pnormal(0,1) (5f)
where the discrimination parameters λ(d)
iare each con-
strained to non-negativity and we define a per-item entropy
penalty in Eq. 5b.
The dimension-wise horseshoe priors on the discrimination
parameters encourage scale sparsity, and the item-wise en-
tropy priors encourage items too load into a small number
of scales.
Hyperparameter scaling: If the apriori expectation is that
the dominant scale (on a per-item basis) holds weight q1
and the other weights are uniform, then ηi=qlog(q)
(1q) log((1 q)/(D1)) is an appropriate value for the
scaling factor ηi.In this manuscript we use q= 0.8.
The parameters κ(d)control the overall amount of sparsity
in each scale dimension. For partitioning a set of Iitems
into Ddimensions, we expect each dimension to have ap-
proximately I/D nonzero terms. As in Piironen and Ve-
htari (2017b) and van der Pas et al. (2014), we derived
an approximate scaling on κ(d)based on asymptotic ap-
proximation of the bias in the posterior mode. This ap-
proximation suggests the scaling κ(d)
0=p∆(D, K, I)/P
where ∆(D, K, I)is a constant derived in the Supplemen-
tal Methods.
2.3 Autoencoded amortized inference
The intended use of item response models like the WD-
FAB is to use them to score new response patterns, effec-
tively reducing high-dimensional response vectors to low-
dimensional ability representations. In probabilistic au-
toencoders, the mapping is known as the encoder. As part
of training the generative hierarchical Bayesian model of
Eq. 5 (the decoder), we also learn the encoder function
encoder(Xp) = qθp:RDR+where qθpis an approx-
imation of the marginal density π(θp|Xp). This surrogate
density can then be used for approximating posterior ex-
pectations
Eθ|X(g(θp)|Xp,X)
=Zg(θp)ZZ π(θp|λ,τ,Xp)π(λ,τ|X)dλdτdθp
Zg(θp)encoder(Xp)dθp.(6)
We note that the model defined in Eq. 5, without men-
tion of an encoder function, is already sufficiently defined
for Bayesian inference. For this reason, one needs to take
care so that the encoder function that does not modify the
statistics of the model. We do so by defining a variational
Bayesian expectation maximization (VBEM) algorithm for
inferring the model and solving for the implied encoder
function, which ends up obeying the integral relationship
in Eq. 6.
2.4 Variational Bayesian EM
In the WD-FAB, the high dimensionality of the item bank
makes Markov-Chain Monte-Carlo based inference of the
model in Eq. 5 computationally impractical. Instead,
we developed an efficient variational Bayesian expecta-
tion maximization (VBEM) (Bernardo et al., 2003) pro-
cedure that resembles common training techniques used
for learning variational probabilistic autoencoders (Higgins
et al., 2016; Ainsworth et al., 2018; Ansari and Soh, 2018;
Kingma and Welling, 2013; Doersch, 2016). Additionally,
摘要:

AutoencodedsparseBayesianin-IRTfactorization,calibration,andamortizedinferencefortheWorkDisabilityFunctionalAssessmentBatteryJoshuaC.ChangCarsonC.ChowJuliaPorcinoNIHClinicalCenterNIHNIDDKNIHClinicalCenterAbstractTheWorkDisabilityFunctionalAssessmentBat-tery(WD-FAB)isamultidimensionalitemre-sponsethe...

展开>> 收起<<
Autoencoded sparse Bayesian in-IRT factorization calibration and amortized inference for the Work Disability Functional Assessment Battery Joshua C. Chang Carson C. Chow Julia Porcino.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.61MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注