Autoencoded sparse Bayesian in-IRT factorization calibration and amortized inference for the Work Disability Functional Assessment Battery Joshua C. Chang Carson C. Chow Julia Porcino

2025-05-02 0 0 1.61MB 16 页 10玖币

侵权投诉

Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized

inference for the Work Disability Functional Assessment Battery

Joshua C. Chang Carson C. Chow Julia Porcino

NIH Clinical Center NIH NIDDK NIH Clinical Center

Abstract

The Work Disability Functional Assessment Bat-

tery (WD-FAB) is a multidimensional item re-

sponse theory (IRT) instrument designed for as-

sessing work-related mental and physical func-

tion based on responses to an item bank. In

prior iterations it was developed using traditional

means – linear factorization and null hypothesis

statistical testing for item partitioning/selection,

and ﬁnally, posthoc calibration of disjoint uni-

dimensional IRT models. As a result, the WD-

FAB, like many other IRT instruments, is a

posthoc model. Its item partitioning, based on

exploratory factor analysis, is blind to the ﬁ-

nal nonlinear IRT model and is not performed

in a manner consistent with goodness of ﬁt to

the ﬁnal model. In this manuscript, we de-

velop a Bayesian hierarchical model for self-

consistently performing the following simulta-

neous tasks: scale factorization, item selection,

parameter identiﬁcation, and response scoring.

This method uses sparsity-based shrinkage to ob-

viate the linear factorization and null hypoth-

esis statistical tests that are usually required

for developing multidimensional IRT models, so

that item partitioning is consistent with the ulti-

mate nonlinear factor model. We also analogize

our multidimensional IRT model to probabilis-

tic autoencoders, specifying an encoder function

that amortizes the inference of ability parame-

ters from item responses. The encoder function

is equivalent to the “VBE” step in a stochas-

tic variational Bayesian expectation maximiza-

tion (VBEM) procedure that we use for approxi-

mate Bayesian inference on the entire model. We

use the method on a sample of WD-FAB item re-

sponses and compare the resulting item discrim-

Proceedings of the 26th International Conference on Artiﬁcial In-

telligence and Statistics (AISTATS) 2023, TBD, USA. PMLR:

inations to those obtained using the traditional

posthoc method.

1 Introduction

The United States Social Security Administration (SSA),

the administrator of the largest federal disability beneﬁts

program in the US, is tasked with determining the eligi-

bility of approximately two million applicants annually for

beneﬁts. Determining a person’s ability to engage in work

is difﬁcult. Additionally, capacity for work in individuals

may change over time and tools are needed for assessing

these changes, for instance in support of return-to-work

programs.

The statutory deﬁnition of disability requires determining

whether a person’s ability to work is limited by the pres-

ence of medical conditions (SSA). Modern models of dis-

ability such as the World Health Organization (WHO)’s

International Classiﬁcation of Functioning, Disability and

Health (ICF) view disability as a biopsychosocial con-

struct (Brandt and Smalligan, 2019), contextualizing dis-

ability as an interaction between the functional capability

of individuals and the needs and opportunities of their envi-

ronment. Assessing disability through this lens is resource-

intensive, motivating the development of tools to aid in the

adjudication process by objectively characterizing the func-

tional ability of an applicant. The Work Disability Func-

tional Assessment Battery (WD-FAB) is such a tool for un-

derstanding work-related physical and mental function of

individuals relative to the working adult population based

on responses to a battery of items.

1.1 Work Disability Functional Assessment Battery

The WD-FAB was developed by researchers at the Boston

University Health and Disability Research Institute (BU) in

collaboration with the National Institutes of Health (NIH)

and with the support of the Social Security Administration

(SSA). The intended use of this instrument is to provide

more standardized and consistent information about an in-

dividual’s functional abilities to help inform SSA’s dis-

ability adjudication process. The WD-FAB provides eight

arXiv:2210.10952v4 [stat.ME] 9 May 2023

Autoencoded Bayesian sparse multidimensional IRT

scores across two domains of physical and mental function

that are relevant to a person’s ability to work. The ICF is

one of the key frameworks for the content of these domains.

The ICF includes categories for classifying function at the

cellular, organ, and whole person level, referred to as activ-

ities and participation. The WD-FAB focuses on measuring

activity.

The development of the WD-FAB is detailed in several pa-

pers (Marfeo et al., 2018; Meterko et al., 2015; Jette et al.,

2019; Porcino et al., 2018). Subject matter experts used

the ICF, discipline-speciﬁc frameworks, and existing func-

tional assessment instruments, to develop a bank of approx-

imately 300 physical and 300 mental items that pertain to

work-related function. They further divided the physical

items into four subcategories (PD - physical demands, PDR

- physical demands replenishment, PF - physical function,

DA - daily activities) and mental items into three categories

(CC - community cognition, II - interpersonal interactions,

BH - behavioral health) based on how they relate to ICF

content, however, they did not use this categorization in

their analyses.

The item banks consist of questions that ask about a range

of everyday type activities, such as vacuuming, emptying

a dishwasher, painting a room, walking a block, turning a

door knob, speaking to someone on the phone, and manag-

ing under stress. Valid responses were graded on either four

or ﬁve option Likert scales with ordinal responses such as

agreement (Strongly agree, Agree, Disagree, Strongly dis-

agree), or frequency (Never, Rarely, Sometimes, Often, Al-

ways). Overall, these studies collected item responses from

a total of 11,901 subjects sampled from claimants for dis-

ability beneﬁts as well as working-age adults who represent

the general population of the United States.

The developers of the WD-FAB then followed the PROMIS

guidelines (Fries et al., 2014; Cella et al., 2007; DeWalt

et al., 2007) for measure development. They ﬁrst per-

formed exploratory factor analysis on the response matrix,

the output of which is a collection of linear factors with

dense loadings. Then, they extracted the ﬁrst four factors.

For each factor they used stepwise rejection of items based

on null hypothesis statistical testing, thresholding to select

a subset of items for each dimension. They then assessed

validity of unidimensionality of each of the item subsets

using conﬁrmatory factor analysis. Finally, they calibrated

independent predictive models for how a person may re-

spond to each subset of items. Besides the arbitrariness of

the thresholds used for item selection, a major weakness

of this procedure is in how the scale factorization is not

performed in a way that is mindful of the ﬁnal nonlinear

model. Alternate item factorizations that do not arise from

the linear factor analyses are prematurely excluded, uncer-

tainty in the factorization is not propagated, and the IRT

model is effectively a posthoc analysis. For this reason, we

will refer to the prior WD-FAB instrument as the posthoc

WD-FAB.

1.2 Item Response Theory

Item response theory (IRT), a generative latent-variable

modeling framework, is the dominant statistical paradigm

for quantifying assessments. Some applications of IRT

include standardized testing including Graduate Record

Exam (GRE) (Kingston and Dorans, 1982), the Scholas-

tic Aptitude Test (SAT) (Carlson and von Davier, 2013)

and the Graduate Management Admission Test (Kingston

et al., 1985). Other applications of IRT include medi-

cal/psychological assessments such as activities of daily

living (Fieo et al., 2010), quality of life (Bilbao et al.,

2014), and personality tests (Goldberg, 1992; Bore et al.,

2020; Saunders and Ngo, 2017; DeYoung et al., 2016;

Funke, 2005; Spence et al., 2012). IRT also serves as the

theoretical basis for the WD-FAB (Meterko et al., 2015;

Marfeo et al., 2016, 2019; Chang et al., 2022b).

In item response theory (IRT), a person’s test responses

are modeled as an interaction between personal traits (also

called abilities) and item-speciﬁc parameters. The item pa-

rameters relate to the difﬁculty of the item and the discrim-

ination of the item, or the degree to which the question’s

responses are determined by personal traits. The two types

of attributes work together to predict an individual’s re-

sponses via item response functions. Conversely, a set of

responses may be statistically inverted in order to estimate

an individual’s ability. The central idea behind IRT is to

use person-speciﬁc abilities in order to make comparisons

between people in a population.

Multidimensional instruments: For complex phenom-

ena, such as disability, a single scalar factor cannot ade-

quately describe how a person would respond to a diverse

set of items (Yuker, 1994). In these cases, one can develop

a multidimensional IRT model (MIRT). Like in the WD-

FAB, MIRT models are typically composed of ensembles

of unidimensional models, developed using the stepwise

procedure of linear factor analyses followed by calibration

of disjoint nonlinear unidimensional IRT models. Each

step of in these procedures require statistical decisions –

in practice these decisions are performed using arbitrary P-

value cutoffs. Ultimately, the resulting MIRT model is a

post-hoc model, and the initial item partitioning steps are

not performed with consideration to how well the ﬁnal IRT

model ﬁts the data. This issue is problematic because abili-

ties are derived from response patterns with the assumption

that the model accurately represents the response patterns

of the population.

1.3 Novelty and relation to prior work

In this manuscript, we re-examine the methodology behind

the WD-FAB and highlight how modern statistical tech-

niques can improve it. Speciﬁcally, we show that proba-

Joshua C. Chang, Carson C. Chow, Julia Porcino

bilistic autoencoders can serve as a complete pipeline for

translating survey responses into a set of interpretable in-

dicators about functional ability, with greater predictive

power than existing techniques. Prior work has noted that

IRT models are inherently similar to probabilistic autoen-

coders (Chang et al., 2019; Converse et al., 2019, 2021),

where an encoder performs amortized inference on person-

speciﬁc abilities. Viewing IRT models as a speciﬁc cate-

gory of autoencoders motivates extensions to standard IRT

methods. Prior work has not constrained the encoder func-

tion so that it does not modify the statistics of the decoder.

Our main methodological contributions are: 1. the adap-

tation of Bayesian sparsity methods to perform factoriza-

tion directly in an IRT model 2. the speciﬁcation of an

encoder function, fully speciﬁed by the decoder, that de-

ﬁnes the ‘VBE”-step of a variational Bayesian expectation

maximization algorithm – and in doing so does not modify

the statistics of the decoder.

2 Methods

2.1 Notation

The response data takes the form of a P×Imatrix, where

Pcorresponds to the number of people and Icorresponds

to the number of items. We denote this matrix X. Unless

otherwise stated, we will index rows in this matrix using

the symbol pand columns of this matrix using the symbol

i. Each entry of this matrix is a valid response from the set

{1,2,...K}, where K= 5 for the WD-FAB.

Parameters in the model may vary according to person p,

item i, and latent dimension d. We generally use bold let-

ters for denoting the collection of all values of a parame-

ter (e.g., θdenotes the collection of all ability parameters).

For speciﬁc slices of a parameter we use bold lowercase

symbols – for example, θp= (θ(1)

p, θ(2)

p, . . . , θ(D)

p)corre-

sponds to a vector of all ability parameters for person p.

In this manuscript we will denote the collection of all

model parameters as Γ, the collection of all ability parame-

ters as θ,and the collection of all model parameters except

the ability parameters as Γ\θ.

2.2 Multidimensional IRT as a probabilistic

autoencoder

The unidimensional ability scale graded response model

(GRM) (Samejima, 1969) is an item response theory (IRT)

model for ordinal responses. The GRM states that the prob-

ability that person presponds to item iwith a choice jis

Pr(Xpi =j|θp,τi, λi) = Pr(Xpi ≥j|θp, τij , λi)

−Pr(Xpi ≥j+ 1|θp, τi,j+1, λi),(1)

where we deﬁne the GRM in its probit variation, utilizing

the cumulative distribution function for the unit normal dis-

λWθ

horseshoe

Figure 1: Plate diagram corresponding to the multidimen-

sional IRT model in Eq. 3. Applying the horseshoe prior to

λperforms factorization for the model through sparsity.

tribution Φ, so that

Pr(x≥j|θ, τj, λ) = 









Φ(λ(θ−τj)) j∈[2, K]

1j≤1

0j > K

.(2)

Within the model, τi= (τi,1, τi,2, . . .)where τi,j+1 ≥τi,j

are item difﬁculty parameters. The ability parameters θp

map a person’s ability ranking within their population to

a real-valued scale. The remaining parameters λiare item

discrimination parameters – they represent how informa-

tive a particular item is to the scale, and visa versa. When

the discrimination goes to zero, then an item is effectively

decoupled from the scale.

Extending the GRM to multiple ability scale dimensions,

we deﬁne a discrimination-weighted mixture GRM:

Pr(Xpi =j|{θ(d)

p}d,{{τ(d)

ij }j}d,{λ(d)

i}d)

d=1

wid Pr(Xpi =j|θ(d)

p, τ(d)

i,j , τ(d)

i,j+1, λ(d)

wid =λ(d)

d=1

λ(d)

i,(3)

noting that λ(d)

i= 0 ⇒wid = 0; this form of weight-

ing allows us to extend the GRM to a mixture model with-

out needing to introduce any new free parameters. The de-

pendencies between the variables within this model are de-

picted in Fig. 1.

This multidimensional IRT model assumes that each per-

son’s ability consists of Dscales. The parameter θ(d)

pis the

Autoencoded Bayesian sparse multidimensional IRT

ability for person pon scale dand λ(d)

iis the discrimination

of item iwith respect to scale d. It strongly resembles prob-

abilistic matrix factorization and other probabilistic autoen-

coders. When trained on a sample of individuals and their

responses, the model in Eq. 3 deﬁnes a total likelihood

π(X|θ,λ,τ) =

Pr(Xpi =j|{θ(d)

p}d,{{τ(d)

ij }j}d,{λ(d)

i}d)δXpij

(4)

that takes as input a high-dimensional response matrix X=

(Xpi)and derives a lower dimensional representation ma-

trix θ= (θ(d)

p)pd, where the p−th row in the representation

matrix corresponds to the multidimensional ability for per-

son p. The weight matrix W= (wid)id decodes the ability

components for an individual into probability masses for

their item responses. This matrix serves the same purpose

as a factor loading matrix in principle components analy-

sis. Our objective is to obtain this matrix in-unison with

other model parameters that directly relate to how individ-

uals might respond to a given item battery.

Sparse factorization: By determining the matrix W,we

factor the items into multiple scales. For improving the in-

terpretability of these factorizations, we seek sparse factors,

as in sparse probabilistic matrix factorization (Gopalan

et al., 2014; Mnih and Salakhutdinov, 2008; Chang et al.,

2019, 2020). We accomplish this goal by using the horse-

shoe priors (Carvalho et al., 2010; Bhadra et al., 2015,

2019) on the discrimination parameters on a scale-by-scale

basis. Our overall hierarchical probabilistic model for si-

multaneous factorization and calibration of the multidi-

mensional GRM is speciﬁed:

−log π(λ|ξi,κ,η) = X

i,d "λ(d)

2(ξ(d)

iκ(d))2+ log(ξ(d)

iκ(d))#

i,d

log π(wi|ηi) + const (5a)

π(wi|ηi)∝exp η−1

w(d)

ilog w(d)

i!(5b)

σ(d)

i=ξ(d)

iκ(d)ξ(d)

i∼cauchy+(0,1) (5c)

ηi∼normal+(0, η0)κ(d)∼cauchy+(0, κ(d)

0)(5d)

τ(d)

i,2∼normal(µ(d)

i,1) τ(d)

i,j |τ(d)

i,j−1∼normal+(τ(d)

i,j−1,1)

(5e)

µ(d)

i∼normal(0,1) θ(d)

p∼normal(0,1) (5f)

where the discrimination parameters λ(d)

iare each con-

strained to non-negativity and we deﬁne a per-item entropy

penalty in Eq. 5b.

The dimension-wise horseshoe priors on the discrimination

parameters encourage scale sparsity, and the item-wise en-

tropy priors encourage items too load into a small number

of scales.

Hyperparameter scaling: If the apriori expectation is that

the dominant scale (on a per-item basis) holds weight q≈1

and the other weights are uniform, then ηi=−qlog(q)−

(1−q) log((1 −q)/(D−1)) is an appropriate value for the

scaling factor ηi.In this manuscript we use q= 0.8.

The parameters κ(d)control the overall amount of sparsity

in each scale dimension. For partitioning a set of Iitems

into Ddimensions, we expect each dimension to have ap-

proximately I/D nonzero terms. As in Piironen and Ve-

htari (2017b) and van der Pas et al. (2014), we derived

an approximate scaling on κ(d)based on asymptotic ap-

proximation of the bias in the posterior mode. This ap-

proximation suggests the scaling κ(d)

0=p∆(D, K, I)/P

where ∆(D, K, I)is a constant derived in the Supplemen-

tal Methods.

2.3 Autoencoded amortized inference

The intended use of item response models like the WD-

FAB is to use them to score new response patterns, effec-

tively reducing high-dimensional response vectors to low-

dimensional ability representations. In probabilistic au-

toencoders, the mapping is known as the encoder. As part

of training the generative hierarchical Bayesian model of

Eq. 5 (the decoder), we also learn the encoder function

encoder(Xp) = qθp:RD→R+where qθpis an approx-

imation of the marginal density π(θp|Xp). This surrogate

density can then be used for approximating posterior ex-

pectations

Eθ|X(g(θp)|Xp,X)

=Zg(θp)ZZ π(θp|λ,τ,Xp)π(λ,τ|X)dλdτdθp

≈Zg(θp)encoder(Xp)dθp.(6)

We note that the model deﬁned in Eq. 5, without men-

tion of an encoder function, is already sufﬁciently deﬁned

for Bayesian inference. For this reason, one needs to take

care so that the encoder function that does not modify the

statistics of the model. We do so by deﬁning a variational

Bayesian expectation maximization (VBEM) algorithm for

inferring the model and solving for the implied encoder

function, which ends up obeying the integral relationship

in Eq. 6.

2.4 Variational Bayesian EM

In the WD-FAB, the high dimensionality of the item bank

makes Markov-Chain Monte-Carlo based inference of the

model in Eq. 5 computationally impractical. Instead,

we developed an efﬁcient variational Bayesian expecta-

tion maximization (VBEM) (Bernardo et al., 2003) pro-

cedure that resembles common training techniques used

for learning variational probabilistic autoencoders (Higgins

et al., 2016; Ainsworth et al., 2018; Ansari and Soh, 2018;

Kingma and Welling, 2013; Doersch, 2016). Additionally,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AutoencodedsparseBayesianin-IRTfactorization,calibration,andamortizedinferencefortheWorkDisabilityFunctionalAssessmentBatteryJoshuaC.ChangCarsonC.ChowJuliaPorcinoNIHClinicalCenterNIHNIDDKNIHClinicalCenterAbstractTheWorkDisabilityFunctionalAssessmentBat-tery(WD-FAB)isamultidimensionalitemre-sponsethe...

展开>> 收起<<

Autoencoded sparse Bayesian in-IRT factorization calibration and amortized inference for the Work Disability Functional Assessment Battery Joshua C. Chang Carson C. Chow Julia Porcino.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Autoencoded sparse Bayesian in-IRT factorization calibration and amortized inference for the Work Disability Functional Assessment Battery Joshua C. Chang Carson C. Chow Julia Porcino

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: