
Autoencoded Bayesian sparse multidimensional IRT
scores across two domains of physical and mental function
that are relevant to a person’s ability to work. The ICF is
one of the key frameworks for the content of these domains.
The ICF includes categories for classifying function at the
cellular, organ, and whole person level, referred to as activ-
ities and participation. The WD-FAB focuses on measuring
activity.
The development of the WD-FAB is detailed in several pa-
pers (Marfeo et al., 2018; Meterko et al., 2015; Jette et al.,
2019; Porcino et al., 2018). Subject matter experts used
the ICF, discipline-specific frameworks, and existing func-
tional assessment instruments, to develop a bank of approx-
imately 300 physical and 300 mental items that pertain to
work-related function. They further divided the physical
items into four subcategories (PD - physical demands, PDR
- physical demands replenishment, PF - physical function,
DA - daily activities) and mental items into three categories
(CC - community cognition, II - interpersonal interactions,
BH - behavioral health) based on how they relate to ICF
content, however, they did not use this categorization in
their analyses.
The item banks consist of questions that ask about a range
of everyday type activities, such as vacuuming, emptying
a dishwasher, painting a room, walking a block, turning a
door knob, speaking to someone on the phone, and manag-
ing under stress. Valid responses were graded on either four
or five option Likert scales with ordinal responses such as
agreement (Strongly agree, Agree, Disagree, Strongly dis-
agree), or frequency (Never, Rarely, Sometimes, Often, Al-
ways). Overall, these studies collected item responses from
a total of 11,901 subjects sampled from claimants for dis-
ability benefits as well as working-age adults who represent
the general population of the United States.
The developers of the WD-FAB then followed the PROMIS
guidelines (Fries et al., 2014; Cella et al., 2007; DeWalt
et al., 2007) for measure development. They first per-
formed exploratory factor analysis on the response matrix,
the output of which is a collection of linear factors with
dense loadings. Then, they extracted the first four factors.
For each factor they used stepwise rejection of items based
on null hypothesis statistical testing, thresholding to select
a subset of items for each dimension. They then assessed
validity of unidimensionality of each of the item subsets
using confirmatory factor analysis. Finally, they calibrated
independent predictive models for how a person may re-
spond to each subset of items. Besides the arbitrariness of
the thresholds used for item selection, a major weakness
of this procedure is in how the scale factorization is not
performed in a way that is mindful of the final nonlinear
model. Alternate item factorizations that do not arise from
the linear factor analyses are prematurely excluded, uncer-
tainty in the factorization is not propagated, and the IRT
model is effectively a posthoc analysis. For this reason, we
will refer to the prior WD-FAB instrument as the posthoc
WD-FAB.
1.2 Item Response Theory
Item response theory (IRT), a generative latent-variable
modeling framework, is the dominant statistical paradigm
for quantifying assessments. Some applications of IRT
include standardized testing including Graduate Record
Exam (GRE) (Kingston and Dorans, 1982), the Scholas-
tic Aptitude Test (SAT) (Carlson and von Davier, 2013)
and the Graduate Management Admission Test (Kingston
et al., 1985). Other applications of IRT include medi-
cal/psychological assessments such as activities of daily
living (Fieo et al., 2010), quality of life (Bilbao et al.,
2014), and personality tests (Goldberg, 1992; Bore et al.,
2020; Saunders and Ngo, 2017; DeYoung et al., 2016;
Funke, 2005; Spence et al., 2012). IRT also serves as the
theoretical basis for the WD-FAB (Meterko et al., 2015;
Marfeo et al., 2016, 2019; Chang et al., 2022b).
In item response theory (IRT), a person’s test responses
are modeled as an interaction between personal traits (also
called abilities) and item-specific parameters. The item pa-
rameters relate to the difficulty of the item and the discrim-
ination of the item, or the degree to which the question’s
responses are determined by personal traits. The two types
of attributes work together to predict an individual’s re-
sponses via item response functions. Conversely, a set of
responses may be statistically inverted in order to estimate
an individual’s ability. The central idea behind IRT is to
use person-specific abilities in order to make comparisons
between people in a population.
Multidimensional instruments: For complex phenom-
ena, such as disability, a single scalar factor cannot ade-
quately describe how a person would respond to a diverse
set of items (Yuker, 1994). In these cases, one can develop
a multidimensional IRT model (MIRT). Like in the WD-
FAB, MIRT models are typically composed of ensembles
of unidimensional models, developed using the stepwise
procedure of linear factor analyses followed by calibration
of disjoint nonlinear unidimensional IRT models. Each
step of in these procedures require statistical decisions –
in practice these decisions are performed using arbitrary P-
value cutoffs. Ultimately, the resulting MIRT model is a
post-hoc model, and the initial item partitioning steps are
not performed with consideration to how well the final IRT
model fits the data. This issue is problematic because abili-
ties are derived from response patterns with the assumption
that the model accurately represents the response patterns
of the population.
1.3 Novelty and relation to prior work
In this manuscript, we re-examine the methodology behind
the WD-FAB and highlight how modern statistical tech-
niques can improve it. Specifically, we show that proba-