A Spectral Approach to Item Response Theory Duc Nguyen Department of Computer and Information Science

2025-04-27 0 0 652.49KB 32 页 10玖币
侵权投诉
A Spectral Approach to Item Response Theory
Duc Nguyen
Department of Computer and Information Science
University of Pennsylvania
mdnguyen@seas.upenn.edu
Anderson Y. Zhang
Department of Statistics and Data Science
University of Pennsylvania
ayz@wharton.upenn.edu
Abstract
The Rasch model is one of the most fundamental models in item response theory and
has wide-ranging applications from education testing to recommendation systems.
In a universe with
n
users and
m
items, the Rasch model assumes that the binary
response
Xli ∈ {0,1}
of a user
l
with parameter
θ
l
to an item
i
with parameter
β
i
(e.g., a user likes a movie, a student correctly solves a problem) is distributed as
P(Xli =1)=1/(1 + exp ((θ
lβ
i)))
. In this paper, we propose a new item
estimation algorithm for this celebrated model (i.e., to estimate
β
). The core of
our algorithm is the computation of the stationary distribution of a Markov chain
defined on an item-item graph. We complement our algorithmic contributions with
finite-sample error guarantees, the first of their kind in the literature, showing that
our algorithm is consistent and enjoys favorable optimality properties. We discuss
practical modifications to accelerate and robustify the algorithm that practitioners
can adopt. Experiments on synthetic and real-life datasets, ranging from small
education testing datasets to large recommendation systems datasets show that our
algorithm is scalable, accurate, and competitive with the most commonly used
methods in the literature.
1 Introduction
Item response theory (IRT) is the study of the relationship between latent characteristics (a student’s
ability versus a test’s difficulty or a user’s taste versus a movie’s features) and the manifestations
of these characteristics (a student’s performance on a test or a user’s rating of a movie). Originally
developed by the psychometric community [
45
,
51
], item response theory has been applied to diverse
settings such as education testing [
37
], crowdsourcing [
54
], recommendation systems [
12
], finance
[50] and marketing research [10].
One of the most fundamental models in IRT is the Rasch model [
45
]. It models the binary response
Xli ∈ {0,1}of user lwith latent parameter θ
lRto item iwith latent parameter β
iRby
P(Xli = 1) = 1
1 + exp ((θ
lβ
i)) .(1)
For example, in education testing,
θ
l
corresponds to the ability of student
l
,
β
i
the difficulty of
problem
i
and
Xli = 1
if the student correctly solves the problem. Binary response data has grown
abundantly in modern domains: Netflix famously switched from a 5-star rating system to a binary
like/dislike feedback system, data on students’ engagement and performance grows significantly as
education moves online during the pandemic.
Traditionally, the goal of estimation under the Rasch model is to recover the item parameters
β
. In
education testing, an estimate of the item parameters can be used to calibrate scores across different
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04317v2 [cs.LG] 29 Oct 2023
versions of a test. In recommendation systems, the item parameters can be used to produce a ranking
over the items. In general, estimation is challenging under the Rasch model because for each user
and item pair, we only get a single observation or none in the case of missing data.
Joint maximum likelihood estimate (JMLE) is one of the earliest methods developed for the estimation
problem [
3
,
22
,
25
,
27
]. It estimates both the user and item parameters by maximizing the joint
likelihood function using an alternating maximization algorithm. While efficient, JMLE is known to
be inconsistent (that is, even as
n→ ∞
, JMLE does not recover
β
) when the number of items is
finite [
3
,
24
] (e.g., Figure 1a). Intuitively, this is because there are many nuisance user parameters
to a finite number of item parameters. As a result, JMLE is mostly used for prelimary parameter
estimation and researchers have developed other solutions to address the inconsistency problem,
broadly consisting of 3 approaches as follows.
The first approach is marginal maximum likelihood estimate (MMLE) [
7
]. The statistician first
specifies a prior distribution over the user parameters. The objective of MMLE is to maximize the
marignal likelihood function which integrates out the user parameters. In pratice, MMLE runs quite
fast, handles missing data well and is reasonably accurate. However, its performance depends on the
accuracy of the prior distribution. If misspecified, MMLE may produce inaccurate estimates (e.g.,
Figure 1b). Model selection is thus a crucial procedure when applying MMLE to real data.
The second approach is conditional maximum likelihood estimate (CMLE) [
3
,
22
,
27
]. CMLE builds
on the fact that under the Rasch model the total number of positive responses
sl
for each user
l
is a sufficient statistic for the user parameter
θ
l
. Instead of the joint likelihood function, CMLE
maximizes the likelihood conditioned on
{sl}n
l=1
. Unlike JMLE, CMLE is statistically consistent
without requiring any distribution assumptions about
θ
. For small datasets with no missing data,
CMLE is quite accurate. However, it may incur high computational cost and numerical issues on
large datasets with many items and missing entries. Practioners have observed that CMLE often
produces inaccurate estimates [35, 36] in this regime (e.g., Figure 1c).
The third approach, which our algorithm follows, uses pairwise information that can be extracted
from binary responses. Intuitively, if a user responds to two items, one negatively and one positively,
we learn that the later is ‘better’. Following this intuition, previous authors [
23
,
17
,
48
] have designed
spectral algorithms that first construct an item-item matrix and then compute its leading eigenvector.
One common limitation of these methods is that the item-item matrix is assumed to be dense.
Therefore, these methods aren’t directly extendable to large scale datasets in applications such as
recommendation systems where the item-item observation is sparse.
Furthermore, most theoretical guarantees for the above methods are asymptotic
(n→ ∞)
. However,
having finite sample error guarantees is useful in real-life applications. For example, when we only
observe a handful of responses to a new item, it is important to have an accurate estimate of the error
over the item parameter. Asymptotic guarantees, on the other hand, are accurate mostly under a large
sample size regime, and can be inaccurate in the data-poor regime.
Our Contributions: Motivated by known limitations of the existing methods, we propose a new,
theoretically grounded algorithm that addresses these limitations and performs competitively with the
most commonly used methods in the literature. More specifically:
In Sections 2 and 4, we describe the spectral algorithm and practical modifications – an
accelerated version of the original algorithm and a regularization strategy – that allow the
algorithm to scale to large real-life datasets with sparse observation patterns and alleviate
numerical issues.
In Section 3, we present non-asymptotic error guarantees for the spectral method – the
first of their kind in the literature – in Theorems 3.1 and 3.3. Notably, under the regime
where
m
grows, the spectral algorithm has optimal (up to a constant factor) estimation error
achievable by any unbiased estimator (Theorem 3.4). Under the challenging regime where
m
is a constant or grows very slowly we show that the spectral algorithm is, unlike JMLE,
consistent (Corollary 3.2).
In Section 5, we present experiment results on a wide range of datasets, both synthetic
and real, to show that our spectral algorithm is competitive with the most commonly used
methods in the literature, works off-the-shelf with minimal tuning and is scalable on large
datasets.
2
1.1 Notations and Problem Formulation
As briefly described before, in a universe of
n
users and
m
items, each user
l
has a latent parameter
θ
lR
and each item
i
has latent parameter
β
iR
. The reader may recognize that there is a
fundamental identifiability issue associated with the Rasch model pertaining to translation. That is,
{θ, β}
and
{θ+α1n, β+α1m}
describe the same model for any
αR
. For this reason, we
impose a normalization constraint on the item parameters
β1m= 0
. We consider the fixed range
setting where
β
i[β
min, β
max]i[m]
for some constants
β
min
,
β
max
. Similarly, we assume that
θ
l[θ
min, θ
max]
for some constants
θ
min, θ
max 1
. The observed data is
X∈ {0,1,∗}n×m
where
denotes missing data and for entries where
Xli ̸=
,
Xli
is independently distributed per Equation (1).
Let
A∈ {0,1}n×m
denote the assignment matrix where
Ali = 1
if user
l
responds (either negatively
or positively) to item
i
and
0
if user
l
does not respond to item
i
(i.e.,
Xli =
). Define
B=AA
,
i.e.,
Bij
is the number of users who respond to both items
i, j
. The goal of item estimation is to
obtain an estimate
β
from the observed data
X
and the metric of interest is the
2
error,
ββ2
.
2 The Spectral Estimator
In this section we describe our spectral algorithm which is summarized in Algorithm 1. At a high
level, the algorithm constructs a Markov chain defined on a graph whose vertices are the items and its
transition probabilities are estimated using the observed user-item response data. The algorithm then
computes the stationary distribution of this Markov chain and the estimate
β
is obtained following a
simple transformation.
We first define, for each item pair
i, j
and a fixed assignment
A
, a quantity which we term pairwise
differential measurement:
Yij =
n
X
l=1
AliAlj Xli(1 Xlj )i̸=j[m].(2)
Intuitively,
Yij
is the number of users who respond
1
to
i
and
0
to
j
. Given the pairwise differential
measurements, consider a Markov chain
P[0,1]m×m
whose transition probabilities are defined as
follows:
Pij =(1
dYij if i̸=j
1Pk̸=i1
dYik if i=j,(3)
where
d
is a sufficiently large normalization factor chosen such that the resulting pairwise transition
probability matrix does not contain any negative entries. Typically,
d=O(maxi[m]Pk̸=iBik)
.
The algorithm then computes the stationary distribution
π
of the Markov chain (e.g., using power
iteration) and recover
β
using a truncated log transformation step. The truncated transformation is
used to facilitate the resulting theoretical analysis. The statistician could use any reasonable estimate
of
β
max β
min
and incur little impact on practical performance of the algorithm. In real-life datasets,
the constructed Markov chain is often sparse (not every pair of items has non-zero pairwise differential
measurements). Practicioners could take advantage of this sparsity to speed up the computation of
the stationary distribution such as by using sparse matrix-vector multiplication subroutines.
To understand the intuition behind our spectral algorithm, let us consider the following idealized
Markov chain where the state transition probabilities are exact:
P
ij =(1
dY
ij for i̸=j
11
dPk̸=iY
ik for i=j,(4)
where
Y
ij =Pn
l=1 AliAlj E[Xli(1 Xlj )]
. For every pair
i, j
, given a sufficiently large number of
users who respond to both items,
Yij
will concentrate around
Y
ij
. Then, under an appropriately large
scaling factor
d
,
Pij P
ij
and the two Markov chains are ‘close’. This means that the stationary
distribution of
P
is also close to that of
P
. At the same time, the true item parameter
β
is directly
related to the stationary distribution of P. This relation is summarized by Proposition 2.1.
1
The bounded range assumption is a common one in the literature on the Rasch model. Intuitively, it
eliminates the presence of items that are always repsonded positively to (or negatively to) and users who only
responds positively (or negatively) that leads to parameter unidentifiability [26].
3
Algorithm 1 Spectral Estimator
Input: User-item binary response data X∈ {0,1,∗}n×m.
Output: An estimate of the item parameters β= [β1, . . . , βm].
1: Construct a Markov chain Pper Equation (3).
2: Compute the stationary distribution of P:
Initialize π(0) = [ 1
m,..., 1
m].
For t= 1,2, . . . until convergence, compute
π(t)=π(t1)P
π(t1)P1
.
3: Compute ¯
βi= log max nπi,1
meβ
maxβ
min ofor i[m].
4: Return the normalized item parameters, i.e., β=¯
β¯
β1/m.
Proposition 2.1. Consider the idealized Markov chain described in Equation (4). The stationary
distribution πof Psatisfies π
i=eβ
i/(Pm
k=1 eβ
k)for i[m].
Essentially Proposition 2.1 states that
π
is proportional to
eβ
. Thus
β
can be recovered from
π
up to a global normalization. Now, given a sufficiently large number of users, the empirical stationary
distribution πwill be close to πand naturally the obtained estimate βis also close to β.
Readers who are familiar with the ranking from pairwise comparison literature might recognize the
similarity between the spectral algorithm and Rank Centrality [
43
] for parameter estimation under
the Bradley-Terry-Luce model [
38
]. Similarly to Rank Centrality, our algorithm constructs a Markov
chain on the item-item graph and recovers parameter estimate from its stationary distirbution. In
both cases, the Markov chain interpretation is motivated by the unique characteristics of the BTL and
Rasch likelihood function. However, the Markov chain construction differs between our algorithm
and Rank Centrality and so does the resulting analysis.
3 Theoretical Analysis
In this section, we present the main theoretical contributions of the paper. Specifically, we obtain in
Section 3.1 two finite sample error bounds for two different regimes of
m
: where
m
is a constant
or grows very slowly and where
m
grows at least logarithmically relative to
n
. In addition to our
upper bounds, we show in Section 3.2 a Cramer-Rao lower bound for the mean squared error of any
unbiased estimator, establishing the optimality of the spectral algorithm under the the second regime.
For the special case
m= 2
, we show that the error rate obtained by the spectral algorithm is optimal
up to a log factor.
3.1 Finite Sample Error Guarantees
Sampling Model: Let us consider a random sampling model where for each user
l[n]
, each item
i[m]
is independently shown to that user with probability
p
(i.e.,
P(Ali = 1) = p
). Once shown
an item l, the user iresponds with Xli distributed according to Equation (1).
Under this sampling model and the regime where
m
is a constant or grows very slowly, we obtain the
following upper bound on the estimation error of the spectral algorithm which is, to the best of our
knowledge, the first finite sample error guarantee for any consistent estimator under the Rasch model
in the literature.
Theorem 3.1. Consider the sampling model described above. Suppose that
np2Clog m
for a
sufficiently large constant Cthen the output of the spectral algorithm statisfies
ββ2Cpmax{m, log np2}
pnp2
with probability at least 1min{e12m,1
(np2)12 } − exp C1np2, where C, C1are constants.
4
As alluded to before in our algorithm description, the proof of Theorem 3.1 uses Markov chain
analysis and a central object is the idealized Markov chain
P
with its stationary distribution
π
. The
proof is rather long and involved so we defer the details to the supplementary materials and describe
here the main idea. The starting point is a Markov chain eigen-perturbation bound (see Lemma A.3):
ππ2π(PP)2
µ(P)− ∥PP2
,
where
µ(P)
is the spectral gap of the idealized Markov chain. We then bound the numerator and
the denominator separately. We will show under the setting of Theorem 3.1 that
µ(P)− ∥PP2= Ω1
dand π(PP)2=Opmax{m, log np2}
dmpnp2.
Combining these bounds with the following relation gives us the desired error bound:
ββ2=Om· ∥ππ2.
As an immediate consequence of Theorem 3.1, we can also prove the consistency of the spectral
algorithm under the constant
m
regime. As mentioned previously, JMLE, one of the most well known
methods in the Rasch modeling literature, is inconsistent in this regime.
Corollary 3.2. Consider the setting of Theorem 3.1. For a fixed
m
and
p= 1
, the spectral algorithm
is a consistent estimator of
β
. That is, its output
β
satisfies
limn→∞ P(ββ2< ϵ)=1,ϵ > 0.
Under the regime where
m
grows, we could sharpen the results of Theorem 3.1. Specifically, when
the number of items shown to each user is sufficiently large, we improve by a
p
factor which can be
significant when pis small. This is summarized by the following theorem.
Theorem 3.3. Consider the setting of Theorem 3.1. Assume further that
mp C′′ log n
for a
sufficiently large constant C′′ then the output of the spectral algorithm statisfies
ββ2Cm
np
with probability at least 1exp C2np22n9, where C, C2are constants.
The reader may wonder why there would be a difference between the two regimes. Intuitively, when
m
is a small constant, the distribution of the number of items shown to the users are not tightly
concentrated. Some users are shown all of the items while some are shown only one. By design, our
spectral algorithm uses pairwise differential measurements. This means that when a user responds
to only one item, that information is not fully used. On the other hand, when
mp =O(log n)
, the
number of items shown to the users is concentrated (all users are shown approximately the same
number of items) and more pairwise differential measurements are available. There is less information
being under-utilized by the algorithm and it enjoys a tighter (in fact optimal) error rate.
3.2 Cramer-Rao Lower Bound
In this section, we present complementary results to our finite error guarantees obtained in the
previous section. Notably, under the regime where
m
is allowed to grow with
n
, we show that the
minimum mean squared error achievable by any unbiased estimator is no more than a constant factor
smaller than the upper bound for the spectral algorithm established in Theorem 3.3. This optimality
result is summarized by the following theorem.
Theorem 3.4. Consider the sampling model described in in Section 3.1. Let
T
be any unbiased
estimator for the item parameters. Then the mean squared error of such estimator is lower bounded
as
ET(X)β2
2cm
np ,
where T(X)is the output of the estimator Twhen given data Xand cis a constant.
5
摘要:

ASpectralApproachtoItemResponseTheoryDucNguyenDepartmentofComputerandInformationScienceUniversityofPennsylvaniamdnguyen@seas.upenn.eduAndersonY.ZhangDepartmentofStatisticsandDataScienceUniversityofPennsylvaniaayz@wharton.upenn.eduAbstractTheRaschmodelisoneofthemostfundamentalmodelsinitemresponsetheo...

展开>> 收起<<
A Spectral Approach to Item Response Theory Duc Nguyen Department of Computer and Information Science.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:652.49KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注