A Spectral Approach to Item Response Theory Duc Nguyen Department of Computer and Information Science

2025-04-27 0 0 652.49KB 32 页 10玖币

侵权投诉

A Spectral Approach to Item Response Theory

Duc Nguyen

Department of Computer and Information Science

University of Pennsylvania

mdnguyen@seas.upenn.edu

Anderson Y. Zhang

Department of Statistics and Data Science

University of Pennsylvania

ayz@wharton.upenn.edu

Abstract

The Rasch model is one of the most fundamental models in item response theory and

has wide-ranging applications from education testing to recommendation systems.

In a universe with

users and

items, the Rasch model assumes that the binary

response

Xli ∈ {0,1}

of a user

with parameter

θ∗

to an item

with parameter

β∗

(e.g., a user likes a movie, a student correctly solves a problem) is distributed as

P(Xli =1)=1/(1 + exp (−(θ∗

l−β∗

i)))

. In this paper, we propose a new item

estimation algorithm for this celebrated model (i.e., to estimate

β∗

). The core of

our algorithm is the computation of the stationary distribution of a Markov chain

deﬁned on an item-item graph. We complement our algorithmic contributions with

ﬁnite-sample error guarantees, the ﬁrst of their kind in the literature, showing that

our algorithm is consistent and enjoys favorable optimality properties. We discuss

practical modiﬁcations to accelerate and robustify the algorithm that practitioners

can adopt. Experiments on synthetic and real-life datasets, ranging from small

education testing datasets to large recommendation systems datasets show that our

algorithm is scalable, accurate, and competitive with the most commonly used

methods in the literature.

1 Introduction

Item response theory (IRT) is the study of the relationship between latent characteristics (a student’s

ability versus a test’s difﬁculty or a user’s taste versus a movie’s features) and the manifestations

of these characteristics (a student’s performance on a test or a user’s rating of a movie). Originally

developed by the psychometric community [

], item response theory has been applied to diverse

settings such as education testing [

], crowdsourcing [

], recommendation systems [

], ﬁnance

[50] and marketing research [10].

One of the most fundamental models in IRT is the Rasch model [

]. It models the binary response

Xli ∈ {0,1}of user lwith latent parameter θ∗

l∈Rto item iwith latent parameter β∗

i∈Rby

P(Xli = 1) = 1

1 + exp (−(θ∗

l−β∗

i)) .(1)

For example, in education testing,

θ∗

corresponds to the ability of student

β∗

the difﬁculty of

problem

and

Xli = 1

if the student correctly solves the problem. Binary response data has grown

abundantly in modern domains: Netﬂix famously switched from a 5-star rating system to a binary

like/dislike feedback system, data on students’ engagement and performance grows signiﬁcantly as

education moves online during the pandemic.

Traditionally, the goal of estimation under the Rasch model is to recover the item parameters

β∗

. In

education testing, an estimate of the item parameters can be used to calibrate scores across different

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04317v2 [cs.LG] 29 Oct 2023

versions of a test. In recommendation systems, the item parameters can be used to produce a ranking

over the items. In general, estimation is challenging under the Rasch model because for each user

and item pair, we only get a single observation or none in the case of missing data.

Joint maximum likelihood estimate (JMLE) is one of the earliest methods developed for the estimation

problem [

]. It estimates both the user and item parameters by maximizing the joint

likelihood function using an alternating maximization algorithm. While efﬁcient, JMLE is known to

be inconsistent (that is, even as

n→ ∞

, JMLE does not recover

β∗

) when the number of items is

ﬁnite [

] (e.g., Figure 1a). Intuitively, this is because there are many nuisance user parameters

to a ﬁnite number of item parameters. As a result, JMLE is mostly used for prelimary parameter

estimation and researchers have developed other solutions to address the inconsistency problem,

broadly consisting of 3 approaches as follows.

The ﬁrst approach is marginal maximum likelihood estimate (MMLE) [

]. The statistician ﬁrst

speciﬁes a prior distribution over the user parameters. The objective of MMLE is to maximize the

marignal likelihood function which integrates out the user parameters. In pratice, MMLE runs quite

fast, handles missing data well and is reasonably accurate. However, its performance depends on the

accuracy of the prior distribution. If misspeciﬁed, MMLE may produce inaccurate estimates (e.g.,

Figure 1b). Model selection is thus a crucial procedure when applying MMLE to real data.

The second approach is conditional maximum likelihood estimate (CMLE) [

]. CMLE builds

on the fact that under the Rasch model the total number of positive responses

for each user

is a sufﬁcient statistic for the user parameter

θ∗

. Instead of the joint likelihood function, CMLE

maximizes the likelihood conditioned on

{sl}n

l=1

. Unlike JMLE, CMLE is statistically consistent

without requiring any distribution assumptions about

θ∗

. For small datasets with no missing data,

CMLE is quite accurate. However, it may incur high computational cost and numerical issues on

large datasets with many items and missing entries. Practioners have observed that CMLE often

produces inaccurate estimates [35, 36] in this regime (e.g., Figure 1c).

The third approach, which our algorithm follows, uses pairwise information that can be extracted

from binary responses. Intuitively, if a user responds to two items, one negatively and one positively,

we learn that the later is ‘better’. Following this intuition, previous authors [

] have designed

spectral algorithms that ﬁrst construct an item-item matrix and then compute its leading eigenvector.

One common limitation of these methods is that the item-item matrix is assumed to be dense.

Therefore, these methods aren’t directly extendable to large scale datasets in applications such as

recommendation systems where the item-item observation is sparse.

Furthermore, most theoretical guarantees for the above methods are asymptotic

(n→ ∞)

. However,

having ﬁnite sample error guarantees is useful in real-life applications. For example, when we only

observe a handful of responses to a new item, it is important to have an accurate estimate of the error

over the item parameter. Asymptotic guarantees, on the other hand, are accurate mostly under a large

sample size regime, and can be inaccurate in the data-poor regime.

Our Contributions: Motivated by known limitations of the existing methods, we propose a new,

theoretically grounded algorithm that addresses these limitations and performs competitively with the

most commonly used methods in the literature. More speciﬁcally:

•

In Sections 2 and 4, we describe the spectral algorithm and practical modiﬁcations – an

accelerated version of the original algorithm and a regularization strategy – that allow the

algorithm to scale to large real-life datasets with sparse observation patterns and alleviate

numerical issues.

•

In Section 3, we present non-asymptotic error guarantees for the spectral method – the

ﬁrst of their kind in the literature – in Theorems 3.1 and 3.3. Notably, under the regime

where

grows, the spectral algorithm has optimal (up to a constant factor) estimation error

achievable by any unbiased estimator (Theorem 3.4). Under the challenging regime where

is a constant or grows very slowly we show that the spectral algorithm is, unlike JMLE,

consistent (Corollary 3.2).

•

In Section 5, we present experiment results on a wide range of datasets, both synthetic

and real, to show that our spectral algorithm is competitive with the most commonly used

methods in the literature, works off-the-shelf with minimal tuning and is scalable on large

datasets.

1.1 Notations and Problem Formulation

As brieﬂy described before, in a universe of

users and

items, each user

has a latent parameter

θ∗

l∈R

and each item

has latent parameter

β∗

i∈R

. The reader may recognize that there is a

fundamental identiﬁability issue associated with the Rasch model pertaining to translation. That is,

{θ∗, β∗}

and

{θ∗+α1n, β∗+α1m}

describe the same model for any

α∈R

. For this reason, we

impose a normalization constraint on the item parameters

β∗⊤1m= 0

. We consider the ﬁxed range

setting where

β∗

i∈[β∗

min, β∗

max]∀i∈[m]

for some constants

β∗

min

β∗

max

. Similarly, we assume that

θ∗

l∈[θ∗

min, θ∗

max]

for some constants

θ∗

min, θ∗

max 1

. The observed data is

X∈ {0,1,∗}n×m

where

∗

denotes missing data and for entries where

Xli ̸=∗

Xli

is independently distributed per Equation (1).

Let

A∈ {0,1}n×m

denote the assignment matrix where

Ali = 1

if user

responds (either negatively

or positively) to item

and

if user

does not respond to item

(i.e.,

Xli =∗

). Deﬁne

B=A⊤A

i.e.,

Bij

is the number of users who respond to both items

i, j

. The goal of item estimation is to

obtain an estimate

from the observed data

and the metric of interest is the

ℓ2

error,

∥β−β∗∥2

2 The Spectral Estimator

In this section we describe our spectral algorithm which is summarized in Algorithm 1. At a high

level, the algorithm constructs a Markov chain deﬁned on a graph whose vertices are the items and its

transition probabilities are estimated using the observed user-item response data. The algorithm then

computes the stationary distribution of this Markov chain and the estimate

is obtained following a

simple transformation.

We ﬁrst deﬁne, for each item pair

i, j

and a ﬁxed assignment

, a quantity which we term pairwise

differential measurement:

Yij =

l=1

AliAlj Xli(1 −Xlj )∀i̸=j∈[m].(2)

Intuitively,

Yij

is the number of users who respond

and

. Given the pairwise differential

measurements, consider a Markov chain

P∈[0,1]m×m

whose transition probabilities are deﬁned as

follows:

Pij =(1

dYij if i̸=j

1−Pk̸=i1

dYik if i=j,(3)

where

is a sufﬁciently large normalization factor chosen such that the resulting pairwise transition

probability matrix does not contain any negative entries. Typically,

d=O(maxi∈[m]Pk̸=iBik)

The algorithm then computes the stationary distribution

of the Markov chain (e.g., using power

iteration) and recover

using a truncated log transformation step. The truncated transformation is

used to facilitate the resulting theoretical analysis. The statistician could use any reasonable estimate

β∗

max −β∗

min

and incur little impact on practical performance of the algorithm. In real-life datasets,

the constructed Markov chain is often sparse (not every pair of items has non-zero pairwise differential

measurements). Practicioners could take advantage of this sparsity to speed up the computation of

the stationary distribution such as by using sparse matrix-vector multiplication subroutines.

To understand the intuition behind our spectral algorithm, let us consider the following idealized

Markov chain where the state transition probabilities are exact:

P∗

ij =(1

dY∗

ij for i̸=j

1−1

dPk̸=iY∗

ik for i=j,(4)

where

Y∗

ij =Pn

l=1 AliAlj E[Xli(1 −Xlj )]

. For every pair

i, j

, given a sufﬁciently large number of

users who respond to both items,

Yij

will concentrate around

Y∗

. Then, under an appropriately large

scaling factor

Pij ≈P∗

and the two Markov chains are ‘close’. This means that the stationary

distribution of

is also close to that of

P∗

. At the same time, the true item parameter

β∗

is directly

related to the stationary distribution of P∗. This relation is summarized by Proposition 2.1.

The bounded range assumption is a common one in the literature on the Rasch model. Intuitively, it

eliminates the presence of items that are always repsonded positively to (or negatively to) and users who only

responds positively (or negatively) that leads to parameter unidentiﬁability [26].

Algorithm 1 Spectral Estimator

Input: User-item binary response data X∈ {0,1,∗}n×m.

Output: An estimate of the item parameters β= [β1, . . . , βm].

1: Construct a Markov chain Pper Equation (3).

2: Compute the stationary distribution of P:

Initialize π(0) = [ 1

m,..., 1

m].

For t= 1,2, . . . until convergence, compute

π(t)⊤=π(t−1)⊤P

∥π(t−1)⊤P∥1

3: Compute ¯

βi= log max nπi,1

meβ∗

max−β∗

min ofor i∈[m].

4: Return the normalized item parameters, i.e., β=¯

β−¯

β⊤1/m.

Proposition 2.1. Consider the idealized Markov chain described in Equation (4). The stationary

distribution π∗of P∗satisﬁes π∗

i=eβ∗

i/(Pm

k=1 eβ∗

k)for i∈[m].

Essentially Proposition 2.1 states that

π∗

is proportional to

eβ∗

. Thus

β∗

can be recovered from

π∗

up to a global normalization. Now, given a sufﬁciently large number of users, the empirical stationary

distribution πwill be close to π∗and naturally the obtained estimate βis also close to β∗.

Readers who are familiar with the ranking from pairwise comparison literature might recognize the

similarity between the spectral algorithm and Rank Centrality [

] for parameter estimation under

the Bradley-Terry-Luce model [

]. Similarly to Rank Centrality, our algorithm constructs a Markov

chain on the item-item graph and recovers parameter estimate from its stationary distirbution. In

both cases, the Markov chain interpretation is motivated by the unique characteristics of the BTL and

Rasch likelihood function. However, the Markov chain construction differs between our algorithm

and Rank Centrality and so does the resulting analysis.

3 Theoretical Analysis

In this section, we present the main theoretical contributions of the paper. Speciﬁcally, we obtain in

Section 3.1 two ﬁnite sample error bounds for two different regimes of

: where

is a constant

or grows very slowly and where

grows at least logarithmically relative to

. In addition to our

upper bounds, we show in Section 3.2 a Cramer-Rao lower bound for the mean squared error of any

unbiased estimator, establishing the optimality of the spectral algorithm under the the second regime.

For the special case

m= 2

, we show that the error rate obtained by the spectral algorithm is optimal

up to a log factor.

3.1 Finite Sample Error Guarantees

Sampling Model: Let us consider a random sampling model where for each user

l∈[n]

, each item

i∈[m]

is independently shown to that user with probability

(i.e.,

P(Ali = 1) = p

). Once shown

an item l, the user iresponds with Xli distributed according to Equation (1).

Under this sampling model and the regime where

is a constant or grows very slowly, we obtain the

following upper bound on the estimation error of the spectral algorithm which is, to the best of our

knowledge, the ﬁrst ﬁnite sample error guarantee for any consistent estimator under the Rasch model

in the literature.

Theorem 3.1. Consider the sampling model described above. Suppose that

np2≥C′log m

for a

sufﬁciently large constant C′then the output of the spectral algorithm statisﬁes

∥β−β∗∥2≤Cpmax{m, log np2}

pnp2

with probability at least 1−min{e−12m,1

(np2)12 } − exp −C1np2, where C, C1are constants.

As alluded to before in our algorithm description, the proof of Theorem 3.1 uses Markov chain

analysis and a central object is the idealized Markov chain

P∗

with its stationary distribution

π∗

. The

proof is rather long and involved so we defer the details to the supplementary materials and describe

here the main idea. The starting point is a Markov chain eigen-perturbation bound (see Lemma A.3):

∥π−π∗∥2≤∥π∗⊤(P∗−P)∥2

µ∗(P∗)− ∥P−P∗∥2

where

µ∗(P∗)

is the spectral gap of the idealized Markov chain. We then bound the numerator and

the denominator separately. We will show under the setting of Theorem 3.1 that

µ∗(P∗)− ∥P−P∗∥2= Ω1

dand ∥π∗⊤(P∗−P)∥2=Opmax{m, log np2}

dmpnp2.

Combining these bounds with the following relation gives us the desired error bound:

∥β−β∗∥2=Om· ∥π−π∗∥2.

As an immediate consequence of Theorem 3.1, we can also prove the consistency of the spectral

algorithm under the constant

regime. As mentioned previously, JMLE, one of the most well known

methods in the Rasch modeling literature, is inconsistent in this regime.

Corollary 3.2. Consider the setting of Theorem 3.1. For a ﬁxed

and

p= 1

, the spectral algorithm

is a consistent estimator of

β∗

. That is, its output

satisﬁes

limn→∞ P(∥β−β∗∥2< ϵ)=1,∀ϵ > 0.

Under the regime where

grows, we could sharpen the results of Theorem 3.1. Speciﬁcally, when

the number of items shown to each user is sufﬁciently large, we improve by a

√p

factor which can be

signiﬁcant when pis small. This is summarized by the following theorem.

Theorem 3.3. Consider the setting of Theorem 3.1. Assume further that

mp ≥C′′ log n

for a

sufﬁciently large constant C′′ then the output of the spectral algorithm statisﬁes

∥β−β∗∥2≤C∗√m

√np

with probability at least 1−exp −C2np2−2n−9, where C∗, C2are constants.

The reader may wonder why there would be a difference between the two regimes. Intuitively, when

is a small constant, the distribution of the number of items shown to the users are not tightly

concentrated. Some users are shown all of the items while some are shown only one. By design, our

spectral algorithm uses pairwise differential measurements. This means that when a user responds

to only one item, that information is not fully used. On the other hand, when

mp =O(log n)

, the

number of items shown to the users is concentrated (all users are shown approximately the same

number of items) and more pairwise differential measurements are available. There is less information

being under-utilized by the algorithm and it enjoys a tighter (in fact optimal) error rate.

3.2 Cramer-Rao Lower Bound

In this section, we present complementary results to our ﬁnite error guarantees obtained in the

previous section. Notably, under the regime where

is allowed to grow with

, we show that the

minimum mean squared error achievable by any unbiased estimator is no more than a constant factor

smaller than the upper bound for the spectral algorithm established in Theorem 3.3. This optimality

result is summarized by the following theorem.

Theorem 3.4. Consider the sampling model described in in Section 3.1. Let

be any unbiased

estimator for the item parameters. Then the mean squared error of such estimator is lower bounded

E∥T(X)−β∗∥2

2≥cm

np ,

where T(X)is the output of the estimator Twhen given data Xand cis a constant.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASpectralApproachtoItemResponseTheoryDucNguyenDepartmentofComputerandInformationScienceUniversityofPennsylvaniamdnguyen@seas.upenn.eduAndersonY.ZhangDepartmentofStatisticsandDataScienceUniversityofPennsylvaniaayz@wharton.upenn.eduAbstractTheRaschmodelisoneofthemostfundamentalmodelsinitemresponsetheo...

展开>> 收起<<

A Spectral Approach to Item Response Theory Duc Nguyen Department of Computer and Information Science.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Spectral Approach to Item Response Theory Duc Nguyen Department of Computer and Information Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: