versions of a test. In recommendation systems, the item parameters can be used to produce a ranking
over the items. In general, estimation is challenging under the Rasch model because for each user
and item pair, we only get a single observation or none in the case of missing data.
Joint maximum likelihood estimate (JMLE) is one of the earliest methods developed for the estimation
problem [
3
,
22
,
25
,
27
]. It estimates both the user and item parameters by maximizing the joint
likelihood function using an alternating maximization algorithm. While efficient, JMLE is known to
be inconsistent (that is, even as
n→ ∞
, JMLE does not recover
β∗
) when the number of items is
finite [
3
,
24
] (e.g., Figure 1a). Intuitively, this is because there are many nuisance user parameters
to a finite number of item parameters. As a result, JMLE is mostly used for prelimary parameter
estimation and researchers have developed other solutions to address the inconsistency problem,
broadly consisting of 3 approaches as follows.
The first approach is marginal maximum likelihood estimate (MMLE) [
7
]. The statistician first
specifies a prior distribution over the user parameters. The objective of MMLE is to maximize the
marignal likelihood function which integrates out the user parameters. In pratice, MMLE runs quite
fast, handles missing data well and is reasonably accurate. However, its performance depends on the
accuracy of the prior distribution. If misspecified, MMLE may produce inaccurate estimates (e.g.,
Figure 1b). Model selection is thus a crucial procedure when applying MMLE to real data.
The second approach is conditional maximum likelihood estimate (CMLE) [
3
,
22
,
27
]. CMLE builds
on the fact that under the Rasch model the total number of positive responses
sl
for each user
l
is a sufficient statistic for the user parameter
θ∗
l
. Instead of the joint likelihood function, CMLE
maximizes the likelihood conditioned on
{sl}n
l=1
. Unlike JMLE, CMLE is statistically consistent
without requiring any distribution assumptions about
θ∗
. For small datasets with no missing data,
CMLE is quite accurate. However, it may incur high computational cost and numerical issues on
large datasets with many items and missing entries. Practioners have observed that CMLE often
produces inaccurate estimates [35, 36] in this regime (e.g., Figure 1c).
The third approach, which our algorithm follows, uses pairwise information that can be extracted
from binary responses. Intuitively, if a user responds to two items, one negatively and one positively,
we learn that the later is ‘better’. Following this intuition, previous authors [
23
,
17
,
48
] have designed
spectral algorithms that first construct an item-item matrix and then compute its leading eigenvector.
One common limitation of these methods is that the item-item matrix is assumed to be dense.
Therefore, these methods aren’t directly extendable to large scale datasets in applications such as
recommendation systems where the item-item observation is sparse.
Furthermore, most theoretical guarantees for the above methods are asymptotic
(n→ ∞)
. However,
having finite sample error guarantees is useful in real-life applications. For example, when we only
observe a handful of responses to a new item, it is important to have an accurate estimate of the error
over the item parameter. Asymptotic guarantees, on the other hand, are accurate mostly under a large
sample size regime, and can be inaccurate in the data-poor regime.
Our Contributions: Motivated by known limitations of the existing methods, we propose a new,
theoretically grounded algorithm that addresses these limitations and performs competitively with the
most commonly used methods in the literature. More specifically:
•
In Sections 2 and 4, we describe the spectral algorithm and practical modifications – an
accelerated version of the original algorithm and a regularization strategy – that allow the
algorithm to scale to large real-life datasets with sparse observation patterns and alleviate
numerical issues.
•
In Section 3, we present non-asymptotic error guarantees for the spectral method – the
first of their kind in the literature – in Theorems 3.1 and 3.3. Notably, under the regime
where
m
grows, the spectral algorithm has optimal (up to a constant factor) estimation error
achievable by any unbiased estimator (Theorem 3.4). Under the challenging regime where
m
is a constant or grows very slowly we show that the spectral algorithm is, unlike JMLE,
consistent (Corollary 3.2).
•
In Section 5, we present experiment results on a wide range of datasets, both synthetic
and real, to show that our spectral algorithm is competitive with the most commonly used
methods in the literature, works off-the-shelf with minimal tuning and is scalable on large
datasets.
2