FasterRisk Fast and Accurate Interpretable Risk Scores Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1

2025-04-26 0 0 2.59MB 82 页 10玖币
侵权投诉
FasterRisk: Fast and Accurate Interpretable Risk
Scores
Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1
1Duke Univeristy 2University of British Columbia
{jiachang.liu, chudi.zhong, boxuan.li}@duke.edu
mseltzer@cs.ubc.ca, cynthia@cs.duke.edu
Abstract
Over the last century, risk scores have been the most popular form of predictive
model used in healthcare and criminal justice. Risk scores are sparse linear models
with integer coefficients; often these models can be memorized or placed on an
index card. Typically, risk scores have been created either without data or by
rounding logistic regression coefficients, but these methods do not reliably produce
high-quality risk scores. Recent work used mathematical programming, which
is computationally slow. We introduce an approach for efficiently producing a
collection of high-quality risk scores learned from data. Specifically, our approach
produces a pool of almost-optimal sparse continuous solutions, each with a different
support set, using a beam-search algorithm. Each of these continuous solutions is
transformed into a separate risk score through a “star ray” search, where a range of
multipliers are considered before rounding the coefficients sequentially to maintain
low logistic loss. Our algorithm returns all of these high-quality risk scores for the
user to consider. This method completes within minutes and can be valuable in a
broad variety of applications.
1 Introduction
Risk scores are sparse linear models with integer coefficients that predict risks. They are arguably
the most popular form of predictive model for high stakes decisions through the last century and
are the standard form of model used in criminal justice [
4
,
22
] and medicine [
19
,
27
,
34
,
31
,
41
].
1. Oval Shape -2 points ...
2. Irregular Shape 4 points +...
3. Circumscribed Margin -5 points +...
4. Spiculated Margin 2 points +...
5. Age 60 3 points +
SCORE =
SCORE -7 -5 -4 -3 -2 -1
RISK 6.0% 10.6% 13.8% 17.9% 22.8% 28.6%
SCORE 012345
RISK 35.2% 42.4% 50.0% 57.6% 64.8% 71.4%
Figure 1: Risk score on the mammo dataset [
15
], whose pop-
ulation is biopsy patients. It predicts the risk of malignancy
of a breast lesion. Risk score is from FasterRisk on a fold
of a 5-CV split. The AUCs on the training and test sets are
0.854 and 0.853, respectively.
Their history dates back to at least the
criminal justice work of Burgess [
8
],
where, based on their criminal history
and demographics, individuals were
assigned integer point scores between
0 and 21 that determined the probabil-
ity of their “making good or of fail-
ing upon parole. Other famous risk
scores are arguably the most widely-
used predictive models in healthcare.
These include the APGAR score [
3
],
developed in 1952 and given to new-
borns, and the CHADS
2
score [
18
],
which estimates stroke risk for atrial
fibrillation patients. Figures 1 and 2
show example risk scores, which es-
These authors contributed equally.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05846v1 [cs.LG] 12 Oct 2022
timate risk of a breast lesion being
malignant. 1. Irregular Shape 4 points ...
2. Circumscribed Margin -5 points +...
3. SpiculatedMargin 2 points +...
4. Age 45 1 point +...
5. Age 60 3 points +
SCORE =
SCORE -5 -4 -3 -2 -1 0
RISK 7.3% 9.7% 12.9% 16.9% 21.9% 27.8%
SCORE 123456
RISK 34.6% 42.1% 50.0% 57.9% 65.4% 72.2%
Figure 2: A second risk score on the mammo dataset on the
same fold as in Figure 1. The AUCs on the training and
test sets are 0.855 and 0.859, respectively. FasterRisk can
produce a diverse pool of high-quality models. Users can
choose a model that best fits with their domain knowledge.
Risk scores have the benefit of being
easily memorized; usually their names
reveal the full model – for instance,
the factors in CHADS
2
are past
C
hronic heart failure,
H
ypertension,
A
ge
75 years,
D
iabetes, and past
S
troke (where past stroke receives
2
points and the others each receive 1
point). For risk scores, counterfactu-
als are often trivial to compute, even
without a calculator. Also, checking
that the data and calculations are cor-
rect is easier with risk scores than with
other approaches. In short, risk scores
have been created by humans for a
century to support a huge spectrum of
applications [2, 23, 30, 43, 44, 47], because humans find them easy to understand.
Traditionally, risk scores have been created in two main ways: (1) without data, with expert knowledge
only (and validated only afterwards on data), and (2) using a semi-manual process involving manual
feature selection and rounding of logistic regression coefficients. That is, these approaches rely
heavily on domain expertise and rely little on data. Unfortunately, the alternative of building a model
directly from data leads to computationally hard problems: optimizing risk scores over a global
objective on data is NP-hard, because in order to produce integer-valued scores, the feasible region
must be the integer lattice. There have been only a few approaches to design risk scores automatically
[
5
,
6
,
9
,
10
,
16
,
32
,
33
,
38
,
39
,
40
], but each of these has a flaw that limits its use in practice: the
optimization-based approaches use mathematical programming solvers (which require a license) that
are slow and scale poorly, and the other methods are randomized greedy algorithms, producing fast
but much lower-quality solutions. We need an approach that exhibits the best of both worlds: speed
fast enough to operate in a few minutes on a laptop and optimization/search capability as powerful as
that of the mathematical programming tools. Our method, FasterRisk, lies at this intersection. It is
fast enough to enable interactive model design and can rapidly produce a large pool of models from
which users can choose rather than producing only a single model.
One may wonder why simple rounding of
1
-regularized logistic regression coefficients does not
yield sufficiently good risk scores. Past works [
37
,
39
] explain this as follows: the sheer amount of
1
regularization needed to get a very sparse solution leads to large biases and worse loss values, and
rounding goes against the performance gradient. For example, consider the following coefficients
from
1
regularization: [1.45, .87, .83, .47, .23, .15, ... ]. This model is worse than its unregularized
counterpart due to the bias induced by the large
1
term. Its rounded solution is [1,1,1,0,0,0,..], which
leads to even worse loss. Instead, one could multiply all the coefficients by a constant and then
round, but which constant is best? There are an infinite number of choices. Even if some value of the
multiplier leads to minimal loss due to rounding, the bias from the
1
term still limits the quality of
the solution.
The algorithm presented here does not have these disadvantages. The steps are: (1) Fast subset search
with
0
optimization (avoiding the bias from
1
). This requires the solution of an NP-hard problem,
but our fast subset selection algorithm is able to solve this quickly. We proceed from this accurate
sparse continuous solution, preserving both sparseness and accuracy in the next steps. (2) Find a pool
of diverse continuous sparse solutions that are almost as good as the solution found in (1) but with
different support sets. (3) A “star ray” search, where we search for feasible integer-valued solutions
along multipliers of each item in the pool from (2). By using multipliers, the search space resembles
the rays of a star, because it extends each coefficient in the pool outward from the origin to search
for solutions. To find integer solutions, we perform a local search (a form of sequential rounding).
This method yields high performance solutions: we provide a theoretical upper bound on the loss
difference between the continuous sparse solution and the rounded integer sparse solution.
2
Through extensive experiments, we show that our proposed method is computationally fast and
produces high-quality integer solutions. This work thus provides valuable and novel tools to create
risk scores for professionals in many different fields, such as healthcare, finance, and criminal justice.
Contributions
: Our contributions include the three-step framework for producing risk scores, a
beam-search-based algorithm for logistic regression with bounded coefficients (for Step 1), the search
algorithm to find pools of diverse high-quality continuous solutions (for Step 2), the star ray search
technique using multipliers (Step 3), and a theorem guaranteeing the quality of the star ray search.
2 Related Work
Optimization-based approaches: Risk scores, which model
P(y= 1|x)
, are different from threshold
classifiers, which predict either
y= 1
or
y=1
given
x
. Most work in the area of optimization of
integer-valued sparse linear models focuses on classifiers, not risk scores [
5
,
6
,
9
,
32
,
33
,
37
,
40
,
46
].
This difference is important, because a classifier generally cannot be calibrated well for use in risk
scoring: only its single decision point is optimized. Despite this, several works use the hinge loss
to calibrate predictions [
6
,
9
,
32
]. All of these optimization-based algorithms use mathematical
programming solvers (i.e., integer programming solvers), which tend to be slow and cannot be used
on larger problems. However, they can handle both feature selection and integer constraints.
To directly optimize risk scores, typically the logistic loss is used. The RiskSLIM algorithm [
39
]
optimizes the logistic loss regularized with
0
regularization, subject to integer constraints on the
coefficients. RiskSLIM uses callbacks to a MIP solver, alternating between solving linear programs
and using branch-and-cut to divide and reduce the search space. The branch-and-cut procedure needs
to keep track of unsolved nodes, whose number increases exponentially with the size of the feature
space. Thus, RiskSLIM’s major challenge is scalability.
Local search-based approaches: As discussed earlier, a natural way to produce a scoring system or
risk score is by selecting features manually and rounding logistic regression coefficients or hinge-loss
solutions to integers [
10
,
11
,
39
]. While rounding is fast, rounding errors can cause the solution quality
to be much worse than that of the optimization-based approaches. Several works have proposed
improvements over traditional rounding. In Randomized Rounding [
10
], each coefficient is rounded
up or down randomly, based on its continuous coefficient value. However, randomized rounding
does not seem to perform well in practice. Chevaleyre [
10
] also proposed Greedy Rounding, where
coefficients are rounded sequentially. While this technique aimed to provide theoretical guarantees
for the hinge loss, we identified a serious flaw in the argument, rendering the bounds incorrect (see
Appendix B). The RiskSLIM paper [
39
] proposed SequentialRounding, which, at each iteration,
chooses a coefficient to round up or down, making the best choice according to the regularized
logistic loss. This gives better solutions than other types of rounding, because the coefficients are
considered together through their performance on the loss function, not independently.
A drawback of SequentialRounding is that it considers rounding up or down only to the nearest
integer from the continuous solution. By considering multipliers, we consider a much larger space
of possible solutions. The idea of multipliers (i.e., “scale and round”) is used for medical scoring
systems [
11
], though, as far as we know, it has been used only with traditional rounding rather than
SequentialRounding, which could easily lead to poor performance, and we have seen no previous
work that studies how to perform scale-and-round in a systematic, computationally efficient way.
While the general idea of scale-and-round seems simple, it is not: there are an infinite number of
possible multipliers, and, for each one, a number of possible nearby integer coefficient vectors that is
the size of a hypercube, expanding exponentially in the search space.
Sampling Methods: The Bayesian method of Ertekin et al. [
16
] samples scoring systems, favoring
those that are simpler and more accurate, according to a prior. “Pooling” [
39
] creates multiple models
through sampling along the regularization path of ElasticNet. As discussed, when regularization is
tuned high enough to induce sparse solutions, it results in substantial bias and low-quality solutions
(see [
37
,
39
] for numerous experiments on this point). Note that there is a literature on finding diverse
solutions to mixed-integer optimization problems [e.g.,
1
], but it focuses only on linear objective
functions.
3
Algorithm 1 FasterRisk(D,k,C,B,,T,Nm)→ {(w+t, w+t
0, mt)}t
Input:
dataset
D
(consisting of feature matrix
XRn×p
and labels
yRn
), sparsity constraint
k
,
coefficient constraint
C= 5
, beam search size
B= 10
, tolerance level
= 0.3
, number of attempts
T= 50, number of multipliers to try Nm= 20.
Output:
a pool
P
of scoring systems
{(wt, wt
0), mt}
where
t
is the index enumerating all found
scoring systems with kwtk0kand kwtkCand mtis the corresponding multiplier.
1:
Call Algorithm 2 SparseBeamLR
(D, k, C, B)
to find a high-quality solution
(w, w
0)
to the
sparse logistic regression problem with continuous coefficients satisfying a box constraint,
i.e., solve Problem
(3)
. (Algorithm SparseBeamLR will call Algorithm ExpandSuppBy1 as a
subroutine, which grows the solution by beam search.)
2:
Call Algorithm 5 CollectSparseDiversePool
((w, w
0), , T )
, which solves Problem
(4)
. Place
its output {(wt, wt
0)}tin pool P={w, w
0}.PP∪ {(wt, wt
0)}t.
3:
Send each member
t
in the pool
P
, which is
(wt, wt
0)
, to Algorithm 3 StarRaySearch
(D,(wt, wt
0), C, Nm)
to perform a line search among possible multiplier values and obtain
an integer solution
(w+t, w+t
0)
with multiplier
mt
. Algorithm 3 calls Algorithm 6 Auxiliary-
LossRounding which conducts the rounding step.
4:
Return the collection of risk scores
{(w+t, w+t
0, mt)}t
. If desired, return only the best model
according to the logistic loss.
3 Methodology
Define dataset
D={1,xi, yi}n
i=1
(1 is a static feature corresponding to the intercept) and scaled
dataset as
1
m× D ={1
m,1
mxi, yi}n
i=1
, for a real-valued
m
. Our goal is to produce high-quality risk
scores within a few minutes on a small personal computer. We start with an optimization problem
similar to RiskSLIM’s [
39
], which minimizes the logistic loss subject to sparsity constraints and
integer coefficients:
min
w,w0
L(w, w0,D),where L(w, w0,D) = Pn
i=1 log(1 + exp(yi(xT
iw+w0))) (1)
such that kwk0kand wZp,j[1, .., p]wj[5,5], w0Z.
In practice, the range of these box constraints
[5,5]
is user-defined and can be different for each
coefficient. (We use 5 for ease of exposition.) The sparsity constraint
kwk0k
or integer
constraints
wZp
make the problem NP-hard, and this is a difficult mixed-integer nonlinear
program. Transforming the original features to all possible dummy variables, which is a standard
type of preprocessing [e.g.,
24
], changes the model into a (flexible) generalized additive model; such
models can be as accurate as the best machine learning models [
39
,
42
]. Thus, we generally process
variables in xto be binary.
To make the solution space substantially larger than
[5,4, ..., 4,5]p
, we use multipliers. The
problem becomes:
min
w,w0,m Lw, w0,1
mD,where Lw, w0,1
mD=
n
X
i=1
log 1 + exp yi
xT
iw+w0
m (2)
such that kwk0k, wZp,j[1, .., p], wj[5,5], w0Z, m > 0.
Note that the use of multipliers does not weaken the interpretability of the risk score: the user still sees
integer risk scores composed of values
wj∈ {−5,4, .., 4,5}
,
w0Z
. Only the risk conversion
table is calculated differently, as P(Y= 1|x) = 1/(1 + ef(x))where f(x) = 1
m(wTx+w0).
Our method proceeds in three steps, as outlined in Algorithm 1. In the first step, it approximately
solves the following
sparse logistic regression
problem with a box constraint (but not integer
constraints), detailed in Section 3.1 and Algorithm 2.
(w, w
0)argmin
w,w0
L(w, w0,D),kwk0k, wRp,j[1, ..., p],wj[5,5], w0R.
(3)
The algorithm gives an accurate and sparse real-valued solution (w, w
0).
The second step produces
many near-optimal sparse logistic regression solutions
, again without
integer constraints, detailed in Section 3.2 and Algorithm 5. Algorithm 5 uses
(w, w
0)
from the
4
first step to find a set {(wt, wt
0)}tsuch that for all tand a given threshold w:
(wt, wt
0)obeys L(wt, wt
0,D)L(w, w
0,D)×(1 + w)(4)
kwtk0k, wtRp,j[1, ..., p], wt
j[5,5], wt
0R.
After these steps, we have a pool of almost-optimal sparse logistic regression models. In the third
step, for each coefficient vector in the pool, we
compute a risk score
. It is a feasible integer solution
(w+t, w+t
0)to the following, which includes a positive multiplier mt>0:
Lw+t, w+t
0,1
mtDL(wt, wt
0,D) + t,(5)
w+tZp,j[1, ..., p], w+t
j[5,5], w+t
0Z,
where we derive a tight theoretical upper bound on
t
. A detailed solution to
(5)
is shown in Algorithm
6 in Appendix A. We solve the optimization problem for a large range of multipliers in Algorithm 3
for each coefficient vector in the pool, choosing the best multiplier for each coefficient vector. This
third step yields a large collection of risk scores, all of which are approximately as accurate as the best
sparse logistic regression model that can be obtained. All steps in this process are fast and scalable.
Algorithm 2 SparseBeamLR(D,k,C,B)(w, w0)
Input: dataset D, sparsity constraint k, coefficient constraint C, and beam search size B.
Output: a sparse continuous coefficient vector (w, w0)with kwk0k, kwkC.
1: Define N+and Nas numbers of positive and negative labels, respectively.
2: w0log(N+/N),w0Initialize the intercept and coefficients.
3: F ← ∅ Initialize the collection of found supports as an empty set
4: (W,F)ExpandSuppBy1(D,(w, w0),F, B).Returns Bmodels of support 1
5: for t= 2, ..., k do Beam search to expand the support
6: Wtmp ← ∅
7: for (w0, w0
0)∈ W do Each of these has support t1
8: (W0,F)ExpandSuppBy1(D,(w0, w0
0),F, B).Returns Bmodels with supp. t.
9: Wtmp ← Wtmp ∪ W0
10: end for
11: Reset Wto be the Bsolutions in Wtmp with the smallest logistic loss values.
12: end for
13: Pick (w, w0)from Wwith the smallest logistic loss.
14: Return (w, w0).
3.1 High-quality Sparse Continuous Solution
There are many different approaches for sparse logistic regression, including
1
regularization [
35
],
ElasticNet [
48
],
0
regularization [
13
,
24
], and orthogonal matching pursuit (OMP) [
14
,
25
], but none
of these approaches seem to be able to handle both the box constraints and the sparsity constraint
in Problem 3, so we developed a new approach. This approach, in Algorithm 2, SparseBeamLR,
uses beam search for best subset selection: each iteration contains several coordinate descent steps
to determine whether a new variable should be added to the support, and it clips coefficients to
the box
[5,5]
as it proceeds. Hence the algorithm is able to determine, before committing to the
new variable, whether it is likely to decrease the loss while obeying the box constraints. This beam
search algorithm for solving
(3)
implicitly uses the assumption that one of the best models of size
k
implicitly contains variables of one of the best models of size
k1
. This type of assumption has
been studied in the sparse learning literature [
14
] (Theorem 5). However, we are not aware of any
other work that applies box constraints or beam search for sparse logistic regression. In Appendix E,
we show that our method produces better solutions than the OMP method presented in [14].
Algorithm 2 calls the ExpandSuppBy1 Algorithm, which has two major steps. The detailed algorithm
can be found in Appendix A. For the first step, given a solution
w
, we perform optimization on each
single coordinate joutside of the current support supp(w):
d
jargmin
d[5,5]
L(w+dej, w0,D)for jwhere wj= 0.(6)
5
摘要:

FasterRisk:FastandAccurateInterpretableRiskScoresJiachangLiu1ChudiZhong1BoxuanLi1MargoSeltzer2CynthiaRudin11DukeUniveristy2UniversityofBritishColumbia{jiachang.liu,chudi.zhong,boxuan.li}@duke.edumseltzer@cs.ubc.ca,cynthia@cs.duke.eduAbstractOverthelastcentury,riskscoreshavebeenthemostpopularformof...

展开>> 收起<<
FasterRisk Fast and Accurate Interpretable Risk Scores Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1.pdf

共82页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:82 页 大小:2.59MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 82
客服
关注