FasterRisk Fast and Accurate Interpretable Risk Scores Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1

2025-04-26 0 0 2.59MB 82 页 10玖币

侵权投诉

FasterRisk: Fast and Accurate Interpretable Risk

Scores

Jiachang Liu1∗Chudi Zhong1∗Boxuan Li1Margo Seltzer2Cynthia Rudin1

1Duke Univeristy 2University of British Columbia

{jiachang.liu, chudi.zhong, boxuan.li}@duke.edu

mseltzer@cs.ubc.ca, cynthia@cs.duke.edu

Abstract

Over the last century, risk scores have been the most popular form of predictive

model used in healthcare and criminal justice. Risk scores are sparse linear models

with integer coefﬁcients; often these models can be memorized or placed on an

index card. Typically, risk scores have been created either without data or by

rounding logistic regression coefﬁcients, but these methods do not reliably produce

high-quality risk scores. Recent work used mathematical programming, which

is computationally slow. We introduce an approach for efﬁciently producing a

collection of high-quality risk scores learned from data. Speciﬁcally, our approach

produces a pool of almost-optimal sparse continuous solutions, each with a different

support set, using a beam-search algorithm. Each of these continuous solutions is

transformed into a separate risk score through a “star ray” search, where a range of

multipliers are considered before rounding the coefﬁcients sequentially to maintain

low logistic loss. Our algorithm returns all of these high-quality risk scores for the

user to consider. This method completes within minutes and can be valuable in a

broad variety of applications.

1 Introduction

Risk scores are sparse linear models with integer coefﬁcients that predict risks. They are arguably

the most popular form of predictive model for high stakes decisions through the last century and

are the standard form of model used in criminal justice [

] and medicine [

1. Oval Shape -2 points ...

2. Irregular Shape 4 points +...

3. Circumscribed Margin -5 points +...

4. Spiculated Margin 2 points +...

5. Age ≥60 3 points +

SCORE =

SCORE -7 -5 -4 -3 -2 -1

RISK 6.0% 10.6% 13.8% 17.9% 22.8% 28.6%

SCORE 01234≥5

RISK 35.2% 42.4% 50.0% 57.6% 64.8% 71.4%

Figure 1: Risk score on the mammo dataset [

], whose pop-

ulation is biopsy patients. It predicts the risk of malignancy

of a breast lesion. Risk score is from FasterRisk on a fold

of a 5-CV split. The AUCs on the training and test sets are

0.854 and 0.853, respectively.

Their history dates back to at least the

criminal justice work of Burgess [

where, based on their criminal history

and demographics, individuals were

assigned integer point scores between

0 and 21 that determined the probabil-

ity of their “making good or of fail-

ing upon parole.” Other famous risk

scores are arguably the most widely-

used predictive models in healthcare.

These include the APGAR score [

developed in 1952 and given to new-

borns, and the CHADS

score [

which estimates stroke risk for atrial

ﬁbrillation patients. Figures 1 and 2

show example risk scores, which es-

∗These authors contributed equally.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05846v1 [cs.LG] 12 Oct 2022

timate risk of a breast lesion being

malignant. 1. Irregular Shape 4 points ...

2. Circumscribed Margin -5 points +...

3. SpiculatedMargin 2 points +...

4. Age ≥45 1 point +...

5. Age ≥60 3 points +

SCORE =

SCORE -5 -4 -3 -2 -1 0

RISK 7.3% 9.7% 12.9% 16.9% 21.9% 27.8%

SCORE 123456

RISK 34.6% 42.1% 50.0% 57.9% 65.4% 72.2%

Figure 2: A second risk score on the mammo dataset on the

same fold as in Figure 1. The AUCs on the training and

test sets are 0.855 and 0.859, respectively. FasterRisk can

produce a diverse pool of high-quality models. Users can

choose a model that best ﬁts with their domain knowledge.

Risk scores have the beneﬁt of being

easily memorized; usually their names

reveal the full model – for instance,

the factors in CHADS

are past

hronic heart failure,

ypertension,

≥

75 years,

iabetes, and past

troke (where past stroke receives

points and the others each receive 1

point). For risk scores, counterfactu-

als are often trivial to compute, even

without a calculator. Also, checking

that the data and calculations are cor-

rect is easier with risk scores than with

other approaches. In short, risk scores

have been created by humans for a

century to support a huge spectrum of

applications [2, 23, 30, 43, 44, 47], because humans ﬁnd them easy to understand.

Traditionally, risk scores have been created in two main ways: (1) without data, with expert knowledge

only (and validated only afterwards on data), and (2) using a semi-manual process involving manual

feature selection and rounding of logistic regression coefﬁcients. That is, these approaches rely

heavily on domain expertise and rely little on data. Unfortunately, the alternative of building a model

directly from data leads to computationally hard problems: optimizing risk scores over a global

objective on data is NP-hard, because in order to produce integer-valued scores, the feasible region

must be the integer lattice. There have been only a few approaches to design risk scores automatically

[

], but each of these has a ﬂaw that limits its use in practice: the

optimization-based approaches use mathematical programming solvers (which require a license) that

are slow and scale poorly, and the other methods are randomized greedy algorithms, producing fast

but much lower-quality solutions. We need an approach that exhibits the best of both worlds: speed

fast enough to operate in a few minutes on a laptop and optimization/search capability as powerful as

that of the mathematical programming tools. Our method, FasterRisk, lies at this intersection. It is

fast enough to enable interactive model design and can rapidly produce a large pool of models from

which users can choose rather than producing only a single model.

One may wonder why simple rounding of

1

-regularized logistic regression coefﬁcients does not

yield sufﬁciently good risk scores. Past works [

] explain this as follows: the sheer amount of

1

regularization needed to get a very sparse solution leads to large biases and worse loss values, and

rounding goes against the performance gradient. For example, consider the following coefﬁcients

from

1

regularization: [1.45, .87, .83, .47, .23, .15, ... ]. This model is worse than its unregularized

counterpart due to the bias induced by the large

1

term. Its rounded solution is [1,1,1,0,0,0,..], which

leads to even worse loss. Instead, one could multiply all the coefﬁcients by a constant and then

round, but which constant is best? There are an inﬁnite number of choices. Even if some value of the

multiplier leads to minimal loss due to rounding, the bias from the

1

term still limits the quality of

the solution.

The algorithm presented here does not have these disadvantages. The steps are: (1) Fast subset search

with

0

optimization (avoiding the bias from

1

). This requires the solution of an NP-hard problem,

but our fast subset selection algorithm is able to solve this quickly. We proceed from this accurate

sparse continuous solution, preserving both sparseness and accuracy in the next steps. (2) Find a pool

of diverse continuous sparse solutions that are almost as good as the solution found in (1) but with

different support sets. (3) A “star ray” search, where we search for feasible integer-valued solutions

along multipliers of each item in the pool from (2). By using multipliers, the search space resembles

the rays of a star, because it extends each coefﬁcient in the pool outward from the origin to search

for solutions. To ﬁnd integer solutions, we perform a local search (a form of sequential rounding).

This method yields high performance solutions: we provide a theoretical upper bound on the loss

difference between the continuous sparse solution and the rounded integer sparse solution.

Through extensive experiments, we show that our proposed method is computationally fast and

produces high-quality integer solutions. This work thus provides valuable and novel tools to create

risk scores for professionals in many different ﬁelds, such as healthcare, ﬁnance, and criminal justice.

Contributions

: Our contributions include the three-step framework for producing risk scores, a

beam-search-based algorithm for logistic regression with bounded coefﬁcients (for Step 1), the search

algorithm to ﬁnd pools of diverse high-quality continuous solutions (for Step 2), the star ray search

technique using multipliers (Step 3), and a theorem guaranteeing the quality of the star ray search.

2 Related Work

Optimization-based approaches: Risk scores, which model

P(y= 1|x)

, are different from threshold

classiﬁers, which predict either

y= 1

y=−1

given

. Most work in the area of optimization of

integer-valued sparse linear models focuses on classiﬁers, not risk scores [

This difference is important, because a classiﬁer generally cannot be calibrated well for use in risk

scoring: only its single decision point is optimized. Despite this, several works use the hinge loss

to calibrate predictions [

]. All of these optimization-based algorithms use mathematical

programming solvers (i.e., integer programming solvers), which tend to be slow and cannot be used

on larger problems. However, they can handle both feature selection and integer constraints.

To directly optimize risk scores, typically the logistic loss is used. The RiskSLIM algorithm [

]

optimizes the logistic loss regularized with

0

regularization, subject to integer constraints on the

coefﬁcients. RiskSLIM uses callbacks to a MIP solver, alternating between solving linear programs

and using branch-and-cut to divide and reduce the search space. The branch-and-cut procedure needs

to keep track of unsolved nodes, whose number increases exponentially with the size of the feature

space. Thus, RiskSLIM’s major challenge is scalability.

Local search-based approaches: As discussed earlier, a natural way to produce a scoring system or

risk score is by selecting features manually and rounding logistic regression coefﬁcients or hinge-loss

solutions to integers [

]. While rounding is fast, rounding errors can cause the solution quality

to be much worse than that of the optimization-based approaches. Several works have proposed

improvements over traditional rounding. In Randomized Rounding [

], each coefﬁcient is rounded

up or down randomly, based on its continuous coefﬁcient value. However, randomized rounding

does not seem to perform well in practice. Chevaleyre [

] also proposed Greedy Rounding, where

coefﬁcients are rounded sequentially. While this technique aimed to provide theoretical guarantees

for the hinge loss, we identiﬁed a serious ﬂaw in the argument, rendering the bounds incorrect (see

Appendix B). The RiskSLIM paper [

] proposed SequentialRounding, which, at each iteration,

chooses a coefﬁcient to round up or down, making the best choice according to the regularized

logistic loss. This gives better solutions than other types of rounding, because the coefﬁcients are

considered together through their performance on the loss function, not independently.

A drawback of SequentialRounding is that it considers rounding up or down only to the nearest

integer from the continuous solution. By considering multipliers, we consider a much larger space

of possible solutions. The idea of multipliers (i.e., “scale and round”) is used for medical scoring

systems [

], though, as far as we know, it has been used only with traditional rounding rather than

SequentialRounding, which could easily lead to poor performance, and we have seen no previous

work that studies how to perform scale-and-round in a systematic, computationally efﬁcient way.

While the general idea of scale-and-round seems simple, it is not: there are an inﬁnite number of

possible multipliers, and, for each one, a number of possible nearby integer coefﬁcient vectors that is

the size of a hypercube, expanding exponentially in the search space.

Sampling Methods: The Bayesian method of Ertekin et al. [

] samples scoring systems, favoring

those that are simpler and more accurate, according to a prior. “Pooling” [

] creates multiple models

through sampling along the regularization path of ElasticNet. As discussed, when regularization is

tuned high enough to induce sparse solutions, it results in substantial bias and low-quality solutions

(see [

] for numerous experiments on this point). Note that there is a literature on ﬁnding diverse

solutions to mixed-integer optimization problems [e.g.,

], but it focuses only on linear objective

functions.

Algorithm 1 FasterRisk(D,k,C,B,,T,Nm)→ {(w+t, w+t

0, mt)}t

Input:

dataset

(consisting of feature matrix

X∈Rn×p

and labels

y∈Rn

), sparsity constraint

coefﬁcient constraint

C= 5

, beam search size

B= 10

, tolerance level

= 0.3

, number of attempts

T= 50, number of multipliers to try Nm= 20.

Output:

a pool

of scoring systems

{(wt, wt

0), mt}

where

is the index enumerating all found

scoring systems with kwtk0≤kand kwtk∞≤Cand mtis the corresponding multiplier.

Call Algorithm 2 SparseBeamLR

(D, k, C, B)

to ﬁnd a high-quality solution

(w∗, w∗

to the

sparse logistic regression problem with continuous coefﬁcients satisfying a box constraint,

i.e., solve Problem

(3)

. (Algorithm SparseBeamLR will call Algorithm ExpandSuppBy1 as a

subroutine, which grows the solution by beam search.)

Call Algorithm 5 CollectSparseDiversePool

((w∗, w∗

0), , T )

, which solves Problem

(4)

. Place

its output {(wt, wt

0)}tin pool P={w∗, w∗

0}.P←P∪ {(wt, wt

0)}t.

Send each member

in the pool

, which is

(wt, wt

, to Algorithm 3 StarRaySearch

(D,(wt, wt

0), C, Nm)

to perform a line search among possible multiplier values and obtain

an integer solution

(w+t, w+t

with multiplier

. Algorithm 3 calls Algorithm 6 Auxiliary-

LossRounding which conducts the rounding step.

Return the collection of risk scores

{(w+t, w+t

0, mt)}t

. If desired, return only the best model

according to the logistic loss.

3 Methodology

Deﬁne dataset

D={1,xi, yi}n

i=1

(1 is a static feature corresponding to the intercept) and scaled

dataset as

m× D ={1

m,1

mxi, yi}n

i=1

, for a real-valued

. Our goal is to produce high-quality risk

scores within a few minutes on a small personal computer. We start with an optimization problem

similar to RiskSLIM’s [

], which minimizes the logistic loss subject to sparsity constraints and

integer coefﬁcients:

min

w,w0

L(w, w0,D),where L(w, w0,D) = Pn

i=1 log(1 + exp(−yi(xT

iw+w0))) (1)

such that kwk0≤kand w∈Zp,∀j∈[1, .., p]wj∈[−5,5], w0∈Z.

In practice, the range of these box constraints

[−5,5]

is user-deﬁned and can be different for each

coefﬁcient. (We use 5 for ease of exposition.) The sparsity constraint

kwk0≤k

or integer

constraints

w∈Zp

make the problem NP-hard, and this is a difﬁcult mixed-integer nonlinear

program. Transforming the original features to all possible dummy variables, which is a standard

type of preprocessing [e.g.,

], changes the model into a (ﬂexible) generalized additive model; such

models can be as accurate as the best machine learning models [

]. Thus, we generally process

variables in xto be binary.

To make the solution space substantially larger than

[−5,−4, ..., 4,5]p

, we use multipliers. The

problem becomes:

min

w,w0,m Lw, w0,1

mD,where Lw, w0,1

mD=

i=1

log 1 + exp −yi

iw+w0

m (2)

such that kwk0≤k, w∈Zp,∀j∈[1, .., p], wj∈[−5,5], w0∈Z, m > 0.

Note that the use of multipliers does not weaken the interpretability of the risk score: the user still sees

integer risk scores composed of values

wj∈ {−5,−4, .., 4,5}

w0∈Z

. Only the risk conversion

table is calculated differently, as P(Y= 1|x) = 1/(1 + e−f(x))where f(x) = 1

m(wTx+w0).

Our method proceeds in three steps, as outlined in Algorithm 1. In the ﬁrst step, it approximately

solves the following

sparse logistic regression

problem with a box constraint (but not integer

constraints), detailed in Section 3.1 and Algorithm 2.

(w∗, w∗

0)∈argmin

w,w0

L(w, w0,D),kwk0≤k, w∈Rp,∀j∈[1, ..., p],wj∈[−5,5], w0∈R.

(3)

The algorithm gives an accurate and sparse real-valued solution (w∗, w∗

0).

The second step produces

many near-optimal sparse logistic regression solutions

, again without

integer constraints, detailed in Section 3.2 and Algorithm 5. Algorithm 5 uses

(w∗, w∗

from the

ﬁrst step to ﬁnd a set {(wt, wt

0)}tsuch that for all tand a given threshold w:

(wt, wt

0)obeys L(wt, wt

0,D)≤L(w∗, w∗

0,D)×(1 + w∗)(4)

kwtk0≤k, wt∈Rp,∀j∈[1, ..., p], wt

j∈[−5,5], wt

0∈R.

After these steps, we have a pool of almost-optimal sparse logistic regression models. In the third

step, for each coefﬁcient vector in the pool, we

compute a risk score

. It is a feasible integer solution

(w+t, w+t

0)to the following, which includes a positive multiplier mt>0:

Lw+t, w+t

0,1

mtD≤L(wt, wt

0,D) + t,(5)

w+t∈Zp,∀j∈[1, ..., p], w+t

j∈[−5,5], w+t

0∈Z,

where we derive a tight theoretical upper bound on

t

. A detailed solution to

(5)

is shown in Algorithm

6 in Appendix A. We solve the optimization problem for a large range of multipliers in Algorithm 3

for each coefﬁcient vector in the pool, choosing the best multiplier for each coefﬁcient vector. This

third step yields a large collection of risk scores, all of which are approximately as accurate as the best

sparse logistic regression model that can be obtained. All steps in this process are fast and scalable.

Algorithm 2 SparseBeamLR(D,k,C,B)→(w, w0)

Input: dataset D, sparsity constraint k, coefﬁcient constraint C, and beam search size B.

Output: a sparse continuous coefﬁcient vector (w, w0)with kwk0≤k, kwk∞≤C.

1: Deﬁne N+and N−as numbers of positive and negative labels, respectively.

2: w0←log(−N+/N−),w←0Initialize the intercept and coefﬁcients.

3: F ← ∅ Initialize the collection of found supports as an empty set

4: (W,F)←ExpandSuppBy1(D,(w, w0),F, B).Returns ≤Bmodels of support 1

5: for t= 2, ..., k do Beam search to expand the support

6: Wtmp ← ∅

7: for (w0, w0

0)∈ W do Each of these has support t−1

8: (W0,F)←ExpandSuppBy1(D,(w0, w0

0),F, B).Returns ≤Bmodels with supp. t.

9: Wtmp ← Wtmp ∪ W0

10: end for

11: Reset Wto be the Bsolutions in Wtmp with the smallest logistic loss values.

12: end for

13: Pick (w, w0)from Wwith the smallest logistic loss.

14: Return (w, w0).

3.1 High-quality Sparse Continuous Solution

There are many different approaches for sparse logistic regression, including

1

regularization [

ElasticNet [

0

regularization [

], and orthogonal matching pursuit (OMP) [

], but none

of these approaches seem to be able to handle both the box constraints and the sparsity constraint

in Problem 3, so we developed a new approach. This approach, in Algorithm 2, SparseBeamLR,

uses beam search for best subset selection: each iteration contains several coordinate descent steps

to determine whether a new variable should be added to the support, and it clips coefﬁcients to

the box

[−5,5]

as it proceeds. Hence the algorithm is able to determine, before committing to the

new variable, whether it is likely to decrease the loss while obeying the box constraints. This beam

search algorithm for solving

(3)

implicitly uses the assumption that one of the best models of size

implicitly contains variables of one of the best models of size

k−1

. This type of assumption has

been studied in the sparse learning literature [

] (Theorem 5). However, we are not aware of any

other work that applies box constraints or beam search for sparse logistic regression. In Appendix E,

we show that our method produces better solutions than the OMP method presented in [14].

Algorithm 2 calls the ExpandSuppBy1 Algorithm, which has two major steps. The detailed algorithm

can be found in Appendix A. For the ﬁrst step, given a solution

, we perform optimization on each

single coordinate joutside of the current support supp(w):

d∗

j∈argmin

d∈[−5,5]

L(w+dej, w0,D)for ∀jwhere wj= 0.(6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FasterRisk:FastandAccurateInterpretableRiskScoresJiachangLiu1ChudiZhong1BoxuanLi1MargoSeltzer2CynthiaRudin11DukeUniveristy2UniversityofBritishColumbia{jiachang.liu,chudi.zhong,boxuan.li}@duke.edumseltzer@cs.ubc.ca,cynthia@cs.duke.eduAbstractOverthelastcentury,riskscoreshavebeenthemostpopularformof...

展开>> 收起<<

FasterRisk Fast and Accurate Interpretable Risk Scores Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1.pdf

共82页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FasterRisk Fast and Accurate Interpretable Risk Scores Jiachang Liu1Chudi Zhong1Boxuan Li1Margo Seltzer2Cynthia Rudin1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: