
timate risk of a breast lesion being
malignant. 1. Irregular Shape 4 points ...
2. Circumscribed Margin -5 points +...
3. SpiculatedMargin 2 points +...
4. Age ≥45 1 point +...
5. Age ≥60 3 points +
SCORE =
SCORE -5 -4 -3 -2 -1 0
RISK 7.3% 9.7% 12.9% 16.9% 21.9% 27.8%
SCORE 123456
RISK 34.6% 42.1% 50.0% 57.9% 65.4% 72.2%
Figure 2: A second risk score on the mammo dataset on the
same fold as in Figure 1. The AUCs on the training and
test sets are 0.855 and 0.859, respectively. FasterRisk can
produce a diverse pool of high-quality models. Users can
choose a model that best fits with their domain knowledge.
Risk scores have the benefit of being
easily memorized; usually their names
reveal the full model – for instance,
the factors in CHADS
2
are past
C
hronic heart failure,
H
ypertension,
A
ge
≥
75 years,
D
iabetes, and past
S
troke (where past stroke receives
2
points and the others each receive 1
point). For risk scores, counterfactu-
als are often trivial to compute, even
without a calculator. Also, checking
that the data and calculations are cor-
rect is easier with risk scores than with
other approaches. In short, risk scores
have been created by humans for a
century to support a huge spectrum of
applications [2, 23, 30, 43, 44, 47], because humans find them easy to understand.
Traditionally, risk scores have been created in two main ways: (1) without data, with expert knowledge
only (and validated only afterwards on data), and (2) using a semi-manual process involving manual
feature selection and rounding of logistic regression coefficients. That is, these approaches rely
heavily on domain expertise and rely little on data. Unfortunately, the alternative of building a model
directly from data leads to computationally hard problems: optimizing risk scores over a global
objective on data is NP-hard, because in order to produce integer-valued scores, the feasible region
must be the integer lattice. There have been only a few approaches to design risk scores automatically
[
5
,
6
,
9
,
10
,
16
,
32
,
33
,
38
,
39
,
40
], but each of these has a flaw that limits its use in practice: the
optimization-based approaches use mathematical programming solvers (which require a license) that
are slow and scale poorly, and the other methods are randomized greedy algorithms, producing fast
but much lower-quality solutions. We need an approach that exhibits the best of both worlds: speed
fast enough to operate in a few minutes on a laptop and optimization/search capability as powerful as
that of the mathematical programming tools. Our method, FasterRisk, lies at this intersection. It is
fast enough to enable interactive model design and can rapidly produce a large pool of models from
which users can choose rather than producing only a single model.
One may wonder why simple rounding of
1
-regularized logistic regression coefficients does not
yield sufficiently good risk scores. Past works [
37
,
39
] explain this as follows: the sheer amount of
1
regularization needed to get a very sparse solution leads to large biases and worse loss values, and
rounding goes against the performance gradient. For example, consider the following coefficients
from
1
regularization: [1.45, .87, .83, .47, .23, .15, ... ]. This model is worse than its unregularized
counterpart due to the bias induced by the large
1
term. Its rounded solution is [1,1,1,0,0,0,..], which
leads to even worse loss. Instead, one could multiply all the coefficients by a constant and then
round, but which constant is best? There are an infinite number of choices. Even if some value of the
multiplier leads to minimal loss due to rounding, the bias from the
1
term still limits the quality of
the solution.
The algorithm presented here does not have these disadvantages. The steps are: (1) Fast subset search
with
0
optimization (avoiding the bias from
1
). This requires the solution of an NP-hard problem,
but our fast subset selection algorithm is able to solve this quickly. We proceed from this accurate
sparse continuous solution, preserving both sparseness and accuracy in the next steps. (2) Find a pool
of diverse continuous sparse solutions that are almost as good as the solution found in (1) but with
different support sets. (3) A “star ray” search, where we search for feasible integer-valued solutions
along multipliers of each item in the pool from (2). By using multipliers, the search space resembles
the rays of a star, because it extends each coefficient in the pool outward from the origin to search
for solutions. To find integer solutions, we perform a local search (a form of sequential rounding).
This method yields high performance solutions: we provide a theoretical upper bound on the loss
difference between the continuous sparse solution and the rounded integer sparse solution.
2