UNIFY
Third, since most bilevel optimization methods rely on separate steps to deal with
θ
and
z
, the optimization problem
that defines
g
can be built on the fly, adjusting the size of the decision vector
z
as needed (e.g. scaling the number of
possible transplants with the patient/donor pool).
Finally, the reformulated policy features a vector (the virtual parameters
y
) that is absent in the original formulation.
Choosing
y
is a design decision, and the trickiest step for applying our method. A conservative approach consists in: 1)
starting from a mathematical model of the original decision problem (e.g. routing, assignment, selection); 2) selecting a
set of parameters from the model to be used as y.
While this approach is viable, it misses a key opportunity: in fact, as the notation and terminology for
˜
C
and
˜
f
imply,
the feasible set and cost function of the inner optimization problem
g
may differ from the original ones. For example, it
is possible to introduce penalties or rewards, buffers on constraint satisfaction thresholds, or even to relax constraints
or to remove cost terms. This element of flexibility provides a new design handle that can be used to “partition” the
challenging aspects of the original problem into either
h
or
g
, depending on which of the two components is best suited
to deal with them. While this is a powerful new mechanism to tackle complex problems, it does require a degree of
creativity and expertise to be effectively used. We provide a few examples of how to do it in Section 5.
Sequential Formulation and RL
The training formulation from Equation (5) applies to single-stage problems, but
it can be extended to sequential decision-making. Formally, we can view the problem as a Markov Decision Process
hX, Z, p+, f, p1, γi
where
X
is the set of possible (observable) states and
Z
is the decision space,
p+
is the probability
distribution for state transitions,
f
is the cost function,
p1
is the probability distribution for the initial state, and
γ∈(0,1]
is a discount factor. At each step of the sequential process we will have access to a distinct observed state
xk
and we will
output a distinct decision vector
zk
; the value
p+(xk+1, xk, zk)
denotes the probability of reaching state
xk+1
when
applying decisions
zk
on state
xk
; the cost function
f
is the same as in Equation (2) and
(5)
. Within this framework, the
training problem becomes:
arg min
θ∈Θ
Eτ∼p"eoh
X
k=1
γkf(xk+1, xk, zk)#
zk=g(xk, yk) = arg min
z∈˜
C(xk,yk)
˜
f(xk, yk, zk)
yk=h(xk, θ)
(6)
where
yk
is the ML model output for step
k
, and
τ
refers to a trajectory, i.e. to a sequence of states, ML outputs, and
decisions
{(xk, yk, zk)}eoh
k=1
. The probability of a trajectory is given by
p(τ) = p1(x1)Qeoh
k=1 p+(xk+1, xk, zk)
. The
term
eoh
is the End Of Horizon, and it can be infinite if the sequence is not upper-bounded. In this case, the discount
factor γshould be strictly lower than 1, while γshould be equal to 1 for decision problems over a finite horizon.
Equation (6) can be interpreted as defining a Reinforcement Learning problem, where the
h
plays the role of a
conventional RL policy, and
g
can be viewed at training time as part of the environment. This is an important
observation since it implies that any RL approach can be used. Once an optimized parameter vector
θ∗
has been
obtained, the reformulated policy is given by
g(x, h(x, θ∗))
as usual. The sequential formulation directly generalizes
the single step one, meaning that RL algorithms can be used to tackle Equation (5). In fact, this is exactly what we do in
our experimentation since it simplifies the exposition.
4 Relation to Other Approaches
Decision Focused Learning (DFL)
Approaches in this class seek to train estimators for parameters of optimization
models, while explicitly accounting for the impact that estimation inaccuracy has on the decisions. The idea gained
attention after the seminal work by [
4
], with several new methods being proposed – and well covered in a recent survey
by [5].
While the original approach was limited to convex costs and constraints, subsequent works have tackled combinatorial
problems, either in an approximate fashion via continuous relaxations [
6
], or in an exact fashion by assuming a fixed
feasible space and linear costs [7, 8]. Outer and inner relaxations have also been used to improve scalability [9, 10].
DFL can be interpreted as solving (via subgradient optimization) a close variant of Equation (5), with some restrictions
and one generalization. In particular, all DFL approaches lack virtual parameters: the ML model is instead expected to
estimate part of the future state, i.e.
y
is a prediction for a portion of
x+
. For example, the estimator might look at
current traffic to predict future travel times. Additionally,
˜
f
always corresponds to the original decision problem cost,
and in all but a few cases the feasible space is fixed, i.e.
C(x, y) = C
. The connection between
y
and
x+
implies that
4