that are to other RL problems such as robot control. Others have also shown that long-term fairness
is nontrivial, and analyzing it in the context of a static scenario can be harmful because it contradicts
fairness objectives optimized in static settings [
21
,
22
,
26
]. For example, [
21
,
26
] find that providing
a direct subsidy for a disadvantaged group with the purpose of improving some institutional utility
actually widens the gap between advantaged and disadvantaged groups over time, which further
shows that long-term fairness is difficult to achieve. There have been a growing number of studies
on fairness in the long-term with various algorithmic approaches [
28
,
35
,
24
,
14
]. [
24
] proposes
a graph-based algorithm to improve fairness in recommendations for items and suppliers. They
relate fairness to breaking the perpetuation of bias in the interactions between users and items. [
14
]
proposes the use of causal directed acyclic graphs (DAGs) as a paradigm for studying fairness in
dynamical systems. They argue that causal reasoning can help improve the fairness of off-policy
learning, and if the underlying environment dynamics are known, causal DAGs can be used as
simulators for the models in training. [
35
] provides a framework for studying long-term fairness
and finds that static fairness constraints can either promote fairness or increase disparity between
advantaged and disadvantaged groups in dynamical systems. [
17
] studies how to maintain long-term
fairness on item exposure for the task of splitting items into groups by recommendation, using a
modified Constrained Policy Optimization (CPO) procedure [
1
]. [
9
] introduces the fairness notion of
return parity, a measure of the similarity in expected returns across different demographic groups,
and provides an algorithm for minimizing this disparity.
Several recent works have also considered fairness-like constraints in deep reinforcement learning in
various different contexts. [
8
] designs fairness optimized actor-critic algorithms in deep reinforcement
learning. They enforce fairness by multiplicatively adjusting the reward for fairness utility optimiza-
tion in standard actor-critic reinforcement learning. The work of [
32
] studied multi-dimensional
reward functions for MDPs motivated by fairness and equality constraints, and performed theoretical
analysis on the approximation error with respect to the optimal average reward. Our focus is on
proposing a practical algorithm for making fair decisions in the dynamic environments formulated
in [
16
] and show that policy optimization through advantage regularization can find the neural network
policies that significantly outperform previously known strategies in the dynamic setting.
Policy Optimization under Constraints.
The most widely-adopted formulation of RL with a set
of constraints is constrained Markov Decision Processes (CMDPs) [
2
,
34
]. Safety constraints are
incorporated by augmenting the standard MDP framework with constraints over expectations of
auxiliary costs. When models are known in discrete tabular settings, a CMDP is solvable using linear
programming (LP) [
2
]. However, results are limited for model-free scenarios where model dynamics
are unknown, and for large-scale or even continuous state action spaces [
1
,
11
,
34
]. More importantly,
both objective and constraint in high-dimensional CMDP settings, where high-capacity function
approximators are adopted, are non-convex. Recent methods in solving CMDPs in continuous spaces
can be divided into two categories, in terms of ways to incorporate constraints. In soft constrained
RL, it is a common practice to apply Lagrangian method with learnable Lagrangian multipliers and
solve the converted unconstrained saddle-point optimization problem using policy-based methods
[
5
,
10
,
33
]. Such Lagrangian methods achieve overall safety when policies converge asymptotically,
nevertheless allowing possible violations during training. On the contrary, hard-constrained RL
aims to learn safe policies throughout training. Representative works include Constrained Policy
Optimization (CPO) based on trust region [
1
], surrogate algorithms with stepwise [
15
] and super-
martingale [27] surrogate constraints, as well as Lyapunov-based approaches [11, 12].
3 Policy Optimization with Advantage Regularization
In long-term fairness studies [
16
], fairness is evaluated over a time horizon where the agent interacts
with the environment, and the environment can change in response to the interactions. Simulations
following the MDP framework is one way to analyze fairness over time and systematically come up
with strategies for maximizing fairness in the long-term rather than in a single step. MDPs naturally
incorporate the idea that actions made in the present can have accumulating consequences on the
environment over time. Long-term fairness is evaluated with metrics that describe the consequences
made by an agent’s policy on the different subgroups in an environment over time. These metrics are
computed at each step of the MDP, and include data collected from the past time steps.
3