Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems Eric Yang Yu

2025-05-02 0 0 2.64MB 14 页 10玖币
侵权投诉
Policy Optimization with Advantage Regularization
for Long-Term Fairness in Decision Systems
Eric Yang Yu
UC San Diego
Zhizhen Qin
UC San Diego
Min Kyung Lee
UT Austin
Sicun Gao
UC San Diego
Abstract
Long-term fairness is an important factor of consideration in designing and de-
ploying learning-based decision systems in high-stake decision-making contexts.
Recent work has proposed the use of Markov Decision Processes (MDPs) to formu-
late decision-making with long-term fairness requirements in dynamically changing
environments, and demonstrated major challenges in directly deploying heuristic
and rule-based policies that worked well in static environments. We show that
policy optimization methods from deep reinforcement learning can be used to find
strictly better decision policies that can often achieve both higher overall utility and
less violation of the fairness requirements, compared to previously-known strate-
gies. In particular, we propose new methods for imposing fairness requirements in
policy optimization by regularizing the advantage evaluation of different actions.
Our proposed methods make it easy to impose fairness constraints without reward
engineering or sacrificing training efficiency. We perform detailed analyses in three
established case studies, including attention allocation in incident monitoring, bank
loan approval, and vaccine distribution in population networks.
1 Introduction
Learning-based algorithmic decision systems are increasingly used in high-stake decision-making
contexts. A critical factor of consideration in their design and deployment is to ensure fairness and
avoid disparate impacts on the marginalized populations [
25
]. Although many approaches have been
developed to study and ensure fairness in algorithmic decision systems [
25
,
13
], most of the literature
studies fair decision-making in a one-shot context, meaning they make the decision that maximizes
fairness in a static setting. This approach fails to explicitly address how decisions made in the present
may affect the future status and behaviors of targeted groups, which in turn can form a feedback
loop that negatively impacts the effectiveness and fairness of the decision-making strategies. In other
words, the implications associated with long-term fairness, or fairness evaluated over a time horizon
rather than in a single time step, are largely under-studied.
The long-term impact of such decision systems has recently been explored through explicit modeling
of the dynamics and feedback effects in the interactions between the decision-makers and the
targeted populations [
16
,
28
,
35
,
24
,
14
]. In particular, the recent work of [
16
] has demonstrated,
with concrete simulation examples, how long-term fairness can not be analyzed in closed forms,
but requires the use of more computational analysis tools based on simulations. They proposed
to formulate such long-term dynamics and the interaction between the decision-making and the
environment in the framework of Markov Decision Processes (MDPs). This formulation and the
corresponding simulation environments make it possible to take advantage of recent advances in deep
reinforcement learning (RL) for finding new decision-making policies that can achieve both better
overall utility and fairness, compared to manually designed heuristic and rule-based strategies.
One challenge of directly using RL-based methods for learning decision-making policies, however, is
that the goal of decision systems in the high-stake decision-making context is often inherently multi-
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12546v1 [cs.LG] 22 Oct 2022
objective. On one hand, from a utilitarian perspective, an effective policy should try to maximize the
overall expected utility of the decisions for all targeted groups. On the other hand, constraints such
as the fairness requirements should be explicitly enforced to prevent biased policies that negatively
treat certain groups in temporary or historically disadvantageous situations. Since the standard RL
framework for policy optimization only optimizes a policy with respect to a monolithic reward
function, it can be difficult to enforce these fairness requirements during training. An intuitive
approach for enforcing fairness in standard RL is to define a penalty term in the objective function
that captures the magnitude of violation of the fairness requirements by the policy. However, this
approach would require the RL user to define the trade-off between the utilitarian objective and the
fairness objective, typically as weights on each objective. This requirement of reward engineering
can make it hard to justify the policies obtained by the RL algorithms, because one can question
whether the predefined weights have introduced problematic assumptions and trade-offs between
the objectives in the first place. The monolithic reward definitions may also incentivize the learning
agent to perform reward hacking [
29
], or adopt undesirable behaviors that exploit the wrong trade-off
between the different objectives, such as conservatively accumulating incremental rewards without
achieving the overall goal just to avoid the penalty of violating constraints.
One approach for addressing such problems is the framework of Constrained Markov Decision
Processes (CMDPs) [
2
], which allows the RL user to explicitly declare rewards and constraints and
use RL algorithms such as Constrained Policy Optimization (CPO) [1] to simultaneously maximize
reward and minimize constraint violation over time. However, the CMDP formulation only requires
the learning algorithms to lower the expectation of constraint violation asymptotically, i.e. achieving
constraint-abiding policies in a probabilistic sense if training time is allowed to be infinitely long, and
can not ensure fairness for policies trained in practice. Moreover, in comparison to standard policy
optimization methods such as Proximal Policy Optimization (PPO) [
31
], algorithms for CMDPs can
take significantly longer, and the policies obtained after finite training periods may still have high
constraint-violation rates and poor performance.
We propose new methods for enforcing long-term fairness properties in decision systems by taking a
constrained RL approach. At a high level, we enforce fairness requirements at the policy gradient level
during policy optimization with minimal additional computational overhead. By enforcing fairness
constraints through advantage regularization rather than at the objective level, we avoid reward
engineering or hacking on the decision problems and aim to algorithmically optimize the trade-off
between utility and fairness. At the same time, the proposed learning algorithms can train the decision
policies much more efficiently than existing CMDP methods. Finally, the simplicity of our approach
enables easy integration with off-the-shelf policy optimization algorithms in RL. Our methods are
inspired by Lyapunov stability methods for improving stability in control systems [
19
,
30
,
23
,
4
,
7
,
6
].
We show that fairness properties can be handled in a similar framework with our specific design of
the constraint regularization terms. In sum, our main contributions are as follows:
We show that RL approaches are effective for designing policies that can achieve long-term fairness,
where existing heuristic and rule-based approaches do not perform well [
16
]. Specifically, we
demonstrate that policy optimization methods can find strictly better decision policies that achieve
higher overall utility and violate less of the fairness requirements than previously-known strategies.
We propose novel methods for imposing fairness requirements in standard policy optimization
procedures by regularizing the advantage evaluation during the policy gradient steps. This approach
uses control-theoretic frameworks to enforce fairness constraints and avoids reward engineering
issues in the decision-making context [16].
We evaluate our approaches in several established case studies using the simulation environments [
16
,
3], such as incident monitoring, bank loan approval, and vaccine distribution for infectious diseases
in population networks. We find that the proposed policy optimization with advantage regularization
is able to find policies that perform better than previously-known strategies, both in achieving higher
overall utility and lower violation of the fairness requirements in all the case study environments.
2 Related Work
Long-term Fairness in Algorithmic Decision-Making.
The work in [
16
] is the first to formulate
long-term fairness problems in decision systems as Markov Decision Processes (MDPs). The
simulation environments proposed in the work allow us to consider the agent design problem in ways
2
that are to other RL problems such as robot control. Others have also shown that long-term fairness
is nontrivial, and analyzing it in the context of a static scenario can be harmful because it contradicts
fairness objectives optimized in static settings [
21
,
22
,
26
]. For example, [
21
,
26
] find that providing
a direct subsidy for a disadvantaged group with the purpose of improving some institutional utility
actually widens the gap between advantaged and disadvantaged groups over time, which further
shows that long-term fairness is difficult to achieve. There have been a growing number of studies
on fairness in the long-term with various algorithmic approaches [
28
,
35
,
24
,
14
]. [
24
] proposes
a graph-based algorithm to improve fairness in recommendations for items and suppliers. They
relate fairness to breaking the perpetuation of bias in the interactions between users and items. [
14
]
proposes the use of causal directed acyclic graphs (DAGs) as a paradigm for studying fairness in
dynamical systems. They argue that causal reasoning can help improve the fairness of off-policy
learning, and if the underlying environment dynamics are known, causal DAGs can be used as
simulators for the models in training. [
35
] provides a framework for studying long-term fairness
and finds that static fairness constraints can either promote fairness or increase disparity between
advantaged and disadvantaged groups in dynamical systems. [
17
] studies how to maintain long-term
fairness on item exposure for the task of splitting items into groups by recommendation, using a
modified Constrained Policy Optimization (CPO) procedure [
1
]. [
9
] introduces the fairness notion of
return parity, a measure of the similarity in expected returns across different demographic groups,
and provides an algorithm for minimizing this disparity.
Several recent works have also considered fairness-like constraints in deep reinforcement learning in
various different contexts. [
8
] designs fairness optimized actor-critic algorithms in deep reinforcement
learning. They enforce fairness by multiplicatively adjusting the reward for fairness utility optimiza-
tion in standard actor-critic reinforcement learning. The work of [
32
] studied multi-dimensional
reward functions for MDPs motivated by fairness and equality constraints, and performed theoretical
analysis on the approximation error with respect to the optimal average reward. Our focus is on
proposing a practical algorithm for making fair decisions in the dynamic environments formulated
in [
16
] and show that policy optimization through advantage regularization can find the neural network
policies that significantly outperform previously known strategies in the dynamic setting.
Policy Optimization under Constraints.
The most widely-adopted formulation of RL with a set
of constraints is constrained Markov Decision Processes (CMDPs) [
2
,
34
]. Safety constraints are
incorporated by augmenting the standard MDP framework with constraints over expectations of
auxiliary costs. When models are known in discrete tabular settings, a CMDP is solvable using linear
programming (LP) [
2
]. However, results are limited for model-free scenarios where model dynamics
are unknown, and for large-scale or even continuous state action spaces [
1
,
11
,
34
]. More importantly,
both objective and constraint in high-dimensional CMDP settings, where high-capacity function
approximators are adopted, are non-convex. Recent methods in solving CMDPs in continuous spaces
can be divided into two categories, in terms of ways to incorporate constraints. In soft constrained
RL, it is a common practice to apply Lagrangian method with learnable Lagrangian multipliers and
solve the converted unconstrained saddle-point optimization problem using policy-based methods
[
5
,
10
,
33
]. Such Lagrangian methods achieve overall safety when policies converge asymptotically,
nevertheless allowing possible violations during training. On the contrary, hard-constrained RL
aims to learn safe policies throughout training. Representative works include Constrained Policy
Optimization (CPO) based on trust region [
1
], surrogate algorithms with stepwise [
15
] and super-
martingale [27] surrogate constraints, as well as Lyapunov-based approaches [11, 12].
3 Policy Optimization with Advantage Regularization
In long-term fairness studies [
16
], fairness is evaluated over a time horizon where the agent interacts
with the environment, and the environment can change in response to the interactions. Simulations
following the MDP framework is one way to analyze fairness over time and systematically come up
with strategies for maximizing fairness in the long-term rather than in a single step. MDPs naturally
incorporate the idea that actions made in the present can have accumulating consequences on the
environment over time. Long-term fairness is evaluated with metrics that describe the consequences
made by an agent’s policy on the different subgroups in an environment over time. These metrics are
computed at each step of the MDP, and include data collected from the past time steps.
3
摘要:

PolicyOptimizationwithAdvantageRegularizationforLong-TermFairnessinDecisionSystemsEricYangYuUCSanDiegoZhizhenQinUCSanDiegoMinKyungLeeUTAustinSicunGaoUCSanDiegoAbstractLong-termfairnessisanimportantfactorofconsiderationindesigningandde-ployinglearning-baseddecisionsystemsinhigh-stakedecision-makingco...

展开>> 收起<<
Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems Eric Yang Yu.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.64MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注