Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems Eric Yang Yu

2025-05-02 0 0 2.64MB 14 页 10玖币

侵权投诉

Policy Optimization with Advantage Regularization

for Long-Term Fairness in Decision Systems

Eric Yang Yu

UC San Diego

Zhizhen Qin

UC San Diego

Min Kyung Lee

UT Austin

Sicun Gao

UC San Diego

Abstract

Long-term fairness is an important factor of consideration in designing and de-

ploying learning-based decision systems in high-stake decision-making contexts.

Recent work has proposed the use of Markov Decision Processes (MDPs) to formu-

late decision-making with long-term fairness requirements in dynamically changing

environments, and demonstrated major challenges in directly deploying heuristic

and rule-based policies that worked well in static environments. We show that

policy optimization methods from deep reinforcement learning can be used to ﬁnd

strictly better decision policies that can often achieve both higher overall utility and

less violation of the fairness requirements, compared to previously-known strate-

gies. In particular, we propose new methods for imposing fairness requirements in

policy optimization by regularizing the advantage evaluation of different actions.

Our proposed methods make it easy to impose fairness constraints without reward

engineering or sacriﬁcing training efﬁciency. We perform detailed analyses in three

established case studies, including attention allocation in incident monitoring, bank

loan approval, and vaccine distribution in population networks.

1 Introduction

Learning-based algorithmic decision systems are increasingly used in high-stake decision-making

contexts. A critical factor of consideration in their design and deployment is to ensure fairness and

avoid disparate impacts on the marginalized populations [

]. Although many approaches have been

developed to study and ensure fairness in algorithmic decision systems [

], most of the literature

studies fair decision-making in a one-shot context, meaning they make the decision that maximizes

fairness in a static setting. This approach fails to explicitly address how decisions made in the present

may affect the future status and behaviors of targeted groups, which in turn can form a feedback

loop that negatively impacts the effectiveness and fairness of the decision-making strategies. In other

words, the implications associated with long-term fairness, or fairness evaluated over a time horizon

rather than in a single time step, are largely under-studied.

The long-term impact of such decision systems has recently been explored through explicit modeling

of the dynamics and feedback effects in the interactions between the decision-makers and the

targeted populations [

]. In particular, the recent work of [

] has demonstrated,

with concrete simulation examples, how long-term fairness can not be analyzed in closed forms,

but requires the use of more computational analysis tools based on simulations. They proposed

to formulate such long-term dynamics and the interaction between the decision-making and the

environment in the framework of Markov Decision Processes (MDPs). This formulation and the

corresponding simulation environments make it possible to take advantage of recent advances in deep

reinforcement learning (RL) for ﬁnding new decision-making policies that can achieve both better

overall utility and fairness, compared to manually designed heuristic and rule-based strategies.

One challenge of directly using RL-based methods for learning decision-making policies, however, is

that the goal of decision systems in the high-stake decision-making context is often inherently multi-

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12546v1 [cs.LG] 22 Oct 2022

objective. On one hand, from a utilitarian perspective, an effective policy should try to maximize the

overall expected utility of the decisions for all targeted groups. On the other hand, constraints such

as the fairness requirements should be explicitly enforced to prevent biased policies that negatively

treat certain groups in temporary or historically disadvantageous situations. Since the standard RL

framework for policy optimization only optimizes a policy with respect to a monolithic reward

function, it can be difﬁcult to enforce these fairness requirements during training. An intuitive

approach for enforcing fairness in standard RL is to deﬁne a penalty term in the objective function

that captures the magnitude of violation of the fairness requirements by the policy. However, this

approach would require the RL user to deﬁne the trade-off between the utilitarian objective and the

fairness objective, typically as weights on each objective. This requirement of reward engineering

can make it hard to justify the policies obtained by the RL algorithms, because one can question

whether the predeﬁned weights have introduced problematic assumptions and trade-offs between

the objectives in the ﬁrst place. The monolithic reward deﬁnitions may also incentivize the learning

agent to perform reward hacking [

], or adopt undesirable behaviors that exploit the wrong trade-off

between the different objectives, such as conservatively accumulating incremental rewards without

achieving the overall goal just to avoid the penalty of violating constraints.

One approach for addressing such problems is the framework of Constrained Markov Decision

Processes (CMDPs) [

], which allows the RL user to explicitly declare rewards and constraints and

use RL algorithms such as Constrained Policy Optimization (CPO) [1] to simultaneously maximize

reward and minimize constraint violation over time. However, the CMDP formulation only requires

the learning algorithms to lower the expectation of constraint violation asymptotically, i.e. achieving

constraint-abiding policies in a probabilistic sense if training time is allowed to be inﬁnitely long, and

can not ensure fairness for policies trained in practice. Moreover, in comparison to standard policy

optimization methods such as Proximal Policy Optimization (PPO) [

], algorithms for CMDPs can

take signiﬁcantly longer, and the policies obtained after ﬁnite training periods may still have high

constraint-violation rates and poor performance.

We propose new methods for enforcing long-term fairness properties in decision systems by taking a

constrained RL approach. At a high level, we enforce fairness requirements at the policy gradient level

during policy optimization with minimal additional computational overhead. By enforcing fairness

constraints through advantage regularization rather than at the objective level, we avoid reward

engineering or hacking on the decision problems and aim to algorithmically optimize the trade-off

between utility and fairness. At the same time, the proposed learning algorithms can train the decision

policies much more efﬁciently than existing CMDP methods. Finally, the simplicity of our approach

enables easy integration with off-the-shelf policy optimization algorithms in RL. Our methods are

inspired by Lyapunov stability methods for improving stability in control systems [

We show that fairness properties can be handled in a similar framework with our speciﬁc design of

the constraint regularization terms. In sum, our main contributions are as follows:

•

We show that RL approaches are effective for designing policies that can achieve long-term fairness,

where existing heuristic and rule-based approaches do not perform well [

]. Speciﬁcally, we

demonstrate that policy optimization methods can ﬁnd strictly better decision policies that achieve

higher overall utility and violate less of the fairness requirements than previously-known strategies.

•

We propose novel methods for imposing fairness requirements in standard policy optimization

procedures by regularizing the advantage evaluation during the policy gradient steps. This approach

uses control-theoretic frameworks to enforce fairness constraints and avoids reward engineering

issues in the decision-making context [16].

We evaluate our approaches in several established case studies using the simulation environments [

3], such as incident monitoring, bank loan approval, and vaccine distribution for infectious diseases

in population networks. We ﬁnd that the proposed policy optimization with advantage regularization

is able to ﬁnd policies that perform better than previously-known strategies, both in achieving higher

overall utility and lower violation of the fairness requirements in all the case study environments.

2 Related Work

Long-term Fairness in Algorithmic Decision-Making.

The work in [

] is the ﬁrst to formulate

long-term fairness problems in decision systems as Markov Decision Processes (MDPs). The

simulation environments proposed in the work allow us to consider the agent design problem in ways

that are to other RL problems such as robot control. Others have also shown that long-term fairness

is nontrivial, and analyzing it in the context of a static scenario can be harmful because it contradicts

fairness objectives optimized in static settings [

]. For example, [

] ﬁnd that providing

a direct subsidy for a disadvantaged group with the purpose of improving some institutional utility

actually widens the gap between advantaged and disadvantaged groups over time, which further

shows that long-term fairness is difﬁcult to achieve. There have been a growing number of studies

on fairness in the long-term with various algorithmic approaches [

]. [

] proposes

a graph-based algorithm to improve fairness in recommendations for items and suppliers. They

relate fairness to breaking the perpetuation of bias in the interactions between users and items. [

]

proposes the use of causal directed acyclic graphs (DAGs) as a paradigm for studying fairness in

dynamical systems. They argue that causal reasoning can help improve the fairness of off-policy

learning, and if the underlying environment dynamics are known, causal DAGs can be used as

simulators for the models in training. [

] provides a framework for studying long-term fairness

and ﬁnds that static fairness constraints can either promote fairness or increase disparity between

advantaged and disadvantaged groups in dynamical systems. [

] studies how to maintain long-term

fairness on item exposure for the task of splitting items into groups by recommendation, using a

modiﬁed Constrained Policy Optimization (CPO) procedure [

]. [

] introduces the fairness notion of

return parity, a measure of the similarity in expected returns across different demographic groups,

and provides an algorithm for minimizing this disparity.

Several recent works have also considered fairness-like constraints in deep reinforcement learning in

various different contexts. [

] designs fairness optimized actor-critic algorithms in deep reinforcement

learning. They enforce fairness by multiplicatively adjusting the reward for fairness utility optimiza-

tion in standard actor-critic reinforcement learning. The work of [

] studied multi-dimensional

reward functions for MDPs motivated by fairness and equality constraints, and performed theoretical

analysis on the approximation error with respect to the optimal average reward. Our focus is on

proposing a practical algorithm for making fair decisions in the dynamic environments formulated

in [

] and show that policy optimization through advantage regularization can ﬁnd the neural network

policies that signiﬁcantly outperform previously known strategies in the dynamic setting.

Policy Optimization under Constraints.

The most widely-adopted formulation of RL with a set

of constraints is constrained Markov Decision Processes (CMDPs) [

]. Safety constraints are

incorporated by augmenting the standard MDP framework with constraints over expectations of

auxiliary costs. When models are known in discrete tabular settings, a CMDP is solvable using linear

programming (LP) [

]. However, results are limited for model-free scenarios where model dynamics

are unknown, and for large-scale or even continuous state action spaces [

]. More importantly,

both objective and constraint in high-dimensional CMDP settings, where high-capacity function

approximators are adopted, are non-convex. Recent methods in solving CMDPs in continuous spaces

can be divided into two categories, in terms of ways to incorporate constraints. In soft constrained

RL, it is a common practice to apply Lagrangian method with learnable Lagrangian multipliers and

solve the converted unconstrained saddle-point optimization problem using policy-based methods

[

]. Such Lagrangian methods achieve overall safety when policies converge asymptotically,

nevertheless allowing possible violations during training. On the contrary, hard-constrained RL

aims to learn safe policies throughout training. Representative works include Constrained Policy

Optimization (CPO) based on trust region [

], surrogate algorithms with stepwise [

] and super-

martingale [27] surrogate constraints, as well as Lyapunov-based approaches [11, 12].

3 Policy Optimization with Advantage Regularization

In long-term fairness studies [

], fairness is evaluated over a time horizon where the agent interacts

with the environment, and the environment can change in response to the interactions. Simulations

following the MDP framework is one way to analyze fairness over time and systematically come up

with strategies for maximizing fairness in the long-term rather than in a single step. MDPs naturally

incorporate the idea that actions made in the present can have accumulating consequences on the

environment over time. Long-term fairness is evaluated with metrics that describe the consequences

made by an agent’s policy on the different subgroups in an environment over time. These metrics are

computed at each step of the MDP, and include data collected from the past time steps.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PolicyOptimizationwithAdvantageRegularizationforLong-TermFairnessinDecisionSystemsEricYangYuUCSanDiegoZhizhenQinUCSanDiegoMinKyungLeeUTAustinSicunGaoUCSanDiegoAbstractLong-termfairnessisanimportantfactorofconsiderationindesigningandde-ployinglearning-baseddecisionsystemsinhigh-stakedecision-makingco...

展开>> 收起<<

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems Eric Yang Yu.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems Eric Yang Yu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: