
most invariably accepted by UG Responders (see Camerer,
2011). Relatedly, UG Responders often reject offers below
30%, presumably as retaliation for being treated unfairly
(G¨
uth et al., 1982; Thaler, 1988; G¨
uth & Tietz, 1990; Bolton
& Zwick, 1995; Nowak et al., 2000; Camerer & Fehr, 2006).
A growing body of experimental work has revealed that in-
duced emotions strongly affect UG Responder’s accept/reject
behavior, with positive emotions increasing the chance of low
offers being accepted (e.g., Riepl et al., 2016; Andrade &
Ariely, 2009), and negative emotions decreasing the chance
of low offers being accepted (e.g., Bonini et al., 2011; Harl´
e
& Sanfey, 2010; Liu et al., 2016; Moretti & Di Pellegrino,
2010; Vargas et al., 2019). Experimentally, these emotions
are often induced by a movie clip or recall task.
3 A Computational Model of UG Responder
Recently, Nobandegani et al. (2020) presented a process
model of UG Responder, called sample-based expected util-
ity (SbEU). SbEU provides a unified account of several dis-
parate empirical findings in UG (i.e., the effects of expecta-
tion, competition, and time pressure on UG Responder), and
also explains the effect of a wide range of emotions on UG
Responder (Lizzotte, Nobandegani, & Shultz, 2021).
Nobandegani et al.’s process-level account rests on two
main assumptions. First, UG Responder uses SbEU to esti-
mate the expected-utility gap between their expectation and
the offer, i.e., E[u(offer)−u(expectation)], where u(·)de-
notes Responder’s utility function. If this estimate is pos-
itive — indicating that the offer made is, on average, higher
than Responder’s expectation — Responder accepts the offer;
otherwise, Responder rejects the offer. This assumption is
supported by substantial empirical evidence showing that Re-
sponder’s expectation serves as a reference point for subjec-
tive valuation of offers (Sanfey, 2009; Battigalli et al., 2015;
Vavra et al., 2018; Xiang et al., 2013; Chang & Sanfey, 2013).
The second assumption is that negative emotions elevate
loss-aversion while positive emotions lower loss-aversion
(Lizotte et al., 2021). Again, this assumptions is supported by
mounting empirical evidence (e.g., De Martino et al., 2010;
Sokol-Hessner et al., 2015, 2009) suggesting that emotions
modulate loss-aversion — the tendency to overweight losses
as compared to gains (Kahneman & Tverskey, 1979).
Concretely, SbEU assumes that an agent estimates ex-
pected utility:
E[u(o)] = Zp(o)u(o)do,(1)
using self-normalized importance sampling (Nobandegani et
al., 2018; Nobandegani & Shultz, 2020b, 2020c), with its im-
portance distribution q∗aiming to optimally minimize mean-
squared error (MSE):
ˆ
E=1
∑s
j=1wj
s
∑
i=1
wiu(oi),∀i:oi∼q∗,wi=p(oi)
q∗(oi),(2)
q∗(o)∝p(o)|u(o)|s1+|u(o)|√s
|u(o)|√s.(3)
MSE is a standard measure of estimation quality, widely used
in decision theory and mathematical statistics (Poor, 2013).
In Eqs. (1-3), odenotes an outcome of a risky gamble, p(o)
the objective probability of outcome o,u(o)the subjective
utility of outcome o,ˆ
Ethe importance-sampling estimate of
expected utility given in Eq. (1), q∗the importance-sampling
distribution, oian outcome randomly sampled from q∗, and s
the number of samples drawn from q∗.
SbEU has so far explained a broad range of empirical
findings in human decision-making, e.g., the fourfold pat-
terns of risk preferences in both outcome probability and out-
come magnitude (Nobandegani et al., 2018), risky decoy and
violation of betweenness (Nobandegani et al., 2019c), vio-
lation of stochastic dominance (Xia, Nobandegani, Shultz,
& Bhui, 2022), violation of cumulative independence (Cao,
Nobandegani, & Shultz, 2022), the three contextual effects of
similarity, attraction, and compromise (da Silva Castanheira,
Nobandegani, Shultz, & Otto, 2019), the Allais, St. Peters-
burg, and Ellsberg paradoxes (Nobandegani & Shultz, 2020b,
2020c; Nobandegani et al., 2021), cooperation in Prisoner’s
Dilemma (Nobandegani et al., 2019a), and human coordina-
tion behavior in coordination games (Nobandegani & Shultz,
2020a). Notably, SbEU is the first, and thus far the only,
resource-rational process model that bridges between risky,
value-based, and game-theoretic decision-making.
4 Training RL Agents in UG
In this section, we substantiate the idea of cognitive mod-
els as simulators in the context of moral decision-making,
by having RL agents learn about fairness through interacting
with a cognitive model of UG Responder (Nobandegani et
al., 2020), as a proxy for human Responders, thereby making
their training process both less costly and faster.
To train RL Proposers, we leverage the broad framework
of multi-armed bandits in reinforcement learning (Katehakis
& Veinott, 1987; Gittins, 1979), and adopt the well-known
Thompson Sampling method (Thompson, 1933). Specifi-
cally, we assume that RL Proposer should decide what per-
centage of the total money Tthey are willing to offer to
SbEU Responder. For ease of analysis, here we assume
that RL Proposer chooses between a finite set of options:
A={0,
T
10 ,
2T
10 ,··· ,
9T
10 ,T}.
In reinforcement learning terminology, RL Proposer
learns, through trial and error while striking a balance be-
tween exploration and exploitation, which option a∈A
yields the highest mean reward. Here, we train RL Proposers
using Thompson Sampling, a well-established method in the
reinforcement learning literature enjoying strong optimality
guarantees (Agrawal & Goyal, 2012, 2013); see Algorithm 1.
Algorithm 1 can be described in simple terms as follows.
At the start, i.e., prior to any learning, the number of times
an offer a∈Ais so far accepted, Sa(S for success), and the
number of times it is rejected, Fa(F for failure), are both set to