Probabilities of Causation Adequate Size of Experimental and Observational Samples Ang Li

2025-05-02 0 0 673.65KB 12 页 10玖币
侵权投诉
Probabilities of Causation: Adequate Size of
Experimental and Observational Samples
Ang Li
Department of Computer Science
University of California Los Angeles
Los Angeles, CA 90095
angli@cs.ucla.edu
Ruirui Mao
Department of Statistics
University of Wisconsin Madison
Madison, Wisconsin 53705
rmao28@wisc.edu
Judea Pearl
Department of Computer Science
University of California Los Angeles
Los Angeles, CA 90095
judea@cs.ucla.edu
Abstract
The probabilities of causation are commonly used to solve decision-making prob-
lems. Tian and Pearl derived sharp bounds for the probability of necessity and
sufficiency (PNS), the probability of sufficiency (PS), and the probability of neces-
sity (PN) using experimental and observational data. The assumption is that one
is in possession of a large enough sample to permit an accurate estimation of the
experimental and observational distributions. In this study, we present a method for
determining the sample size needed for such estimation, when a given confidence
interval (CI) is specified. We further show by simulation that the proposed sample
size delivered stable estimations of the bounds of PNS.
1 Introduction
The probabilities of causation are widely used in many areas of industry, marketing, and health
science, to solve decision-making problems. For example, Li and Pearl [
6
,
8
] proposed the “benefit
function”, a linear combination of the probabilities of causation, which is the payoff/cost associated
with selecting an individual with given characteristics to identify a set of individuals who are most
likely to exhibit a desired mode of behavior. Mueller and Pearl [
11
] demonstrated, for example, that
the probabilities of causation should be considered in personalized decision-making. Li et al. [
7
]
showed that the probabilities of causation can improve the accuracy of machine learning algorithms.
Pearl [
14
] first used the structural causal model (SCM) to defined three binary probabilities of
causation (i.e., PNS, PN, and PS) [
3
,
4
,
15
]. Tian and Pearl [
16
] then used experimental and
observational data to bound those three probabilities of causation. Li and Pearl [
8
,
10
] established
formal proof of those bounds. Mueller, Li, and Pearl [
12
] proposed narrowing the bounds of PNS
using covariate information and the causal structure. Dawid et al. [
2
] also proposed using covariate
information to narrow the bounds of PN. Li and Pearl [
5
] recently established the theoretical bounds
of nonbinary probabilities of causation.
All the abovementioned works are asymptotic (i.e., assuming a adequately large sample size to
estimate the experimental and observational distributions). The proposed results in those works
are relationships between the experimental and observational distributions and the probabilities of
causation. However, the adequate sample size for obtaining those probabilities of causation remains
Preprint. Under review.
arXiv:2210.05027v1 [cs.AI] 10 Oct 2022
unclear, thereby creating a barrier between the theoretical results and the real-world applications.
Consider the following motivating example: a mobile carrier that wants to identify customers who
are likely to discontinue their services within the next quarter based on customer characteristics
(company management has access to user data, such as income, age, usage, and monthly payments).
The carrier will then offer these customers a special renewal deal to dissuade them from discontinuing
their services and to increase their service renewal rate. These offers provide considerable discounts
to the customers, and the management prefers that these offers be made only to those customers who
would continue to use the service if and only if they receive the offer. The manager decides to use Li
and Pearl’s unit selection model [
8
] but is unsure how many experimental and observational samples
are required. Are
1000
experimental and
1000
observational samples adequate to bound the benefit
function such that the error of the bounds are within 0.1?
We present an assessment of the “adequate" of the sample size in the sense of CI in this study.
We would then be able to answer the question, “How many samples are adequate to estimate the
probability of causation?" as “This amount of samples is adequate to obtain the bounds of the
probability of causation in
95%
CIs with margin of errors of
0.05
." The probabilities of causation in
most cases are not identifiable; therefore, the CIs are for the bounds of the probabilities of causation
in such cases.
2 Preliminaries
We review the definitions for the three aspects of binary causation in this section, as defined in [
14
].
We use the language of counterfactuals in SCM, as defined in [
3
,
4
]. We use
Yx=y
to denote the
counterfactual sentence “Variable
Y
would have the value
y
, had
X
been
x
". For the rest of the
paper, we use
yx
to denote the event
Yx=y
,
yx0
to denote the event
Yx0=y
,
y0
x
to denote the
event
Yx=y0
, and
y0
x0
to denote the event
Yx0=y0
. We assume that experimental distribution will
be summarized in the form of the causal effects such as
P(yx)
and observational distribution will
be summarized in the form of the joint probability function such as
P(x, y)
. If neither variable is
specified, variable Xrepresents treatment and variable Yrepresents effect.
The following are three prominent probabilities of causation:
Definition 1
(Probability of necessity (PN))
.
Let
X
and
Y
be two binary variables in a causal model
M
, let
x
and
y
stand for the propositions
X=true
and
Y=true
, respectively, and
x0
and
y0
for
their complements. The probability of necessity is defined as the expression [14]
PN =
P(Yx0=false|X=true, Y =true) =
P(y0
x0|x, y)
Definition 2 (Probability of sufficiency (PS)).[14]
PS =
P(yx|y0, x0)
Definition 3 (Probability of necessity and sufficiency (PNS)).[14]
PNS =
P(yx, y0
x0)
PNS denotes for the probability that
y
would respond to
x
both ways, and therefore measures both
the sufficiency and necessity of xto produce y.
Tian and Pearl [16] used Balke’s program [1] to provide tight bounds for PNS, PN, and PS.
PNS has the following tight bounds:
max
0,
P(yx)P(yx0),
P(y)P(yx0),
P(yx)P(y)
PNS min
P(yx),
P(y0
x0),
P(x, y) + P(x0, y0),
P(yx)P(yx0) + P(x, y0) + P(x0, y)
(1)
Note that we omitted the bounds of PN and PS because this study focuses primarily on the adequate
sample size for estimating the bounds of PNS, it is simple to extend to other probabilities of causation.
2
Z
X Y
Figure 1: The Causal Model, where
X
is a binary treatment,
Y
is a binary effect, and
Z
is a set of
20
independent binary confounders.
3 Main Result
The bounds of PNS are the linear combination of the experimental distributions
P(yx), P (yx0)
and
the observational distributions
P(x, y), P (x, y0), P (x0, y), P (x0, y0)
from Equation 1. Therefore, if
we can obtain the CI of each of these distributions, then we can obtain the CIs of the bounds of the
PNS. Let
R
be a random variable such that
R= 1
if the event
yx
occurs and
R= 0
if the event
y0
x
occurs, then it is clear that
RBernoulli(P(yx))
. Therefore, if we use the frequentist to estimate
the experimental and observational distributions, we have the following theorem and corollary (the
detailed proof are in the appendix):
Theorem 4.
Given
m
experimental samples and
n
observational samples, if the frequentist is used
to estimate the experimental and observational distributions, then the margin error of the bounds
of PNS in
1α
confidence interval is at most
z1α/2(q1
m+q1
n)
, where
z1α/2
can be found on
z-table of standard normal distribution.
Corollary 5.
If the frequentist is used to estimate the experimental and observational distribution, to
obtain the at most
0.05
margin error of the bounds of PNS in
95%
confidence interval, we need
m
experimental samples and
n
observational samples, where
(q1
m+q1
n)5/196.
More specifically,
if
m=n
,
6147
experimental and
6147
observational samples are adequate to obtain the at most
0.05 margin error of the bounds of PNS in 95% confidence interval.
This amount of samples ensures that the margin errors of the bounds of PNS in
95%
CI are no
more than 0.05. However, in practice, we usually do not need this amount of samples because there
is only one term (i.e.,
P(yx)P(yx0) + P(x, y0) + P(x0, y)
) in the PNS bounds of Equation 1,
which consists of four distributions. The terms such as
P(yx)
only require
385
experimental and
385
observational samples to obtain the at most
0.05
margin error of the
95%
CI, and the terms such
as P(yx)P(yx0)only require 1537 experimental and 1537 observational samples to obtain the at
most
0.05
margin error of the
95%
CI. We will illustrate the real errors of the estimations in simulated
studies in the following section.
4 Simulation Results
Here, we present simulated studies to show that the proposed number of experimental and obser-
vational samples are adequate to obtain the desired margin errors of the bounds of PNS using two
SCMs.
4.1 Causal Model
To estimate the margin errors of the bounds, we must first understand the data generation process
to have true experimental and observational distributions. The two models we are using are shown
in Figure 1 (two models have the same causal graph, but with different coefficients in SCMs; the
generation method of the models is in the appendix), where
X
is a binary treatment,
Y
is a binary
effect, and
Z
is a set of
20
independent confounders (say
Z1, ..., Z20
). The structural equations are as
follow (for simplicity reason, we let x= 1, x0= 0, and y= 1, y0= 0):
3
摘要:

ProbabilitiesofCausation:AdequateSizeofExperimentalandObservationalSamplesAngLiDepartmentofComputerScienceUniversityofCaliforniaLosAngelesLosAngeles,CA90095angli@cs.ucla.eduRuiruiMaoDepartmentofStatisticsUniversityofWisconsinMadisonMadison,Wisconsin53705rmao28@wisc.eduJudeaPearlDepartmentofComputerS...

展开>> 收起<<
Probabilities of Causation Adequate Size of Experimental and Observational Samples Ang Li.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:673.65KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注