Probabilities of Causation Adequate Size of Experimental and Observational Samples Ang Li

2025-05-02 0 0 673.65KB 12 页 10玖币

侵权投诉

Probabilities of Causation: Adequate Size of

Experimental and Observational Samples

Ang Li

Department of Computer Science

University of California Los Angeles

Los Angeles, CA 90095

angli@cs.ucla.edu

Ruirui Mao

Department of Statistics

University of Wisconsin Madison

Madison, Wisconsin 53705

rmao28@wisc.edu

Judea Pearl

Department of Computer Science

University of California Los Angeles

Los Angeles, CA 90095

judea@cs.ucla.edu

Abstract

The probabilities of causation are commonly used to solve decision-making prob-

lems. Tian and Pearl derived sharp bounds for the probability of necessity and

sufﬁciency (PNS), the probability of sufﬁciency (PS), and the probability of neces-

sity (PN) using experimental and observational data. The assumption is that one

is in possession of a large enough sample to permit an accurate estimation of the

experimental and observational distributions. In this study, we present a method for

determining the sample size needed for such estimation, when a given conﬁdence

interval (CI) is speciﬁed. We further show by simulation that the proposed sample

size delivered stable estimations of the bounds of PNS.

1 Introduction

The probabilities of causation are widely used in many areas of industry, marketing, and health

science, to solve decision-making problems. For example, Li and Pearl [

] proposed the “beneﬁt

function”, a linear combination of the probabilities of causation, which is the payoff/cost associated

with selecting an individual with given characteristics to identify a set of individuals who are most

likely to exhibit a desired mode of behavior. Mueller and Pearl [

] demonstrated, for example, that

the probabilities of causation should be considered in personalized decision-making. Li et al. [

]

showed that the probabilities of causation can improve the accuracy of machine learning algorithms.

Pearl [

] ﬁrst used the structural causal model (SCM) to deﬁned three binary probabilities of

causation (i.e., PNS, PN, and PS) [

]. Tian and Pearl [

] then used experimental and

observational data to bound those three probabilities of causation. Li and Pearl [

] established

formal proof of those bounds. Mueller, Li, and Pearl [

] proposed narrowing the bounds of PNS

using covariate information and the causal structure. Dawid et al. [

] also proposed using covariate

information to narrow the bounds of PN. Li and Pearl [

] recently established the theoretical bounds

of nonbinary probabilities of causation.

All the abovementioned works are asymptotic (i.e., assuming a adequately large sample size to

estimate the experimental and observational distributions). The proposed results in those works

are relationships between the experimental and observational distributions and the probabilities of

causation. However, the adequate sample size for obtaining those probabilities of causation remains

Preprint. Under review.

arXiv:2210.05027v1 [cs.AI] 10 Oct 2022

unclear, thereby creating a barrier between the theoretical results and the real-world applications.

Consider the following motivating example: a mobile carrier that wants to identify customers who

are likely to discontinue their services within the next quarter based on customer characteristics

(company management has access to user data, such as income, age, usage, and monthly payments).

The carrier will then offer these customers a special renewal deal to dissuade them from discontinuing

their services and to increase their service renewal rate. These offers provide considerable discounts

to the customers, and the management prefers that these offers be made only to those customers who

would continue to use the service if and only if they receive the offer. The manager decides to use Li

and Pearl’s unit selection model [

] but is unsure how many experimental and observational samples

are required. Are

1000

experimental and

1000

observational samples adequate to bound the beneﬁt

function such that the error of the bounds are within 0.1?

We present an assessment of the “adequate" of the sample size in the sense of CI in this study.

We would then be able to answer the question, “How many samples are adequate to estimate the

probability of causation?" as “This amount of samples is adequate to obtain the bounds of the

probability of causation in

95%

CIs with margin of errors of

0.05

." The probabilities of causation in

most cases are not identiﬁable; therefore, the CIs are for the bounds of the probabilities of causation

in such cases.

2 Preliminaries

We review the deﬁnitions for the three aspects of binary causation in this section, as deﬁned in [

We use the language of counterfactuals in SCM, as deﬁned in [

]. We use

Yx=y

to denote the

counterfactual sentence “Variable

would have the value

, had

been

". For the rest of the

paper, we use

to denote the event

Yx=y

yx0

to denote the event

Yx0=y

to denote the

event

Yx=y0

, and

to denote the event

Yx0=y0

. We assume that experimental distribution will

be summarized in the form of the causal effects such as

P(yx)

and observational distribution will

be summarized in the form of the joint probability function such as

P(x, y)

. If neither variable is

speciﬁed, variable Xrepresents treatment and variable Yrepresents effect.

The following are three prominent probabilities of causation:

Deﬁnition 1

(Probability of necessity (PN))

Let

and

be two binary variables in a causal model

, let

and

stand for the propositions

X=true

and

Y=true

, respectively, and

and

for

their complements. The probability of necessity is deﬁned as the expression [14]

PN =

∆P(Yx0=false|X=true, Y =true) =

∆P(y0

x0|x, y)

Deﬁnition 2 (Probability of sufﬁciency (PS)).[14]

PS =

∆P(yx|y0, x0)

Deﬁnition 3 (Probability of necessity and sufﬁciency (PNS)).[14]

PNS =

∆P(yx, y0

x0)

PNS denotes for the probability that

would respond to

both ways, and therefore measures both

the sufﬁciency and necessity of xto produce y.

Tian and Pearl [16] used Balke’s program [1] to provide tight bounds for PNS, PN, and PS.

PNS has the following tight bounds:

max 









P(yx)−P(yx0),

P(y)−P(yx0),

P(yx)−P(y)









≤PNS ≤min 









P(yx),

P(y0

x0),

P(x, y) + P(x0, y0),

P(yx)−P(yx0) + P(x, y0) + P(x0, y)











(1)

Note that we omitted the bounds of PN and PS because this study focuses primarily on the adequate

sample size for estimating the bounds of PNS, it is simple to extend to other probabilities of causation.

X Y

Figure 1: The Causal Model, where

is a binary treatment,

is a binary effect, and

is a set of

independent binary confounders.

3 Main Result

The bounds of PNS are the linear combination of the experimental distributions

P(yx), P (yx0)

and

the observational distributions

P(x, y), P (x, y0), P (x0, y), P (x0, y0)

from Equation 1. Therefore, if

we can obtain the CI of each of these distributions, then we can obtain the CIs of the bounds of the

PNS. Let

be a random variable such that

R= 1

if the event

occurs and

R= 0

if the event

occurs, then it is clear that

R∼Bernoulli(P(yx))

. Therefore, if we use the frequentist to estimate

the experimental and observational distributions, we have the following theorem and corollary (the

detailed proof are in the appendix):

Theorem 4.

Given

experimental samples and

observational samples, if the frequentist is used

to estimate the experimental and observational distributions, then the margin error of the bounds

of PNS in

1−α

conﬁdence interval is at most

z1−α/2(q1

m+q1

, where

z1−α/2

can be found on

z-table of standard normal distribution.

Corollary 5.

If the frequentist is used to estimate the experimental and observational distribution, to

obtain the at most

0.05

margin error of the bounds of PNS in

95%

conﬁdence interval, we need

experimental samples and

observational samples, where

(q1

m+q1

n)≤5/196.

More speciﬁcally,

m=n

6147

experimental and

6147

observational samples are adequate to obtain the at most

0.05 margin error of the bounds of PNS in 95% conﬁdence interval.

This amount of samples ensures that the margin errors of the bounds of PNS in

95%

CI are no

more than 0.05. However, in practice, we usually do not need this amount of samples because there

is only one term (i.e.,

P(yx)−P(yx0) + P(x, y0) + P(x0, y)

) in the PNS bounds of Equation 1,

which consists of four distributions. The terms such as

P(yx)

only require

385

experimental and

385

observational samples to obtain the at most

0.05

margin error of the

95%

CI, and the terms such

as P(yx)−P(yx0)only require 1537 experimental and 1537 observational samples to obtain the at

most

0.05

margin error of the

95%

CI. We will illustrate the real errors of the estimations in simulated

studies in the following section.

4 Simulation Results

Here, we present simulated studies to show that the proposed number of experimental and obser-

vational samples are adequate to obtain the desired margin errors of the bounds of PNS using two

SCMs.

4.1 Causal Model

To estimate the margin errors of the bounds, we must ﬁrst understand the data generation process

to have true experimental and observational distributions. The two models we are using are shown

in Figure 1 (two models have the same causal graph, but with different coefﬁcients in SCMs; the

generation method of the models is in the appendix), where

is a binary treatment,

is a binary

effect, and

is a set of

independent confounders (say

Z1, ..., Z20

). The structural equations are as

follow (for simplicity reason, we let x= 1, x0= 0, and y= 1, y0= 0):

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProbabilitiesofCausation:AdequateSizeofExperimentalandObservationalSamplesAngLiDepartmentofComputerScienceUniversityofCaliforniaLosAngelesLosAngeles,CA90095angli@cs.ucla.eduRuiruiMaoDepartmentofStatisticsUniversityofWisconsinMadisonMadison,Wisconsin53705rmao28@wisc.eduJudeaPearlDepartmentofComputerS...

展开>> 收起<<

Probabilities of Causation Adequate Size of Experimental and Observational Samples Ang Li.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Probabilities of Causation Adequate Size of Experimental and Observational Samples Ang Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: