Counterfactual Policy Evaluation in Reproducing Kernel Hilbert - - PowerPoint PPT Presentation

counterfactual policy evaluation in reproducing kernel
SMART_READER_LITE
LIVE PREVIEW

Counterfactual Policy Evaluation in Reproducing Kernel Hilbert - - PowerPoint PPT Presentation

Counterfactual Policy Evaluation in Reproducing Kernel Hilbert Spaces Krikamol Muandet Max Planck Institute for Intelligent Systems Tbingen, Germany Jeju, Korea February 22, 2019 Krikamol Muandet Counterfactual Learning in RKHS Jeju,


slide-1
SLIDE 1

Counterfactual Policy Evaluation in Reproducing Kernel Hilbert Spaces

Krikamol Muandet Max Planck Institute for Intelligent Systems Tübingen, Germany Jeju, Korea — February 22, 2019

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 1 / 27

slide-2
SLIDE 2

Acknowledgment

Motonobu Kanagawa Sorawit Saengkyongam Sanparith Marukatat U of Tübingen UCL NECTEC

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 2 / 27

slide-3
SLIDE 3

1

Introduction

2

Counterfactual Mean Embedding

3

Policy Evaluation

4

Discussion

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 3 / 27

slide-4
SLIDE 4

Introduction

1

Introduction

2

Counterfactual Mean Embedding

3

Policy Evaluation

4

Discussion

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 4 / 27

slide-5
SLIDE 5

Introduction Motivation

Motivation

Recommendation Autonomous Car Healthcare

Goal: Identify the best (causal) policy.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 5 / 27

slide-6
SLIDE 6

Introduction Motivation

Motivation

Recommendation Autonomous Car Healthcare

Goal: Identify the best (causal) policy.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 5 / 27

slide-7
SLIDE 7

Introduction Motivation

Personalization

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 6 / 27

slide-8
SLIDE 8

Introduction Motivation

Healthcare

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 7 / 27

slide-9
SLIDE 9

Introduction Problem Setup

A Causal Policy

X: Context, T : Treatment, Y: Outcome, π: Policy Ex: age gender , pills, cholesterol level. A context x . A treatment t t x for x t . An outcome y y x t for x t y .

Context X Policy Treatmnt T Outcome Y

0The term “context” and “covariate” may be used interchangeably.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 8 / 27

slide-10
SLIDE 10

Introduction Problem Setup

A Causal Policy

X: Context, T : Treatment, Y: Outcome, π: Policy Ex: X = {age, gender}, T = pills, Y = cholesterol level. A context x . A treatment t t x for x t . An outcome y y x t for x t y .

Context X Policy Treatmnt T Outcome Y

0The term “context” and “covariate” may be used interchangeably.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 8 / 27

slide-11
SLIDE 11

Introduction Problem Setup

A Causal Policy

X: Context, T : Treatment, Y: Outcome, π: Policy Ex: X = {age, gender}, T = pills, Y = cholesterol level. A context x ∼ ρ. A treatment t t x for x t . An outcome y y x t for x t y .

Context X Policy Treatmnt T Outcome Y

0The term “context” and “covariate” may be used interchangeably.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 8 / 27

slide-12
SLIDE 12

Introduction Problem Setup

A Causal Policy

X: Context, T : Treatment, Y: Outcome, π: Policy Ex: X = {age, gender}, T = pills, Y = cholesterol level. A context x ∼ ρ. A treatment t ∼ π(t | x) for (x, t) ∈ X × T . An outcome y y x t for x t y .

Context X Policy Treatmnt T Outcome Y

0The term “context” and “covariate” may be used interchangeably.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 8 / 27

slide-13
SLIDE 13

Introduction Problem Setup

A Causal Policy

X: Context, T : Treatment, Y: Outcome, π: Policy Ex: X = {age, gender}, T = pills, Y = cholesterol level. A context x ∼ ρ. A treatment t ∼ π(t | x) for (x, t) ∈ X × T . An outcome y ∼ η(y|x, t) for (x, t, y) ∈ X × T × Y.

Context X Policy π Treatmnt T Outcome Y

0The term “context” and “covariate” may be used interchangeably.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 8 / 27

slide-14
SLIDE 14

Introduction Problem Setup

How to Identify Good Policies

Randomized Exp. (A/B Test) Gold standard in science × Expensive, time-consuming, or unethical Observational Studies No randomization Cheaper, safer, and more ethical Selection bias

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 9 / 27

slide-15
SLIDE 15

Introduction Problem Setup

How to Identify Good Policies

Randomized Exp. (A/B Test) Gold standard in science × Expensive, time-consuming, or unethical Observational Studies No randomization Cheaper, safer, and more ethical × Selection bias

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 9 / 27

slide-16
SLIDE 16

Counterfactual Mean Embedding

1

Introduction

2

Counterfactual Mean Embedding

3

Policy Evaluation

4

Discussion

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 10 / 27

slide-17
SLIDE 17

Counterfactual Mean Embedding Potential Outcome Framework

Potential Outcome Framework

Standard framework in social science, econometric, and healthcare. Treatment T and outcome Y Y .

T placebo injection Y cholesterol level if T placebo Y cholesterol level if T injection.

Unit Y Y Y Y A 15 20

  • 5

B 10 12

  • 2

C 5 11

  • 6

D 12 19

  • 7

Individual treatment efgect: ITE i Y i Y i Fundamental Problem of Causal Inference (FPCI) (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 11 / 27

slide-18
SLIDE 18

Counterfactual Mean Embedding Potential Outcome Framework

Potential Outcome Framework

Standard framework in social science, econometric, and healthcare. Treatment T ∈ {0, 1} and outcome Y0, Y1 ∈ R.

◮ T ∈ {placebo, injection} ◮ Y0 = cholesterol level if T = placebo ◮ Y1 = cholesterol level if T = injection.

Unit Y1 Y0 Y1 − Y0 A 15 20

  • 5

B 10 12

  • 2

C 5 11

  • 6

D 12 19

  • 7

Individual treatment efgect: ITE i Y i Y i Fundamental Problem of Causal Inference (FPCI) (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 11 / 27

slide-19
SLIDE 19

Counterfactual Mean Embedding Potential Outcome Framework

Potential Outcome Framework

Standard framework in social science, econometric, and healthcare. Treatment T ∈ {0, 1} and outcome Y0, Y1 ∈ R.

◮ T ∈ {placebo, injection} ◮ Y0 = cholesterol level if T = placebo ◮ Y1 = cholesterol level if T = injection.

Unit Y1 Y0 Y1 − Y0 A 15 20

  • 5

B 10 12

  • 2

C 5 11

  • 6

D 12 19

  • 7

Individual treatment efgect: ITE(i) := Y1(i) − Y0(i) Fundamental Problem of Causal Inference (FPCI) (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 11 / 27

slide-20
SLIDE 20

Counterfactual Mean Embedding Potential Outcome Framework

Potential Outcome Framework

Standard framework in social science, econometric, and healthcare. Treatment T ∈ {0, 1} and outcome Y0, Y1 ∈ R.

◮ T ∈ {placebo, injection} ◮ Y0 = cholesterol level if T = placebo ◮ Y1 = cholesterol level if T = injection.

Unit Y1 Y0 Y1 − Y0 A 15

  • ?

B

  • 12

? C 5

  • ?

D

  • 19

? Individual treatment efgect: ITE(i) := Y1(i) − Y0(i) Fundamental Problem of Causal Inference (FPCI) (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 11 / 27

slide-21
SLIDE 21

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: x T X x We observe a dataset x t y x t y xn tn yn where xi ti yi covariate received treatment outcome . The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-22
SLIDE 22

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: x T X x We observe a dataset x t y x t y xn tn yn where xi ti yi covariate received treatment outcome . The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-23
SLIDE 23

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: x T X x We observe a dataset x t y x t y xn tn yn where xi ti yi covariate received treatment outcome . The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-24
SLIDE 24

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: ρ(x) := P(T = 1 | X = x). We observe a dataset x t y x t y xn tn yn where xi ti yi covariate received treatment outcome . The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-25
SLIDE 25

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: ρ(x) := P(T = 1 | X = x). We observe a dataset D = {(x1, t1, y1), (x2, t2, y2), . . . , (xn, tn, yn)} where (xi, ti, yi) := (covariate, received treatment, outcome). The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-26
SLIDE 26

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Causal efgect is defjned w.r.t. the counterfactual outcomes.

◮ What would the value of Y1 have been had the subject get the injection?

Covariates (X) associated with each unit are available. Confounders (Z) afgecting both T and Y simultaneously may exist. A propensity score: ρ(x) := P(T = 1 | X = x). We observe a dataset D = {(x1, t1, y1), (x2, t2, y2), . . . , (xn, tn, yn)} where (xi, ti, yi) := (covariate, received treatment, outcome). The treatment assignment mechanism is not known. (Rubin 2005)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 12 / 27

slide-27
SLIDE 27

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Main Assumptions

Stable unit treatment value assumption (SUTVA): The outcome of the ith unit is independent of those of other units and their received treatments. Unconfoundedness/ignorability/exogeneity Y0, Y1 ⊥ ⊥ T | X Treatment positivity: For all x and t, 0 < P(T = t | X = x) < 1.

Theorem (Propensity Score)

Let X T X be the propensity score. Suppose that ignorability holds. Then we have Y Y T X

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 13 / 27

slide-28
SLIDE 28

Counterfactual Mean Embedding Potential Outcome Framework

Rubin’s Causal Model

Main Assumptions

Stable unit treatment value assumption (SUTVA): The outcome of the ith unit is independent of those of other units and their received treatments. Unconfoundedness/ignorability/exogeneity Y0, Y1 ⊥ ⊥ T | X Treatment positivity: For all x and t, 0 < P(T = t | X = x) < 1.

Theorem (Propensity Score)

Let ρ(X) = P(T = 1 | X) be the propensity score. Suppose that ignorability holds. Then we have Y0, Y1 ⊥ ⊥ T | ρ(X).

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 13 / 27

slide-29
SLIDE 29

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy to the target policy ?” Let Yi be the outcome and Zi Xi Ti for i . Chernozhukov et al. (2013) defjnes a counterfactual distribution

Y Y Z

y z d

Z

z Under the main assumptions, the counterfactual distribution

Y

corresponds to the interventional distribution

Y .

We will construct an estimate for

Y without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-30
SLIDE 30

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy π0 to the target policy π1?” Let Yi be the outcome and Zi Xi Ti for i . Chernozhukov et al. (2013) defjnes a counterfactual distribution

Y Y Z

y z d

Z

z Under the main assumptions, the counterfactual distribution

Y

corresponds to the interventional distribution

Y .

We will construct an estimate for

Y without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-31
SLIDE 31

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy π0 to the target policy π1?” Let Yi be the outcome and Zi = (Xi, Ti) for i ∈ {0, 1}. Chernozhukov et al. (2013) defjnes a counterfactual distribution

Y Y Z

y z d

Z

z Under the main assumptions, the counterfactual distribution

Y

corresponds to the interventional distribution

Y .

We will construct an estimate for

Y without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-32
SLIDE 32

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy π0 to the target policy π1?” Let Yi be the outcome and Zi = (Xi, Ti) for i ∈ {0, 1}. Chernozhukov et al. (2013) defjnes a counterfactual distribution PY1 :=

  • PY0|Z0(y|z) dPZ1(z).

Under the main assumptions, the counterfactual distribution

Y

corresponds to the interventional distribution

Y .

We will construct an estimate for

Y without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-33
SLIDE 33

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy π0 to the target policy π1?” Let Yi be the outcome and Zi = (Xi, Ti) for i ∈ {0, 1}. Chernozhukov et al. (2013) defjnes a counterfactual distribution PY1 :=

  • PY0|Z0(y|z) dPZ1(z).

Under the main assumptions, the counterfactual distribution PY1 corresponds to the interventional distribution P∗

Y1.

We will construct an estimate for

Y without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-34
SLIDE 34

Counterfactual Mean Embedding Potential Outcome Framework

Counterfactual Distribution

π0: null/logged policy, π1: target/new policy. Our goal is to answer the following counterfactual question: “How would the outcomes have changed, if we had switched from the null policy π0 to the target policy π1?” Let Yi be the outcome and Zi = (Xi, Ti) for i ∈ {0, 1}. Chernozhukov et al. (2013) defjnes a counterfactual distribution PY1 :=

  • PY0|Z0(y|z) dPZ1(z).

Under the main assumptions, the counterfactual distribution PY1 corresponds to the interventional distribution P∗

Y1.

We will construct an estimate for PY1 without any sample from it.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 14 / 27

slide-35
SLIDE 35

Counterfactual Mean Embedding Kernel Mean Embedding of Distributions

Implicit Representation of Distributions

x p(x) RKHS H µP µQ P Q

Kernel Mean Embedding (Berlinet and Thomas-Agnan 2004, Smola et al. 2007)

Let φ(x) = k(x, ·) be a canonical feature map from X into H. A kernel mean embedding (KME) of a distribution P over X is defjned by µP :=

  • X

φ(x) dP(x) =

  • X

k(x, ·) dP(x). The embedding is well-defjned if

1

the kernel k is measurable and

2

the kernel is bounded, i.e., k x x for all x .

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 15 / 27

slide-36
SLIDE 36

Counterfactual Mean Embedding Kernel Mean Embedding of Distributions

Implicit Representation of Distributions

x p(x) RKHS H µP µQ P Q

Kernel Mean Embedding (Berlinet and Thomas-Agnan 2004, Smola et al. 2007)

Let φ(x) = k(x, ·) be a canonical feature map from X into H. A kernel mean embedding (KME) of a distribution P over X is defjned by µP :=

  • X

φ(x) dP(x) =

  • X

k(x, ·) dP(x). The embedding µP is well-defjned if

1

the kernel k is measurable and

2

the kernel is bounded, i.e., k(x, x) < ∞ for all x ∈ X.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 15 / 27

slide-37
SLIDE 37

Counterfactual Mean Embedding Kernel Mean Embedding of Distributions

Embedding of Conditional Distributions

X Y H G CYXC−1

XXk(x, ·)

µY |X=x k(x, ·) CYXC−1

XX

y p(y|x) P(Y |X = x) The conditional mean embedding of P(Y | X) can be defjned as UY |X : H → G, UY |X := CYXC−1

XX

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 16 / 27

slide-38
SLIDE 38

Counterfactual Mean Embedding Causal Learning

Counterfactual Mean Embedding

Recall that we have π0: null/logged policy, π1: target/new policy. An embedding of

Y Y Z

y z d

Z

z can be defjned by

Y

y d

Y

y y d

Y Z

y z d

Z

z

Y Z Z Z

Theorem (causal interpretation)

Suppose that exogeneity holds, i.e., Y Y T X almost surely for X and that common support assumption holds. Then,

Y Y

where

Y denotes an RKHS embedding of the interventional distribution Y .

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 17 / 27

slide-39
SLIDE 39

Counterfactual Mean Embedding Causal Learning

Counterfactual Mean Embedding

Recall that we have π0: null/logged policy, π1: target/new policy. An embedding of PY1 =

  • PY0|Z0(y|z) dPZ1(z) can be defjned by

µY1 :=

  • ϕ(y) dPY1(y) =
  • ϕ(y) dPY0|Z0(y|z) dPZ1(z) = CY0Z0C−1

Z0 µZ1.

Theorem (causal interpretation)

Suppose that exogeneity holds, i.e., Y Y T X almost surely for X and that common support assumption holds. Then,

Y Y

where

Y denotes an RKHS embedding of the interventional distribution Y .

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 17 / 27

slide-40
SLIDE 40

Counterfactual Mean Embedding Causal Learning

Counterfactual Mean Embedding

Recall that we have π0: null/logged policy, π1: target/new policy. An embedding of PY1 =

  • PY0|Z0(y|z) dPZ1(z) can be defjned by

µY1 :=

  • ϕ(y) dPY1(y) =
  • ϕ(y) dPY0|Z0(y|z) dPZ1(z) = CY0Z0C−1

Z0 µZ1.

Theorem (causal interpretation)

Suppose that exogeneity holds, i.e., Y0, Y1 ⊥ ⊥ T|X almost surely for X and that common support assumption holds. Then,

µY1 = µ∗

Y1,

where µ∗

Y1 denotes an RKHS embedding of the interventional distribution P∗ Y1.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 17 / 27

slide-41
SLIDE 41

Counterfactual Mean Embedding Causal Learning

Counterfactual Mean Embedding

Proposition (empirical estimate)

Given samples (z1, y1), . . . , (zn, yn) from PY0Z0(z, y) and z′

1, . . . , z′ m from PZ1(z).

Ψ = [ϕ(y1), . . . , ϕ(yn)]⊤ Kij = k(zi, zj), Lij = k(zi, z′

j )

1n = (1/m, . . . , 1/m)⊤ ˆ µY1 = CY0Z0( CZ0 + εI)−1 ˆ µZ1 = Ψ(K + nεI)−1L1n =

n

  • i=1

βiϕ(yi).

Theorem (uniform convergence)

Under some technical assumptions, if

n decays to zero suffjciently slowly as

n and limn

Z Z

, we have that, as n ,

Y Y p

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 18 / 27

slide-42
SLIDE 42

Counterfactual Mean Embedding Causal Learning

Counterfactual Mean Embedding

Proposition (empirical estimate)

Given samples (z1, y1), . . . , (zn, yn) from PY0Z0(z, y) and z′

1, . . . , z′ m from PZ1(z).

Ψ = [ϕ(y1), . . . , ϕ(yn)]⊤ Kij = k(zi, zj), Lij = k(zi, z′

j )

1n = (1/m, . . . , 1/m)⊤ ˆ µY1 = CY0Z0( CZ0 + εI)−1 ˆ µZ1 = Ψ(K + nεI)−1L1n =

n

  • i=1

βiϕ(yi).

Theorem (uniform convergence)

Under some technical assumptions, if εn decays to zero suffjciently slowly as n → ∞ and limn→∞ ˆ µZ1 − µZ1H = 0, we have that, as n → ∞,

  • ˆ

µY1 − µY1

  • G

p

− → 0.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 18 / 27

slide-43
SLIDE 43

Counterfactual Mean Embedding Causal Learning

Convergence Rate

Theorem

Let g := dPZ1/ dPZ0 and θ(z, ˜ z) := E[ℓ(Y0, ˜ Y0)|Z0 = z, ˜ Z0 = ˜ z]. Assume that g ∈ Range(T α) for 0 < α ≤ 1 and that θ ∈ Range((T ⊗ T)β) for 0 < β ≤ 1. Then for εn = cn−1/(1+β+max(1−α,α)) with c > 0 being arbitrary but independent

  • f n, we have
  • CY0Z0(

CZ0 + εnI)−1ˆ µZ1 − µY1

  • F = Op
  • n−(α+β)/(2(1+β+max(1−α,α)))

as n → ∞. Remark: controls the overlapping between

Z and Z .

controls the smoothness of

Y Z

y z . Our estimator has a “doubly-robust”-like property.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 19 / 27

slide-44
SLIDE 44

Counterfactual Mean Embedding Causal Learning

Convergence Rate

Theorem

Let g := dPZ1/ dPZ0 and θ(z, ˜ z) := E[ℓ(Y0, ˜ Y0)|Z0 = z, ˜ Z0 = ˜ z]. Assume that g ∈ Range(T α) for 0 < α ≤ 1 and that θ ∈ Range((T ⊗ T)β) for 0 < β ≤ 1. Then for εn = cn−1/(1+β+max(1−α,α)) with c > 0 being arbitrary but independent

  • f n, we have
  • CY0Z0(

CZ0 + εnI)−1ˆ µZ1 − µY1

  • F = Op
  • n−(α+β)/(2(1+β+max(1−α,α)))

as n → ∞. Remark: α controls the overlapping between PZ1 and PZ0. β controls the smoothness of PY0|Z0(y|z). Our estimator has a “doubly-robust”-like property.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 19 / 27

slide-45
SLIDE 45

Policy Evaluation

1

Introduction

2

Counterfactual Mean Embedding

3

Policy Evaluation

4

Discussion

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 20 / 27

slide-46
SLIDE 46

Policy Evaluation Policy Evaluation with CME

Policy Evaluation

Consider a recommendation platform:

◮ Context: User information x ∈ X ◮ Treatment: Recommendation policy t ∼ π(t|x) ◮ Outcome: Reward y = δ(x, t)

Given the logged data from an initial policy and target policy : x t y xn tn yn x t xm tm Assume that y x t y x t . Then, we have y y x t d x t y x t d x t y is a counterfactual reward distribution under the new policy . Let Z X T and Z X T .

y Y Z Z Z Z

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 21 / 27

slide-47
SLIDE 47

Policy Evaluation Policy Evaluation with CME

Policy Evaluation

Consider a recommendation platform:

◮ Context: User information x ∈ X ◮ Treatment: Recommendation policy t ∼ π(t|x) ◮ Outcome: Reward y = δ(x, t)

Given the logged data from an initial policy π0 and target policy π1: D0 = {(x1, t1, y1), . . . , (xn, tn, yn)}, D1 = {(x∗

1 , t∗ 1), . . . , (x∗ m, t∗ m)}

Assume that y x t y x t . Then, we have y y x t d x t y x t d x t y is a counterfactual reward distribution under the new policy . Let Z X T and Z X T .

y Y Z Z Z Z

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 21 / 27

slide-48
SLIDE 48

Policy Evaluation Policy Evaluation with CME

Policy Evaluation

Consider a recommendation platform:

◮ Context: User information x ∈ X ◮ Treatment: Recommendation policy t ∼ π(t|x) ◮ Outcome: Reward y = δ(x, t)

Given the logged data from an initial policy π0 and target policy π1: D0 = {(x1, t1, y1), . . . , (xn, tn, yn)}, D1 = {(x∗

1 , t∗ 1), . . . , (x∗ m, t∗ m)}

Assume that P0(y | x′, t′) = P1(y | x′, t′). Then, we have P1(y) =

  • P1(y | x∗, t∗) dP1(x∗, t∗) =
  • P0(y | x, t) dP1(x, t)

y is a counterfactual reward distribution under the new policy . Let Z X T and Z X T .

y Y Z Z Z Z

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 21 / 27

slide-49
SLIDE 49

Policy Evaluation Policy Evaluation with CME

Policy Evaluation

Consider a recommendation platform:

◮ Context: User information x ∈ X ◮ Treatment: Recommendation policy t ∼ π(t|x) ◮ Outcome: Reward y = δ(x, t)

Given the logged data from an initial policy π0 and target policy π1: D0 = {(x1, t1, y1), . . . , (xn, tn, yn)}, D1 = {(x∗

1 , t∗ 1), . . . , (x∗ m, t∗ m)}

Assume that P0(y | x′, t′) = P1(y | x′, t′). Then, we have P1(y) =

  • P1(y | x∗, t∗) dP1(x∗, t∗) =
  • P0(y | x, t) dP1(x, t)

P1(y) is a counterfactual reward distribution under the new policy π1. Let Z X T and Z X T .

y Y Z Z Z Z

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 21 / 27

slide-50
SLIDE 50

Policy Evaluation Policy Evaluation with CME

Policy Evaluation

Consider a recommendation platform:

◮ Context: User information x ∈ X ◮ Treatment: Recommendation policy t ∼ π(t|x) ◮ Outcome: Reward y = δ(x, t)

Given the logged data from an initial policy π0 and target policy π1: D0 = {(x1, t1, y1), . . . , (xn, tn, yn)}, D1 = {(x∗

1 , t∗ 1), . . . , (x∗ m, t∗ m)}

Assume that P0(y | x′, t′) = P1(y | x′, t′). Then, we have P1(y) =

  • P1(y | x∗, t∗) dP1(x∗, t∗) =
  • P0(y | x, t) dP1(x, t)

P1(y) is a counterfactual reward distribution under the new policy π1. Let Z0 = (X, T) and Z1 = (X ∗, T ∗). µP1(y) = CY0Z0(CZ0Z0 + εI)−1µZ1

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 21 / 27

slide-51
SLIDE 51

Policy Evaluation Policy Evaluation with CME

Experimental Results

10

7

10

5

10

3

10

1

Mean Square Error

Stochastic target policy, M=100, K=10 Stochastic target policy, M=10, K=5

1000 4000 16000 63000 250000 1000000

Number of observations

10

7

10

5

10

3

10

1

Mean Square Error

Deterministic target policy, M=100, K=10

1000 4000 16000 63000 250000 1000000

Number of observations

Deterministic target policy, M=10, K=5

Estimator CME Direct DR Slate wIPS OnPolicy

Dataset: Microsoft Learning to Rank Challenge dataset (MSLR-WEB30K)

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 22 / 27

slide-52
SLIDE 52

Discussion

1

Introduction

2

Counterfactual Mean Embedding

3

Policy Evaluation

4

Discussion

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 23 / 27

slide-53
SLIDE 53

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient J can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-54
SLIDE 54

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient ∇θJ(θ) can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-55
SLIDE 55

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient ∇θJ(θ) can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

◮ Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-56
SLIDE 56

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient ∇θJ(θ) can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

◮ Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-57
SLIDE 57

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient ∇θJ(θ) can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

◮ Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-58
SLIDE 58

Discussion

Discussion

In policy learning, given a policy πθ, the objective and its gradient are J(θ) := Ex∼ρX Et∼πθ(t|x)Ey∼η(y|x,t) [δ(x, t, y)] ∇θJ(θ) = Eπθ[δ(x, t, y)∇θ log π(t|x)]. The gradient ∇θJ(θ) can be directly estimated by CME. Several disciplines that make use of the observational studies will benefjt from this work.

◮ Social science, econometric, healthcare, fjnance, etc.

Include experimental data to improve the policy. Incorporate multiple sets of observational data obtained from difgerent policies. Our problem is related to (batch) reinforcement learning, policy gradient methods, and contextual bandit in machine learning.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 24 / 27

slide-59
SLIDE 59

Discussion Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 25 / 27

slide-60
SLIDE 60

Discussion

Contact

Location Max Planck Campus Tübingen Website http://krikamol.org Email krikamol@tuebingen.mpg.de Publication http://krikamol.org/research/pubs.htm

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 26 / 27

slide-61
SLIDE 61

Discussion

References I

  • A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and
  • Statistics. Kluwer Academic Publishers, 2004.
  • V. Chernozhukov, I. Fernández-Val, and B. Melly. Inference on counterfactual
  • distributions. Econometrica, 81(6):2205–2268, 2013.
  • D. B. Rubin. Causal inference using potential outcomes. Journal of the American

Statistical Association, 100(469):322–331, 2005.

  • A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for
  • distributions. In Proceedings of the 18th International Conference on Algorithmic

Learning Theory (ALT), pages 13–31. Springer-Verlag, 2007.

Krikamol Muandet Counterfactual Learning in RKHS Jeju, Korea — February 22, 2019 27 / 27