On estimation of functional causal models: Post - nonlinear - - PowerPoint PPT Presentation

on estimation of functional causal models post nonlinear
SMART_READER_LITE
LIVE PREVIEW

On estimation of functional causal models: Post - nonlinear - - PowerPoint PPT Presentation

On estimation of functional causal models: Post - nonlinear causal model as an example Kun Zhang, Zhikun W ang, Bernhard Schlkopf Dept. Empirical Inference Max Planck Institute for Intelligent Systems Tbingen, Germany 1


slide-1
SLIDE 1

On estimation of functional causal models: Post-nonlinear causal model as an example

Kun Zhang, Zhikun W ang, Bernhard Schölkopf

  • Dept. Empirical Inference

Max Planck Institute for Intelligent Systems Tübingen, Germany

1

slide-2
SLIDE 2

Causal discovery

lCausal discovery: identify causal

relations from purely observed data

lIn the past decades, under certain

assumptions, it was made possible to derive causation from passively

  • bserved data

¡ statistical data ⇒ causal structure ¡ causal Markov assumption ¡ faithfulness…

X1 X2

  • 1.1 1.0

2.1 2.0 3.1 4.2 2.3 -0.6 1.3 2.2

  • 1.8 0.9

. . . . . .

X1 X2

?

2

slide-3
SLIDE 3

Constraint-based causal discovery

  • under Markov condition & faithfulness assumption, uses

(conditional) independence constraints to find candidate causal structures

  • example: PC algorithm (Spirtes & Glymour, 1991)
  • Markov equivalence class
  • pattern Y⎯X⎯Z
  • same adjacencies
  • → if all agree on orientation; ⎯ if disagree
  • might be unique: v-structure

Y Z | X Y Z

3

slide-4
SLIDE 4
  • {local causal structures} → {conditional independences}

X Z | Y

X Y Z X Y Z X Y Z X Y Z X Y Z

X Y X Y

Constraint-based method: An inverse problem

4

slide-5
SLIDE 5
  • {local causal structures} → {conditional independences}

X Z | Y

X Y Z X Y Z X Y Z X Y Z

faithfulness

X Y Z

X Y X Y

Constraint-based method: An inverse problem

4

slide-6
SLIDE 6
  • {local causal structures} → {conditional independences}

X Z | Y

X Y Z X Y Z X Y Z X Y Z

equivalence class faithfulness

X Y Z

X Y X Y

Constraint-based method: An inverse problem

4

slide-7
SLIDE 7
  • {local causal structures} → {conditional independences}

X Z | Y

X Y Z X Y Z X Y Z X Y Z

equivalence class faithfulness

X Y Z

X Y X Y

X Y X Y X Y

  • r
  • r

Z

two-variable case?

Constraint-based method: An inverse problem

4

slide-8
SLIDE 8
  • {local causal structures} → {conditional independences}

X Z | Y

X Y Z X Y Z X Y Z X Y Z

equivalence class faithfulness

  • Instead, try to directly

identify local causal structures with functional causal models X Y Z

X Y X Y

X Y X Y X Y

  • r
  • r

Z

two-variable case?

Constraint-based method: An inverse problem

4

slide-9
SLIDE 9

Outline

  • Causal discovery based on functional causal models

(FCMs)

  • Estimation of FCMs: Relationship between dependence

minimization and maximum likelihood

  • Post-nonlinear causal model as an example: By warped

Gaussian processes with flexible noise distribution

5

slide-10
SLIDE 10

Causality is about data-generating process

  • Functional causal model Y = f(X, E): Effect generated from

cause with independent noise

  • Why useful?
  • structural constraints on f guarantee identifiability of the causal

model, i.e., asymmetry in X and Y

  • in practice f can usually be approximated with a well-constrained

form !

  • How to distinguish cause from effect:
  • Fit the model for both directions, and see which direction gives

independence between the assumed cause and noise

f X E

Y

6

slide-11
SLIDE 11

FCM cannot distinguish cause from effect without constraints on f

  • Without constraints on f, for given (X, Y), both Y = f1(X, E) with E X

and X = f2(X, E1) with E1 Y are possible

  • E.g., with a Gram-Schmidt-orthogonalization procedure (Darmois,

‘51, Hyvärinen & Pajunen, ‘99)

! x = cdf(x1), so ! x ~ U(0,1); e = cdf(y | ! x) = p!

x,y( !

x,t)

!" x2

#

dt. Then (x, y) $ ( ! x,e), with E_||_X

7

slide-12
SLIDE 12

(Generally) identifiable FCMs with independent noise

  • linear non-Gaussian acyclic causal model (Shimizu et al., ‘06)
  • additive noise model (Hoyer et al., ’09)
  • post-nonlinear (PNL) causal model (Zhang & Hyvärinen, ’09)

Y = aX +E Y = f(X) +E Y = f2 ( f1(X) +E )

Some papers estimate the models by maximum likelihood; some propose to minimize the dependence between X and E: What is the difference?

8

slide-13
SLIDE 13

Mutual information minimization vs. maximum likelihood ?

  • Model:

Maximum likelihood Independence criterion

  • Maximum likelihood is equivalent to mutual information minimization
  • However, convenient to do model selection with maximum likelihood !

Y = f(X, E; θ1); E ⊥ ⊥ X, E ∼ p(E; θ2), f ∈ F (appropriately constrained)

maximizing lX→Y (θ) =

T

X

i=1

log PF(xi, yi) =

T

X

i=1

log p(X = xi) +

T

X

i=1

log p(E = ˆ ei; θ2)−

T

X

i=1

log

  • ∂f

∂E

  • E=ˆ

ei

  • .

minimizing I(X, ˆ E; θ) = − 1 T

T

X

i=1

log p(X = xi)− 1 T

T

X

i=1

log p(X = xi, Y = yi)+ 1 T

T

X

i=1

log p( ˆ E = ˆ ei; θ2) + 1 T

T

X

i=1

log

  • ∂f

∂E

  • E=ˆ

ei

  • 1

T lX→Y (θ) = 1 T

T

X

i=1

log p(X = xi, Y = yi) − I(X, ˆ E; θ).

9

slide-14
SLIDE 14

Loss that might be caused by a wrongly specified noise distribution

  • Maximum likelihood (or mutual information minimization) aims

to maximize

  • If f has additive noise, ∂f/∂E ≡ 1; reduces to ordinary regression

problem

  • In general, if p(E) is wrongly specified (e.g., simply set to

Gaussian), the estimated f might not be statistically consistent

  • The estimated f might have to sacrifice to make the estimated

noise closer to the specified p(E) such that term 1 becomes bigger; a trade-off of the two terms, is maximized

JX→Y = PT

i=1 log p(E = ˆ

ei)− PT

i=1 log

  • ∂f

∂E

  • E=ˆ

ei

  • Y = f(X, E; θ1); E ⊥

⊥ X, E ∼ p(E; θ2), f ∈ F (appropriately constrained)

10

slide-15
SLIDE 15

Post-nonlinear causal model: An example

  • Without prior knowledge, the assumed model is expected to be
  • general enough: adapted to approximate the true generating process
  • identifiable: asymmetry in causes and effects
  • post-nonlinear (PNL) causal model: Y = f2 ( f1 (X) + E)
  • PNL causal model is generally identifiable (Zhang & Hyvärinen, ’09)

11

slide-16
SLIDE 16

Estimating PNL model by Warped Gaussian processes with non-Gaussian noise

  • Previously estimated by minimizing mutual information between

X and Ê: difficult to do model selection

  • A maximum likelihood perspective
  • Using a Gaussian process (GP) prior for f1 ⇒ warped Gaussian

processes (Snelson et al., 2004)

  • Further use a flexible noise distribution (mixture of Gaussians)

for E; otherwise the estimated f may be inconsistent

  • Represent f2 with some basis functions
  • Use MCMC for Bayesian inference on f1, f2, and E

Y = f2 ( f1 (X) + E)

12

slide-17
SLIDE 17

Simulation: Settings and results

  • To illustrate different behaviors of

estimated PNL causal model by

  • warped GPs with MoG noise

(WGP-MoG)

  • warped GPs with Gaussian noise

(WGP-Gaussian)

  • mutual information minimization

approach with MLPs and MoG noise (PNL-MLP)

  • Generated data: Z = 2X+E, Y = Z, i.e.,

f1 = 2X, f2(Z) = Z, and E is log-normal

Y = f2 ( f1 (X) + E)

−1.5 −1 −0.5 0.5 1 1.5

X Y

data points function f1(x)

13

slide-18
SLIDE 18
  • WGP-Gaussian
  • WGP-MoG
  • PNL-MLP
−4 −2 2 4 6 8 10 12 14 −1 1 2 3 4 5 6 7 8

ˆ Z Y

warping function f2

(b) Estimated PNL function f2

−1.5 −1 −0.5 0.5 1 1.5 2

X ˆ Z

unwarped data point GP posterior mean

(c) Estimated f1

−1.5 −1 −0.5 0.5 1 1.5 2

X ˆ N

estimated noise

(d) X and estimated noise

not independent!

−2 2 4 6 8 −1 1 2 3 4 5 6 7 8

ˆ Z Y

warping function f2

(a) Estimated PNL function f2

−1.5 −1 −0.5 0.5 1 1.5 2

X ˆ Z

unwarped data point GP posterior mean

(b) Estimated f1

−1.5 −1 −0.5 0.5 1 1.5 2

X ˆ N

estimated noise

(c) X and estimated noise

−2 2 4 6 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

ˆ N

distribution of noise

(d) MoG Noise Distribution

a l m

  • s

t l i n e a r ! independent !

−10 10 20 30 −2 2 4 6 8 10 ˆ Z Y

(a) Estimated PNL function f2

−2 −1 1 2 −5 5 10 15 20 25 X ˆ Z

unwarped data point estimated f1

(b) Estimated f1

−2 −1 1 2 −5 5 10 15 20 25 X ˆ N

(c) X and estimated noise

independent !

Y = f2 ( f1 (X) + E)

14

slide-19
SLIDE 19

On real data

  • Apply different approaches for causal direction determination on 77

cause-effect pairs, on which ground truth is known based on background info

  • On pairs 22 and 57, PNL-WGP-MoG prefers X→Y

, which is plausible due to background info, but PNL-WMP-Gaussian prefers Y→X

Method PNL-MLP PNL-WGP-Gaussian PNL-WGP-MoG ANM GPI IGCI Accuracy (%) 70 67 76 63 72 73

Accuracy of different methods for causal direction determination on the cause-effect pairs.

Additive noise model Gaussian process latent variable model Information geometric causal inference

−3 −2 −1 1 2

X Y

data points

Data pair 22 X: age of a person, Y : corresponding height

−2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2

X Y

data points

Data pair 57 X: latitude of the country’s capital, Y : life expectancy

15

slide-20
SLIDE 20

Conclusion

  • Functional causal model (FCM) based approach could

fully identify causal structure

  • For estimating FCMs, maximum likelihood is equivalent

to minimizing the mutual info between the assume cause and noise

  • Bayesian model selection is easier to do from the

maximum likelihood point of view

  • In general, the performance depends on the specified

noise distribution; better to learn it from data

Tianks for your listfning.

16