Local Independence Tests for Point Processes Learning causality in - - PowerPoint PPT Presentation

local independence tests for point processes learning
SMART_READER_LITE
LIVE PREVIEW

Local Independence Tests for Point Processes Learning causality in - - PowerPoint PPT Presentation

Local Independence Tests for Point Processes Learning causality in event models Nikolaj Thams, University of Copenhagen November 21 st , 2019 Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen Hawkes Processes


slide-1
SLIDE 1

Local Independence Tests for Point Processes Learning causality in event models

Nikolaj Thams, University of Copenhagen November 21st, 2019

Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen

slide-2
SLIDE 2

Hawkes Processes Causality Local independence test Experimental results Conclusion

slide-3
SLIDE 3

Learning causality in event models?

T

Time

b h c a

slide-4
SLIDE 4

Learning causality in event models?

T

Time

b h c a

slide-5
SLIDE 5

Hawkes Processes

slide-6
SLIDE 6

Point process

Point processes A point process with marks V = {1, . . . , d} is a collection

  • f random measures random

Nk = ∑

i

Tk

i ,

where Tk

i is the i’th event of type k. This defjnes processes

t → Nk

t := Nk(0, t]. T1 T2 T3 Nt t

If the compensator Ak

t of Nk t equals

∫ t

0 λk sds for some λk, λk is the intensity of Nk.

Observe that ENk

t =

∫ t

0 Eλk sds.

Famous examples: Poisson process (λt constant) and Hawkes process (next slide).

slide-7
SLIDE 7

Hawkes processes

Hawkes process The process with intensity: λk

t = βk 0 +

v∈V

∫ t−

−∞

gvk(t − s)N(ds) = βk

0 +

v∈V

s<t

gvk(t − s) is called the (linear) Hawkes process, with kernels g for some integrable functions g. E.g. gvk(x) = βvk

1 e−βvk

2 (x).

This motivates using graphs for summarizing dependencies:

N1 N2 5 10 15 20 5 10 15 20 0.3 0.4 0.5 0.6

Time Intensity Process

N1 N2

1 2

slide-8
SLIDE 8

Hawkes processes

Hawkes process The process with intensity: λk

t = βk 0 +

v∈V

∫ t−

−∞

gvk(t − s)N(ds) = βk

0 +

v∈V

s<t

gvk(t − s) is called the (linear) Hawkes process, with kernels g for some integrable functions g. E.g. gvk(x) = βvk

1 e−βvk

2 (x).

This motivates using graphs for summarizing dependencies:

N1 N2 5 10 15 20 5 10 15 20 0.3 0.4 0.5 0.6

Time Intensity Process

N1 N2

1 2

slide-9
SLIDE 9

Causality

slide-10
SLIDE 10

Causal inference

Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 c Essential assumption: Also describes the system under interventions Xi c. A graph ̏ satisfjes, in conjunction with a separation criterion satisfjes:

  • The global Markov property if A

B C A

PB C.

  • Faithfulness A

PB

C A

B C

The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.

slide-11
SLIDE 11

Causal inference

Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 := c Essential assumption: Also describes the system under interventions Xi := c. A graph ̏ satisfjes, in conjunction with a separation criterion satisfjes:

  • The global Markov property if A

B C A

PB C.

  • Faithfulness A

PB

C A

B C

The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.

slide-12
SLIDE 12

Causal inference

Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 := c Essential assumption: Also describes the system under interventions Xi := c. A graph ̏ satisfjes, in conjunction with a separation criterion ⊥ satisfjes:

  • The global Markov property if A ⊥B|C =

⇒ A

| =

PB|C.

  • Faithfulness A

| =

PB | C =

⇒ A ⊥B|C The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.

slide-13
SLIDE 13

Causal inference: Dynamical system

Causal ideas have been generalized the dynamical setting, e.g. (Didelez 2008; Mogensen, Malinsky, et al. 2018; Mogensen and Hansen 2018)

X1

t1

X2

t1

X3

t1

X1

t2

X2

t2

X3

t2

X1

t3

X2

t3

X3

t3

. . . . . . . . .

X1 X2 X3

slide-14
SLIDE 14

Volterra series

Local independence Let N be a marked point process. For subsets A, B, C ⊆ V, we say that B is locally independent of A given C if for every b ∈ B: λb,A∪C

t

= E[λb

t | FA∪C t

]

version

∈ FC

t

and we write A ̸→ B | C. Heuristically, the intensity of b, when observing A ∪ C, depends only on events of C. Under faithfulness assumptions, there exist algorithms for learning the causal graph (Meek 2014; Mogensen and Hansen 2018), by removing the edge a → b if a ̸→ b | C for some C. In practice, this requires an empirical test for independence!

slide-15
SLIDE 15

Local independence test

slide-16
SLIDE 16

Local independence test

We want to test: H0 : j ̸→ k | C Equivalently to test if λk,C

t

is a version of λk,C∪{j}

t

. We propose to fjt: λ

k,C∪{j} t

= β

k 0 +

∫ t gjk(t − s)Nj(ds) + λk,C

t

Then H0 : gjk = 0 will have the right level, if we estimate the true λk,C. Problem: If there are latent variables, the marginalized model may not be a Hawkes process. So how to estimate

C gener-

ally, to retain level?

k h c j

slide-17
SLIDE 17

Local independence test

We want to test: H0 : j ̸→ k | C Equivalently to test if λk,C

t

is a version of λk,C∪{j}

t

. We propose to fjt: λ

k,C∪{j} t

= β

k 0 +

∫ t gjk(t − s)Nj(ds) + λk,C

t

Then H0 : gjk = 0 will have the right level, if we estimate the true λk,C. Problem: If there are latent variables, the marginalized model may not be a Hawkes process. So how to estimate λC gener- ally, to retain level?

k h c j

slide-18
SLIDE 18

Voltera approximations

To develop a non-parametric fjt for λC, we prove the following theorem, resembling Voltera series for continuous systems. Theorem Suppose that N is a stationary point process. There exist a sequence of functions hα

N,

such that letting: λN

t = h0 N + N

n=1

|α|=n

∫ t

−∞

· · · ∫ t

−∞

N(t − s1, · · · t − sn)Nα1(ds1) · · · Nαn(dsn)

and λN

t P

− → λC for N → ∞.

slide-19
SLIDE 19

Approximating intensity

λC approximations A1: Approximate by 2nd order iterated integrals. A2: Approximate kernels using tensor splines hα(x1, . . . , xn) ≈ ∑d

j1=1 · · · ∑d jn=1 βα j1,...,jnbj1(x1) · · · bjn(xn)

In vector notation: λC

t (β) = β0 +

v∈C

∫ t−

−∞

(βv)TΦ1(t − s)Nv(ds) + ∑

v1,v2∈C v2≥v1

∫ t−

−∞

(βv1v2)TΦ2(t − s1, t − s2)N(v1,v2)(ds2) =: βT

CxC t

Similarly for gjk, such that λ

k,C t

= β

k 0 +

∫ t gjk(t − s)Nj(ds) + λk,C

t

= β

Txt + βT CxC t =: βTxt

slide-20
SLIDE 20

Maximum Likelihood Estimation

The likelihood is concave for linear intensities! log LT(β) = ∫ T log ( βTxt ) Nk(dt) − βT ∫ T xk

tdt

We penalize with a roughness penalty: maxβ log LT(β) − κ0βTΩβ s.t. Xβ ≥ 0 The distribution of maximum likelihood estimate is approximately normal: ˆ β

approx

∼ N ( (I + 2κ0ˆ J−1

T Ω)β0, ˆ

J−1

T ˆ

KTˆ J−1

T

) with ˆ KT = ∫ T

xtxT

t

ˆ βTxt dt and ˆ

JT = ˆ KT − 2κ0Ω

slide-21
SLIDE 21

Local Independence Test

Given the distribution of β = (β, βC), we can test the hypothesis H0 : j ̸→ k | C. How do we test ΦTβ ≡ 0?

  • First idea: β approximately normal, so test directly β = 0.
  • Better idea (see Wood 2012), evaluate basis Φ in a grid G = {x1, . . . , xM}. Fitted

function values over grid is thus Φ(G)Tβ. If β ∼ N(µj, Σj) then Wald test statistic for null hypothesis Φ(G)Tµj = 0 is: Tα = (β)TΦ(G) [ Φ(G)TΣjΦ(G) ]−1 Φ(G)Tβ (1) This is χ2

(M)-distributed, and we can test for signifjcance of components!

slide-22
SLIDE 22

Summary of test

We summarize our proposed test. To test j ̸→ k | C:

  • Approximate λC by Voltera expansion at degree 2 and with spline-kernels.
  • Fit λ

k,C(β) within model class by penalized MLE.

  • Test ϕTβ ≡ 0 using grid evaluation and Wald approximation.
  • If test is accepted, conclude local independence.
slide-23
SLIDE 23

Experimental results

slide-24
SLIDE 24

Experiment 1: Testing various structures

In each of the following 7 structures, we test a ̸→ b | b, C:

a c b a c h b a c h b L1: L2: L3: a b a h b a c h b P1: P2: P3:

We obtain acceptance rates:

L1 L2 L3 P1 P2 P3 1 2 1 2 1 2 1 2 1 2 1 2 0% 20% 40% 60% 80% 100%

H0 acceptance rate Test outcome

Accepted Rejected

slide-25
SLIDE 25

Experiment 1: Testing various structures

In each of the following 7 structures, we test a ̸→ b | b, C:

a c b a c h b a c h b L1: L2: L3: a b a h b a c h b P1: P2: P3:

We obtain acceptance rates:

L1 L2 L3 P1 P2 P3 1 2 1 2 1 2 1 2 1 2 1 2 0% 20% 40% 60% 80% 100%

H0 acceptance rate Test outcome

Accepted Rejected

slide-26
SLIDE 26

Causal discovery

We evaluate the performance in the CA-algorithm, which estimates the causal graph.

T Time

→ a b c d

a ̸→ b | {b, c, d}

a b c d

· · ·

a b c d

slide-27
SLIDE 27

Experiment 2: Causal discovery

We simulate random graphs, simulate a dataset from this graph, recover the graph from dataset and measure the Structural Hamming Distance (SHD) to the true graph: SHD between 1 and 2 The (minimum) number of actions between fmipping, adding or removing an edge needed to turn 1 into 2

1 2 3 4 5 3 4 5 6

Dimension of graph SHD to true graph Type

Baseline LI Test

slide-28
SLIDE 28

Conclusion

slide-29
SLIDE 29

Conclusion

  • Causal inference is possible in point process models, using conditional

independence tests!

  • Facing latent components in a Hawkes model, the marginal process may not be

Hawkes.

  • The Voltera expansions can overcome this model misspecifjcation, by fjtting a

general functional form of intensities.

  • We propose a testing framework based on splines, and have promising

experimental results.

slide-30
SLIDE 30

References i

Daley, Daryl J and David Vere-Jones (2007). An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media. Didelez, Vanessa (2008). “Graphical models for marked point processes based on local independence”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.1, pp. 245–264. Meek, Christopher (2014). “Toward learning graphical and causal process models”. In: Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274. CEUR-WS. org, pp. 43–48.

slide-31
SLIDE 31

References ii

Mogensen, Søren Wengel and Niels Richard Hansen (2018). “Markov equivalence of marginalized local independence graphs”. In: arXiv preprint arXiv:1802.10163. To appear in Ann. Statist. Mogensen, Søren Wengel, Daniel Malinsky, and Niels Richard Hansen (2018). “Causal learning for partially observed stochastic dynamical systems”. In: 34th Conference

  • n Uncertainty in Artifjcial Intelligence 2018, UAI 2018Conference on Uncertainty in

Artifjcial Intelligence. Association For Uncertainty in Artifjcial Intelligence (AUAI),

  • pp. 350–360.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf (2017). Elements of causal inference: foundations and learning algorithms. MIT press. Wood, Simon N (2012). “On p-values for smooth components of an extended generalized additive model”. In: Biometrika 100.1, pp. 221–228.

slide-32
SLIDE 32

Questions?

slide-33
SLIDE 33

Voltera: Sketch of proof I

First we show the representation at time 0, i.e. for λ0

  • 1. For any λ0, use that 1|λ0|<Nλ0

P

→ λ and 1|λ0|<Nλ0 ∈ L1(F)

  • 2. Defjne Fτ = σ(T1 ∧ τ, T2 ∧ τ, . . . , ), and show ∪τ≤0L1(Fτ) is dense in L1(F),

where F = σ(Nt, t < 0) via martingale convergence.

  • 3. Through combinatoric argument, show that for λ0 ∈ L1(Fτ), 1N([τ,0])=1λ0 has a

additive decomposition

N

n=1

βn ∫

[τ,0]

f(t1)1Dn dN(tn) a.s. →N λ01N([τ,0])=1

  • 4. Extend to 1N([τ,0])=Mλ0 and sum all terms.
slide-34
SLIDE 34

Voltera: Sketch of proof II

  • 5. Using time-homogenity 1, we extend the result to every time t.
  • 6. Extension to multivariate point processes is simple, using:

((−∞,0]×V)n hn(t1, v1, . . . , tn, vn)N(dtn × vn)

= ∑

|α|=n

(−∞,0]n hα(t1, . . . , tn)N(dtn)

1λ(π, {Ns}s<π) = λ(0, {Nπ s }s<0)

slide-35
SLIDE 35

Local independence graphs

Local independence graph For point process with coordinates V = {1, . . . , d}, defjne the local independence graph = (V, E) by E = {(a, b) | a → b | V\{a}} Example a b c

slide-36
SLIDE 36

Graphs and µ-separation

a c1 c2 b

Graph:

a c1 c2 b

Walk: Collider µ-connection and separation For = (V, E) let a, b ∈ V, C ⊆ V. A µ-connecting walk p from a to b given C is a walk from a to b such that:

  • 1. p is non-trivial and its fjnal edge points to b.
  • 2. a /

∈ C

  • 3. coll(p) ⊆ An(C)
  • 4. noncoll(p) ∩ C = ∅

If no walks from a to b are µ-connecting given C, they are µ-separated and we write a ⊥µ b | C.

slide-37
SLIDE 37

Global Markov property

The following concepts relate local independence to a graph ̏: Global Markov Property A ⊥µ B | C implies A ̸→ B | C Faithfullness A ̸→ B | C implies A ⊥µ B | C. The global Markov property makes the local independence graph ”relevant” for understanding the underlying point process. Recovering the graph using independence test Assuming faithfullness and the global Markov property, (Meek 2014) proposes an algorithm which guarantees to return the true local independence graph, essentially by testing a ̸→ b | C for all a, b and sets C of increasing size.

slide-38
SLIDE 38

Backup: The CA algorithm

Algorithm 1 Causal Analysis algorithm Initialize = (V, ECA) as a fully connected graph for v ∈ V do: n = 0 while n < |pa(v)| do: for v′ ∈ pa(v) do: for C ⊆ pa(v)\{v′} with |C| = n do: if v′ ̸→ v | C then remove (v′, v) from ECA. n = n + 1 return = (V, ECA) In short: For pairs (v′, v), remove the edge v′ → v if there exist a set C such that v′ ̸→ v | C.

slide-39
SLIDE 39

Backup: P1 and P2

Defjnition

  • ⊥ satisfjes P1 if separation v′ ⊥ v | C for v′ /

∈ C implies (v′, v) / ∈ E.

  • ⊥ satisfjes P2 if lack of an edge (v′, v) implies existence of a set C ⊆ pa(v) such

that v′ ⊥ v | C. The CA algorithm assumes both P1 and P2. d-separation satisfjes P1 and δ- and µ-separation satisfjes P1 and P2. We show that for ⊥ satisfying P1 and P2, two graphs have the same separations exactly if they are equal.

slide-40
SLIDE 40

Backup: Example of Local independence

Example 3 children (a, b, c) throwing a ball a

exp(1)

→ b

exp(1)

→ c

exp(1)

→ a · · · . Nv counts the number

  • f times child v has thrown the ball. This has intensities:

λa

t = 1Na

t =Nc t

λb

t = 1Nb

t <Na t

λc

t = 1Nc

t<Nb t

We fjnd b ̸→ a | a, c and a → b | b, c, because: λa,{a,b,c}

t

= E [ λa

t | F{a,b,c} t

] = 1Na

t =Nc t ∈ Fa∪c

t

λb,{a,b,c}

t

= E [ λb

t | F{a,b,c} t

] = 1Nb

t <Na t /

∈ Fb∪c

T

Also a → a | b, c because λa,{a,b,c}

t

= 1Na

t =Nc t /

∈ Fb∪c

t

slide-41
SLIDE 41

Backup: Runtime

10 20 30 40 250 500 750 1000 1250

Point count Run time (s) Test order

First order Second order

Structure

S1 S3 S4i

Figure 1: Runtime of 300 invocations of the local empirical independence test. a ̸→ˆ

λ b | b, C

was tested 100 times in difgerent structure S1, S3 and S4i.

slide-42
SLIDE 42

Backup: Tuning κ0

S4ii S4iii S4iv S1 S2 S3 S4i 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75

Scale κ0 Test p−value Scale κ0

1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value Scale κ0

1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value Scale κ0

1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value

S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value

S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value Scale κ0

1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75

Scale κ0 Test p−value

Figure 2: Boxplots of p-values from the 7 structures. From each structure 100 Hawkes process was simulated, and the local empirical independence test was run, each with the roughness penalty at various levels of κ0. Each simulation produced a p-value, which is plotted. The red dotted line shows the 5

  • level. The headers show the ground truth of whether a

b b C. The dark-green line show the fraction of the simulated p-values falling below a 5

  • level.
slide-43
SLIDE 43

Backup: Latent experiment

1 2 2 3 4 5 2 3 4 5 2 3 4 5 2 4 6 8

Observed nodes SHD Algorithm

Second Order SHD First Order SHD

Figure 3: Structural Hamming Distances of graphs estimated using the ECA-algorithm with a fjrst- and second-order local empirical independence test (second being the standard one, used above). Each of the boxes 0, 1 and 2 indicate the number |V\O| of latent variables. The lines represent the average SHD within each group.