Local Independence Tests for Point Processes Learning causality in event models
Nikolaj Thams, University of Copenhagen November 21st, 2019
Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen
Local Independence Tests for Point Processes Learning causality in - - PowerPoint PPT Presentation
Local Independence Tests for Point Processes Learning causality in event models Nikolaj Thams, University of Copenhagen November 21 st , 2019 Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen Hawkes Processes
Nikolaj Thams, University of Copenhagen November 21st, 2019
Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen
Hawkes Processes Causality Local independence test Experimental results Conclusion
Learning causality in event models?
T
Time
b h c a
Learning causality in event models?
T
Time
b h c a
Point process
Point processes A point process with marks V = {1, . . . , d} is a collection
Nk = ∑
i
Tk
i ,
where Tk
i is the i’th event of type k. This defjnes processes
t → Nk
t := Nk(0, t]. T1 T2 T3 Nt t
If the compensator Ak
t of Nk t equals
∫ t
0 λk sds for some λk, λk is the intensity of Nk.
Observe that ENk
t =
∫ t
0 Eλk sds.
Famous examples: Poisson process (λt constant) and Hawkes process (next slide).
Hawkes processes
Hawkes process The process with intensity: λk
t = βk 0 +
∑
v∈V
∫ t−
−∞
gvk(t − s)N(ds) = βk
0 +
∑
v∈V
∑
s<t
gvk(t − s) is called the (linear) Hawkes process, with kernels g for some integrable functions g. E.g. gvk(x) = βvk
1 e−βvk
2 (x).
This motivates using graphs for summarizing dependencies:
N1 N2 5 10 15 20 5 10 15 20 0.3 0.4 0.5 0.6
Time Intensity Process
N1 N2
1 2
Hawkes processes
Hawkes process The process with intensity: λk
t = βk 0 +
∑
v∈V
∫ t−
−∞
gvk(t − s)N(ds) = βk
0 +
∑
v∈V
∑
s<t
gvk(t − s) is called the (linear) Hawkes process, with kernels g for some integrable functions g. E.g. gvk(x) = βvk
1 e−βvk
2 (x).
This motivates using graphs for summarizing dependencies:
N1 N2 5 10 15 20 5 10 15 20 0.3 0.4 0.5 0.6
Time Intensity Process
N1 N2
1 2
Causal inference
Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 c Essential assumption: Also describes the system under interventions Xi c. A graph ̏ satisfjes, in conjunction with a separation criterion satisfjes:
B C A
PB C.
PB
C A
B C
The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.
Causal inference
Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 := c Essential assumption: Also describes the system under interventions Xi := c. A graph ̏ satisfjes, in conjunction with a separation criterion satisfjes:
B C A
PB C.
PB
C A
B C
The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.
Causal inference
Static system Structural Causal Models (SCMs) consist of functional assignments, summarized by parents in a graph. Xi = fi(Xpai, ϵi) , i ∈ V X1 X2 X3 X1 X2 X3 := c Essential assumption: Also describes the system under interventions Xi := c. A graph ̏ satisfjes, in conjunction with a separation criterion ⊥ satisfjes:
⇒ A
| =
PB|C.
| =
PB | C =
⇒ A ⊥B|C The global Markov property and faithfullness is the motivation for developing conditional independence tests in causality. See (Peters et al. 2017) for details.
Causal inference: Dynamical system
Causal ideas have been generalized the dynamical setting, e.g. (Didelez 2008; Mogensen, Malinsky, et al. 2018; Mogensen and Hansen 2018)
X1
t1
X2
t1
X3
t1
X1
t2
X2
t2
X3
t2
X1
t3
X2
t3
X3
t3
. . . . . . . . .
X1 X2 X3
Volterra series
Local independence Let N be a marked point process. For subsets A, B, C ⊆ V, we say that B is locally independent of A given C if for every b ∈ B: λb,A∪C
t
= E[λb
t | FA∪C t
]
version
∈ FC
t
and we write A ̸→ B | C. Heuristically, the intensity of b, when observing A ∪ C, depends only on events of C. Under faithfulness assumptions, there exist algorithms for learning the causal graph (Meek 2014; Mogensen and Hansen 2018), by removing the edge a → b if a ̸→ b | C for some C. In practice, this requires an empirical test for independence!
Local independence test
We want to test: H0 : j ̸→ k | C Equivalently to test if λk,C
t
is a version of λk,C∪{j}
t
. We propose to fjt: λ
k,C∪{j} t
= β
k 0 +
∫ t gjk(t − s)Nj(ds) + λk,C
t
Then H0 : gjk = 0 will have the right level, if we estimate the true λk,C. Problem: If there are latent variables, the marginalized model may not be a Hawkes process. So how to estimate
C gener-
ally, to retain level?
k h c j
Local independence test
We want to test: H0 : j ̸→ k | C Equivalently to test if λk,C
t
is a version of λk,C∪{j}
t
. We propose to fjt: λ
k,C∪{j} t
= β
k 0 +
∫ t gjk(t − s)Nj(ds) + λk,C
t
Then H0 : gjk = 0 will have the right level, if we estimate the true λk,C. Problem: If there are latent variables, the marginalized model may not be a Hawkes process. So how to estimate λC gener- ally, to retain level?
k h c j
Voltera approximations
To develop a non-parametric fjt for λC, we prove the following theorem, resembling Voltera series for continuous systems. Theorem Suppose that N is a stationary point process. There exist a sequence of functions hα
N,
such that letting: λN
t = h0 N + N
∑
n=1
∑
|α|=n
∫ t
−∞
· · · ∫ t
−∞
hα
N(t − s1, · · · t − sn)Nα1(ds1) · · · Nαn(dsn)
and λN
t P
− → λC for N → ∞.
Approximating intensity
λC approximations A1: Approximate by 2nd order iterated integrals. A2: Approximate kernels using tensor splines hα(x1, . . . , xn) ≈ ∑d
j1=1 · · · ∑d jn=1 βα j1,...,jnbj1(x1) · · · bjn(xn)
In vector notation: λC
t (β) = β0 +
∑
v∈C
∫ t−
−∞
(βv)TΦ1(t − s)Nv(ds) + ∑
v1,v2∈C v2≥v1
∫ t−
−∞
(βv1v2)TΦ2(t − s1, t − s2)N(v1,v2)(ds2) =: βT
CxC t
Similarly for gjk, such that λ
k,C t
= β
k 0 +
∫ t gjk(t − s)Nj(ds) + λk,C
t
= β
Txt + βT CxC t =: βTxt
Maximum Likelihood Estimation
The likelihood is concave for linear intensities! log LT(β) = ∫ T log ( βTxt ) Nk(dt) − βT ∫ T xk
tdt
We penalize with a roughness penalty: maxβ log LT(β) − κ0βTΩβ s.t. Xβ ≥ 0 The distribution of maximum likelihood estimate is approximately normal: ˆ β
approx
∼ N ( (I + 2κ0ˆ J−1
T Ω)β0, ˆ
J−1
T ˆ
KTˆ J−1
T
) with ˆ KT = ∫ T
xtxT
t
ˆ βTxt dt and ˆ
JT = ˆ KT − 2κ0Ω
Local Independence Test
Given the distribution of β = (β, βC), we can test the hypothesis H0 : j ̸→ k | C. How do we test ΦTβ ≡ 0?
function values over grid is thus Φ(G)Tβ. If β ∼ N(µj, Σj) then Wald test statistic for null hypothesis Φ(G)Tµj = 0 is: Tα = (β)TΦ(G) [ Φ(G)TΣjΦ(G) ]−1 Φ(G)Tβ (1) This is χ2
(M)-distributed, and we can test for signifjcance of components!
Summary of test
We summarize our proposed test. To test j ̸→ k | C:
k,C(β) within model class by penalized MLE.
Experiment 1: Testing various structures
In each of the following 7 structures, we test a ̸→ b | b, C:
a c b a c h b a c h b L1: L2: L3: a b a h b a c h b P1: P2: P3:
We obtain acceptance rates:
L1 L2 L3 P1 P2 P3 1 2 1 2 1 2 1 2 1 2 1 2 0% 20% 40% 60% 80% 100%
H0 acceptance rate Test outcome
Accepted Rejected
Experiment 1: Testing various structures
In each of the following 7 structures, we test a ̸→ b | b, C:
a c b a c h b a c h b L1: L2: L3: a b a h b a c h b P1: P2: P3:
We obtain acceptance rates:
L1 L2 L3 P1 P2 P3 1 2 1 2 1 2 1 2 1 2 1 2 0% 20% 40% 60% 80% 100%
H0 acceptance rate Test outcome
Accepted Rejected
Causal discovery
We evaluate the performance in the CA-algorithm, which estimates the causal graph.
T Time→ a b c d
a ̸→ b | {b, c, d}
a b c d
· · ·
a b c d
Experiment 2: Causal discovery
We simulate random graphs, simulate a dataset from this graph, recover the graph from dataset and measure the Structural Hamming Distance (SHD) to the true graph: SHD between 1 and 2 The (minimum) number of actions between fmipping, adding or removing an edge needed to turn 1 into 2
1 2 3 4 5 3 4 5 6
Dimension of graph SHD to true graph Type
Baseline LI Test
Conclusion
independence tests!
Hawkes.
general functional form of intensities.
experimental results.
References i
Daley, Daryl J and David Vere-Jones (2007). An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media. Didelez, Vanessa (2008). “Graphical models for marked point processes based on local independence”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.1, pp. 245–264. Meek, Christopher (2014). “Toward learning graphical and causal process models”. In: Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274. CEUR-WS. org, pp. 43–48.
References ii
Mogensen, Søren Wengel and Niels Richard Hansen (2018). “Markov equivalence of marginalized local independence graphs”. In: arXiv preprint arXiv:1802.10163. To appear in Ann. Statist. Mogensen, Søren Wengel, Daniel Malinsky, and Niels Richard Hansen (2018). “Causal learning for partially observed stochastic dynamical systems”. In: 34th Conference
Artifjcial Intelligence. Association For Uncertainty in Artifjcial Intelligence (AUAI),
Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf (2017). Elements of causal inference: foundations and learning algorithms. MIT press. Wood, Simon N (2012). “On p-values for smooth components of an extended generalized additive model”. In: Biometrika 100.1, pp. 221–228.
Voltera: Sketch of proof I
First we show the representation at time 0, i.e. for λ0
P
→ λ and 1|λ0|<Nλ0 ∈ L1(F)
where F = σ(Nt, t < 0) via martingale convergence.
additive decomposition
N
∑
n=1
βn ∫
[τ,0]
f(t1)1Dn dN(tn) a.s. →N λ01N([τ,0])=1
Voltera: Sketch of proof II
∫
((−∞,0]×V)n hn(t1, v1, . . . , tn, vn)N(dtn × vn)
= ∑
|α|=n
∫
(−∞,0]n hα(t1, . . . , tn)N(dtn)
1λ(π, {Ns}s<π) = λ(0, {Nπ s }s<0)
Local independence graphs
Local independence graph For point process with coordinates V = {1, . . . , d}, defjne the local independence graph = (V, E) by E = {(a, b) | a → b | V\{a}} Example a b c
Graphs and µ-separation
a c1 c2 b
Graph:
a c1 c2 b
Walk: Collider µ-connection and separation For = (V, E) let a, b ∈ V, C ⊆ V. A µ-connecting walk p from a to b given C is a walk from a to b such that:
∈ C
If no walks from a to b are µ-connecting given C, they are µ-separated and we write a ⊥µ b | C.
Global Markov property
The following concepts relate local independence to a graph ̏: Global Markov Property A ⊥µ B | C implies A ̸→ B | C Faithfullness A ̸→ B | C implies A ⊥µ B | C. The global Markov property makes the local independence graph ”relevant” for understanding the underlying point process. Recovering the graph using independence test Assuming faithfullness and the global Markov property, (Meek 2014) proposes an algorithm which guarantees to return the true local independence graph, essentially by testing a ̸→ b | C for all a, b and sets C of increasing size.
Backup: The CA algorithm
Algorithm 1 Causal Analysis algorithm Initialize = (V, ECA) as a fully connected graph for v ∈ V do: n = 0 while n < |pa(v)| do: for v′ ∈ pa(v) do: for C ⊆ pa(v)\{v′} with |C| = n do: if v′ ̸→ v | C then remove (v′, v) from ECA. n = n + 1 return = (V, ECA) In short: For pairs (v′, v), remove the edge v′ → v if there exist a set C such that v′ ̸→ v | C.
Backup: P1 and P2
Defjnition
∈ C implies (v′, v) / ∈ E.
that v′ ⊥ v | C. The CA algorithm assumes both P1 and P2. d-separation satisfjes P1 and δ- and µ-separation satisfjes P1 and P2. We show that for ⊥ satisfying P1 and P2, two graphs have the same separations exactly if they are equal.
Backup: Example of Local independence
Example 3 children (a, b, c) throwing a ball a
exp(1)
→ b
exp(1)
→ c
exp(1)
→ a · · · . Nv counts the number
λa
t = 1Na
t =Nc t
λb
t = 1Nb
t <Na t
λc
t = 1Nc
t<Nb t
We fjnd b ̸→ a | a, c and a → b | b, c, because: λa,{a,b,c}
t
= E [ λa
t | F{a,b,c} t
] = 1Na
t =Nc t ∈ Fa∪c
t
λb,{a,b,c}
t
= E [ λb
t | F{a,b,c} t
] = 1Nb
t <Na t /
∈ Fb∪c
T
Also a → a | b, c because λa,{a,b,c}
t
= 1Na
t =Nc t /
∈ Fb∪c
t
Backup: Runtime
10 20 30 40 250 500 750 1000 1250
Point count Run time (s) Test order
First order Second order
Structure
S1 S3 S4i
Figure 1: Runtime of 300 invocations of the local empirical independence test. a ̸→ˆ
λ b | b, C
was tested 100 times in difgerent structure S1, S3 and S4i.
Backup: Tuning κ0
S4ii S4iii S4iv S1 S2 S3 S4i 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
Scale κ0 Test p−value Scale κ0
1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value Scale κ0
1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value Scale κ0
1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value
S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value
S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value Scale κ0
1e−04 0.001 0.01 0.1 0.316 1 3.162 10 100 1000 10000 S1 S2 S3 S4i S4ii S4iii S4iv Roughness penalty 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.00 0.25 0.50 0.75
Scale κ0 Test p−value
Figure 2: Boxplots of p-values from the 7 structures. From each structure 100 Hawkes process was simulated, and the local empirical independence test was run, each with the roughness penalty at various levels of κ0. Each simulation produced a p-value, which is plotted. The red dotted line shows the 5
b b C. The dark-green line show the fraction of the simulated p-values falling below a 5
Backup: Latent experiment
1 2 2 3 4 5 2 3 4 5 2 3 4 5 2 4 6 8
Observed nodes SHD Algorithm
Second Order SHD First Order SHD
Figure 3: Structural Hamming Distances of graphs estimated using the ECA-algorithm with a fjrst- and second-order local empirical independence test (second being the standard one, used above). Each of the boxes 0, 1 and 2 indicate the number |V\O| of latent variables. The lines represent the average SHD within each group.