Structure learning for CTBNs Blazej Miasojedow Institute of Applied - - PowerPoint PPT Presentation

structure learning for ctbn s
SMART_READER_LITE
LIVE PREVIEW

Structure learning for CTBNs Blazej Miasojedow Institute of Applied - - PowerPoint PPT Presentation

Structure learning for CTBNs Structure learning for CTBNs Blazej Miasojedow Institute of Applied Mathematics and Mechanics, University of Warsaw 05 June 2020 1 1 Based on joint works with Wojciech Niemiro (Warsaw/Torun), Wojciech Rejchel


slide-1
SLIDE 1

Structure learning for CTBN’s

Structure learning for CTBN’s

Blazej Miasojedow

Institute of Applied Mathematics and Mechanics, University of Warsaw

05 June 2020

1

1Based on joint works with Wojciech Niemiro (Warsaw/Torun), Wojciech Rejchel

(Torun), Maryia Shpak (Lublin)

Blazej Miasojedow (UW) 05 June 2020 1 / 24

slide-2
SLIDE 2

Structure learning for CTBN’s Outline

Outline

1

CTBN

2

Structure learning Full observations Partial observations

Blazej Miasojedow (UW) 05 June 2020 2 / 24

slide-3
SLIDE 3

Structure learning for CTBN’s Outline

Outline

1

CTBN

2

Structure learning Full observations Partial observations

Blazej Miasojedow (UW) 05 June 2020 2 / 24

slide-4
SLIDE 4

Structure learning for CTBN’s CTBN

1

CTBN

2

Structure learning Full observations Partial observations

Blazej Miasojedow (UW) 05 June 2020 3 / 24

slide-5
SLIDE 5

Structure learning for CTBN’s CTBN

Continuous time Bayesian networks

X(t) multivariate Markov jump process on state X =

v∈V Xv where:

(V, E) is a directed graph with possible cycles describing dependence structure. Xv space of possible values at node v, assumed to be discrete. Intensity matrix Q given by conditional intensities Q(x, x′) =

  • Qv(xpa(v), xv, xv′)

if x−v = x−v′ and xv = xv′ for some v; if x−v = x−v′ for all v, where pa(v) denotes the set of parents of node v in the graph (V, E).

Blazej Miasojedow (UW) 05 June 2020 4 / 24

slide-6
SLIDE 6

Structure learning for CTBN’s CTBN

Continuous time Bayesian networks

X(t) multivariate Markov jump process on state X =

v∈V Xv where:

(V, E) is a directed graph with possible cycles describing dependence structure. Xv space of possible values at node v, assumed to be discrete. Intensity matrix Q given by conditional intensities Q(x, x′) =

  • Qv(xpa(v), xv, xv′)

if x−v = x−v′ and xv = xv′ for some v; if x−v = x−v′ for all v, where pa(v) denotes the set of parents of node v in the graph (V, E).

Blazej Miasojedow (UW) 05 June 2020 4 / 24

slide-7
SLIDE 7

Structure learning for CTBN’s CTBN

Example

Blazej Miasojedow (UW) 05 June 2020 5 / 24

slide-8
SLIDE 8

Structure learning for CTBN’s CTBN

Probability densities of CTBNs

Density can be expressed as a product of conditional densities p(X) = ν (x(0))

  • v∈V

p(XvXpa(v)), with p(XvXpa(v)) =

  • c∈Xpa(v)
  • a∈Xv
  • a′∈Xv

a′=a

Qv(c; a, a′)nT

v (c; a,a′)

  • c∈Xpa(v)
  • a∈Xv

exp

  • −Qv(c; a)tT

v(c; a)

, nT

v(c; a, a′) be a number of those jumps from a to a′ at node v,

which occurred when the parent nodes configuration was c. tT

v(c; a) be the length of time when the state of node v was a and

the configuration of the parents was c.

Blazej Miasojedow (UW) 05 June 2020 6 / 24

slide-9
SLIDE 9

Structure learning for CTBN’s Structure learning

1

CTBN

2

Structure learning Full observations Partial observations

Blazej Miasojedow (UW) 05 June 2020 7 / 24

slide-10
SLIDE 10

Structure learning for CTBN’s Structure learning

Structure learning

Based on observation we want to reconstruct the structure of graph and further estimate conditional intensities matrices. We consider two cases

1

Full trajectory is observed.

2

We observe trajectories only in fixed time points tobs

1 , . . . , tobs k

with some noise.

Blazej Miasojedow (UW) 05 June 2020 8 / 24

slide-11
SLIDE 11

Structure learning for CTBN’s Structure learning

Connections with standard Bayesian networks

Bayesian networks: consist from inpdependent observations, but graph needs to be acyclic. CTBN: dependent observation (Markovian process), no restrictions for graph. Easier to formulate thev structure learning problem for CTBNs. No restrictions are required. Analysis of methods is more demanding for CTBNs. We need to deal with Markov Jump Processes.

Blazej Miasojedow (UW) 05 June 2020 9 / 24

slide-12
SLIDE 12

Structure learning for CTBN’s Structure learning

Connections with standard Bayesian networks

Bayesian networks: consist from inpdependent observations, but graph needs to be acyclic. CTBN: dependent observation (Markovian process), no restrictions for graph. Easier to formulate thev structure learning problem for CTBNs. No restrictions are required. Analysis of methods is more demanding for CTBNs. We need to deal with Markov Jump Processes.

Blazej Miasojedow (UW) 05 June 2020 9 / 24

slide-13
SLIDE 13

Structure learning for CTBN’s Structure learning

Connections with standard Bayesian networks

Bayesian networks: consist from inpdependent observations, but graph needs to be acyclic. CTBN: dependent observation (Markovian process), no restrictions for graph. Easier to formulate thev structure learning problem for CTBNs. No restrictions are required. Analysis of methods is more demanding for CTBNs. We need to deal with Markov Jump Processes.

Blazej Miasojedow (UW) 05 June 2020 9 / 24

slide-14
SLIDE 14

Structure learning for CTBN’s Structure learning

Connections with standard Bayesian networks

Bayesian networks: consist from inpdependent observations, but graph needs to be acyclic. CTBN: dependent observation (Markovian process), no restrictions for graph. Easier to formulate thev structure learning problem for CTBNs. No restrictions are required. Analysis of methods is more demanding for CTBNs. We need to deal with Markov Jump Processes.

Blazej Miasojedow (UW) 05 June 2020 9 / 24

slide-15
SLIDE 15

Structure learning for CTBN’s Structure learning

Existing approaches

Search and score strategy, based on full Bayesian model Nodelman (2007); Acerbi et al. (2014). Mean field approximation combined with variational inference Linzner and Koeppl (2018). Estimating parameters for full graph in Bayesian setting and removing edges based on marginal posterior probabilities Linzner et al. (2019).

Blazej Miasojedow (UW) 05 June 2020 10 / 24

slide-16
SLIDE 16

Structure learning for CTBN’s Structure learning Full observations

Full observation

Idea:

1

Start with full model.

2

Express log(Qv(c, a, a′)) = βTZ(c), β is vector of unknown parameter and Z(c) is a vector of dummy variables decoding configuration of all nodes except v.

3

Estimate sparse β by Lasso arg min

β

{−ℓ(β) + λβ1} , where ℓ is a likelihood given by ℓ(β) =

  • w∈V
  • c∈X−w
  • s∈Xw
  • s′∈Xw

s′=s

nw(c; s, s′)βw

s,s′Zw(c)−tw(c; s) exp(βw s,s′TZw(c))

(1)

Blazej Miasojedow (UW) 05 June 2020 11 / 24

slide-17
SLIDE 17

Structure learning for CTBN’s Structure learning Full observations

Example

We consider a binary CTBN with three nodes A, B and C. For the node A we define the function ZA as ZA(b, c) = [1, I(b = 1), I(c = 1)]⊤ and β is defined as follows β =

  • βA

0,1, βA 1,0, βB 0,1, βB 1,0, βC 0,1, βC 1,0

⊤ . With slight abuse of notation, the vector βA

0,1 is given as

βA

0,1 =

  • βA

0,1(1), βA 0,1(B), βA 0,1(C)

⊤ .

Blazej Miasojedow (UW) 05 June 2020 12 / 24

slide-18
SLIDE 18

Structure learning for CTBN’s Structure learning Full observations

Connection between parametrization and structure

In our setting identifying edges in the graph is equivalent to finding non-zero elements of β βw

0,1(u) = 0 or βw 1,0(u) = 0 ⇔ the edge u → w exists.

Blazej Miasojedow (UW) 05 June 2020 13 / 24

slide-19
SLIDE 19

Structure learning for CTBN’s Structure learning Full observations

Notation and assumptions

d0 = |supp(β)|, S = supp(β), C(ξ) = {θ: |θSC|1 ≤ ξ|θS|1} for some ξ > 1, βmin = mink |βk| F(ξ) = inf

0=θ∈C(ξ,S)

  • w∈V
  • s′=s
  • cSw∈XSw

exp

  • βw⊤

s,s′ Zw(cSw, 0)

θw⊤

s,s′ Zw(cSw, 0)

2 |θS|1|θ|∞ (2) We assume that F(ξ) > 0 for some ξ > 1 ∆ = maxs=s′ Q(s, s′)

Blazej Miasojedow (UW) 05 June 2020 14 / 24

slide-20
SLIDE 20

Structure learning for CTBN’s Structure learning Full observations

Main result

Theorem 1 (Shpak,Rejchel,BM 2020)

Let ε ∈ (0, 1), ξ > 1 be arbitrary. Suppose that F(ξ) defined in (2) is positive and T > 36

  • (max

w∈V |Sw| + 1) log 2 + log (d||ν||2/ε)

  • min

w∈V,s∈Xw,cSw ∈XSw

π2(s, cSw, 0)ρ1 . (3) We also assume that T∆ ≥ 2 and 2 ξ + 1 ξ − 1 log(K/ε)

T ≤ λ ≤ 2ζF(ξ) e(ξ + 1)|S| , (4) where K = 2(2 + e2)d(d − 1) and ζ = min

w∈V,s∈Xw,cSw ∈XSw

π(s, cSw, 0)/2. Then with probability at least 1 − 2ε we have

|ˆ β − β|∞ ≤ 2eξλ (ξ + 1)ζF(ξ) . (5)

Blazej Miasojedow (UW) 05 June 2020 15 / 24

slide-21
SLIDE 21

Structure learning for CTBN’s Structure learning Full observations

Consistency of model selection

Corollary 2

Let R denote the right-hand side of the inequality (5). Consider the thresholded Lasso estimator with the set of nonzero coordinates ˆ

  • S. The set ˆ

S contains only those coefficients of the Lasso estimator , which are larger in the absolute value than a pre-specified threshold δ. If βmin/2 > δ ≥ R, then P

  • ˆ

S = S

  • ≥ 1 − 2ε .

Blazej Miasojedow (UW) 05 June 2020 16 / 24

slide-22
SLIDE 22

Structure learning for CTBN’s Structure learning Full observations

Remarks

If we forget about constants, ∆ and parameters of MJP, i.e. ν, π, ρ1, ζ etc. in assumptions. Then the estimation error is small, if we have that T ≥ log2(d/ε)|S|2 F2(ξ) Conditions (3) and (4) depend also on parameters of MJP. Precisely, they depend on the stationary distribution π and the spectral gap ρ1, which in general decrease exponentially with d. However, in some specific cases, it can be proved that they decrease polynomially.

Blazej Miasojedow (UW) 05 June 2020 17 / 24

slide-23
SLIDE 23

Structure learning for CTBN’s Structure learning Full observations

CIF vs. F

The cone invertibility factor is defined as ¯ F(ξ) = inf

0=θ∈C(ξ,S)

θ′∇2ℓ(β)θ |θS|1|θ|∞ . and θT∇2ℓ(β)θ = 1 T

  • w∈V
  • c∈X−w
  • s′=s

tw(c; s)

  • θw⊤

s,s′ Zw(c)

2 exp(βw⊤

s,s′ Zw(c)). (6)

CIF implies “strong convexity” restricted to cone. We use classical strategy of proof where positive CIF is required. In our case CIF contains a sum of exponentially many r.v.. So we introduce F to overcome this difficulty.

Blazej Miasojedow (UW) 05 June 2020 18 / 24

slide-24
SLIDE 24

Structure learning for CTBN’s Structure learning Full observations

Lower bounds on F

Lemma 3

For every ξ > 1 we have with high probablity ¯ F(ξ) ≥ ζF(ξ) ≥ ζ ξAβ , (7) where Aβ =

  • w∈V
  • s′=s
  • j:βw

s,s′(j)=0

exp

  • −βw

s,s′(j)

  • .

(8)

Blazej Miasojedow (UW) 05 June 2020 19 / 24

slide-25
SLIDE 25

Structure learning for CTBN’s Structure learning Full observations

Sketch of proof

We use classical technique where it is required to bound ∇ℓ(β)∞ and ¯ F(ξ) with high probability. To bound ∇ℓ(β)∞ we derive new concentration inequality for

  • ccupation time of MJPs.

To bound ¯ F(ξ) we use Lezaud inequality.

Blazej Miasojedow (UW) 05 June 2020 20 / 24

slide-26
SLIDE 26

Structure learning for CTBN’s Structure learning Full observations

Details of implementation

1

Compute lasso estimator on a grid (Estimators for different nodes could be computed in parallel) ˆ βw

s,s′(i) = arg min θw

s,s′

  • ℓw

s,s′(θw s,s′) + λi|θw s,s′|1

  • ,

2

Choose λ by BIC: i∗ = arg min

1≤i≤100

  • nℓw

s,s′(ˆ

βw

s,s′(i)) + log(n)ˆ

βw

s,s′(i)0

  • ,

3

Choose threshold δ by GIC: δ∗ = arg min

δ∈Ω

  • nℓw

s,s′(ˆ

βw,δ

s,s′ ) + log(2d(d − 1))ˆ

βw,δ

s,s′ 0

  • ,

Blazej Miasojedow (UW) 05 June 2020 21 / 24

slide-27
SLIDE 27

Structure learning for CTBN’s Structure learning Full observations

Chain example

Blazej Miasojedow (UW) 05 June 2020 22 / 24

slide-28
SLIDE 28

Structure learning for CTBN’s Structure learning Partial observations

Partial observations

For the partial observation we can analogously define the lasso estimator, but with likelihood of form ℓ(β) = − log

  • g(y|x)pβ(x)
  • dx ,

where g is distribution of observed y and pβ is density of hidden trajectory of CTBN. To solve the lasso problem we can use generalized EM algorithm. The expectation step could be done via numerical integration (Nodelman (2007), Linzner and Koeppl (2018),Linzner et al. (2019))

  • r by MCMC algorithm Rao and Teh (2012)

The theoretical analysis of estimator would me much more challenging, because ℓ is no convex anymore.

Blazej Miasojedow (UW) 05 June 2020 23 / 24

slide-29
SLIDE 29

Structure learning for CTBN’s Structure learning Partial observations

Partial observations

For the partial observation we can analogously define the lasso estimator, but with likelihood of form ℓ(β) = − log

  • g(y|x)pβ(x)
  • dx ,

where g is distribution of observed y and pβ is density of hidden trajectory of CTBN. To solve the lasso problem we can use generalized EM algorithm. The expectation step could be done via numerical integration (Nodelman (2007), Linzner and Koeppl (2018),Linzner et al. (2019))

  • r by MCMC algorithm Rao and Teh (2012)

The theoretical analysis of estimator would me much more challenging, because ℓ is no convex anymore.

Blazej Miasojedow (UW) 05 June 2020 23 / 24

slide-30
SLIDE 30

Structure learning for CTBN’s Structure learning Partial observations

Partial observations

For the partial observation we can analogously define the lasso estimator, but with likelihood of form ℓ(β) = − log

  • g(y|x)pβ(x)
  • dx ,

where g is distribution of observed y and pβ is density of hidden trajectory of CTBN. To solve the lasso problem we can use generalized EM algorithm. The expectation step could be done via numerical integration (Nodelman (2007), Linzner and Koeppl (2018),Linzner et al. (2019))

  • r by MCMC algorithm Rao and Teh (2012)

The theoretical analysis of estimator would me much more challenging, because ℓ is no convex anymore.

Blazej Miasojedow (UW) 05 June 2020 23 / 24

slide-31
SLIDE 31

Structure learning for CTBN’s Structure learning Partial observations

Partial observations

For the partial observation we can analogously define the lasso estimator, but with likelihood of form ℓ(β) = − log

  • g(y|x)pβ(x)
  • dx ,

where g is distribution of observed y and pβ is density of hidden trajectory of CTBN. To solve the lasso problem we can use generalized EM algorithm. The expectation step could be done via numerical integration (Nodelman (2007), Linzner and Koeppl (2018),Linzner et al. (2019))

  • r by MCMC algorithm Rao and Teh (2012)

The theoretical analysis of estimator would me much more challenging, because ℓ is no convex anymore.

Blazej Miasojedow (UW) 05 June 2020 23 / 24

slide-32
SLIDE 32

Structure learning for CTBN’s Structure learning Partial observations

Thank you!

Blazej Miasojedow (UW) 05 June 2020 24 / 24