Optimal Information Passing: How much vs. How fast Abbas Kazemipour - - PowerPoint PPT Presentation

optimal information passing how much vs how fast
SMART_READER_LITE
LIVE PREVIEW

Optimal Information Passing: How much vs. How fast Abbas Kazemipour - - PowerPoint PPT Presentation

Optimal Information Passing: How much vs. How fast Abbas Kazemipour MAST Group Meeting University of Maryland. College Park kaazemi@umd.edu March 24, 2016 Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 1 / 20 Overview 1 Introduction


slide-1
SLIDE 1

Optimal Information Passing: How much vs. How fast

Abbas Kazemipour

MAST Group Meeting University of Maryland. College Park kaazemi@umd.edu

March 24, 2016

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 1 / 20

slide-2
SLIDE 2

Overview

1 Introduction

Discrete Hawkes Process as a Markov Chain

2 Part 2: Stationary Distributions

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 2 / 20

slide-3
SLIDE 3

Discrete Hawkes Process as a Markov Chain

1 Discrete Hawkes Process:

xk = Ber(φ(θT xk−1

k−p)),

(1)

2 History components form a Markov Chain

xk−1

k−p,

a binary vector of length p.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 3 / 20

slide-4
SLIDE 4

Discrete Hawkes Process as a Markov Chain

1 Discrete Hawkes Process:

xk = Ber(φ(θT xk−1

k−p)),

(1)

2 History components form a Markov Chain

xk−1

k−p,

a binary vector of length p.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 3 / 20

slide-5
SLIDE 5

Simulation: p = 100, n = 500,s = 3 and γn = 0.1.

1 Each spike train under this model, corresponds to a walk across

the states.

2 The corresponding likelihood is the product of the weights of the

edges visited along the walk.

3

Figure: State Space for p = 3

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20

slide-6
SLIDE 6

Simulation: p = 100, n = 500,s = 3 and γn = 0.1.

1 Each spike train under this model, corresponds to a walk across

the states.

2 The corresponding likelihood is the product of the weights of the

edges visited along the walk.

3

Figure: State Space for p = 3

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20

slide-7
SLIDE 7

Simulation: p = 100, n = 500,s = 3 and γn = 0.1.

1 Each spike train under this model, corresponds to a walk across

the states.

2 The corresponding likelihood is the product of the weights of the

edges visited along the walk.

3

Figure: State Space for p = 3

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 4 / 20

slide-8
SLIDE 8

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-9
SLIDE 9

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-10
SLIDE 10

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-11
SLIDE 11

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-12
SLIDE 12

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-13
SLIDE 13

Introduction

1 We observe n consecutive snapshots of length p (a total of

n + p − 1 samples): {xk}n

k=−p+1

2 xn

1 can be approximated by a sequence of Bernoulli random

variables with rates λn

1

What is a good optimization problem for estimating θ? Answer: ℓ1-regularized ML. How does a suitable n compare to p, s order-wise? Answer: n = O(p2/3). How does such an estimator perform compared to the traditional estimation methods? Answer: much better! But why?

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 5 / 20

slide-14
SLIDE 14

Preliminaries

1 Consider the Discrete Hawkes process model

λi = µ + θ′xi−1

i−p,

(2)

2 Negative (conditional) log-likelihood

L(θ) = − 1 n

n

  • i=1

xi log λi − λi. (3)

3 Bernoulli approximation

L(θ) ≈ − 1 n

n

  • i=1

xi log λi + (1 − xi) log(1 − λi) = h(x1, x2, · · · , xn). (4)

4 Negative log-likelihood equals the joint entropy (information) of

the spiking.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20

slide-15
SLIDE 15

Preliminaries

1 Consider the Discrete Hawkes process model

λi = µ + θ′xi−1

i−p,

(2)

2 Negative (conditional) log-likelihood

L(θ) = − 1 n

n

  • i=1

xi log λi − λi. (3)

3 Bernoulli approximation

L(θ) ≈ − 1 n

n

  • i=1

xi log λi + (1 − xi) log(1 − λi) = h(x1, x2, · · · , xn). (4)

4 Negative log-likelihood equals the joint entropy (information) of

the spiking.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20

slide-16
SLIDE 16

Preliminaries

1 Consider the Discrete Hawkes process model

λi = µ + θ′xi−1

i−p,

(2)

2 Negative (conditional) log-likelihood

L(θ) = − 1 n

n

  • i=1

xi log λi − λi. (3)

3 Bernoulli approximation

L(θ) ≈ − 1 n

n

  • i=1

xi log λi + (1 − xi) log(1 − λi) = h(x1, x2, · · · , xn). (4)

4 Negative log-likelihood equals the joint entropy (information) of

the spiking.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20

slide-17
SLIDE 17

Preliminaries

1 Consider the Discrete Hawkes process model

λi = µ + θ′xi−1

i−p,

(2)

2 Negative (conditional) log-likelihood

L(θ) = − 1 n

n

  • i=1

xi log λi − λi. (3)

3 Bernoulli approximation

L(θ) ≈ − 1 n

n

  • i=1

xi log λi + (1 − xi) log(1 − λi) = h(x1, x2, · · · , xn). (4)

4 Negative log-likelihood equals the joint entropy (information) of

the spiking.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 6 / 20

slide-18
SLIDE 18

ML vs. ℓ1-regularization

Maximum Likelihood Estimation

  • θML = arg min

θ∈Θ

L(θ), (5)

1 Maximizes the joint entropy of spiking to have maximum

transferred information.

ℓ1-regularized estimate

  • θsp := arg min

θ∈Θ

L(θ) + γnθ1. (6)

2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20

slide-19
SLIDE 19

ML vs. ℓ1-regularization

Maximum Likelihood Estimation

  • θML = arg min

θ∈Θ

L(θ), (5)

1 Maximizes the joint entropy of spiking to have maximum

transferred information.

ℓ1-regularized estimate

  • θsp := arg min

θ∈Θ

L(θ) + γnθ1. (6)

2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20

slide-20
SLIDE 20

ML vs. ℓ1-regularization

Maximum Likelihood Estimation

  • θML = arg min

θ∈Θ

L(θ), (5)

1 Maximizes the joint entropy of spiking to have maximum

transferred information.

ℓ1-regularized estimate

  • θsp := arg min

θ∈Θ

L(θ) + γnθ1. (6)

2 What does regularization do apart from motivating sparsity? 3 To show: regularization determines the speed of data transfer. 4 Battle between speed and amount of information. Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 7 / 20

slide-21
SLIDE 21

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-22
SLIDE 22

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-23
SLIDE 23

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-24
SLIDE 24

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-25
SLIDE 25

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-26
SLIDE 26

Second Largest Eigenvalue Modulus and its Significance

1 The Markov chain defined by the history components of the

Hawkes process has a stationary distribution π.

2 Converges to π irrespective of the initial state. 3 How fast this happens determines how fast the data has been

transferred.

4 The transition probability matrix is a function of θ. 5 Perron-Frobenius theorem: has unique largest eigenvalue λ1 = 1. 6 The second largest eigenvalue modulus determines the speed of

convergence. λ = max{λ2, −λn} max

i∈S P t(i, .) − π(.)∼ Cλt,

t → ∞

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 8 / 20

slide-27
SLIDE 27

Example

1 λ = 0

P = 1/2 1/2 1/2 1/2

  • (7)

2 λ = 0.5

P = 1/4 3/4 3/4 1/4

  • (8)

P 2 = 0.625 0.375 0.375 0.675

  • (9)

P 5 = 0.4844 0.5156 0.5156 0.4844

  • (10)

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 9 / 20

slide-28
SLIDE 28

Example

1 λ = 0

P = 1/2 1/2 1/2 1/2

  • (7)

2 λ = 0.5

P = 1/4 3/4 3/4 1/4

  • (8)

P 2 = 0.625 0.375 0.375 0.675

  • (9)

P 5 = 0.4844 0.5156 0.5156 0.4844

  • (10)

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 9 / 20

slide-29
SLIDE 29

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-30
SLIDE 30

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-31
SLIDE 31

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-32
SLIDE 32

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-33
SLIDE 33

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-34
SLIDE 34

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-35
SLIDE 35

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-36
SLIDE 36

Methods of Estimating the SLEM for Reversible Chains

1 Mixing rate of the chain, spectral gap 2 Bounds on TV minimization, Eigenvalue decomposition 3 Dirichlet forms and Poincare inequality 4 Comparison techniques 5 Wilson’s method 6 Nash inequalities 7 Evolving sets and martingales 8 Representation theory Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 10 / 20

slide-37
SLIDE 37

Fastest Mixing Markov Chain Problem

1 FMMC problem: choose P such that the SLEM is minimized 2 Quick review: SDP

minimize cT x subject to x1F1 + · · · xnFn + G 0, Ax = b, where G, F1, · · · , Fn are symmetric.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 11 / 20

slide-38
SLIDE 38

Fastest Mixing Markov Chain Problem

1 FMMC problem: choose P such that the SLEM is minimized 2 Quick review: SDP

minimize cT x subject to x1F1 + · · · xnFn + G 0, Ax = b, where G, F1, · · · , Fn are symmetric.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 11 / 20

slide-39
SLIDE 39

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Assumption:

P = P T .

2 Stationary distribution: (1/n)✶. 3 Idea: project onto orthogonal complement of (1/n)✶.

λ = QPQ2, where Q = I − (1/n)✶✶T .

4 Thus

λ = P − (1/n)✶✶T 2.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 12 / 20

slide-40
SLIDE 40

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Assumption:

P = P T .

2 Stationary distribution: (1/n)✶. 3 Idea: project onto orthogonal complement of (1/n)✶.

λ = QPQ2, where Q = I − (1/n)✶✶T .

4 Thus

λ = P − (1/n)✶✶T 2.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 12 / 20

slide-41
SLIDE 41

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Assumption:

P = P T .

2 Stationary distribution: (1/n)✶. 3 Idea: project onto orthogonal complement of (1/n)✶.

λ = QPQ2, where Q = I − (1/n)✶✶T .

4 Thus

λ = P − (1/n)✶✶T 2.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 12 / 20

slide-42
SLIDE 42

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Assumption:

P = P T .

2 Stationary distribution: (1/n)✶. 3 Idea: project onto orthogonal complement of (1/n)✶.

λ = QPQ2, where Q = I − (1/n)✶✶T .

4 Thus

λ = P − (1/n)✶✶T 2.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 12 / 20

slide-43
SLIDE 43

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1

minimize P − (1/n)✶✶T 2 subject to P✶ = ✶ Pij ≥ 0 Pij = 0 (i, j) not connected

2 Equivalently:

minimize t subject to − tI P − (1/n)✶✶T tI P✶ = ✶ Pij ≥ 0 Pij = 0 (i, j) not connected

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 13 / 20

slide-44
SLIDE 44

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1

minimize P − (1/n)✶✶T 2 subject to P✶ = ✶ Pij ≥ 0 Pij = 0 (i, j) not connected

2 Equivalently:

minimize t subject to − tI P − (1/n)✶✶T tI P✶ = ✶ Pij ≥ 0 Pij = 0 (i, j) not connected

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 13 / 20

slide-45
SLIDE 45

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Cannot directly apply to DHP. 2 P = P T . 3 States grow exponentially. 4 What can we do then?

Lemma

For a DHS, λ is monotone in θ1.

5 Important implication: ℓ1-regularization not only enforces sparsity

but also minimizes the SLEM !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 14 / 20

slide-46
SLIDE 46

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Cannot directly apply to DHP. 2 P = P T . 3 States grow exponentially. 4 What can we do then?

Lemma

For a DHS, λ is monotone in θ1.

5 Important implication: ℓ1-regularization not only enforces sparsity

but also minimizes the SLEM !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 14 / 20

slide-47
SLIDE 47

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Cannot directly apply to DHP. 2 P = P T . 3 States grow exponentially. 4 What can we do then?

Lemma

For a DHS, λ is monotone in θ1.

5 Important implication: ℓ1-regularization not only enforces sparsity

but also minimizes the SLEM !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 14 / 20

slide-48
SLIDE 48

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Cannot directly apply to DHP. 2 P = P T . 3 States grow exponentially. 4 What can we do then?

Lemma

For a DHS, λ is monotone in θ1.

5 Important implication: ℓ1-regularization not only enforces sparsity

but also minimizes the SLEM !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 14 / 20

slide-49
SLIDE 49

SDP Characterization of Fastest Mixing Markov Chain [Boyd et. al.]

1 Cannot directly apply to DHP. 2 P = P T . 3 States grow exponentially. 4 What can we do then?

Lemma

For a DHS, λ is monotone in θ1.

5 Important implication: ℓ1-regularization not only enforces sparsity

but also minimizes the SLEM !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 14 / 20

slide-50
SLIDE 50

Stationary Distributions

1 Given a large P find the stationary distributions 2 Applications: statistical inference, network analysis etc. 3 Power iteration methods (Lanczos algorithm, Arnoldi algorithm

etc.) and MCMC.

4 Can we write this problem as a convex optimization problem? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 15 / 20

slide-51
SLIDE 51

Stationary Distributions

1 Given a large P find the stationary distributions 2 Applications: statistical inference, network analysis etc. 3 Power iteration methods (Lanczos algorithm, Arnoldi algorithm

etc.) and MCMC.

4 Can we write this problem as a convex optimization problem? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 15 / 20

slide-52
SLIDE 52

Stationary Distributions

1 Given a large P find the stationary distributions 2 Applications: statistical inference, network analysis etc. 3 Power iteration methods (Lanczos algorithm, Arnoldi algorithm

etc.) and MCMC.

4 Can we write this problem as a convex optimization problem? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 15 / 20

slide-53
SLIDE 53

Stationary Distributions

1 Given a large P find the stationary distributions 2 Applications: statistical inference, network analysis etc. 3 Power iteration methods (Lanczos algorithm, Arnoldi algorithm

etc.) and MCMC.

4 Can we write this problem as a convex optimization problem? Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 15 / 20

slide-54
SLIDE 54

Introduction

1 Quick review:

Borel-Cantelli’s lemma

For events A1, · · · , A∞, if

i p(Ai) < ∞, then Ai happens finitely often.

2 For a Markov chain n-step transition probabilities is given by

Pij(n) = P n(i, j)

3 Goal: minimize the stationary states of the Markov chain to find

support of a positive sparse vector.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 16 / 20

slide-55
SLIDE 55

Introduction

1 Quick review:

Borel-Cantelli’s lemma

For events A1, · · · , A∞, if

i p(Ai) < ∞, then Ai happens finitely often.

2 For a Markov chain n-step transition probabilities is given by

Pij(n) = P n(i, j)

3 Goal: minimize the stationary states of the Markov chain to find

support of a positive sparse vector.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 16 / 20

slide-56
SLIDE 56

Introduction

1 Quick review:

Borel-Cantelli’s lemma

For events A1, · · · , A∞, if

i p(Ai) < ∞, then Ai happens finitely often.

2 For a Markov chain n-step transition probabilities is given by

Pij(n) = P n(i, j)

3 Goal: minimize the stationary states of the Markov chain to find

support of a positive sparse vector.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 16 / 20

slide-57
SLIDE 57

Introduction

1 Suppose x is a positive s-sparse vector. 2 We have measurements y = Ax. 3 wlog assume x1= 1 (measure y1 = ✶T x for example). 4 Suppose x is the stationary distribution of a Markov chain !

Main idea

Finding the sparsest x ≡ Finding the Markov chain with smallest number of stationary states !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 17 / 20

slide-58
SLIDE 58

Introduction

1 Suppose x is a positive s-sparse vector. 2 We have measurements y = Ax. 3 wlog assume x1= 1 (measure y1 = ✶T x for example). 4 Suppose x is the stationary distribution of a Markov chain !

Main idea

Finding the sparsest x ≡ Finding the Markov chain with smallest number of stationary states !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 17 / 20

slide-59
SLIDE 59

Introduction

1 Suppose x is a positive s-sparse vector. 2 We have measurements y = Ax. 3 wlog assume x1= 1 (measure y1 = ✶T x for example). 4 Suppose x is the stationary distribution of a Markov chain !

Main idea

Finding the sparsest x ≡ Finding the Markov chain with smallest number of stationary states !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 17 / 20

slide-60
SLIDE 60

Introduction

1 Suppose x is a positive s-sparse vector. 2 We have measurements y = Ax. 3 wlog assume x1= 1 (measure y1 = ✶T x for example). 4 Suppose x is the stationary distribution of a Markov chain !

Main idea

Finding the sparsest x ≡ Finding the Markov chain with smallest number of stationary states !

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 17 / 20

slide-61
SLIDE 61

Introduction

1 By Borel-Cantelli’s lemma (and a little bit more work) state i is

transient if and only if

n pii(n) < ∞.

2 Goal: try to minimize

i

  • n pii(n).

3

minimize tr(I + P + P 2 + · · ·) subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

4 Unfortunately the objective function does not converge! Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 18 / 20

slide-62
SLIDE 62

Introduction

1 By Borel-Cantelli’s lemma (and a little bit more work) state i is

transient if and only if

n pii(n) < ∞.

2 Goal: try to minimize

i

  • n pii(n).

3

minimize tr(I + P + P 2 + · · ·) subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

4 Unfortunately the objective function does not converge! Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 18 / 20

slide-63
SLIDE 63

Introduction

1 By Borel-Cantelli’s lemma (and a little bit more work) state i is

transient if and only if

n pii(n) < ∞.

2 Goal: try to minimize

i

  • n pii(n).

3

minimize tr(I + P + P 2 + · · ·) subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

4 Unfortunately the objective function does not converge! Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 18 / 20

slide-64
SLIDE 64

Introduction

1 By Borel-Cantelli’s lemma (and a little bit more work) state i is

transient if and only if

n pii(n) < ∞.

2 Goal: try to minimize

i

  • n pii(n).

3

minimize tr(I + P + P 2 + · · ·) subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

4 Unfortunately the objective function does not converge! Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 18 / 20

slide-65
SLIDE 65

Formulation

1

tr(I + P + P 2 + · · ·) = tr((I − P)−1) =

  • 1

1 − λi

2 Need to remove λ1 = 1 ! 3 Functional matrix Z of the Markov chain:

Z =

  • i=0

(P i − ✶xT ) = (I − P + ✶xT )−1

4 Elements of Z are finite!

Zjk =

  • i=0

(Pjk(t) − xk)

5 Zjk represents how quickly the probability mass at node k from a

random walk beginning at node j converges to xk.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 19 / 20

slide-66
SLIDE 66

Formulation

1

tr(I + P + P 2 + · · ·) = tr((I − P)−1) =

  • 1

1 − λi

2 Need to remove λ1 = 1 ! 3 Functional matrix Z of the Markov chain:

Z =

  • i=0

(P i − ✶xT ) = (I − P + ✶xT )−1

4 Elements of Z are finite!

Zjk =

  • i=0

(Pjk(t) − xk)

5 Zjk represents how quickly the probability mass at node k from a

random walk beginning at node j converges to xk.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 19 / 20

slide-67
SLIDE 67

Formulation

1

tr(I + P + P 2 + · · ·) = tr((I − P)−1) =

  • 1

1 − λi

2 Need to remove λ1 = 1 ! 3 Functional matrix Z of the Markov chain:

Z =

  • i=0

(P i − ✶xT ) = (I − P + ✶xT )−1

4 Elements of Z are finite!

Zjk =

  • i=0

(Pjk(t) − xk)

5 Zjk represents how quickly the probability mass at node k from a

random walk beginning at node j converges to xk.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 19 / 20

slide-68
SLIDE 68

Formulation

1

tr(I + P + P 2 + · · ·) = tr((I − P)−1) =

  • 1

1 − λi

2 Need to remove λ1 = 1 ! 3 Functional matrix Z of the Markov chain:

Z =

  • i=0

(P i − ✶xT ) = (I − P + ✶xT )−1

4 Elements of Z are finite!

Zjk =

  • i=0

(Pjk(t) − xk)

5 Zjk represents how quickly the probability mass at node k from a

random walk beginning at node j converges to xk.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 19 / 20

slide-69
SLIDE 69

Formulation

1

tr(I + P + P 2 + · · ·) = tr((I − P)−1) =

  • 1

1 − λi

2 Need to remove λ1 = 1 ! 3 Functional matrix Z of the Markov chain:

Z =

  • i=0

(P i − ✶xT ) = (I − P + ✶xT )−1

4 Elements of Z are finite!

Zjk =

  • i=0

(Pjk(t) − xk)

5 Zjk represents how quickly the probability mass at node k from a

random walk beginning at node j converges to xk.

Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 19 / 20

slide-70
SLIDE 70

Formulation

1

minimize tr(I − P + ✶xT )−1 subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

2 Still not convex ! Need to relax. 3 Might be tougher than ℓ1-regularization, just a new perspective 4 Might help come up with new algorithms [Ozdaglar et.al.] Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 20 / 20

slide-71
SLIDE 71

Formulation

1

minimize tr(I − P + ✶xT )−1 subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

2 Still not convex ! Need to relax. 3 Might be tougher than ℓ1-regularization, just a new perspective 4 Might help come up with new algorithms [Ozdaglar et.al.] Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 20 / 20

slide-72
SLIDE 72

Formulation

1

minimize tr(I − P + ✶xT )−1 subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

2 Still not convex ! Need to relax. 3 Might be tougher than ℓ1-regularization, just a new perspective 4 Might help come up with new algorithms [Ozdaglar et.al.] Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 20 / 20

slide-73
SLIDE 73

Formulation

1

minimize tr(I − P + ✶xT )−1 subject to P✶ = ✶ xP = x y = Ax Pij ≥ 0 Pij = 0 (i, j) not connected

2 Still not convex ! Need to relax. 3 Might be tougher than ℓ1-regularization, just a new perspective 4 Might help come up with new algorithms [Ozdaglar et.al.] Abbas Kazemipour (UMD) FMMC, SLEM March 24, 2016 20 / 20