Confidence intervals for the mixing time of a reversible Markov - - PowerPoint PPT Presentation

confidence intervals for the mixing time of a reversible
SMART_READER_LITE
LIVE PREVIEW

Confidence intervals for the mixing time of a reversible Markov - - PowerPoint PPT Presentation

Confidence intervals for the mixing time of a reversible Markov chain from a single sample path Daniel Hsu Aryeh Kontorovich Csaba Szepesvri Columbia University, Ben-Gurion University, University of Alberta ITA 2016 1


slide-1
SLIDE 1

Confidence intervals for the mixing time

  • f a reversible Markov chain

from a single sample path

Daniel Hsu† Aryeh Kontorovich♯ Csaba Szepesvári⋆

†Columbia University, ♯Ben-Gurion University, ⋆University of Alberta

ITA 2016

1

slide-2
SLIDE 2

Problem

◮ Irreducible, aperiodic, time-homogeneous Markov chain

X1 → X2 → X3 → · · ·

2

slide-3
SLIDE 3

Problem

◮ Irreducible, aperiodic, time-homogeneous Markov chain

X1 → X2 → X3 → · · ·

◮ There is a unique stationary distribution π with

lim

t→∞ L(Xt | X1 = x) = π ,

for all x ∈ X .

2

slide-4
SLIDE 4

Problem

◮ Irreducible, aperiodic, time-homogeneous Markov chain

X1 → X2 → X3 → · · ·

◮ There is a unique stationary distribution π with

lim

t→∞ L(Xt | X1 = x) = π ,

for all x ∈ X .

◮ The mixing time tmix is the earliest time t with

sup

x∈X

L(Xt | X1 = x) − πtv ≤ 1/4 .

2

slide-5
SLIDE 5

Problem

◮ Irreducible, aperiodic, time-homogeneous Markov chain

X1 → X2 → X3 → · · ·

◮ There is a unique stationary distribution π with

lim

t→∞ L(Xt | X1 = x) = π ,

for all x ∈ X .

◮ The mixing time tmix is the earliest time t with

sup

x∈X

L(Xt | X1 = x) − πtv ≤ 1/4 . Problem: Determine (confidently) if t ≥ tmix after seeing X1, X2, . . . , Xt.

2

slide-6
SLIDE 6

Problem

◮ Irreducible, aperiodic, time-homogeneous Markov chain

X1 → X2 → X3 → · · ·

◮ There is a unique stationary distribution π with

lim

t→∞ L(Xt | X1 = x) = π ,

for all x ∈ X .

◮ The mixing time tmix is the earliest time t with

sup

x∈X

L(Xt | X1 = x) − πtv ≤ 1/4 . Problem: Given δ ∈ (0, 1) and X1:t, determine non-trivial It ⊆ [0, ∞] with P(tmix ∈ It) ≥ 1 − δ .

2

slide-7
SLIDE 7

Some motivation from machine learning and statistics

Chernoff bounds for Markov chains X1 → X2 → · · · : for suitably well-behaved f : X → R, with probability at least 1 − δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

  • tmix log(1/δ)

t

  • deviation bound

. Bound depends on tmix, which may be unknown a priori.

3

slide-8
SLIDE 8

Some motivation from machine learning and statistics

Chernoff bounds for Markov chains X1 → X2 → · · · : for suitably well-behaved f : X → R, with probability at least 1 − δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

  • tmix log(1/δ)

t

  • deviation bound

. Bound depends on tmix, which may be unknown a priori. Examples: Bayesian inference Posterior means & variances via MCMC Reinforcement learning Mean action rewards in an MDP Supervised learning Error rates of hypotheses from non-iid data

3

slide-9
SLIDE 9

Some motivation from machine learning and statistics

Chernoff bounds for Markov chains X1 → X2 → · · · : for suitably well-behaved f : X → R, with probability at least 1 − δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

  • tmix log(1/δ)

t

  • deviation bound

. Bound depends on tmix, which may be unknown a priori. Examples: Bayesian inference Posterior means & variances via MCMC Reinforcement learning Mean action rewards in an MDP Supervised learning Error rates of hypotheses from non-iid data Need observable deviation bounds.

3

slide-10
SLIDE 10

Observable deviation bounds from mixing time bounds?

Suppose an estimator ˆ tmix = ˆ tmix(X1:t) of tmix satisfies: P(tmix ≤ ˆ tmix + εt) ≥ 1 − δ .

4

slide-11
SLIDE 11

Observable deviation bounds from mixing time bounds?

Suppose an estimator ˆ tmix = ˆ tmix(X1:t) of tmix satisfies: P(tmix ≤ ˆ tmix + εt) ≥ 1 − δ . Then with probability at least 1 − 2δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

tmix + εt) log(1/δ) t

  • .

4

slide-12
SLIDE 12

Observable deviation bounds from mixing time bounds?

Suppose an estimator ˆ tmix = ˆ tmix(X1:t) of tmix satisfies: P(tmix ≤ ˆ tmix + εt) ≥ 1 − δ . Then with probability at least 1 − 2δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

tmix + εt) log(1/δ) t

  • .

But ˆ tmix is computed from X1:t, so εt may also depend on tmix.

4

slide-13
SLIDE 13

Observable deviation bounds from mixing time bounds?

Suppose an estimator ˆ tmix = ˆ tmix(X1:t) of tmix satisfies: P(tmix ≤ ˆ tmix + εt) ≥ 1 − δ . Then with probability at least 1 − 2δ,

  • 1

t

t

  • i=1

f (Xi) − Eπ f

˜ O

tmix + εt) log(1/δ) t

  • .

But ˆ tmix is computed from X1:t, so εt may also depend on tmix. Deviation bounds for point estimators are insufficient. Need (observable) confidence intervals for tmix.

4

slide-14
SLIDE 14

What we do

5

slide-15
SLIDE 15

What we do

  • 1. Shift focus to relaxation time trelax to enable spectral methods.

5

slide-16
SLIDE 16

What we do

  • 1. Shift focus to relaxation time trelax to enable spectral methods.
  • 2. Lower/upper bounds on sample path length for point

estimation of trelax.

5

slide-17
SLIDE 17

What we do

  • 1. Shift focus to relaxation time trelax to enable spectral methods.
  • 2. Lower/upper bounds on sample path length for point

estimation of trelax.

  • 3. New algorithm for constructing confidence intervals for trelax.

5

slide-18
SLIDE 18

Relaxation time

◮ Let P be the transition operator of the Markov chain,

and let λ⋆ be its second-largest eigenvalue modulus

(i.e., largest eigenvalue modulus other than 1).

6

slide-19
SLIDE 19

Relaxation time

◮ Let P be the transition operator of the Markov chain,

and let λ⋆ be its second-largest eigenvalue modulus

(i.e., largest eigenvalue modulus other than 1).

◮ Spectral gap: γ⋆ := 1 − λ⋆.

Relaxation time: trelax := 1/γ⋆. (trelax − 1) ln 2 ≤ tmix ≤ trelax ln 4 π⋆ for π⋆ := minx∈X π(x).

6

slide-20
SLIDE 20

Relaxation time

◮ Let P be the transition operator of the Markov chain,

and let λ⋆ be its second-largest eigenvalue modulus

(i.e., largest eigenvalue modulus other than 1).

◮ Spectral gap: γ⋆ := 1 − λ⋆.

Relaxation time: trelax := 1/γ⋆. (trelax − 1) ln 2 ≤ tmix ≤ trelax ln 4 π⋆ for π⋆ := minx∈X π(x). Assumptions on P ensure γ⋆, π⋆ ∈ (0, 1).

6

slide-21
SLIDE 21

Relaxation time

◮ Let P be the transition operator of the Markov chain,

and let λ⋆ be its second-largest eigenvalue modulus

(i.e., largest eigenvalue modulus other than 1).

◮ Spectral gap: γ⋆ := 1 − λ⋆.

Relaxation time: trelax := 1/γ⋆. (trelax − 1) ln 2 ≤ tmix ≤ trelax ln 4 π⋆ for π⋆ := minx∈X π(x). Assumptions on P ensure γ⋆, π⋆ ∈ (0, 1). Spectral approach: construct CI’s for γ⋆ and π⋆.

6

slide-22
SLIDE 22

Our results (point estimation)

We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X.

7

slide-23
SLIDE 23

Our results (point estimation)

We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X.

  • 1. Lower bound:

To estimate γ⋆ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ≥ Ω d log d γ⋆ + 1 π⋆

  • .

7

slide-24
SLIDE 24

Our results (point estimation)

We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X.

  • 1. Lower bound:

To estimate γ⋆ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ≥ Ω d log d γ⋆ + 1 π⋆

  • .
  • 2. Upper bound:

Simple algorithm estimates γ⋆ and π⋆ within a constant multiplicative factor (w.h.p.) with sample path of length

  • O

log d π⋆γ3

  • (for γ⋆) ,
  • O

log d π⋆γ⋆

  • (for π⋆) .

7

slide-25
SLIDE 25

Our results (point estimation)

We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X.

  • 1. Lower bound:

To estimate γ⋆ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ≥ Ω d log d γ⋆ + 1 π⋆

  • .
  • 2. Upper bound:

Simple algorithm estimates γ⋆ and π⋆ within a constant multiplicative factor (w.h.p.) with sample path of length

  • O

log d π⋆γ3

  • (for γ⋆) ,
  • O

log d π⋆γ⋆

  • (for π⋆) .

But point estimator ⇒ confidence interval.

7

slide-26
SLIDE 26

Our results (confidence intervals)

  • 3. New algorithm: Given δ ∈ (0, 1) and X1:t as input,

constructs intervals I γ⋆

t

and I π⋆

t

such that P

  • γ⋆ ∈ I γ⋆

t

  • ≥ 1 − δ

and P

  • π⋆ ∈ I π⋆

t

  • ≥ 1 − δ .

Widths of intervals converge a.s. to zero at

  • log log t

t

rate.

8

slide-27
SLIDE 27

Our results (confidence intervals)

  • 3. New algorithm: Given δ ∈ (0, 1) and X1:t as input,

constructs intervals I γ⋆

t

and I π⋆

t

such that P

  • γ⋆ ∈ I γ⋆

t

  • ≥ 1 − δ

and P

  • π⋆ ∈ I π⋆

t

  • ≥ 1 − δ .

Widths of intervals converge a.s. to zero at

  • log log t

t

rate.

  • 4. Hybrid approach: Use new algorithm to turn error bounds for

point estimators into observable CI’s.

(This improves asymptotic rate for π⋆ interval.)

8

slide-28
SLIDE 28

Plug-in estimator

9

slide-29
SLIDE 29

Plug-in estimator

◮ Reversibility grants the symmetry of

M := diag(π)P =

  • PX1∼π(X1 = x, X2 = x′)
  • x,x′∈X

(doublet state probabilities in stationary chain).

9

slide-30
SLIDE 30

Plug-in estimator

◮ Reversibility grants the symmetry of

M := diag(π)P =

  • PX1∼π(X1 = x, X2 = x′)
  • x,x′∈X

(doublet state probabilities in stationary chain).

◮ Moreover, eigenvalues of

L := diag(π)−1/2M diag(π)−1/2 are real, and satisfy 1 = λ1 > λ2 ≥ · · · ≥ λd > −1 , γ⋆ = 1 − max{λ2, |λd|} .

9

slide-31
SLIDE 31

Plug-in estimator

◮ Reversibility grants the symmetry of

M := diag(π)P =

  • PX1∼π(X1 = x, X2 = x′)
  • x,x′∈X

(doublet state probabilities in stationary chain).

◮ Moreover, eigenvalues of

L := diag(π)−1/2M diag(π)−1/2 are real, and satisfy 1 = λ1 > λ2 ≥ · · · ≥ λd > −1 , γ⋆ = 1 − max{λ2, |λd|} .

◮ Plug-in estimator: estimate π and M from X1:t (using

empirical frequencies), then plug-in to formula for γ⋆.

9

slide-32
SLIDE 32

Chicken-and-egg problem

(Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ⋆): e.g., w.h.p., |ˆ γ⋆ − γ⋆| ≤ L − L ≤ O  

  • log(d) log(t/π⋆)

γ⋆π⋆t   .

10

slide-33
SLIDE 33

Chicken-and-egg problem

(Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ⋆): e.g., w.h.p., |ˆ γ⋆ − γ⋆| ≤ L − L ≤ O  

  • log(d) log(t/π⋆)

γ⋆π⋆t   . This has inverse dependence on γ⋆.

10

slide-34
SLIDE 34

Chicken-and-egg problem

(Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ⋆): e.g., w.h.p., |ˆ γ⋆ − γ⋆| ≤ L − L ≤ O  

  • log(d) log(t/π⋆)

γ⋆π⋆t   . This has inverse dependence on γ⋆. Can’t “solve the bound” for γ⋆

(unlike “empirical Bernstein” inequalities).

10

slide-35
SLIDE 35

Direct estimation of P

Alternative: directly estimate P from X1:t.

11

slide-36
SLIDE 36

Direct estimation of P

Alternative: directly estimate P from X1:t.

◮ Key advantage: observable confidence intervals for P via

“empirical Bernstein” inequality for martingales.

11

slide-37
SLIDE 37

Direct estimation of P

Alternative: directly estimate P from X1:t.

◮ Key advantage: observable confidence intervals for P via

“empirical Bernstein” inequality for martingales. Two problems:

11

slide-38
SLIDE 38

Direct estimation of P

Alternative: directly estimate P from X1:t.

◮ Key advantage: observable confidence intervals for P via

“empirical Bernstein” inequality for martingales. Two problems:

  • 1. Without appealing to symmetry structure, can argue

P − P ≤ ε = ⇒ |ˆ γ⋆ − γ⋆| ≤ O(ε1/(2d)) , but this implies exponential slow-down in rate.

11

slide-39
SLIDE 39

Direct estimation of P

Alternative: directly estimate P from X1:t.

◮ Key advantage: observable confidence intervals for P via

“empirical Bernstein” inequality for martingales. Two problems:

  • 1. Without appealing to symmetry structure, can argue

P − P ≤ ε = ⇒ |ˆ γ⋆ − γ⋆| ≤ O(ε1/(2d)) , but this implies exponential slow-down in rate.

  • 2. Direct appeal to symmetry structure of

L = diag(π)1/2P diag(π)−1/2 gives bounds that depend on π, which is unknown.

11

slide-40
SLIDE 40

Direct estimation of P

Alternative: directly estimate P from X1:t.

◮ Key advantage: observable confidence intervals for P via

“empirical Bernstein” inequality for martingales. Two problems:

  • 1. Without appealing to symmetry structure, can argue

P − P ≤ ε = ⇒ |ˆ γ⋆ − γ⋆| ≤ O(ε1/(2d)) , but this implies exponential slow-down in rate.

  • 2. Direct appeal to symmetry structure of

L = diag(π)1/2P diag(π)−1/2 gives bounds that depend on π, which is unknown. Our approach: Directly estimate P, and indirectly estimate π via P.

11

slide-41
SLIDE 41

Indirect estimation of π

  • 1. We ensure that

P is transition operator for an ergodic chain

(easy via Laplace smoothing).

12

slide-42
SLIDE 42

Indirect estimation of π

  • 1. We ensure that

P is transition operator for an ergodic chain

(easy via Laplace smoothing).

  • 2. Key step: estimate π via

P via group inverse A# of I − P.

12

slide-43
SLIDE 43

Indirect estimation of π

  • 1. We ensure that

P is transition operator for an ergodic chain

(easy via Laplace smoothing).

  • 2. Key step: estimate π via

P via group inverse A# of I − P.

A# contains “virtually everything that one would want to know about the chain” [with transition operator P] (Meyer, 1975).

12

slide-44
SLIDE 44

Indirect estimation of π

  • 1. We ensure that

P is transition operator for an ergodic chain

(easy via Laplace smoothing).

  • 2. Key step: estimate π via

P via group inverse A# of I − P.

A# contains “virtually everything that one would want to know about the chain” [with transition operator P] (Meyer, 1975).

◮ Reveals unique stationary distribution ˆ

π w.r.t. P. This is our indirect estimate of π.

12

slide-45
SLIDE 45

Indirect estimation of π

  • 1. We ensure that

P is transition operator for an ergodic chain

(easy via Laplace smoothing).

  • 2. Key step: estimate π via

P via group inverse A# of I − P.

A# contains “virtually everything that one would want to know about the chain” [with transition operator P] (Meyer, 1975).

◮ Reveals unique stationary distribution ˆ

π w.r.t. P. This is our indirect estimate of π.

◮ Tells us how to bound ˆ

π − π∞ in terms of P − P. Hence, from this, we construct a confidence interval for π.

12

slide-46
SLIDE 46

Overall algorithm (outline)

  • 1. Form empirical estimate and confidence intervals for P

(exploit Markov property & “empirical Bernstein”-type bounds).

  • 2. Form estimate and confidence intervals for π

(via group inverse of I − P).

  • 3. Form estimate and confidence interval for γ⋆

(via confidence intervals for π and P, & eigenvalue perturbation theory).

13

slide-47
SLIDE 47

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

14

slide-48
SLIDE 48

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

◮ Strongly exploit Markov property and ergodicity in confidence

intervals for P and π.

14

slide-49
SLIDE 49

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

◮ Strongly exploit Markov property and ergodicity in confidence

intervals for P and π.

◮ Problem #1: close gap between lower and upper bounds on

sample path length (for point estimation).

14

slide-50
SLIDE 50

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

◮ Strongly exploit Markov property and ergodicity in confidence

intervals for P and π.

◮ Problem #1: close gap between lower and upper bounds on

sample path length (for point estimation).

◮ Problem #2: overcome computational bottlenecks from

matrix operations.

14

slide-51
SLIDE 51

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

◮ Strongly exploit Markov property and ergodicity in confidence

intervals for P and π.

◮ Problem #1: close gap between lower and upper bounds on

sample path length (for point estimation).

◮ Problem #2: overcome computational bottlenecks from

matrix operations.

◮ Problem #3: handle large/continuous state spaces under

suitable assumptions.

14

slide-52
SLIDE 52

Recap and future work

◮ We resolve “chicken-and-egg” problem of observable

confidence intervals for mixing time from a single sample path.

◮ Strongly exploit Markov property and ergodicity in confidence

intervals for P and π.

◮ Problem #1: close gap between lower and upper bounds on

sample path length (for point estimation).

◮ Problem #2: overcome computational bottlenecks from

matrix operations.

◮ Problem #3: handle large/continuous state spaces under

suitable assumptions.

Thanks!

14