Entropy Rate Estimation for Markov Chains with Large State Space - - PowerPoint PPT Presentation

entropy rate estimation for markov chains with large
SMART_READER_LITE
LIVE PREVIEW

Entropy Rate Estimation for Markov Chains with Large State Space - - PowerPoint PPT Presentation

Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS ChuanZheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018,


slide-1
SLIDE 1

Entropy Rate Estimation for Markov Chains with Large State Space

Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS Chuan–Zheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018, Montr´ eal, Canada

slide-2
SLIDE 2

Entropy Rate Estimation for Markov Chains with Large State Space

Entropy Rate Estimation

Entropy rate of a stationary process {Xt}∞

t=1:

¯ H lim

n→∞

H(X n) n , H(X n) =

  • xn∈X n

pX n(xn) log 1 pX n(xn).

2 / 7

slide-3
SLIDE 3

Entropy Rate Estimation for Markov Chains with Large State Space

Entropy Rate Estimation

Entropy rate of a stationary process {Xt}∞

t=1:

¯ H lim

n→∞

H(X n) n , H(X n) =

  • xn∈X n

pX n(xn) log 1 pX n(xn).

◮ fundamental limit of the expected logarithmic loss when

predicting the next symbol given all past symbols

2 / 7

slide-4
SLIDE 4

Entropy Rate Estimation for Markov Chains with Large State Space

Entropy Rate Estimation

Entropy rate of a stationary process {Xt}∞

t=1:

¯ H lim

n→∞

H(X n) n , H(X n) =

  • xn∈X n

pX n(xn) log 1 pX n(xn).

◮ fundamental limit of the expected logarithmic loss when

predicting the next symbol given all past symbols

◮ fundamental limit of data compressing for stationary

stochastic processes

2 / 7

slide-5
SLIDE 5

Entropy Rate Estimation for Markov Chains with Large State Space

Entropy Rate Estimation

Entropy rate of a stationary process {Xt}∞

t=1:

¯ H lim

n→∞

H(X n) n , H(X n) =

  • xn∈X n

pX n(xn) log 1 pX n(xn).

◮ fundamental limit of the expected logarithmic loss when

predicting the next symbol given all past symbols

◮ fundamental limit of data compressing for stationary

stochastic processes

Our Task

Given a length-n trajectory {Xt}n

t=1, estimate ¯

H.

2 / 7

slide-6
SLIDE 6

Entropy Rate Estimation for Markov Chains with Large State Space

From Entropy to Entropy Rate

Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)

For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫

S log S .

3 / 7

slide-7
SLIDE 7

Entropy Rate Estimation for Markov Chains with Large State Space

From Entropy to Entropy Rate

Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)

For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫

S log S .

Sample Complexity

i.i.d. process constant process

slide-8
SLIDE 8

Entropy Rate Estimation for Markov Chains with Large State Space

From Entropy to Entropy Rate

Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)

For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫

S log S .

Sample Complexity

i.i.d. process constant process n ≍

S log S

slide-9
SLIDE 9

Entropy Rate Estimation for Markov Chains with Large State Space

From Entropy to Entropy Rate

Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)

For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫

S log S .

Sample Complexity

i.i.d. process constant process n ≍

S log S

n ≍ ∞

slide-10
SLIDE 10

Entropy Rate Estimation for Markov Chains with Large State Space

From Entropy to Entropy Rate

Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)

For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫

S log S .

Sample Complexity

i.i.d. process constant process n ≍

S log S

n ≍ ∞ n ≍ ?

3 / 7

slide-11
SLIDE 11

Entropy Rate Estimation for Markov Chains with Large State Space

Assumption

Assumption

The data-generating process {Xt}n

t=1 is a reversible first-order

Markov chain with relaxation time τrel.

4 / 7

slide-12
SLIDE 12

Entropy Rate Estimation for Markov Chains with Large State Space

Assumption

Assumption

The data-generating process {Xt}n

t=1 is a reversible first-order

Markov chain with relaxation time τrel.

◮ Relaxation time τrel = (spectral gap)−1 ≥ 1 characterizes the

mixing time of the Markov chain

4 / 7

slide-13
SLIDE 13

Entropy Rate Estimation for Markov Chains with Large State Space

Assumption

Assumption

The data-generating process {Xt}n

t=1 is a reversible first-order

Markov chain with relaxation time τrel.

◮ Relaxation time τrel = (spectral gap)−1 ≥ 1 characterizes the

mixing time of the Markov chain

◮ High-dimensional setting: state space S = |X| is large and

may scale with n

4 / 7

slide-14
SLIDE 14

Entropy Rate Estimation for Markov Chains with Large State Space

Estimators

For first-order Markov chain: ¯ H = H(X1|X0) =

S

  • i=1

πi

  • stationary distribution

conditional entropy

  • H(X1|X0 = i)

5 / 7

slide-15
SLIDE 15

Entropy Rate Estimation for Markov Chains with Large State Space

Estimators

For first-order Markov chain: ¯ H = H(X1|X0) =

S

  • i=1

πi

  • stationary distribution

conditional entropy

  • H(X1|X0 = i)

◮ Estimate of πi: empirical frequency ˆ

πi of state i

5 / 7

slide-16
SLIDE 16

Entropy Rate Estimation for Markov Chains with Large State Space

Estimators

For first-order Markov chain: ¯ H = H(X1|X0) =

S

  • i=1

πi

  • stationary distribution

conditional entropy

  • H(X1|X0 = i)

◮ Estimate of πi: empirical frequency ˆ

πi of state i

◮ Estimate of H(X1|X0 = i): estimate discrete entropy from

samples X(i) = {Xj : Xj−1 = i}

5 / 7

slide-17
SLIDE 17

Entropy Rate Estimation for Markov Chains with Large State Space

Estimators

For first-order Markov chain: ¯ H = H(X1|X0) =

S

  • i=1

πi

  • stationary distribution

conditional entropy

  • H(X1|X0 = i)

◮ Estimate of πi: empirical frequency ˆ

πi of state i

◮ Estimate of H(X1|X0 = i): estimate discrete entropy from

samples X(i) = {Xj : Xj−1 = i}

Estimators

◮ Empirical estimator: ¯

Hemp = S

i=1 ˆ

πi ˆ Hemp(X(i))

◮ Proposed estimator: ¯

Hopt = S

i=1 ˆ

πi ˆ Hopt(X(i))

5 / 7

slide-18
SLIDE 18

Entropy Rate Estimation for Markov Chains with Large State Space

Main Results

Empirical estimator ¯ Hemp

τrel 1 Θ(

S log3 S )

slide-19
SLIDE 19

Entropy Rate Estimation for Markov Chains with Large State Space

Main Results

Empirical estimator ¯ Hemp

τrel 1 Θ(

S log3 S )

n ≍ S2 n S2

6 / 7

slide-20
SLIDE 20

Entropy Rate Estimation for Markov Chains with Large State Space

Main Results

Empirical estimator ¯ Hemp

τrel 1 Θ(

S log3 S )

n ≍ S2 n S2

Proposed estimator ¯ Hopt

τrel 1 Θ(

S log3 S )

1 + Θ( log2 S

√ S )

6 / 7

slide-21
SLIDE 21

Entropy Rate Estimation for Markov Chains with Large State Space

Main Results

Empirical estimator ¯ Hemp

τrel 1 Θ(

S log3 S )

n ≍ S2 n S2

Proposed estimator ¯ Hopt

τrel 1 Θ(

S log3 S )

1 + Θ( log2 S

√ S )

n ≍

S2 log S

n

S2 log S

n

S2 log S

n ≍

S log S

6 / 7

slide-22
SLIDE 22

Entropy Rate Estimation for Markov Chains with Large State Space

Main Results

Empirical estimator ¯ Hemp

τrel 1 Θ(

S log3 S )

n ≍ S2 n S2

Proposed estimator ¯ Hopt

τrel 1 Θ(

S log3 S )

1 + Θ( log2 S

√ S )

n ≍

S2 log S

n

S2 log S

n

S2 log S

n ≍

S log S

For a wide range of τrel, sample complexity does not depend on τrel.

6 / 7

slide-23
SLIDE 23

Entropy Rate Estimation for Markov Chains with Large State Space

Application: Fundamental Limits of Language Models

1 2 3 4 2 4 6 8 10 best known model for PTB best known model for 1BW memory length k estimated cond. entropy ˆ H(Xk|Xk−1

1

) PTB 1BW

Figure: Estimated and achieved fundamental limits of language modeling

◮ Penn Treebank (PTB): 1.50 vs. 5.96 bits per word ◮ Googles One Billion Words (1BW): 3.46 vs. 4.55 bits per word

7 / 7