 
              Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS Chuan–Zheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018, Montr´ eal, Canada
Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n 2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols 2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols ◮ fundamental limit of data compressing for stationary stochastic processes 2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols ◮ fundamental limit of data compressing for stationary stochastic processes Our Task t =1 , estimate ¯ Given a length- n trajectory { X t } n H . 2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . 3 / 7
Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity constant process i.i.d. process
Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ log S constant process i.i.d. process
Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ n ≍ ∞ log S constant process i.i.d. process
Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ n ≍ ? n ≍ ∞ log S constant process i.i.d. process 3 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . 4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . ◮ Relaxation time τ rel = (spectral gap) − 1 ≥ 1 characterizes the mixing time of the Markov chain 4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . ◮ Relaxation time τ rel = (spectral gap) − 1 ≥ 1 characterizes the mixing time of the Markov chain ◮ High-dimensional setting: state space S = |X| is large and may scale with n 4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� � � ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i ���� i =1 stationary distribution 5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� � � ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i ���� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i 5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� � � ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i ���� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i ◮ Estimate of H ( X 1 | X 0 = i ): estimate discrete entropy from samples X ( i ) = { X j : X j − 1 = i } 5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� � � ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i ���� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i ◮ Estimate of H ( X 1 | X 0 = i ): estimate discrete entropy from samples X ( i ) = { X j : X j − 1 = i } Estimators H emp = � S ◮ Empirical estimator: ¯ π i ˆ H emp ( X ( i ) ) i =1 ˆ H opt = � S ◮ Proposed estimator: ¯ π i ˆ H opt ( X ( i ) ) i =1 ˆ 5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp 1 S τ rel Θ( log 3 S )
Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) 6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ 6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt S n ≍ log S S 2 S 2 S 2 n � n � n ≍ log S log S log S 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ 6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt S n ≍ log S S 2 S 2 S 2 n � n � n ≍ log S log S log S 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ For a wide range of τ rel , sample complexity does not depend on τ rel . 6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space Application: Fundamental Limits of Language Models PTB ) 1BW H ( X k | X k − 1 10 1 8 estimated cond. entropy ˆ best known model for PTB 6 4 best known model for 1BW 2 0 1 2 3 4 memory length k Figure: Estimated and achieved fundamental limits of language modeling ◮ Penn Treebank (PTB): 1.50 vs. 5.96 bits per word ◮ Googles One Billion Words (1BW): 3.46 vs. 4.55 bits per word 7 / 7
Recommend
More recommend