Entropy Rate Estimation for Markov Chains with Large State Space - - PowerPoint PPT Presentation
Entropy Rate Estimation for Markov Chains with Large State Space - - PowerPoint PPT Presentation
Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS ChuanZheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018,
Entropy Rate Estimation for Markov Chains with Large State Space
Entropy Rate Estimation
Entropy rate of a stationary process {Xt}∞
t=1:
¯ H lim
n→∞
H(X n) n , H(X n) =
- xn∈X n
pX n(xn) log 1 pX n(xn).
2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Entropy Rate Estimation
Entropy rate of a stationary process {Xt}∞
t=1:
¯ H lim
n→∞
H(X n) n , H(X n) =
- xn∈X n
pX n(xn) log 1 pX n(xn).
◮ fundamental limit of the expected logarithmic loss when
predicting the next symbol given all past symbols
2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Entropy Rate Estimation
Entropy rate of a stationary process {Xt}∞
t=1:
¯ H lim
n→∞
H(X n) n , H(X n) =
- xn∈X n
pX n(xn) log 1 pX n(xn).
◮ fundamental limit of the expected logarithmic loss when
predicting the next symbol given all past symbols
◮ fundamental limit of data compressing for stationary
stochastic processes
2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Entropy Rate Estimation
Entropy rate of a stationary process {Xt}∞
t=1:
¯ H lim
n→∞
H(X n) n , H(X n) =
- xn∈X n
pX n(xn) log 1 pX n(xn).
◮ fundamental limit of the expected logarithmic loss when
predicting the next symbol given all past symbols
◮ fundamental limit of data compressing for stationary
stochastic processes
Our Task
Given a length-n trajectory {Xt}n
t=1, estimate ¯
H.
2 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
From Entropy to Entropy Rate
Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)
For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫
S log S .
3 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
From Entropy to Entropy Rate
Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)
For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫
S log S .
Sample Complexity
i.i.d. process constant process
Entropy Rate Estimation for Markov Chains with Large State Space
From Entropy to Entropy Rate
Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)
For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫
S log S .
Sample Complexity
i.i.d. process constant process n ≍
S log S
Entropy Rate Estimation for Markov Chains with Large State Space
From Entropy to Entropy Rate
Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)
For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫
S log S .
Sample Complexity
i.i.d. process constant process n ≍
S log S
n ≍ ∞
Entropy Rate Estimation for Markov Chains with Large State Space
From Entropy to Entropy Rate
Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16)
For discrete entropy estimation with support size S, consistent estimation is possible if and only if n ≫
S log S .
Sample Complexity
i.i.d. process constant process n ≍
S log S
n ≍ ∞ n ≍ ?
3 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Assumption
Assumption
The data-generating process {Xt}n
t=1 is a reversible first-order
Markov chain with relaxation time τrel.
4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Assumption
Assumption
The data-generating process {Xt}n
t=1 is a reversible first-order
Markov chain with relaxation time τrel.
◮ Relaxation time τrel = (spectral gap)−1 ≥ 1 characterizes the
mixing time of the Markov chain
4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Assumption
Assumption
The data-generating process {Xt}n
t=1 is a reversible first-order
Markov chain with relaxation time τrel.
◮ Relaxation time τrel = (spectral gap)−1 ≥ 1 characterizes the
mixing time of the Markov chain
◮ High-dimensional setting: state space S = |X| is large and
may scale with n
4 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Estimators
For first-order Markov chain: ¯ H = H(X1|X0) =
S
- i=1
πi
- stationary distribution
conditional entropy
- H(X1|X0 = i)
5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Estimators
For first-order Markov chain: ¯ H = H(X1|X0) =
S
- i=1
πi
- stationary distribution
conditional entropy
- H(X1|X0 = i)
◮ Estimate of πi: empirical frequency ˆ
πi of state i
5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Estimators
For first-order Markov chain: ¯ H = H(X1|X0) =
S
- i=1
πi
- stationary distribution
conditional entropy
- H(X1|X0 = i)
◮ Estimate of πi: empirical frequency ˆ
πi of state i
◮ Estimate of H(X1|X0 = i): estimate discrete entropy from
samples X(i) = {Xj : Xj−1 = i}
5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Estimators
For first-order Markov chain: ¯ H = H(X1|X0) =
S
- i=1
πi
- stationary distribution
conditional entropy
- H(X1|X0 = i)
◮ Estimate of πi: empirical frequency ˆ
πi of state i
◮ Estimate of H(X1|X0 = i): estimate discrete entropy from
samples X(i) = {Xj : Xj−1 = i}
Estimators
◮ Empirical estimator: ¯
Hemp = S
i=1 ˆ
πi ˆ Hemp(X(i))
◮ Proposed estimator: ¯
Hopt = S
i=1 ˆ
πi ˆ Hopt(X(i))
5 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Main Results
Empirical estimator ¯ Hemp
τrel 1 Θ(
S log3 S )
Entropy Rate Estimation for Markov Chains with Large State Space
Main Results
Empirical estimator ¯ Hemp
τrel 1 Θ(
S log3 S )
n ≍ S2 n S2
6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Main Results
Empirical estimator ¯ Hemp
τrel 1 Θ(
S log3 S )
n ≍ S2 n S2
Proposed estimator ¯ Hopt
τrel 1 Θ(
S log3 S )
1 + Θ( log2 S
√ S )
6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Main Results
Empirical estimator ¯ Hemp
τrel 1 Θ(
S log3 S )
n ≍ S2 n S2
Proposed estimator ¯ Hopt
τrel 1 Θ(
S log3 S )
1 + Θ( log2 S
√ S )
n ≍
S2 log S
n
S2 log S
n
S2 log S
n ≍
S log S
6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Main Results
Empirical estimator ¯ Hemp
τrel 1 Θ(
S log3 S )
n ≍ S2 n S2
Proposed estimator ¯ Hopt
τrel 1 Θ(
S log3 S )
1 + Θ( log2 S
√ S )
n ≍
S2 log S
n
S2 log S
n
S2 log S
n ≍
S log S
For a wide range of τrel, sample complexity does not depend on τrel.
6 / 7
Entropy Rate Estimation for Markov Chains with Large State Space
Application: Fundamental Limits of Language Models
1 2 3 4 2 4 6 8 10 best known model for PTB best known model for 1BW memory length k estimated cond. entropy ˆ H(Xk|Xk−1
1
) PTB 1BW
Figure: Estimated and achieved fundamental limits of language modeling
◮ Penn Treebank (PTB): 1.50 vs. 5.96 bits per word ◮ Googles One Billion Words (1BW): 3.46 vs. 4.55 bits per word
7 / 7