Entropy Rate Estimation for Markov Chains with Large State Space - PowerPoint PPT Presentation

Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS Chuan–Zheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018, Montr´ eal, Canada

Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n 2 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols 2 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols ◮ fundamental limit of data compressing for stationary stochastic processes 2 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Entropy Rate Estimation Entropy rate of a stationary process { X t } ∞ t =1 : H ( X n ) 1 � ¯ H ( X n ) = p X n ( x n ) log H � lim , p X n ( x n ) . n n →∞ x n ∈X n ◮ fundamental limit of the expected logarithmic loss when predicting the next symbol given all past symbols ◮ fundamental limit of data compressing for stationary stochastic processes Our Task t =1 , estimate ¯ Given a length- n trajectory { X t } n H . 2 / 7

Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . 3 / 7

Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity constant process i.i.d. process

Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ log S constant process i.i.d. process

Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ n ≍ ∞ log S constant process i.i.d. process

Entropy Rate Estimation for Markov Chains with Large State Space From Entropy to Entropy Rate Theorem (Jiao–Venkat–Han–Weissman’15, Wu–Yang’16) For discrete entropy estimation with support size S, consistent S estimation is possible if and only if n ≫ log S . Sample Complexity S n ≍ n ≍ ? n ≍ ∞ log S constant process i.i.d. process 3 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . 4 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . ◮ Relaxation time τ rel = (spectral gap) − 1 ≥ 1 characterizes the mixing time of the Markov chain 4 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Assumption Assumption The data-generating process { X t } n t =1 is a reversible first-order Markov chain with relaxation time τ rel . ◮ Relaxation time τ rel = (spectral gap) − 1 ≥ 1 characterizes the mixing time of the Markov chain ◮ High-dimensional setting: state space S = |X| is large and may scale with n 4 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i �� i =1 stationary distribution 5 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i �� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i 5 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i �� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i ◮ Estimate of H ( X 1 | X 0 = i ): estimate discrete entropy from samples X ( i ) = { X j : X j − 1 = i } 5 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Estimators For first-order Markov chain: conditional entropy S � �� ¯ H = H ( X 1 | X 0 ) = H ( X 1 | X 0 = i ) π i �� i =1 stationary distribution ◮ Estimate of π i : empirical frequency ˆ π i of state i ◮ Estimate of H ( X 1 | X 0 = i ): estimate discrete entropy from samples X ( i ) = { X j : X j − 1 = i } Estimators H emp = � S ◮ Empirical estimator: ¯ π i ˆ H emp ( X ( i ) ) i =1 ˆ H opt = � S ◮ Proposed estimator: ¯ π i ˆ H opt ( X ( i ) ) i =1 ˆ 5 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp 1 S τ rel Θ( log 3 S )

Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) 6 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ 6 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt S n ≍ log S S 2 S 2 S 2 n � n � n ≍ log S log S log S 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ 6 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Main Results Empirical estimator ¯ H emp n � S 2 n ≍ S 2 1 S τ rel Θ( log 3 S ) Proposed estimator ¯ H opt S n ≍ log S S 2 S 2 S 2 n � n � n ≍ log S log S log S 1 + Θ( log 2 S 1 S τ rel Θ( log 3 S ) S ) √ For a wide range of τ rel , sample complexity does not depend on τ rel . 6 / 7

Entropy Rate Estimation for Markov Chains with Large State Space Application: Fundamental Limits of Language Models PTB ) 1BW H ( X k | X k − 1 10 1 8 estimated cond. entropy ˆ best known model for PTB 6 4 best known model for 1BW 2 0 1 2 3 4 memory length k Figure: Estimated and achieved fundamental limits of language modeling ◮ Penn Treebank (PTB): 1.50 vs. 5.96 bits per word ◮ Googles One Billion Words (1BW): 3.46 vs. 4.55 bits per word 7 / 7

Entropy Rate Estimation for Markov Chains with Large State Space - PowerPoint PPT Presentation

Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS ChuanZheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018,

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Discrete time Markov chains Today: Discrete Time Markov Chains, Limiting Discrete time Markov

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Overview Verifying Continuous-Time Markov Chains Negative exponential distributions 1 Lecture

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Simulation of Discrete-Time Markov Chains Discrete-Time Markov Chains (DTMCs) Numerical Solution

Under Interval and Fuzzy From the . . . Symmetric Markov Chains Uncertainty, Symmetric In

Structured Perceptron with Inexact Search x x the man bit the dog x the man bit

Yang Yang MICHIGAN TECH Yang Yang , yyang7@mtu.edu RESEARCH FORUM TECHTALKS Current research

Working with YANG Data Models and Instances Using (Mainly) pyang Ladislav Lhotka 20 July 2014

Diversity of Supernova-Hypernova Properties K. Nomoto (IPMU, U. Tokyo) Diversity &

0 -Sparse Subspace Clustering Yingzhen Yang 1 , Jiashi Feng 2 , Nebojsa Jojic 3 , Jianchao Yang

Weighted SGD for p Regression with Randomized Preconditioning Jiyan Yang Stanford University

ELECTROWEAK PHASE TRANSITION S AND HIGGS COUPLINGS Patrick Meade C.N. Yang Institute for

Section 33 Finite fields Instructor: Yifan Yang Spring 2007 Instructor: Yifan Yang Section

Entropy Rate Estimation for Markov Chains with Large State Space - PowerPoint PPT Presentation

Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Stanford EE Jiantao Jiao Berkeley EECS ChuanZheng Lee Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats Tiancheng Yu Tsinghua EE NIPS 2018,

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Discrete time Markov chains Today: Discrete Time Markov Chains, Limiting Discrete time Markov

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Overview Verifying Continuous-Time Markov Chains Negative exponential distributions 1 Lecture

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Simulation of Discrete-Time Markov Chains Discrete-Time Markov Chains (DTMCs) Numerical Solution

Under Interval and Fuzzy From the . . . Symmetric Markov Chains Uncertainty, Symmetric In

Structured Perceptron with Inexact Search x x the man bit the dog x the man bit

Yang Yang MICHIGAN TECH Yang Yang , yyang7@mtu.edu RESEARCH FORUM TECHTALKS Current research

Working with YANG Data Models and Instances Using (Mainly) pyang Ladislav Lhotka 20 July 2014

Diversity of Supernova-Hypernova Properties K. Nomoto (IPMU, U. Tokyo) Diversity &amp;

0 -Sparse Subspace Clustering Yingzhen Yang 1 , Jiashi Feng 2 , Nebojsa Jojic 3 , Jianchao Yang

Weighted SGD for p Regression with Randomized Preconditioning Jiyan Yang Stanford University

ELECTROWEAK PHASE TRANSITION S AND HIGGS COUPLINGS Patrick Meade C.N. Yang Institute for

Section 33 Finite fields Instructor: Yifan Yang Spring 2007 Instructor: Yifan Yang Section

Diversity of Supernova-Hypernova Properties K. Nomoto (IPMU, U. Tokyo) Diversity &