Three Approaches towards Optimal Property Estimation and Testing - PowerPoint PPT Presentation

Three Approaches towards Optimal Property Estimation and Testing Jiantao Jiao (Stanford EE) Joint work with: Yanjun Han, Dmitri Pavlichin, Kartik Venkat, Tsachy Weissman Frontiers in Distribution Testing Workshop, FOCS 2017 Oct. 14th, 2017 1 / 23

Statistical properties Disclaimer: Throughout this talk, n refers to the number of samples, S refer to the alphabet size of a distribution. 1 Shannon entropy: H ( P ) � � S i =1 − p i ln p i . 2 F α ( P ): F α ( P ) � � S i =1 p α i , α > 0 . 3 KL divergence, χ 2 divergence, L 1 distance, Hellinger distance F ( P , Q ) � � S i =1 f ( p i , q i ) for f ( x , y ) = x ln( x / y ) , ( x − y ) 2 / x , | x − y | , ( √ x − √ y ) 2 . 2 / 23

Tolerant testing/learning/estimation We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L 1 ( P , U S ), U S = (1 / S , 1 / S , . . . , 1 / S ), observe n i.i.d. samples from P ; � S (VV’11, VV’11): exist approach whose error is n ln n when ln S � n � S ; no consistent estimator when n � S S ln S ; � The MLE plug-in L 1 ( ˆ S n when n � S . P n , U S ) achieves error 3 / 23

Tolerant testing/learning/estimation We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L 1 ( P , U S ), U S = (1 / S , 1 / S , . . . , 1 / S ), observe n i.i.d. samples from P ; � S (VV’11, VV’11): exist approach whose error is n ln n when ln S � n � S ; no consistent estimator when n � S S ln S ; � The MLE plug-in L 1 ( ˆ S n when n � S . P n , U S ) achieves error Effective sample size enlargement Minimax rate-optimal with n samples ⇐ ⇒ MLE with n ln n samples Similar results also hold for Shannon entropy (VV’11, VV’11, VV’13, WY’16, JVHW’15), power sum functional (JVHW’15), R´ enyi entropy estimation (AOST’14), χ 2 , Hellinger, and KL-divergence estimation (HJW’16, BZLV’16), L r norm estimation under Gaussian white noise model (HJMW’17), L 1 distance estimation (JHW’16), etc. except for support size (WY’16) 3 / 23

Effective sample size enlargement E | ˆ R minmax ( F , P , n ) = inf sup F − F ( P ) | ˆ F ( X 1 ,..., X n ) P ∈P E | F ( ˆ R plug-in ( F , P , n ) = sup P n ) − F ( P ) | . P ∈P F ( P ) P R minmax ( F , P , n ) R plug-in ( F , P , n ) S � � 1 S log( S ) S log( S ) � p i log M S + + √ n √ n p i n log( n ) n i =1 S S S � p α 0 < α ≤ 1 F α ( P ) = M S i , 2 ( n log( n )) α n α i =1 S 1 − α S 1 − α S S 1 F α ( P ) , 2 < α < 1 M S ( n log( n )) α + n α + √ n √ n ( n log( n )) − ( α − 1) n − ( α − 1) 1 < α < 3 F α ( P ) , M S 2 1 1 α ≥ 3 F α ( P ) , M S √ n √ n 2 � �� n log( n ) � n , n S − Θ max � S S Se − Θ � { P : min i p i ≥ 1 1 ( p i � = 0) S } S Se i =1 � q i S S S � q i � � � | p i − q i | M S q i ∧ q i ∧ n ln n n i =1 i =1 i =1 4 / 23

Effective sample size enlargement Divergence functions: here P , Q ∈ M S where we have m samples from p and n samples from q . For the Kullback-Leibler and χ 2 divergence estimators we only consider ( P , Q ) ∈ { ( P , Q ) | P , Q ∈ M S , P i Q i ≤ u ( S ) } where u ( S ) is some function of S . F ( P , Q ) R minmax ( F , P , m , n ) R plug-in ( F , P , m , n ) S � � S S � | p i − q i | + min { m , n } log(min { m , n } ) min { m , n } i =1 S � � 1 S S ( √ p i − √ q i ) 2 � 2 min { m , n } log(min { m , n } ) min { m , n } i =1 S � � Su ( S ) log( u ( S )) � u ( S ) Su ( S ) log( u ( S )) � u ( S ) p i S S � D ( P � Q ) = p i log + + + + + + √ m √ n √ m √ n m log( m ) n log( n ) q i m n i =1 p 2 Su ( S ) 2 u ( S ) 3 / 2 Su ( S ) 2 u ( S ) 3 / 2 S u ( S ) u ( S ) χ 2 ( P � Q ) = i � − 1 + + + + √ m √ n √ m √ n q i n log( n ) n i =1 5 / 23

Goal of this talk Understand the mechanism behind the logarithmic sample size enlargement. For what functionals do we have this phenomenon? What concrete algorithms achieve this phenomenon? If there exist multiple approaches, what are their relative advantages and disadvantages? 6 / 23

First approach: Approximation methodology Question Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)? 7 / 23

First approach: Approximation methodology Question Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)? Answer Nope. :) Literature on approximation methodology VV’11 (linear estimator), WY’16, WY’16 JVHW’15, AOST’14, HJW’16, BZLV’16, HJMW’16, JHW’16 7 / 23

Example: L 1 distance estimation Given Q = ( q 1 , q 2 , . . . , q S ), we estimate L 1 ( P , Q ) given i.i.d. samples from P . Theorem (J., Han, Weissman’16) √ √ q i ∧ q i �� S � Suppose ln S � ln n � ln n ln n , S ≥ 2 . Then, i =1 � q i S E P | ˆ � inf sup L − L 1 ( P , Q ) | ≍ q i ∧ n ln n . (1) ˆ L P ∈M S i =1 For the MLE, we have S � q i E P | L 1 ( ˆ � sup P n , Q ) − L 1 ( P , Q ) | ≍ q i ∧ n . (2) P ∈M S i =1 8 / 23

Confidence sets in binomial model: coverage probability ≍ 1 − n − A 0 1 Θ = [0 , 1] n ˆ p ∼ B( n , p )

Confidence sets in binomial model: coverage probability ≍ 1 − n − A ln n n 0 1 Θ = [0 , 1] n ˆ p ∼ B( n , p )

Confidence sets in binomial model: coverage probability ≍ 1 − n − A ln n n 0 1 p < ln n Θ = [0 , 1] ˆ n n ˆ p ∼ B( n , p )

Confidence sets in binomial model: coverage probability ≍ 1 − n − A ∼ ln n n ln n U (ˆ p ) n 0 1 p < ln n Θ = [0 , 1] ˆ n n ˆ p ∼ B( n , p )

Confidence sets in binomial model: coverage probability ≍ 1 − n − A ∼ ln n n ln n U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p )

Confidence sets in binomial model: coverage probability ≍ 1 − n − A � p ln n ˆ ∼ ln n ∼ n n ln n U (ˆ p ) U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p ) 9 / 23

Confidence sets in binomial model: coverage probability ≍ 1 − n − A � p ln n ˆ ∼ ln n ∼ n n ln n U (ˆ p ) U (ˆ p ) n 0 1 p < ln n p > ln n Θ = [0 , 1] ˆ ˆ n n n ˆ p ∼ B( n , p ) Theorem (J., Han, Weissman’16) Partition [0 , 1] into finitely number of intervals I i = [ x i , x i +1 ] , x 0 = 0 , n , √ x i +1 − √ x i ≍ � x 1 ≍ ln n ln n n . Then, p ∈ 2 I i with probability 1 − n − A ; 1 if p ∈ I i , then ˆ 2 if ˆ p ∈ I i , then p ∈ 2 I i with probability 1 − n − A ; 3 Those intervals are of the shortest length. 9 / 23

Algorithmic description of Approximation methodology p ′ First conduct sampling splitting, get ˆ p i , ˆ i i.i.d. with distribution 2 n · B( n / 2 , p i ). Suppose q i ∈ I j . For each i do the following: 1 if ˆ p i ∈ I j , compute best polynomial approximation in 2 I j : P K ( x ; q i ) = arg min z ∈ 2 I j || z − q i | − P ( z ) | , max (3) P ∈ Poly K and then estimate | p i − q i | by the unbiased estimator of P K ( p i ; q i ) p ′ using ˆ i ; 2 if ˆ p ′ p i / ∈ I j , estimate | p i − q i | by | ˆ i − q i | ; 3 sum everything up. 10 / 23

Why it works? 1 Suppose ˆ p i ∈ I j . No matter what we use to estimate, one can always assume that p i ∈ 2 I j ; 2 The bias of the MLE is approximately (Strukov and Timan’77) � q i sup || p i − q i | − E | ˆ p i − q i || ≍ q i ∧ n ; (4) p i ∈ 2 I j 3 The bias of the Approximation methodology is approximately (Ditzian and Totik’87) � q i sup || p i − q i | − P K ( p i ; q i ) | ≍ q i ∧ n ln n . (5) p i ∈ 2 I j 4 Permutation invariance does not play a role since we are doing symbol by symbol bias correction; 5 The bias dominates in high dimensions (measure concentration phenomenon). 11 / 23

Properties of the Approximation Methodology 1 Applies to essentially any functional 2 Applies to a wide range of statistical models (binomial, Poisson, Gaussian, etc) 3 Near-linear complexity 4 Explicit polynomial approximation for each different functional 5 Need to tune parameters in practice 12 / 23

Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? 13 / 23

Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? Answer No. For any plug-in rule ˆ P , there exists a fixed Q such that L 1 ( ˆ P , Q ) requires n ≫ S samples to consistently estimate L 1 ( P , Q ), while the S optimal method requires at most n ≫ ln S . 13 / 23

Second approach: Local moment matching methodology Motivation Does there exist a single plug-in estimator that can replace the Approximation methodology? Answer No. For any plug-in rule ˆ P , there exists a fixed Q such that L 1 ( ˆ P , Q ) requires n ≫ S samples to consistently estimate L 1 ( P , Q ), while the S optimal method requires at most n ≫ ln S . Weakened goal What about we only consider permutation invariant functionals? Literature on the local moment matching methodology VV’11 (linear programming), HJW’17 13 / 23

Three Approaches towards Optimal Property Estimation and Testing - PowerPoint PPT Presentation

Three Approaches towards Optimal Property Estimation and Testing Jiantao Jiao (Stanford EE) Joint work with: Yanjun Han, Dmitri Pavlichin, Kartik Venkat, Tsachy Weissman Frontiers in Distribution Testing Workshop, FOCS 2017 Oct. 14th, 2017 1

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

An Optimal Estimation An Optimal Estimation Approach to AIRS CO Approach to AIRS CO Retrievals

What is the cloud? Property of TalentWise Property of TalentWise Cloud HCM Players Property of

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Towards Optimal Constructions of Towards Optimal Constructions of Dynamically Corrected Quantum

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

End User Development: Approaches Towards A End User Development: Approaches Towards A Flexible

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for Mining of Source Code

Learning Approaches to Estimate Depth from RGB Lecture 5 What will we learn - Latest Approaches

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

PT Mega Manunggal Property Tbk 1 PT Mega Manunggal Property Tbk 2 PT Mega Manunggal Property

PT Mega Manunggal Property Tbk 1 PT Mega Manunggal Property Tbk 2 PT Mega Manunggal Property

Towards Optimal Placement Estimating the Effect . . . Precise Formulation of . . . of Bio-Weapon

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Eclipse plug-in for the JML interface Marius Hillenbrand ITI Karlsruhe 1 it is THE modern

Theory Plug-in for Rodin 3.0 T.S. Hoang 1 A. Salehi 1 M. Butler 1 L. Voisin 2 1 ECS, University of

Plug-and-Play ADMM and Forward-Backward Splitting Ernest K. Ryu 1 Jialin Liu 1 Sicheng Wang 2

Delegation with Updatable Unambiguous Proofs and PPAD-Hardness Lisa Yang MIT Based on joint

Interval Analysis Without Intervals Paul Taylor Department of Computer Science University of

Extending Apache SpamAssassin Using Plugins Michael Parker ApacheCon 2005 [ Start Slide ]

A Tutorial on Writing (Binary) Bro Plugins Robin Sommer Corelight / International Computer

Strand Persistency Vaibhav Gogte, William Wang $ , Stephan Diestelhorst $ , Peter M. Chen, Satish