data amplification instance optimal property estimation
play

Data Amplification: Instance-Optimal Property Estimation Yi Hao and - PowerPoint PPT Presentation

Data Amplification: Instance-Optimal Property Estimation Yi Hao and Alon Orlitsky {yih179, alon}@ ucsd .edu 0 / 23 Outline Definitions Estimators Prior results Data amplification Example: Shannon entropy Ideas to take away:


  1. Data Amplification: Instance-Optimal Property Estimation Yi Hao and Alon Orlitsky {yih179, alon}@ ucsd .edu 0 / 23

  2. Outline Definitions Estimators Prior results Data amplification Example: Shannon entropy Ideas to take away: Instance-optimal algorithm Data amplification 1 / 23

  3. Definitions 2 / 23

  4. Discrete Distributions Discrete support set X { heads, tails } = { h, t } { ..., − 1 , 0 , 1 ,... } = Z Distribution p over X , probability p x for x ∈ X p x ≥ 0 ∑ x ∈X p x = 1 p = ( p h ,p t ) p h = . 6 , p t = . 4 P collection of distributions P X all distributions over X P { h, t } = {( p h ,p t )} = {( . 6 ,. 4 ) , ( . 4 ,. 6 ) , ( . 5 ,. 5 ) , ( 0 , 1 ) ,... } 3 / 23

  5. Distribution Property f ∶ P → R Maps distribution to real value H ( p ) ∑ x p x log 1 Shannon entropy p x H α ( p ) 1 − α log ( ∑ x p α x ) 1 Rényi entropy S ( p ) Support size ∑ x 1 p x > 0 S m ( p ) ∑ x ( 1 − ( 1 − p x ) m ) Support coverage Expected # distinct symbols in m samples L q ( p ) ∑ x ∣ p x − q x ∣ Distance to fixed q max ( p ) max { p x ∶ x ∈ X} Highest probability ... Many applications 4 / 23

  6. Property Estimation Unknown: p ∈ P Given: property f and samples X n ∼ p Estimate: f ( p ) Entropy of English words Given: X = { English words } , estimate: H ( p ) unknown: p , # species in habitat Given: X = { bird species } , estimate: S ( p ) unknown: p , How to estimate f ( p ) when p is unknown? 5 / 23

  7. Estimators 6 / 23

  8. Learn from Examples Observe n independent samples X n = X 1 ,...,X n ∼ p Reveal information about p Estimate f ( p ) Estimator: f est ∶ X n → R Estimate for f ( p ) : f est ( X n ) Simplest estimators? 7 / 23

  9. Empirical (Plug-In) Estimator N x # times x appears in X n ∼ p ∶ = N x p emp x n f emp ( X n ) = f ( p emp ( X n )) a.k.a. MLE estimator in literature Advantages plug-and-play: simple two steps universal: applies to all properties intuitive and stable Best-known, most-used {distribution, property} estimator Performance? 8 / 23

  10. Mean Absolute Error (MAE) Classical Alternative to PAC Formulation ∣ f est ( X n ) − f ( p )∣ Absolute error L f est ( p,n ) ∶ = E X n ∼ p ∣ f est ( X n ) − f ( p )∣ mean absolute error L f est (P ,n ) ∶ = max p ∈P L f est ( p,n ) worst-case MAE over P L (P ,n ) ∶ = min f est L f est (P ,n ) min-max MAE over P MSE – similar definitions, similar results, but slightly more complex expressions 9 / 23

  11. Prior Results 10 / 23

  12. Abbreviation if ∣X∣ is finite, write ∣X∣ = k P X = ∆ k , the k -dimensional standard simplex ∆ ≥ 1 / k ∶ = { p ∶ p x ≥ 1 k or p x = 0 , ∀ x } for support size 11 / 23

  13. Prior Work: Empirical and Min-Max MAEs References: P03, VV11a/b, WY14/19, JVHW14, AOST14, OSW16, JHW16, ADOS17 L f emp ( ∆ k ,n ) L ( ∆ k ,n ) Property Base function n + log k n log n + log k Entropy 1 p x log 1 k k √ n √ n p x m exp (− Θ ( n m )) ( 1 − ( 1 − p x ) m ) m exp (− Θ ( n log n )) Supp. coverage 2 m p ( x ) α , α ∈ ( 0 , 1 2 ] k k Power sum 3 4 n α ( n log n ) α p ( x ) α , α ∈ ( 1 2 , 1 ) n α + k 1 − α ( n log n ) α + k 1 − α k k √ n √ n √ √ q x ∣ p x − q x ∣ ∑ x q x ∧ ∑ x q x ∧ q x Dist. to fixed q 5 √ n n log n k exp (− Θ ( n k )) k exp (− Θ ( )) Support size 6 n log n 1 p ( x )> 0 k ⋆ n to n log n when comparing the worst-case performances 1 n ≳ k for empirical; n ≳ k / log k for minimax 2 k = ∞ ; n ≳ m for empirical; n ≳ m / log m for minimax 2 ] : n ≳ k 1 / α for empirical; n ≳ k 1 / α 3 α ∈ ( 0 , 1 log k and log k ≳ log n for minimax 2 , 1 ) : n ≳ k 1 / α for empirical; n ≳ k 1 / α 4 α ∈ ( 1 log k for minimax 5 additional assumptions required, see JHW18 6 consider ∆ ≥ 1 / k instead of ∆ k ; k log k ≳ n ≳ k / log k for minimax 12 / 23

  14. Data Amplification 13 / 23

  15. Beyond the Min-Max Approach Min-max approach is overly pessimistic: practical distributions often possess nice structures and are rarely the worst possible ⋆ Derive “competitive” estimators – needs no knowledge on distribution structures, yet adaptive to the simplicity of underlying distributions ⋆ Achieve n to n log n “amplification” – distribution by distribution, the performance of our estimator with n samples is as good as that of the empirical with n log n 14 / 23

  16. Instance-Optimal Property Estimation For a broad class of properties, we derive an “instance-optimal" estimator which does as well with n samples as the empirical estimator would do with n log n , for every distribution. 15 / 23

  17. Example: Shannon Entropy 16 / 23

  18. Shannon Entropy Estimator f new such that for any ε ≤ 1 , n , and p , Theorem 1 L f new ( p,n ) − L f emp ( p, ε n log n ) ≲ ε Comments f new requires only X n and ε , and runs in near-linear time log n amplification factor is optimal log n ≥ 10 for n ≥ 22 , 027 – “order-of-magnitude improvement” ε can be a vanishing function of n finite support S p , then ε improves to ε ∧ ( S p n + n 0 . 49 ) 1 17 / 23

  19. Simple Implications Empirical entropy estimator – has been studied for a long time G. A. Miller, “Note on the bias of information estimates”, 1955 . – much easier to analyze compared to minimax estimators ⋆ Our result holds on a distribution level , hence strengthens many results derived in the past half-century, in a unified manner n = o ( k / log k ) – large-alphabet regime L ( ∆ k ,n ) ≤ ( 1 + o ( 1 )) log ( 1 + k − 1 n log n ) 18 / 23

  20. Large-Alphabet Entropy Estimation Proof of L f emp ( ∆ k ,n ) ≤ ( 1 + o ( 1 )) log ( 1 + k − 1 n ) for n = o ( k ) – absolute bias [P03] 0 ≤ H ( p ) − E H ( p emp ) = E D KL ( p emp ∥ p ) ≤ E log ( 1 + χ 2 ( p emp ∥ p )) ≤ log ( 1 + E χ 2 ( p emp ∥ p )) = log ( 1 + k − 1 n ) – mean deviation changing a sample modifies f emp by ≤ log n n apply the Efron-Stein inequality → mean deviation ≤ log n √ n ⋆ The proof is very simple compared to that of min-max estimators 19 / 23

  21. Large-Alphabet Entropy Estimation (Cont’) Theorem 1 strengthens the result and yields, for n = o ( k / log k ) , L ( ∆ k ,n ) ≤ log ( 1 + k − 1 n log n ) + o ( 1 ) ⋆ Right expression for entropy estimation? – meaningful since H ( p ) can be as large as log k – for n = Ω ( k / log k ) , by [VV11a/b, WY14/19, JVHW14] L ( ∆ k ,n ) ≍ n log n + log n k ≍ log ( 1 + n log n ) + o ( 1 ) k − 1 k √ – should write L ( ∆ k ,n ) in the latter form 20 / 23

  22. Ideas to Take Away 21 / 23

  23. Ideas Instance-optimal algorithm worst-case algorithm analysis is pessimistic modern data science calls for instance-optimal algorithms better performance on easier instances – data is intrinsically simpler Data amplification designing optimal learning algorithms directly might be hard instead, find a simple algorithm that works emulate its performance by an algorithm that uses fewer samples 22 / 23

  24. Thank you! 23 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend