Data Amplification: Instance-Optimal Property Estimation Yi Hao and - PowerPoint PPT Presentation

Data Amplification: Instance-Optimal Property Estimation Yi Hao and Alon Orlitsky {yih179, alon}@ ucsd .edu 0 / 23

Outline Definitions Estimators Prior results Data amplification Example: Shannon entropy Ideas to take away: Instance-optimal algorithm Data amplification 1 / 23

Definitions 2 / 23

Discrete Distributions Discrete support set X { heads, tails } = { h, t } { ..., − 1 , 0 , 1 ,... } = Z Distribution p over X , probability p x for x ∈ X p x ≥ 0 ∑ x ∈X p x = 1 p = ( p h ,p t ) p h = . 6 , p t = . 4 P collection of distributions P X all distributions over X P { h, t } = {( p h ,p t )} = {( . 6 ,. 4 ) , ( . 4 ,. 6 ) , ( . 5 ,. 5 ) , ( 0 , 1 ) ,... } 3 / 23

Distribution Property f ∶ P → R Maps distribution to real value H ( p ) ∑ x p x log 1 Shannon entropy p x H α ( p ) 1 − α log ( ∑ x p α x ) 1 Rényi entropy S ( p ) Support size ∑ x 1 p x > 0 S m ( p ) ∑ x ( 1 − ( 1 − p x ) m ) Support coverage Expected # distinct symbols in m samples L q ( p ) ∑ x ∣ p x − q x ∣ Distance to fixed q max ( p ) max { p x ∶ x ∈ X} Highest probability ... Many applications 4 / 23

Property Estimation Unknown: p ∈ P Given: property f and samples X n ∼ p Estimate: f ( p ) Entropy of English words Given: X = { English words } , estimate: H ( p ) unknown: p , # species in habitat Given: X = { bird species } , estimate: S ( p ) unknown: p , How to estimate f ( p ) when p is unknown? 5 / 23

Estimators 6 / 23

Learn from Examples Observe n independent samples X n = X 1 ,...,X n ∼ p Reveal information about p Estimate f ( p ) Estimator: f est ∶ X n → R Estimate for f ( p ) : f est ( X n ) Simplest estimators? 7 / 23

Empirical (Plug-In) Estimator N x # times x appears in X n ∼ p ∶ = N x p emp x n f emp ( X n ) = f ( p emp ( X n )) a.k.a. MLE estimator in literature Advantages plug-and-play: simple two steps universal: applies to all properties intuitive and stable Best-known, most-used {distribution, property} estimator Performance? 8 / 23

Mean Absolute Error (MAE) Classical Alternative to PAC Formulation ∣ f est ( X n ) − f ( p )∣ Absolute error L f est ( p,n ) ∶ = E X n ∼ p ∣ f est ( X n ) − f ( p )∣ mean absolute error L f est (P ,n ) ∶ = max p ∈P L f est ( p,n ) worst-case MAE over P L (P ,n ) ∶ = min f est L f est (P ,n ) min-max MAE over P MSE – similar definitions, similar results, but slightly more complex expressions 9 / 23

Prior Results 10 / 23

Abbreviation if ∣X∣ is finite, write ∣X∣ = k P X = ∆ k , the k -dimensional standard simplex ∆ ≥ 1 / k ∶ = { p ∶ p x ≥ 1 k or p x = 0 , ∀ x } for support size 11 / 23

Prior Work: Empirical and Min-Max MAEs References: P03, VV11a/b, WY14/19, JVHW14, AOST14, OSW16, JHW16, ADOS17 L f emp ( ∆ k ,n ) L ( ∆ k ,n ) Property Base function n + log k n log n + log k Entropy 1 p x log 1 k k √ n √ n p x m exp (− Θ ( n m )) ( 1 − ( 1 − p x ) m ) m exp (− Θ ( n log n )) Supp. coverage 2 m p ( x ) α , α ∈ ( 0 , 1 2 ] k k Power sum 3 4 n α ( n log n ) α p ( x ) α , α ∈ ( 1 2 , 1 ) n α + k 1 − α ( n log n ) α + k 1 − α k k √ n √ n √ √ q x ∣ p x − q x ∣ ∑ x q x ∧ ∑ x q x ∧ q x Dist. to fixed q 5 √ n n log n k exp (− Θ ( n k )) k exp (− Θ ( )) Support size 6 n log n 1 p ( x )> 0 k ⋆ n to n log n when comparing the worst-case performances 1 n ≳ k for empirical; n ≳ k / log k for minimax 2 k = ∞ ; n ≳ m for empirical; n ≳ m / log m for minimax 2 ] : n ≳ k 1 / α for empirical; n ≳ k 1 / α 3 α ∈ ( 0 , 1 log k and log k ≳ log n for minimax 2 , 1 ) : n ≳ k 1 / α for empirical; n ≳ k 1 / α 4 α ∈ ( 1 log k for minimax 5 additional assumptions required, see JHW18 6 consider ∆ ≥ 1 / k instead of ∆ k ; k log k ≳ n ≳ k / log k for minimax 12 / 23

Data Amplification 13 / 23

Beyond the Min-Max Approach Min-max approach is overly pessimistic: practical distributions often possess nice structures and are rarely the worst possible ⋆ Derive “competitive” estimators – needs no knowledge on distribution structures, yet adaptive to the simplicity of underlying distributions ⋆ Achieve n to n log n “amplification” – distribution by distribution, the performance of our estimator with n samples is as good as that of the empirical with n log n 14 / 23

Instance-Optimal Property Estimation For a broad class of properties, we derive an “instance-optimal" estimator which does as well with n samples as the empirical estimator would do with n log n , for every distribution. 15 / 23

Example: Shannon Entropy 16 / 23

Shannon Entropy Estimator f new such that for any ε ≤ 1 , n , and p , Theorem 1 L f new ( p,n ) − L f emp ( p, ε n log n ) ≲ ε Comments f new requires only X n and ε , and runs in near-linear time log n amplification factor is optimal log n ≥ 10 for n ≥ 22 , 027 – “order-of-magnitude improvement” ε can be a vanishing function of n finite support S p , then ε improves to ε ∧ ( S p n + n 0 . 49 ) 1 17 / 23

Simple Implications Empirical entropy estimator – has been studied for a long time G. A. Miller, “Note on the bias of information estimates”, 1955 . – much easier to analyze compared to minimax estimators ⋆ Our result holds on a distribution level , hence strengthens many results derived in the past half-century, in a unified manner n = o ( k / log k ) – large-alphabet regime L ( ∆ k ,n ) ≤ ( 1 + o ( 1 )) log ( 1 + k − 1 n log n ) 18 / 23

Large-Alphabet Entropy Estimation Proof of L f emp ( ∆ k ,n ) ≤ ( 1 + o ( 1 )) log ( 1 + k − 1 n ) for n = o ( k ) – absolute bias [P03] 0 ≤ H ( p ) − E H ( p emp ) = E D KL ( p emp ∥ p ) ≤ E log ( 1 + χ 2 ( p emp ∥ p )) ≤ log ( 1 + E χ 2 ( p emp ∥ p )) = log ( 1 + k − 1 n ) – mean deviation changing a sample modifies f emp by ≤ log n n apply the Efron-Stein inequality → mean deviation ≤ log n √ n ⋆ The proof is very simple compared to that of min-max estimators 19 / 23

Large-Alphabet Entropy Estimation (Cont’) Theorem 1 strengthens the result and yields, for n = o ( k / log k ) , L ( ∆ k ,n ) ≤ log ( 1 + k − 1 n log n ) + o ( 1 ) ⋆ Right expression for entropy estimation? – meaningful since H ( p ) can be as large as log k – for n = Ω ( k / log k ) , by [VV11a/b, WY14/19, JVHW14] L ( ∆ k ,n ) ≍ n log n + log n k ≍ log ( 1 + n log n ) + o ( 1 ) k − 1 k √ – should write L ( ∆ k ,n ) in the latter form 20 / 23

Ideas to Take Away 21 / 23

Ideas Instance-optimal algorithm worst-case algorithm analysis is pessimistic modern data science calls for instance-optimal algorithms better performance on easier instances – data is intrinsically simpler Data amplification designing optimal learning algorithms directly might be hard instead, find a simple algorithm that works emulate its performance by an algorithm that uses fewer samples 22 / 23

Thank you! 23 / 23

Data Amplification: Instance-Optimal Property Estimation Yi Hao and - PowerPoint PPT Presentation

Data Amplification: Instance-Optimal Property Estimation Yi Hao and Alon Orlitsky {yih179, alon}@ ucsd .edu 0 / 23 Outline Definitions Estimators Prior results Data amplification Example: Shannon entropy Ideas to take away:

Lehrstuhl fr Systemsicherheit Amplification DDoS Attacks Marc Khrer SPRING 9 Bochum, 31.

Magnetic Field Amplification in SNR by Richtmyer-Meshkov Instability K. Nishihara, T. Sano

MAST BOLOGNA, 25-26 OCTOBER, 2016 DEPArray User Meeting HER2 expression and amplification

Privacy Amplification by Mixing and Diffusion Mechanisms Borja Balle, Gilles Barthe, Marco

An Optimal Estimation An Optimal Estimation Approach to AIRS CO Approach to AIRS CO Retrievals

What is the cloud? Property of TalentWise Property of TalentWise Cloud HCM Players Property of

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano

Nonlinear Control Lecture # 18 Stability of Feedback Systems Nonlinear Control Lecture # 18

SECURE IDENTITY-BASED ENCRYPTION IN THE QUANTUM RANDOM ORACLE MODEL Mark Zhandry Stanford

Chapter 4: Modifying Pixels in a Range Reminder: Pixels are in a matrix Matrices have two

Truck Shipment Example: One-Time 8. Using the same LTL shipment, find online one-time (spot) LTL

VERIFIED COMPILATION OF THE MODULAR RESET, FINALLY Timothy Bourke 1,2 Llio Brun 1,2 Marc Pouzet

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

2 Two Random Variables A number of features of the two-variable problem follow by direct analogy

Data Amplification: Instance-Optimal Property Estimation Yi Hao and - PowerPoint PPT Presentation

Data Amplification: Instance-Optimal Property Estimation Yi Hao and Alon Orlitsky {yih179, alon}@ ucsd .edu 0 / 23 Outline Definitions Estimators Prior results Data amplification Example: Shannon entropy Ideas to take away:

Lehrstuhl fr Systemsicherheit Amplification DDoS Attacks Marc Khrer SPRING 9 Bochum, 31.

Magnetic Field Amplification in SNR by Richtmyer-Meshkov Instability K. Nishihara, T. Sano

MAST BOLOGNA, 25-26 OCTOBER, 2016 DEPArray User Meeting HER2 expression and amplification

Privacy Amplification by Mixing and Diffusion Mechanisms Borja Balle, Gilles Barthe, Marco

An Optimal Estimation An Optimal Estimation Approach to AIRS CO Approach to AIRS CO Retrievals

What is the cloud? Property of TalentWise Property of TalentWise Cloud HCM Players Property of

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano

Nonlinear Control Lecture # 18 Stability of Feedback Systems Nonlinear Control Lecture # 18

SECURE IDENTITY-BASED ENCRYPTION IN THE QUANTUM RANDOM ORACLE MODEL Mark Zhandry Stanford

Chapter 4: Modifying Pixels in a Range Reminder: Pixels are in a matrix Matrices have two

Truck Shipment Example: One-Time 8. Using the same LTL shipment, find online one-time (spot) LTL

VERIFIED COMPILATION OF THE MODULAR RESET, FINALLY Timothy Bourke 1,2 Llio Brun 1,2 Marc Pouzet

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

2 Two Random Variables A number of features of the two-variable problem follow by direct analogy

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=