 
              On Learning Parametric Non-Smooth Continuous Distributions Sudeep Kamath 1 Alon Orlitsky 2 Venkatadheeraj Pichapati 3 Ehsan Zobeidi 2 1 PDT Partners 2 Department of Electrical and Computer Engineering University of California San Diego 3 Apple Inc.
Motivation Motivation •Learning distribution : Classical problem in statistics •Several applications : •Weather 1 •Finance •Data is •rarely discrete •rarely from smooth class of distributions •Can we learn class of (non-smooth) continuous distributions? 2
Notation Motivation ∞ ∫ • p.d.f. : s.t. f ( x ) f ( x ) dx = 1 −∞ F ( x ) = ∫ x • c.d.f.: f ( x ) −∞ • Continuous distributions: no Dirac delta components in f ( x ) • Parametric distribution: can be defined by a parameter(s) C θ θ f λ ( x ) = λ e − λ x • Eg class of exponential distributions 3
Problem Motivation X n = X 1 , X 2 , X 3 , . . . , X n i.i.d. samples from • n f ( x ) X n • Learn from f ( x ) • Output : p.d.f. g X n ( x ) • Measuring how well approximates ? g ( x ) f ( x ) • Distance function ? D ( f , g X n ) • How to estimate distance over all sequences? : E X n ∼ f D ( f , g X n ) 4
Distance Motivation • Distances between distributions D ℓ 1 ( f , g ) = ∫ ∞ : | f ( x ) − g ( x ) | dx ℓ 1 • x = −∞ 2 ( f , g ) = ∫ ∞ ℓ (2) ( f ( x ) − g ( x )) 2 dx : D ℓ 2 • 2 x = −∞ KL D KL ( f , g ) = D ( f || g ) = ∫ ∞ f ( x )log f ( x ) : g ( x ) dx • x = −∞ • For parametric continuous distributions ℓ 2 • Estimating parameter/s reduces and ℓ 1 2 • KL loss • Applications • compression (information theory) • Machine Learning (log loss) 5
Loss function Motivation • Learning loss : • Average additional bits required to code (n+1)th sample E X n ∼ f D ( f || g X n ) • • Loss over class C θ r n (C θ , g ) = max E X n ∼ f D ( f || g X n ) • f ∈ C θ • Instantaneous redundancy (minimax KL loss) r n (C θ ) = min g r n (C θ , g ) • 6
Cumulative Redundancy Motivation • Compression loss • Additional bits to code X^n n − 1 ∑ R n (C θ ) = min g R n (C θ , g ) = min g max E X n ∼ f D ( f || g X j ) • f ∈ C θ j =0 X n • One can code one by one • Code first sample X 1 X 1 • get an estimate of distribution (using ) and code second sample n − 1 ∑ R n ≤ r i • i =0 7
Gaussian Distributions Motivation • Class of Gaussian distributions with unknown mean and known variance 1 e − ( x − θ )2 f θ ( x ) = 2 • 2 π n 1 ∑ • Estimate mean = (ML estimator, sufficient statistic) X i n i =1 • Output : distribution with estimated mean • Near optimal estimator r n = 1 2 n (1 + o (1)) • • Is it true for any class? 8
Smooth Distributions Motivation • Asymptotic Normality of MLE θ ML ( X n )) → N ( 0, 1 I ( θ ) ) n ( θ − ̂ • • is Fisher Information I ( θ ) ∂ 2 ∂ δ 2 D ( f θ || f θ + δ ) | δ =0 = I ( θ ) • θ ML ) ≈ E[ D ( f θ || f θ ) + ∂ ∂ δ D ( f θ || f θ + δ ) | δ =0 ( ̂ θ ML − θ )] E D ( f θ || f ̂ • + ∂ 2 ∂ δ 2 D ( f θ || f θ + δ ) | δ =0 ( ̂ θ ML − θ ) 2 ] 2 I ( θ ) n = 1 1 = I ( θ ) 2 n 9
Smooth distributions Motivation • Lower bound 1 1 • If parameter can be estimated to , n α lim sup r n ≥ α • n • For smooth distributions it has actually been shown that r n ≈ 1 # parameters 2 n ( ) • • How about non-smooth distributions? • distributions with no Fisher Information? 1 A. Barron, N. Hengartner et al., “Information theory and superefficiency,” The Annals of Statistics, vol. 26, no. 5, pp. 1800–1825, 1998. 10
Uniform distributions Motivation • Class of uniform distributions f θ ( x ) = 1 θ 1 0 ≤ x ≤ θ • ≈ 1 max( X n ) • ML estimator : (Estimates to an accuracy of θ n • KL loss for plug in ML estimator : infinite ℓ 2 and losses are still finite • ℓ 1 2 max( X n ) • Output estimator should provide mass even after • How can we allocate probability? Can be finite? r n 11
Prior Motivation • To derive estimator • Consider Pareto distribution Π on θ • There exists a closed form solution for arg min g E θ ∼Π D ( f , g X n ) • g x n ( x ) = f ( x | x n , θ ∼ Π ) • • Further r n ≥ min g E θ ∼Π D ( f , g X n ) 12
r_n for Uniform class Motivation n max( X n ) • Allocates mass uniformly till n + 1 1 1 max( X n ) • Remaining falls pollynomially ( ) after x n +1 n + 1 • This estimator incurs same loss over all uniform distributions • Hence upper and lower bound matches pdf r n ≈ 1 1.2 • n 1.0 1 0.8 • Here one parameter leads Original n 0.6 Estimator 1 0.4 • recall for smooth class 2 n 0.2 x 0.2 0.4 0.6 0.8 1.0 1.2 13
Uniform with 2 parameters Motivation • Class of uniform distributions with both ends unknown • Similar technique to derive optimal estimator n − 2 min( X n ) max( X n ) • Allocates uniformly between and n 1 1 max( X n ) probability falling pollynomially ( ) after • ( x − min( X n )) n − 1 n 1 1 min( X n ) probability falling pollynomially ( ) before • pdf (max( X n ) − x ) n − 1 n 0.6 r n ≈ 2 1 ( loss per parameter) 0.5 • n n 0.4 Original 0.3 Estimator 0.2 0.1 x 14 - 1.0 - 0.5 0.5 1.0
Uniform with fixed width Motivation • Class of uniform distributions with known width but unknown start point f θ ( x ) = 1 θ ≤ x ≤ θ +1 • • Once again optimal estimator derived using “prior” technique • Optimal estimator min( X n ) max( X n ) • Allocates p.d.f. of 1 between and max( X n ) 1 + min( X n ) • p.d.f. falls linearly between and min( X n ) max( X n ) − 1 • p.d.f. falls linearly between and r n ≈ 1 • n 15
Truncated distributions Motivation • Consider any continuous distribution f • Truncated class: Class of distributions generated by truncating f at θ pdf f θ ( x ) = f ( x ) 0.6 F ( θ ) 1 x ≤ θ • 0.5 • No Fisher information for this class either 0.4 Original 0.3 • Transformation y = F ( x ) Estimator 0.2 • Maps this class to class of uniform distributions 0.1 • Optimal estimator already known x - 2 - 1 1 2 x = F − 1 ( y ) • Map back to using transformation x r n ≈ 1 • n 16
r n in general? Motivation 1 • Is always per parameter? r n n • Consider triangle distribution pdf • Fisher Information doesn’t exist 2.0 • Looks smoother than uniform 1.5 1 1 Original • Is loss or ? 1.0 Estimator n 2 n 0.5 x 0.2 0.4 0.6 0.8 1.0 1.2 1.4 17
Scaled Distributions Motivation • Triangle distribution can in fact be seen as a scaled distribution • Consider a p.d.f. with all mass between 0 and 1 • One can scale (stretch) the distribution f θ ( x ) = 1 θ f ( x θ ) • • Pareto distribution is again “least favorable” prior • Optimal estimator can be derived: x i ∞ 0 ∏ n +1 1 θ ) d θ ∫ θ f ( i =0 θ g x n ( x n +1 ) = ∞ x i 0 ∏ n 1 θ ) d θ ∫ θ f ( i =0 θ ∀ x ∈ (0,1), f ( x ) ≠ 0, f (1 − ) ≠ 0, f ′ � (1 − ) • If is finite r n ≈ 1 • n • Recovers for class of uniform distributions starting at 0 r n 18
Scaled Distributions Motivation f (1 − ) = 0 • For triangle distribution, • Previous result doesn’t apply • Calculating is tricky r n • Derived bounds on which suggest bounds on R n r n • lim r n = 0 ≥ 1 • Bounds suggest for triangle distributions can be and r n 2 n 3 2 − π 4 ≈ 0.715 < 1 ≤ n n 19
Future Work Motivation • Establishing r n • for scaled • other classes of distributions 20
Recommend
More recommend