On Learning Parametric Non-Smooth Continuous Distributions Sudeep - - PowerPoint PPT Presentation

on learning parametric non smooth continuous distributions
SMART_READER_LITE
LIVE PREVIEW

On Learning Parametric Non-Smooth Continuous Distributions Sudeep - - PowerPoint PPT Presentation

On Learning Parametric Non-Smooth Continuous Distributions Sudeep Kamath 1 Alon Orlitsky 2 Venkatadheeraj Pichapati 3 Ehsan Zobeidi 2 1 PDT Partners 2 Department of Electrical and Computer Engineering University of California San Diego 3 Apple


slide-1
SLIDE 1

On Learning Parametric Non-Smooth Continuous Distributions

Sudeep Kamath1 Alon Orlitsky2 Venkatadheeraj Pichapati3 Ehsan Zobeidi2

1 PDT Partners 2 Department of Electrical and Computer Engineering

University of California San Diego

3 Apple Inc.

slide-2
SLIDE 2

Motivation

2

Motivation

1

  • Learning distribution : Classical problem in statistics
  • Several applications :
  • Weather
  • Finance
  • Data is
  • rarely discrete
  • rarely from smooth class of distributions
  • Can we learn class of (non-smooth) continuous

distributions?

slide-3
SLIDE 3

Motivation

3

  • p.d.f. :

s.t.

  • c.d.f.:
  • Continuous distributions: no Dirac delta components in
  • Parametric distribution:

can be defined by a parameter(s)

  • Eg class of exponential distributions

f(x) ∫

∞ −∞

f(x)dx = 1 F(x) = ∫

x −∞

f(x) f(x) Cθ θ fλ(x) = λe−λx

Notation

slide-4
SLIDE 4

Motivation

4

  • i.i.d. samples

from

  • Learn

from

  • Output : p.d.f.
  • Measuring how well

approximates ?

  • Distance function

?

  • How to estimate distance over all sequences? :

n Xn = X1, X2, X3, . . . , Xn f(x) f(x) Xn gXn(x) g(x) f(x) D( f, gXn) EXn∼fD( f, gXn)

Problem

slide-5
SLIDE 5

Motivation

5

Distance

  • Distances between distributions
  • :
  • :
  • :
  • For parametric continuous distributions
  • Estimating parameter/s reduces

and

  • KL loss
  • Applications
  • compression (information theory)
  • Machine Learning (log loss)

ℓ1 Dℓ1( f, g) = ∫

∞ x=−∞

| f(x) − g(x)|dx ℓ(2)

2

Dℓ2

2( f, g) = ∫

∞ x=−∞

( f(x) − g(x))2dx KL DKL( f, g) = D( f ||g) = ∫

∞ x=−∞

f(x)log f(x) g(x) dx ℓ1 ℓ2

2

slide-6
SLIDE 6

Motivation

6

Loss function

  • Learning loss :
  • Average additional bits required to code (n+1)th sample
  • Loss over class
  • Instantaneous redundancy (minimax KL loss)
  • EXn∼fD( f ||gXn)

Cθ rn(Cθ, g) = max

f∈Cθ

EXn∼fD( f ||gXn) rn(Cθ) = min

g rn(Cθ, g)

slide-7
SLIDE 7

Motivation

7

Cumulative Redundancy

  • Compression loss
  • Additional bits to code X^n
  • One can code
  • ne by one
  • Code first sample
  • get an estimate of distribution (using

) and code second sample

  • Rn(Cθ) = min

g Rn(Cθ, g) = min g max f∈Cθ

EXn∼f

n−1

j=0

D( f ||gXj) Xn X1 X1 Rn ≤

n−1

i=0

ri

slide-8
SLIDE 8

Motivation

8

Gaussian Distributions

  • Class of Gaussian distributions with unknown mean and known variance
  • Estimate mean =

(ML estimator, sufficient statistic)

  • Output : distribution with estimated mean
  • Near optimal estimator
  • Is it true for any class?

fθ(x) = 1 2π e− (x − θ)2

2

1 n

n

i=1

Xi rn = 1 2n(1 + o(1))

slide-9
SLIDE 9

Motivation

9

Smooth Distributions

  • Asymptotic Normality of MLE
  • is Fisher Information
  • n(θ − ̂

θML(Xn)) → N (0, 1 I(θ) ) I(θ) ∂2 ∂δ2 D( fθ|| fθ+δ)|δ=0 = I(θ) ED( fθ|| f ̂

θML) ≈ E[D( fθ|| fθ) + ∂

∂δ D( fθ|| fθ+δ)|δ=0( ̂ θML − θ)] + ∂2 ∂δ2 D( fθ|| fθ+δ)|δ=0( ̂ θML − θ)2] = I(θ) 1 2I(θ)n = 1 2n

slide-10
SLIDE 10

Motivation

10

Smooth distributions

  • Lower bound1
  • If parameter can be estimated to

,

  • For smooth distributions it has actually been shown that
  • # parameters
  • How about non-smooth distributions?
  • distributions with no Fisher Information?

1 nα lim sup rn ≥ α n rn ≈ 1 2n( )

  • 1A. Barron, N. Hengartner et al., “Information theory and superefficiency,” The Annals of

Statistics, vol. 26, no. 5, pp. 1800–1825, 1998.

slide-11
SLIDE 11

Motivation

11

Uniform distributions

  • Class of uniform distributions
  • ML estimator :

(Estimates to an accuracy of

  • KL loss for plug in ML estimator : infinite
  • and

losses are still finite

  • Output estimator should provide mass even after
  • How can we allocate probability? Can

be finite?

fθ(x) = 1 θ 10≤x≤θ max(Xn) θ ≈ 1 n ℓ1 ℓ2

2

max(Xn) rn

slide-12
SLIDE 12

Motivation

12

Prior

  • To derive estimator
  • Consider Pareto distribution
  • n
  • There exists a closed form solution for
  • Further

Π θ arg min

g Eθ∼ΠD(f, gXn)

gxn(x) = f(x|xn, θ ∼ Π) rn ≥ min

g Eθ∼ΠD(f, gXn)

slide-13
SLIDE 13

Motivation

13

r_n for Uniform class

  • Allocates

mass uniformly till

  • Remaining

falls pollynomially ( ) after

  • This estimator incurs same loss over all uniform distributions
  • Hence upper and lower bound matches
  • Here one parameter leads
  • recall

for smooth class

n n + 1 max(Xn) 1 n + 1 1 xn+1 max(Xn) rn ≈ 1 n 1 n 1 2n

0.2 0.4 0.6 0.8 1.0 1.2 x 0.2 0.4 0.6 0.8 1.0 1.2 pdf

Original Estimator

slide-14
SLIDE 14

Motivation

14

Uniform with 2 parameters

  • Class of uniform distributions with both ends unknown
  • Similar technique to derive optimal estimator
  • Allocates

uniformly between and

  • probability falling pollynomially (

) after

  • probability falling pollynomially (

) before

  • (

loss per parameter)

n − 2 n min(Xn) max(Xn) 1 n 1 (x − min(Xn))n−1 max(Xn) 1 n 1 (max(Xn) − x)n−1 min(Xn) rn ≈ 2 n 1 n

  • 1.0
  • 0.5

0.5 1.0 x 0.1 0.2 0.3 0.4 0.5 0.6 pdf

Original Estimator

slide-15
SLIDE 15

Motivation

15

Uniform with fixed width

  • Class of uniform distributions with known width but unknown start point
  • Once again optimal estimator derived using “prior” technique
  • Optimal estimator
  • Allocates p.d.f. of 1 between

and

  • p.d.f. falls linearly between

and

  • p.d.f. falls linearly between

and

  • fθ(x) = 1θ≤x≤θ+1

min(Xn) max(Xn) max(Xn) 1 + min(Xn) min(Xn) max(Xn) − 1 rn ≈ 1 n

slide-16
SLIDE 16

Motivation

16

Truncated distributions

  • Consider any continuous distribution f
  • Truncated class: Class of distributions generated by truncating f at
  • No Fisher information for this class either
  • Transformation
  • Maps this class to class of uniform distributions
  • Optimal estimator already known
  • Map back to using transformation
  • θ

fθ(x) = f(x) F(θ) 1x≤θ y = F(x) x x = F−1(y) rn ≈ 1 n

  • 2
  • 1

1 2 x 0.1 0.2 0.3 0.4 0.5 0.6 pdf

Original Estimator

slide-17
SLIDE 17

Motivation

17

in general?

rn

  • Is

always per parameter?

  • Consider triangle distribution
  • Fisher Information doesn’t exist
  • Looks smoother than uniform
  • Is loss
  • r

?

rn 1 n 1 n 1 2n

0.2 0.4 0.6 0.8 1.0 1.2 1.4 x 0.5 1.0 1.5 2.0 pdf

Original Estimator

slide-18
SLIDE 18

Motivation

18

Scaled Distributions

  • Triangle distribution can in fact be seen as a scaled distribution
  • Consider a p.d.f. with all mass between 0 and 1
  • One can scale (stretch) the distribution
  • Pareto distribution is again “least favorable” prior
  • Optimal estimator can be derived:
  • If

is finite

  • Recovers

for class of uniform distributions starting at 0

fθ(x) = 1 θ f( x θ ) ∀x ∈ (0,1), f(x) ≠ 0,f(1−) ≠ 0,f′(1−) rn ≈ 1 n rn

gxn(xn+1) = ∫

∞ 0 ∏n+1 i=0 1 θ f( xi θ ) dθ θ

∞ 0 ∏n i=0 1 θ f( xi θ ) dθ θ

slide-19
SLIDE 19

Motivation

19

Scaled Distributions

  • For triangle distribution,
  • Previous result doesn’t apply
  • Calculating

is tricky

  • Derived bounds on

which suggest bounds on

  • Bounds suggest

for triangle distributions can be and

f(1−) = 0 rn Rn rn lim rn = 0 rn ≥ 1 2n ≤

3 2 − π 4 ≈ 0.715

n < 1 n

slide-20
SLIDE 20

Motivation

20

Future Work

  • Establishing
  • for scaled
  • other classes of distributions

rn