Regularization prescriptions and convex duality: density estimation - - PowerPoint PPT Presentation

regularization prescriptions and convex duality density
SMART_READER_LITE
LIVE PREVIEW

Regularization prescriptions and convex duality: density estimation - - PowerPoint PPT Presentation

Regularization prescriptions and convex duality: density estimation and Renyi entropies Ivan Mizera University of Alberta Department of Mathematical and Statistical Sciences Edmonton, Alberta, Canada Linz, October 2008 joint work with Roger


slide-1
SLIDE 1

Regularization prescriptions and convex duality: density estimation and Renyi entropies

Ivan Mizera University of Alberta Department of Mathematical and Statistical Sciences Edmonton, Alberta, Canada Linz, October 2008

joint work with Roger Koenker (University of Illinois at Urbana-Champaign) Gratefully acknowledging the support of the Natural Sciences and Engineering Research Council of Canada

slide-2
SLIDE 2

Density estimation (say)

A useful heuristics: maximum likelihood Given the datapoints X1, X2, . . . , Xn, solve

n

  • i=1

f(Xi) max

f

!

  • r equivalently

n

  • i=1

log f(Xi) min

f

! under the side conditions f 0,

  • f = 1

1

slide-3
SLIDE 3

Note that useful...

5 10 15 20 25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

2

slide-4
SLIDE 4

Dirac catastrophe!

3

slide-5
SLIDE 5

Preventing the disaster for general case

  • Sieves (...)

4

slide-6
SLIDE 6

Preventing the disaster for general case

  • Sieves (...)
  • Regularization

n

  • i=1

log f(Xi) min

f

! f 0,

  • f = 1

4

slide-7
SLIDE 7

Preventing the disaster for general case

  • Sieves (...)
  • Regularization

n

  • i=1

log f(Xi) min

f

! J(f) Λ, f 0,

  • f = 1

4

slide-8
SLIDE 8

Preventing the disaster for general case

  • Sieves (...)
  • Regularization

n

  • i=1

log f(xi) + λJ(f) min

f

! f 0,

  • f = 1

4

slide-9
SLIDE 9

Preventing the disaster for general case

  • Sieves (...)
  • Regularization

n

  • i=1

log f(xi) + λJ(f) min

f

! f 0,

  • f = 1

J(·) - penalty (penalizing complexity, lack of smoothness etc.) for instance, J(f) =

  • |(log f)′′| = TV((log f)′)
  • r also J(f) =
  • |(log f)′′′| = TV((log f)′′)

Good (1971), Good and Gaskins (1971), Silverman (1982), Leonard (1978), Gu (2002), Wahba, Lin, and Leng (2002) See also: Eggermont and LaRiccia (2001) Ramsay and Silverman (2006) Hartigan (2000), Hartigan and Hartigan (1985) Davies and Kovac (2004)

4

slide-10
SLIDE 10

See also in particular

Roger Koenker and Ivan Mizera (2007) Density estimation by total variation regularization Roger Koenker and Ivan Mizera (2006) The alter egos of the regularized maximum likelihood density estimators: deregularized maximum-entropy, Shannon, R´ enyi, Simpson, Gini, and stretched strings Roger Koenker, Ivan Mizera, and Jungmo Yoon (200?) What do kernel density estimators optimize? Roger Koenker and Ivan Mizera (2008): Primal and dual formulations relevant for the numerical estimation of a probability density via regularization Roger Koenker and Ivan Mizera (200?) Quasi-concave density estimation http://www.stat.ualberta.ca/∼mizera/ http://www.econ.uiuc.edu/∼roger/

5

slide-11
SLIDE 11

Preventing the disaster for special cases

  • Shape constraint: monotonicity

n

  • i=1

log f(Xi) min

f

! f 0,

  • f = 1

6

slide-12
SLIDE 12

Preventing the disaster for special cases

  • Shape constraint: monotonicity

n

  • i=1

log f(Xi) min

f

! f decreasing, f 0,

  • f = 1

Grenander (1956), Jongbloed (1998), Groeneboom, Jongbloed, and Wellner (2001),...

6

slide-13
SLIDE 13

Preventing the disaster for special cases

  • Shape constraint: monotonicity

n

  • i=1

log f(Xi) min

f

! f decreasing, f 0,

  • f = 1

Grenander (1956), Jongbloed (1998), Groeneboom, Jongbloed, and Wellner (2001),...

  • Shape constraint: (strong) unimodality

n

  • i=1

log f(Xi) min

f

! f 0,

  • f = 1

6

slide-14
SLIDE 14

Preventing the disaster for special cases

  • Shape constraint: monotonicity

n

  • i=1

log f(Xi) min

f

! f decreasing, f 0,

  • f = 1

Grenander (1956), Jongbloed (1998), Groeneboom, Jongbloed, and Wellner (2001),...

  • Shape constraint: (strong) unimodality

n

  • i=1

log f(Xi) min

f

! −log f convex, f 0,

  • f = 1

Eggermont and LaRiccia (2000), Walther (2000) Rufibach and Dumbgen (2006) Pal, Woodroofe, and Meyer (2006)

6

slide-15
SLIDE 15

Note

Shape constraint: no regularization parameter to be set... ... but of course, we need to believe that the shape is plausible

7

slide-16
SLIDE 16

Note

Shape constraint: no regularization parameter to be set... ... but of course, we need to believe that the shape is plausible Regularization via TV penalty... ... vs log-concavity shape constraint: The differential operator is the same,

  • nly the constraint is somewhat different
  • |(log f)′′| Λ,

in the dual |(log f)′′| Λ Log-concavity: (log f)′′ 0

7

slide-17
SLIDE 17

Note

Shape constraint: no regularization parameter to be set... ... but of course, we need to believe that the shape is plausible Regularization via TV penalty... ... vs log-concavity shape constraint: The differential operator is the same,

  • nly the constraint is somewhat different
  • |(log f)′′| Λ,

in the dual |(log f)′′| Λ Log-concavity: (log f)′′ 0 Only the functional analysis may be a bit more difficult... ... so let us do the shape-constrained case first

7

slide-18
SLIDE 18

The hidden charm of log-concave distributions

A density f is called log-concave if − log f is convex. (Usual conventions: − log 0 = ∞, convex where finite, ...)

8

slide-19
SLIDE 19

The hidden charm of log-concave distributions

A density f is called log-concave if − log f is convex. (Usual conventions: − log 0 = ∞, convex where finite, ...) Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio) Karlin (1968) - monograph about their mathematics Barlow and Proschan (1975) - reliability Flinn and Heckman (1975) - social choice Caplin and Nalebuff (1991a,b) - voting theory Devroye (1984) - how to simulate from them Mizera (1994) - M-estimators

8

slide-20
SLIDE 20

The hidden charm of log-concave distributions

A density f is called log-concave if − log f is convex. (Usual conventions: − log 0 = ∞, convex where finite, ...) Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio) Karlin (1968) - monograph about their mathematics Barlow and Proschan (1975) - reliability Flinn and Heckman (1975) - social choice Caplin and Nalebuff (1991a,b) - voting theory Devroye (1984) - how to simulate from them Mizera (1994) - M-estimators Uniform, Normal, Exponential, Logistic, Weibull, Gamma...

  • all log-concave

If f is log-concave, then

  • it is unimodal (“strongly”)
  • the convolution with any unimodal density is unimodal
  • the convolution with any log-concave density is log-concave
  • f = e−g, with g convex...

8

slide-21
SLIDE 21

The hidden charm of log-concave distributions

A density f is called log-concave if − log f is convex. (Usual conventions: − log 0 = ∞, convex where finite, ...) Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio) Karlin (1968) - monograph about their mathematics Barlow and Proschan (1975) - reliability Flinn and Heckman (1975) - social choice Caplin and Nalebuff (1991a,b) - voting theory Devroye (1984) - how to simulate from them Mizera (1994) - M-estimators Uniform, Normal, Exponential, Logistic, Weibull, Gamma...

  • all log-concave

If f is log-concave, then

  • it is unimodal (“strongly”)
  • the convolution with any unimodal density is unimodal
  • the convolution with any log-concave density is log-concave
  • f = e−g, with g convex...

No heavy tails! t-distributions (finance!): not log-concave (!!)

8

slide-22
SLIDE 22

A convex problem

Let g = − log f; let K be the cone of convex functions. The original problem is transformed:

n

  • i=1

g(Xi) min

g

! g ∈ K,

  • e−g = 1

9

slide-23
SLIDE 23

A convex problem

Let g = − log f; let K be the cone of convex functions. The original problem is transformed:

n

  • i=1

g(Xi) +

  • e−g min

g

! g ∈ K

9

slide-24
SLIDE 24

A convex problem

Let g = − log f; let K be the cone of convex functions. The original problem is transformed:

n

  • i=1

g(Xi) +

  • e−g min

g

! g ∈ K and generalized: let ψ be convex and nonincreasing (like e−x)

n

  • i=1

g(Xi) +

  • e−g min

g

! g ∈ K

9

slide-25
SLIDE 25

A convex problem

Let g = − log f; let K be the cone of convex functions. The original problem is transformed:

n

  • i=1

g(Xi) +

  • e−g min

g

! g ∈ K and generalized: let ψ be convex and nonincreasing (like e−x)

n

  • i=1

g(Xi) +

  • ψ(g) min

g

! g ∈ K

9

slide-26
SLIDE 26

Primal and dual

Recall: K is the cone of convex functions; ψ is convex and nonincreasing The strong Fenchel dual of 1 n

n

  • i=1

g(Xi) +

  • ψ(g) dx min

g

! g ∈ K (P) is −

  • ψ∗(−f) dx max

f

! f = d(Pn − G) dx , G ∈ K∗ (D) Extremal relation: f = −ψ′(g). For penalized estimation, in discretized setting: Koenker and Mizera (2007b)

10

slide-27
SLIDE 27

Remarks

ψ∗(y) = sup

x∈domψ(yx − ψ(x)) is the conjugate of ψ

if primal solutions g are sought in some space, then dual solutions G are sought in a dual space for instance, if g ∈ C(X), and X is compact, then G ∈ C(X)∗, the space of (signed) Radon measures on X. The equality f = d(Pn − G) dx is thus a feasibility constraint (for other G, the dual objective is −∞) K∗ is the dual cone to K - a collection of (signed) Radon measures such that

  • gdG 0 for any convex g.

Dual: good for computation...

11

slide-28
SLIDE 28

Dual: good not only for computation

Couldn’t we have here heavy-tailed distribution too? ...possibly going beyond log-concavity? Recall: the strong Fenchel dual of 1 n

n

  • i=1

g(Xi) +

  • ψ(g) dx min

g

! g ∈ K (P) is −

  • ψ∗(−f) dx max

f

! f = d(Pn − G) dx , G ∈ K∗ (D) Extremal relation: f = −ψ′(g).

12

slide-29
SLIDE 29

Instance: maximum likelihood, α = 1

For ψ(x) = e−x, we have 1 n

n

  • i=1

g(Xi) +

  • e−g min

g

! g ∈ K (P) −

  • f log f dx max

f

! f = d(Pn − G) dx , G ∈ K∗ (D) ... a maximum entropy formulation Extremal relation: f = e−g g required convex → f log-concave How about entropies alternative to Shannon entropy?

13

slide-30
SLIDE 30

R´ enyi system

R´ enyi (1961,1965): entropies defined with the help of (1 − α)−1 log(

  • fα(x)dx),

with Shannon entropy being a limiting form for α = 1. Various entropies correspond to various known divergences: α = 1: Shannon entropy, Kullback-Leibler divergence α = 2: R´ enyi-Simpson-Gini entropy, Pearson’s χ2 α = 1/2: Hellinger’s distance α = 0: reversed Kullback-Leibler New heuristics: MLE → Shannon dual → R´ enyi duals → ? primals

14

slide-31
SLIDE 31

ψ and ψ∗ for various α

! = 2 ! = 1 ! = 1/2 ! = 0 ! = 2 ! = 1 ! = 1/2 ! = 0

15

slide-32
SLIDE 32

Some properties for all α

The density estimators with R´ enyi entropies, as defined above, are:

  • supported by the convex hull of the data
  • the expected value of the estimated density is equal to the

sample mean of the data

  • the function g, appearing in the primal, is a polyhedral

convex function (that is, it is determined by its values at the data points Xi, and is the maximal convex function minorizing those)

  • and the estimates are well-defined:

the minimum of the primal formulation is attained

16

slide-33
SLIDE 33

Instance: α = 2

  • f2(y)dy = max

f

! f = d(Pn − G) dy , G ∈ K∗. (D) 1 n

n

  • i=1

g(Xi) + 1 2

  • g2 dx min

g

! g ∈ K (P) Minimum Pearson χ2, maximum R´ enyi-Simpson-Gini entropy Extremal relation: f = −g g required convex → f concave That yields a class more restrictive than log-concave

  • and thus is not of interest for us!

17

slide-34
SLIDE 34

But perhaps for others...

Replacing g by −f gives − 1 n

n

  • i=1

f(Xi) + 1 2

  • f2 dx min

g

! subject to g ∈ K the objective function of “least squares estimator” Groeneboom, Jongbloed, and Wellner (2001) A folk tune (in the penalized context): Aidu and Vapnik (1989), Terrell (1990) ... and more generally, the primal form for α 1 is equivalent to the objective function of “minimum density power divergence estimators”, introduced by Basu, Harris, Hjort, and Jones (1998) in the context of parametric M- estimation.

18

slide-35
SLIDE 35

De profundis: α = 0

Not explicitly a member of the R´ enyi family - nevertheless, a limit

  • log fdy = max

f

! f = d(Pn − G) dy , G ∈ K∗, (D) 1 n

n

  • i=1

g(Xi) −

  • log g dx = min

g∈C(X) !

g ∈ K. (P) Empirical likelihood (Owen, 2001) Extremal relation g = 1/f the primal thus estimates the “sparsity function” g required convex → 1/f convex

  • that would yield a very nice family of functions...

... but numerically still fragile.

19

slide-36
SLIDE 36

The hierarchy of ρ-convex functions

Hardy, Littlewood, and P´

  • lya (1934): means of order ρ

Avriel (1972): ρ-convex functions ρ < 0: fρ convex ρ = 0: log-concave ρ > 0: fρ concave The class of ρ-convex densities grows with decreasing ρ: if ρ1 < ρ2 then every ρ2-convex is ρ1-convex Every ρ-convex density is quasi-convex: has convex level sets Our α corresponds to ρ = α − 1 - that is: if we do the estimating prescription whose dual involves the R´ enyi α-entropy, then the result is guaranteed to lie in the domain of (α − 1)-convex functions

20

slide-37
SLIDE 37

So the winner is: α = 1/2

“Moderate progress within the limits of law”, “Hellinger selector”: √ fdx max

f

! subject to f = d(Pn − G) dx , G ∈ K∗ (D) 1 n

n

  • i=1

g(Xi) + 1 g dx min

g∈C(X) !

g ∈ K (P) Extremal relation: f = g−2 g required convex → f−1/2 convex (f is −1/2-convex)

  • all log-concave
  • all t family

the primal thus estimates f−1/2 (...rootosparsity)

21

slide-38
SLIDE 38

Weibull, n = 200; left Shannon, right Hellinger

!4 !2 2 4 6 8 10 12 0.2 0.4 0.6 0.8 1 1.2 1.4 !4 !2 2 4 6 8 10 12 0.2 0.4 0.6 0.8 1 1.2 1.4

22

slide-39
SLIDE 39

Another Weibull, n = 200; left Shannon, right Hellinger

!1 !0.5 0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5 !1 !0.5 0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5

23

slide-40
SLIDE 40

Four points at the vertices of the square

24

slide-41
SLIDE 41

Student data on criminal fingers

!6 !4 !2 2 4 6 !6 !4 !2 2 4 6

25

slide-42
SLIDE 42

Once again, but with logarithmic contours

!6 !4 !2 2 4 6 !6 !4 !2 2 4 6

26

slide-43
SLIDE 43

Simulated data: uniform distribution

!1.5 !1 !0.5 0.5 1 1.5 !1.5 !1 !0.5 0.5 1 1.5

27

slide-44
SLIDE 44

A panoramic view

!2 !1 1 2 !1.5 !1 !0.5 0.5 1 1.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

28

slide-45
SLIDE 45

Computation

Main problem: enforcing convexity optimization Easy in dimension 1; in dimension 2, the most promising way seems to be to employ a finite-difference scheme: estimate the Hessian, the matrix

  • f

second derivatives, by finite differences... ...and then enforce this matrix to be positive semidefinite That means: semidefinite programming... ...but with (slightly) nonlinear objective function. In dimension two, one can express the semidefiniteness of the matrix by a rotated quadratic cone... ...and also the reciprocal value can be tricked in that way. Thus: Hellinger selector turns out to be computationally easier than (Shannon) maximum likelihood... We acknowledge using a Danish commercial implementation called Mosek by Erling Andersen, and an open source code by Michael Saunders See also Cule, Samworth, and Stewart (2008)

29

slide-46
SLIDE 46

Summary

  • We can estimate a density restricted to a broader domain

than log-concave - to include also heavy-tailed distributions.

  • Generalizing the formulation dual to the maximum likelihood

in the family of R´ enyi entropies indexed by α, we obtain an interesting family of divergence-based primal/dual estimators.

  • Each yields the estimates in its corresponding ρ-convex class,

in a natural way.

  • Our choice is α = 1/2, which in dual picks a feasible density

closest to the uniform, on the convex hull of the data, in Hellinger distance.

  • And yields −1/2-convex densities, which include all log-

concave densities, but also t-family, that is, algebraical tails; seems like all practically important quasi-concave densities.

  • And is in dimension 2 computationally somewhat more

convenient than other possibilities.

30

slide-47
SLIDE 47

Duality heuristics

Recall: penalized estimation, discretized setting Primal: − 1 n

n

  • i=1

g(xi) + J(−Dg) +

  • ψ(g) = min

g

! where (typically) J(−Dg) = λ

  • |g(k)|p

p

Dual: −

  • ψ∗(f) − J∗(h) = max

f,h !

f = d (Pn + D∗h) dx where ψ∗ is again the conjugate to ψ J∗ is the conjugate to J D∗ is the operator adjoint to D and strong duality yields f = ψ′(g)

31

slide-48
SLIDE 48

Instances

Silverman (1982), Leonard (1978): p = 2, k = 3 Gu (2002), Wahba, Lin, and Leng (2002): p = 2, k = 2 Davies and Kovac (2004), Hartigan (2000), Hartigan and Hartigan (1985): p = 1, k = 1 Koenker and Mizera (2006a,b,c): p = 1, k = 1, 2, 3 Recall: the conjugate of a norm is the indicator of the unit ball in the dual norm. If J(−Dg) = λ

  • |g′|, then the dual is

equivalent to −

  • ψ∗(f) = max

f,h !

f = d (Pn + D∗h) dx h∞ λ If ψ(u) = eu, (which means that ψ∗(u) = u log u) then the primal is a maximum likelihood prescription penalized by

  • |(log f)′| = TV(log f)

And the dual means: stretch h, the antiderivative of f, in the L∞ neighborhood (“tube”) of Pn... (and for other α as well!)

32

slide-49
SLIDE 49

Stretching (“tauting”) strings

−5 −4 −3 −2 −1 1 2 3 4 5 −0.2 0.2 0.4 0.6 0.8 1 1.2

Cumulative distribution function: tube with δ = 0.1

33

slide-50
SLIDE 50

“tube” may be somewhat ambiguous...

!5 !4 !3 !2 !1 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

34

slide-51
SLIDE 51

...but nevertheless, there is one that matches

!5 !4 !3 !2 !1 1 2 3 4 0.05 0.1 0.15 0.2 0.25

...and the density estimate is its derivative (Koenker and Mizera 2006b).

35