On adaptation for the posterior distribution under local and - - PowerPoint PPT Presentation

on adaptation for the posterior distribution under local
SMART_READER_LITE
LIVE PREVIEW

On adaptation for the posterior distribution under local and - - PowerPoint PPT Presentation

On adaptation for the posterior distribution under local and sup-norm Judith Rousseau, Marc Hoffman and Johannes Schmidt - Hieber ENSAE - CREST et CEREMADE, Universit Paris-Dauphine Brown 1/ 19 Outline Bayesian nonparametric : posterior


slide-1
SLIDE 1

On adaptation for the posterior distribution under local and sup-norm

Judith Rousseau, Marc Hoffman and Johannes Schmidt - Hieber

ENSAE - CREST et CEREMADE, Université Paris-Dauphine

Brown

1/ 19

slide-2
SLIDE 2

Outline

1

Bayesian nonparametric : posterior concentration Generalities Adaptation Idea of the proof

2

Why adaptation easy : white noise model

3

What about f(x0) ? or f − f0∞ ?

4

A series of negative result

2/ 19

slide-3
SLIDE 3

Outline

1

Bayesian nonparametric : posterior concentration Generalities Adaptation Idea of the proof

2

Why adaptation easy : white noise model

3

What about f(x0) ? or f − f0∞ ?

4

A series of negative result

3/ 19

slide-4
SLIDE 4

Generalities

◮ Model : Y n

1 |θ ∼ pn θ (density wrt µ), θ ∈ Θ

A priori : θ ∼ Π : prior distribution − → posterior distribution dΠ(θ|X n) = dΠ(θ)pn

θ(Y n 1 )

m(Y n

1 )

, Y n

1 = (Y1, . . . , Yn)

◮ Posterior concentration d(., .) = loss on Θ & θ0 ∈ Θ = True

Eθ0 (Π [Uǫn|Y n

1 ]) = 1 + o(1),

Uǫn = {θ; d(θ, θ0) ≤ ǫn} ǫn ↓ 0

◮ Minimax concentration rates on a Class Θα(L),

sup

θ0∈Θα(L)

Eθ0

  • Π
  • Uc

Mǫn(α)|Y n 1

  • = o(1),

where ǫn(α) = minimax rate under d(., .) & over Θα(L).

4/ 19

slide-5
SLIDE 5

Examples of Models-losses for which nice results exist

◮ Density estimation Yi ∼ pθ i.i.d.

d(pθ, pθ′)2 =

  • (√pθ−√pθ′)2(x)dx,

d(pθ, pθ′) =

  • |pθ−pθ′|(x)dx

◮ Regression function

Yi = f(xi) + ǫi, ǫi ∼ N(0, σ2), θ = (f, σ) d(pθ, pθ′) = f−f ′2, d(pθ, pθ′) = n−1

n

  • i=1

H2(pθ(y|Xi), pθ′(y|Xi)) H = Hellinger

◮ White noise

dY(t) = f(t)dt + n−1/2dW(t) ⇔ Yi = θi + n−1/2ǫi, i ∈ N d(pθ, pθ′) = f − f ′2

5/ 19

slide-6
SLIDE 6

Examples : functional classes

Θα(L) = Hölder (H(α, L)) ǫn(α) = n−α/(2α+1) minimax rate over H(α, L)

◮ Density example : Hellinger loss

Prior = DPM f(x) = fP,σ(x) =

  • φσ(x−µ)dP(µ),

σ ∼ IΓ(a, b) P ∼ DP(A, G0) sup

f0∈Θα(L)

Ef0

  • Π
  • Uc

M(n/ log n)−α/(2α+1)(f0)|Y n 1

  • = o(1),

Uǫ(f0) = {f, h(f0, f) ≤ ǫ} [ log n term necessary ? ]

⇒ Ef0

  • h(ˆ

f, f0)2 (n/ log n)−α/(2α+1), ˆ f(x) = Eπ[f(x)|Y n]

6/ 19

slide-7
SLIDE 7

Outline

1

Bayesian nonparametric : posterior concentration Generalities Adaptation Idea of the proof

2

Why adaptation easy : white noise model

3

What about f(x0) ? or f − f0∞ ?

4

A series of negative result

7/ 19

slide-8
SLIDE 8

Adaptation

For such d(., .) Adaptation is easy : The prior does not depend

  • n α :

sup

α1≤α≤α2

sup

θ0∈Θα(L)

Eθ0

  • Π
  • Uc

M(n/ log n)−α/(2α+1)|Y n 1

  • = o(1),

◮ why ?

8/ 19

slide-9
SLIDE 9

Outline

1

Bayesian nonparametric : posterior concentration Generalities Adaptation Idea of the proof

2

Why adaptation easy : white noise model

3

What about f(x0) ? or f − f0∞ ?

4

A series of negative result

9/ 19

slide-10
SLIDE 10

Outline

Un = UM(n/ log n)−α/(2α+1) and ln(θ) = log pn

θ(Y n 1 )

¯ ǫn = (n/ log n)−α/(2α+1) Π [Uc

n|Y n 1 ] =

  • Uc

n eln(θ)−ln(θ0)dΠ(θ)

  • Θ eln(θ)−ln(θ0)dΠ(θ) := Nn

Dn φn = φn(Y n

1 ) ∈ [0, 1]

Pθ0

  • Π [Uc

n|Y n 1 ] > e−τnǫ2

n

  • ≤ En

θ0[φn] + Pθ0

  • Dn < e−cn ¯

ǫn2

+ e(c+τ)nǫ2

n

  • Uc

n

Eθ [1 − φn] dπ(θ)

10/ 19

slide-11
SLIDE 11

Constraints

En

θ0[φn] = o(1)

& sup

d(θ,θ0)>M ¯ ǫn

Eθ [1 − φn] = o(e−cn ¯

ǫn2) → d(., .)

Pθ0

  • Dn < e−cn ¯

ǫn2

= o(1) We need : Dn ≥

  • Sn

eln(θ)−ln(θ0)dΠ(θ) ≥ e−2nǫ2

  • Sn ∩ {ln(θ) − ln(θ0) > −2n ¯

ǫn2}

  • Ok if Sn = {KL(pn

θ0, pn θ) ≤ n ¯

ǫn2; V(pn

θ0, pn θ) ≤ n ¯

ǫn2} and Π(Sn) ≥ e−cn ¯

ǫn2 → links d(., .)

with KL(., .)

11/ 19

slide-12
SLIDE 12

example : white noise model + L2 loss

Yik = θik + n−1/2ǫik ǫik ∼ N(0, 1), i ∈ N, k ≤ 2i−1

(dY(t) = f(t)dt + n−1/2dW(t)) ◮ Hölder class (α)

θ0 ∈ {θ; |θik| ≤ Li−α−1/2, ∀i, k}

◮ Prior : spike and slab

θik ∼ (1 − pn)δ(0) + png, e.g. g = N(0, v), pn = 1/n

◮ Concentration

Sn ≈ {θ−θ02 ≤ (n/ log n)−2α/(2α+1)} → ∀j ≥ Jn,α, k ≤ 2j; θj,k = 0 2Jn,α = (n/ log n)1/(2α+1) := Rn, Π(Sn) e−CRn log n := e−Cnǫ2

n

Eθ0[φn] = o(1), & sup

θ∈Θn;θ−θ0ǫn

Eθ[1 − φn] ≤ e−cnǫ2

n 12/ 19

slide-13
SLIDE 13

What about f(x0) ? or f − f0∞ ?

Yik = θik+n−1/2ǫik ǫik ∼ N(0, 1), θ0 ∈ {θ; |θik| ≤ Li−α−1/2, ∀i, k}

◮ Prior : spike and slab θik = (1 − pn)δ(0) + png, pn = 1/n ◮ losses :

l(θ, θ0) = (

  • ik

(θik − θo

ik)ψik(x0)2i/2)2

(local) l(θ, θ0) = θ − θ0∞ =

  • i

max

k

|θik − θo

ik|2i/2

(sup)

◮ Bayesian concentration ∀α > 0, ∃θ0 ∈ Θα(L) s.t.

Eθ0

  • Π
  • l(θ, θ0) ≤ n−(α−1/2)/(2α+1) log nq|Y n

1

  • = o(1)

Sub-optimal θo

i0 = ρn2−i/2 and θo ik = 0, i ≤ In : ∀J > 0

  • i>J
  • k

(θo

ik)2 ≤ n−2α/(2α+1),

  • i>J

max

k

|θo

ik| > n−(α−1/2)/(2α+1) log nq

13/ 19

slide-14
SLIDE 14

Risk ?

Yik = θik+n−1/2ǫik ǫik ∼ N(0, 1), θ0 ∈ {θ; |θik| ≤ Li−α−1/2, ∀i, k}

  • Prior : θik = (1 − pn)δ(0) + png, pn = 1/n
  • Suboptimal concentration BUT ˆ

θ = Eπ[θ|Y n] lim sup

n

sup

α1≤α≤α2

(n/ log n)2α/(2α+1) sup

θ0∈Θα

En

θ0

  • l(ˆ

θ, θ0)

  • < +∞

Questions

14/ 19

slide-15
SLIDE 15

Risk ?

Yik = θik+n−1/2ǫik ǫik ∼ N(0, 1), θ0 ∈ {θ; |θik| ≤ Li−α−1/2, ∀i, k}

  • Prior : θik = (1 − pn)δ(0) + png, pn = 1/n
  • Suboptimal concentration BUT ˆ

θ = Eπ[θ|Y n] lim sup

n

sup

α1≤α≤α2

(n/ log n)2α/(2α+1) sup

θ0∈Θα

En

θ0

  • l(ˆ

θ, θ0)

  • < +∞

Questions

◮ Question 1 How general is this (negative) result ?

14/ 19

slide-16
SLIDE 16

Risk ?

Yik = θik+n−1/2ǫik ǫik ∼ N(0, 1), θ0 ∈ {θ; |θik| ≤ Li−α−1/2, ∀i, k}

  • Prior : θik = (1 − pn)δ(0) + png, pn = 1/n
  • Suboptimal concentration BUT ˆ

θ = Eπ[θ|Y n] lim sup

n

sup

α1≤α≤α2

(n/ log n)2α/(2α+1) sup

θ0∈Θα

En

θ0

  • l(ˆ

θ, θ0)

  • < +∞

Questions

◮ Question 1 How general is this (negative) result ? ◮ Question 2 What does it tell us about posterior

concentration ?

14/ 19

slide-17
SLIDE 17

A first general result

H(α1, L) ∪ H(α2, L) ⊂ Θ, α1 < α2

◮ Local loss l(θ, θ0) = (θ(x) − θ0(x))2

Result : There exist no prior that leads to adaptive minimax concentration over any collection of Hölder balls : ∀π prior on Θ, ∀M > 0 max

j

sup

θ0∈H(αj,L)

Eθ0

  • Π
  • l(θ, θ0) > Mn−2αj/(2αj+1)|Y n

= 1

  • What do we loose ?

◮ L∞ and local loss

If ∃θ0 ∈ Θ, Pθ0

  • Π
  • l(θ, θ0) > Mn−2α2/(2α2+1)|Y n

> e−nτ = o(1), τ > 0 Then worse max

j

sup

θ0∈H(αj,L)

Eθ0

  • Π
  • l(θ, θ0) > n−(2αj−τ)/(2αj+1)|Y n

= 1

15/ 19

slide-18
SLIDE 18

Still not completely satisfying

  • For local loss : If we could find a prior with only log n loss then

who cares !

  • L∞ loss : Smaller than e−nτ to be expected because of test

can we be more precise ? Slightly

16/ 19

slide-19
SLIDE 19

Another negative result

H(α1, L) ∪ H(α2, L) ⊂ Θ, α1 < α2 ǫn(α) = (n/ log n)−α/(2α+1) If there exists θ0 ∈ H(α2, L) and 2Jn,α2 = (n/ log n)1/(2α2+1). π (θ − θ02 ≤ cǫn(α2)) e−nǫ2

n(α2)

Then there ∃θ1 ∈ H(α1, L) Eθ1 (Π [l(θ, θ0) >> ǫn(α1)|Y n]) = 1

17/ 19

slide-20
SLIDE 20

Another negative result

H(α1, L) ∪ H(α2, L) ⊂ Θ, α1 < α2 ǫn(α) = (n/ log n)−α/(2α+1) If there exists θ0 ∈ H(α2, L) and 2Jn,α2 = (n/ log n)1/(2α2+1). π (θ − θ02 ≤ cǫn(α2)) e−nǫ2

n(α2)

π  

j≥Jn,α2

  • k

θ2

jk > Aǫn(α2)2

  ≤ e−Bnǫ2

n(α2)

Then there ∃θ1 ∈ H(α1, L) Eθ1 (Π [l(θ, θ0) >> ǫn(α1)|Y n]) = 1

17/ 19

slide-21
SLIDE 21

Another negative result

H(α1, L) ∪ H(α2, L) ⊂ Θ, α1 < α2 ǫn(α) = (n/ log n)−α/(2α+1) If there exists θ0 ∈ H(α2, L) and 2Jn,α2 = (n/ log n)1/(2α2+1). π (θ − θ02 ≤ cǫn(α2)) e−nǫ2

n(α2)

π  

j≥Jn,α2

  • k

θ2

jk > Aǫn(α2)2

  ≤ e−Bnǫ2

n(α2)

∃ρn ↓ 0 s.t. π  

j≥Jn,α2

2j/2 max

k

|θjk| > ρnǫn(α1)   ≤ e−Bnǫ2

n(α2)

Then there ∃θ1 ∈ H(α1, L) Eθ1 (Π [l(θ, θ0) >> ǫn(α1)|Y n]) = 1

17/ 19

slide-22
SLIDE 22

Conclusion

  • Bayesian is great for risks that are related to Kullback : L2 in

regression, hellinger or L1 in density etc.

  • How to understand some specific features in these big

models ? More tricky

  • Can we prove that ∀π : No adaptation in L∞ for concentration

rates ?

  • Why should we care ? → interpretation of credible bands ! ?
  • Are these negative results related to the non existence of

adaptive confidence bands in L∞ ?

  • If no adaptive prior : Important to understand the types of θ0

that won’t work. e.g. . . .

18/ 19

slide-23
SLIDE 23

THANK YOU

19/ 19