Bayesian estimation of the discrepancy with misspecified parametric - - PowerPoint PPT Presentation

bayesian estimation of the discrepancy with misspecified
SMART_READER_LITE
LIVE PREVIEW

Bayesian estimation of the discrepancy with misspecified parametric - - PowerPoint PPT Presentation

Semiparametric density estimation Asymptotics and illustration References Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics


slide-1
SLIDE 1

Semiparametric density estimation Asymptotics and illustration References

Bayesian estimation of the discrepancy with misspecified parametric models

Pierpaolo De Blasi

University of Torino & Collegio Carlo Alberto

Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

Joint work with S. Walker

1 / 22

slide-2
SLIDE 2

Semiparametric density estimation Asymptotics and illustration References

Outline

Semiparametric density estimation Asymptotics and illustration References

2 / 22

slide-3
SLIDE 3

Semiparametric density estimation Asymptotics and illustration References

BNP density estimation

  • Let X1, . . . , Xn be exchangeable (i.e. conditionally iid) observations from

an unknown density f on the real line.

  • If F is the density space and Π(df) the prior, via Bayes theorem

Π(A|X1, . . . , Xn) = R

A

Qn

i=1 f(Xi) Π(df)

R

F

Qn

i=1 f(Xi) Π(df)

  • Wealth of Bayesian nonparametric (BNP) models
  • Dirichlet process mixtures of continuos densities;
  • log spline models;
  • Bernstein polynomials;
  • log Gaussian processes.
  • All with well studied asymptotic properties, e.g. posterior concentration

rates Π(f : d(f, f0) > Mǫn|X1, . . . , Xn)

n→∞

→ 0, when X1, X2, . . . are iid from some “ true ” f0.

3 / 22

slide-4
SLIDE 4

Semiparametric density estimation Asymptotics and illustration References

Discrepancy from a parametric model

  • Suppose now we have a favorite parametric family

fθ(x), θ ∈ Θ ⊂ Rp. likely to be misspecified: there is no θ such that f0 = fθ.

  • We want to learn about the best parameter value θ0 which minimizes the

Kullback-Leibler divergence from true f0: θ0 = arg min

Θ

R f0 log(f0/fθ)

  • A nonparametric component W is introduced to model the discrepancy

between f0 and the closest density fθ0: fθ,W(x) ∝ fθ(x) W(x), so that C(x) := W(x) R W(s) fθ(s) ds is designed to estimate C0(x) = f0(x)/fθ0(x).

4 / 22

slide-5
SLIDE 5

Semiparametric density estimation Asymptotics and illustration References

Related works - Frequentist

Hjort and Glad (1995)

  • Start with a parametric density estimate fˆ

θ(x), ˆ

θ being, e.g., the MLE of θ with respect to the likelihood Qn

i=1 log fθ(xi).

  • Then multiply it with a nonparametric kernel-type of the correction

function r(x) = f0(x)/fˆ

θ(x):

ˆ f(x) = fˆ

θ(x)ˆ

r(x) = 1 n

n

X

i=1

Kh(xi − x) fˆ

θ(x)

θ(xi)

in a two-stage sequential analysis.

  • ˆ

f is shown to be more precise than traditional kernel density estimator in a broad neighborhood around the parametric family, while losing little when the f0 is far from the parametric family.

5 / 22

slide-6
SLIDE 6

Semiparametric density estimation Asymptotics and illustration References

Related works - Bayes

Nonparametric prior built around a parametric model via f(x) = fθ(x)g(Fθ(x)), where Fθ is the cdf of fθ and g is a density on [0, 1] with prior Π.

  • Verdinelli and Wasserman (1999): Π as an infinite exponential family.

Application to goodness of fit testing.

  • Rousseau (2008): Π as a mixtures of betas. Application to goodness of

fit testing.

  • Tokdar (2007): Π as a log Gaussian process prior. Application to

posterior inference for densities with unbounded support. For g(x) = eZ(x)/ R 1

0 eZ(s)ds and Z Gaussian process with covariance

σ(·, ·), f(x) can be written f(x) ∝ fθ(x) e

˜ Z(x)

|{z}

W(x)

for ˜ Z Gaussian process with covariance σ(Fθ(·), Fθ(·)).

6 / 22

slide-7
SLIDE 7

Semiparametric density estimation Asymptotics and illustration References

Posterior updating

fθ,W(x) ∝ fθ(x) W(x), C(x) := W(x) R W(s) fθ(s) ds .

  • Truly semi–parametric: aim is at learning about the best parameter θ0,

then at seeing how close fθ0 is to f0 via C(x) = W(x)/ R W(s) fθ(s) ds.

  • Situation in which the updating process from prior to posterior may be

seen as problematic: the model fθ,W is intrinsically non identified in (θ, C)

  • The full Bayesian update

˜ π(θ, W|x1, . . . , xn) ∝ π(θ)π(W) Qn

i=1 fθ,W(xi)

is appropriate for learning about f0; it is not so for learning about (θ0, C0).

  • The marginal posterior ˜

π(θ|x1, . . . , xn) = R ˜ π(θ, W|x1, . . . , xn)dW has no interpretation: it is not identified what parameter value this ˜ π is targeting.

7 / 22

slide-8
SLIDE 8

Semiparametric density estimation Asymptotics and illustration References

Posterior updating

  • What removes us from the formal Bayes set–up is the desire to

specifically learn about θ0.

  • θ0 defined without any reference to W, or C. Whether we are interested

in learning about C0 or not, our beliefs about θ0 should not change.

  • Hence, the appropriate update for θ is the parametric one:

π(θ|x1, . . . , xn) ∝ π(θ) Qn

i=1 fθ(xi).

  • We keep updating W according to the semi–parametric model,

˜ π(W|θ, x1, . . . , xn) ∝ π(W) Qn

i=1 fθ,W(xi),

so our updating scheme is π(θ, W|x1, . . . , xn) = ˜ π(W|θ, x1, . . . , xn) π(θ|x1, . . . , xn). non-full Bayesian update

8 / 22

slide-9
SLIDE 9

Semiparametric density estimation Asymptotics and illustration References

Posterior updating

π(θ, W|x1, . . . , xn) = ˜ π(W|θ, x1, . . . , xn) π(θ|x1, . . . , xn).

  • (θ, W) are estimated sequentially, with W reflecting additional

uncertainty on θ.

  • Marginalization of the posterior over W is well defined,

π(W|x1, . . . , xn) = R

Θ ˜

π(W|θ, x1, . . . , xn) π(dθ|x1, . . . , xn) since π(θ|x1, . . . , xn) describes the beliefs about the real parameter θ0.

  • Coherence is about properly defining the quantities of interest and

showing that Bayesian updates provide learning about these quantities and this is checked by what is yielded asymptotically.

  • Hence we seek frequentist validation: we show that

the posterior of (θ, C) converges to a point mass at (θ0, C0).

9 / 22

slide-10
SLIDE 10

Semiparametric density estimation Asymptotics and illustration References

Lenk (2003)

  • Let I be a compact interval on the real line and Z a Gaussian process.

Lenk (2003) considers the semi–parametric density model f(x) = fθ(x) eZ(x) R

I fθ(s) eZ(s)ds

for fθ(x) member of the exponential family.

  • In the Loève expansion of Z(x), the orthogonal basis is chosen so that

the sample paths integrate to zero.

  • Further assumption for identification: the orthogonal basis does not

contain any of the canonical statistics of fθ(x).

  • Estimation based on truncation of the series expansion or by imputation
  • f the Gaussian process at a fixed grid of points, see Tokdar (2007).

10 / 22

slide-11
SLIDE 11

Semiparametric density estimation Asymptotics and illustration References

Bounded W(x)

  • Building upon Lenk (2003), we keep working with Gaussian processes

and consider fθ,W(x) = fθ(x) W(x) R

I fθ(s) W(s)ds ,

W(x) = Ψ(Z(x)) where Ψ(u) is a cdf having a smooth unimodal symmetric density ψ(u)

  • n the real line.
  • With an additional condition on Ψ(u), we can show that W(x) preserves

the asymptotic properties of log Gaussian process prior.

  • On the other hand, with W(x) ≤ 1, Walker (2011) describes a latent

model which can deal with the intractable normalizing constant. It is based on

X

k=0

n + k − 1 k ! »Z fθ(s) (1 − W(s)) ds –k = „ 1 R W(s) fθ(s)ds «n .

11 / 22

slide-12
SLIDE 12

Semiparametric density estimation Asymptotics and illustration References

Link function Ψ(u)

  • Lipschitz condition on log Ψ(u):

ψ(u)/Ψ(u) ≤ m uniformly on R satisfied by the standard Laplace cdf, standard logistic cdf or standard Cauchy cdf, but not by the standard normal cdf.

  • For fixed θ, write pz = fθ,Ψ(z). It can be shown that, when z1 − z2∞ < ǫ,

( h(pz1, pz2) ≤ mǫ emǫ/2 K(pz1, pz2) m2ǫ2 emǫ (1 + mǫ)

  • Posterior asymptotic results of van der Vaart and van Zanten (2008)

carries over to this setting: If Ψ−1(f0/fθ) is contained in the support of Z, then Π {pz : h(pz, f0) > ǫ|X1, . . . , Xn} → 0, F ∞ − a.s. Results on posterior contraction rate can be also derived.

12 / 22

slide-13
SLIDE 13

Semiparametric density estimation Asymptotics and illustration References

Conditional posterior of W

(A) Lipschitz condition on log Ψ(u); (B) fθ(x) is continuous and bounded away from zero; (C) the support of Z contains the space C(I) of continuous densities on I. Theorem 1. Under assumptions (A), (B) and (C), the conditional posterior

  • f W given θ is exponentially consistent at all f0 ∈ C(I), i.e. for any ǫ > 0,

˜ π {W : h(fθ,W, f0) > ǫ|θ, X1, . . . , Xn} ≤ e−dn, F ∞ − a.s. for some d > 0 as n → ∞.

  • As corollary, for fixed θ, the posterior of C(x) = W(x)/

R

I fθ(s)W(s)ds

consistently estimates the discrepancy f0(x)/fθ(x).

  • The exponential convergence to 0 is a by-product of standard techniques

for proving posterior consistency.

13 / 22

slide-14
SLIDE 14

Semiparametric density estimation Asymptotics and illustration References

Marginal posterior of θ

  • For given f0, let θ0 be the parameter value that minimize

R

I f0 log(f0/fθ):

θ0 = arg min

Θ

R f0 log(f0/fθ)

  • Under some regularity condition on the family fθ and on the prior at θ0,

the posterior accumulates at θ0 with rate √n: π ˘ |θ − θ0| > Mnn−1/2|X1, . . . , Xn ¯ → 0, F ∞ − a.s. see Kleijn and van der Vaart (2012).

  • One of the key regularity conditions on fθ is the existence of an open

neighborhood of θ0 and a square-integrable function mθ0(x) such that, for all θ1, θ2 ∈ U, |log(fθ1/fθ2)| ≤ mθ0|θ1 − θ2|, P0 – a.s.

  • For our purposes, we focus on a different local property: there exist

α > 0 and an open neighborhood U of θ0 such that for all θ1, θ2 ∈ U: log fθ1/fθ2∞ |θ1 − θ2|α (D)

14 / 22

slide-15
SLIDE 15

Semiparametric density estimation Asymptotics and illustration References

Marginal posterior of W

Assume the regularity conditions on fθ and π(θ) are satisfied for f0 ∈ C(I). Recall the definition of the marginal posterior of W, π(W|x1, . . . , xn) = R

Θ ˜

π(W|θ, x1, . . . , xn)π(dθ|x1, . . . , xn) Theorem 2. Under assumptions (A), (B), (C) and (D), the marginal posterior of W(x) satisfies π ˘ W : h(fθ0,W, f0) > ǫ|X1, . . . , Xn ¯ → 0, F ∞ − a.s. as n → ∞.

  • The marginal posterior of W is evaluated outside a neighborhood

defined in terms of θ0. Clearly, if π(θ) is degenerate at θ0, the result follows directly from Theorem 1.

  • Hint of the proof: sufficient to consider the posterior when the prior is

restricted on |θ − θ0| ≤ Mnn−1/2. We then manipulate numerator and denominator by using (D) together with the inequalities exp{− log(fθ0/fθ)∞} ≤ R

I fθ0(x)W(x)dx

R

I fθ(x)W(x)dx ≤ exp{ log(fθ0/fθ)∞}

15 / 22

slide-16
SLIDE 16

Semiparametric density estimation Asymptotics and illustration References

Marginal posterior of C

Recall the definition C(x) = Cθ,W(x) = W(x) R W(s) fθ(s) ds , which is designed to estimate C0(x) = f0(x)/fθ0(x). Corollary. Under the hypotheses of Theorem 2, as n → ∞, π ˘R

I |C − C0| > ǫ|X1, . . . , Xn

¯ → 0, F ∞

0 –a.s.

  • Together with π

˘ |θ − θ0| > Mnn−1/2|X1, . . . , Xn ¯ → 0, we conclude that the posterior of (θ, C) converges to (θ0, C0).

  • Hint of the proof: Theorem 2 implies that

R

I |Cθ0,W − C0| goes to 0. By

triangular inequality, it is sufficient to show that, uniformly over |θ − θ0| ≤ Mn−1/2, R

I |Cθ,W − Cθ0,W| → 0. This we show by using (D).

16 / 22

slide-17
SLIDE 17

Semiparametric density estimation Asymptotics and illustration References

Illustration 1

n = 500 observations from f0(x) = 2(1 − x); fθ(x) = θe−θx/(1 − e−θ) with improper prior π(θ) ∝ 1/θ.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 x C(x)

Figure: Estimated (bold) and true (dashed) functions of C0(x)

17 / 22

slide-18
SLIDE 18

Semiparametric density estimation Asymptotics and illustration References

Parametric Bayes update

Simulation from the posterior of θ. The minimum K-L parameter value is θ0 = 2.15.

theta Density 1.6 1.8 2.0 2.2 2.4 2.6 2.8 0.0 0.5 1.0 1.5 2.0

Figure: Posterior distribution of θ with parametric Bayes update. Posterior mean 2.13.

18 / 22

slide-19
SLIDE 19

Semiparametric density estimation Asymptotics and illustration References

Full Bayes update

Using the proper conditional posterior ˜ π(θ|W, x1, . . . , xn) ∝ π(θ) Q

i fθ,W(xi)

theta Density 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Figure: Posterior distribution of θ with formal Bayes update. Posterior mean 1.97.

19 / 22

slide-20
SLIDE 20

Semiparametric density estimation Asymptotics and illustration References

Illustration 2

n = 500 observations from f0(x) = 2x; fθ(x) = θxθ−1 with 0 < θ < 1 and uniform prior for θ. θ0 = 1.

theta 0.70 0.75 0.80 0.85 0.90 0.95 1.00 6 theta 0.70 0.75 0.80 0.85 0.90 0.95 1.00 6

Figure: Posterior distributions of θ with parametric Bayes update (top) and formal Bayes update (bottom).

20 / 22

slide-21
SLIDE 21

Semiparametric density estimation Asymptotics and illustration References

Discussion

  • Both the proposed update and formal Bayes update seem to provide a

suitable estimate for C, which is not surprising given its flexibility of estimating C0 with alternative values of θ. Yet the posterior for θ is more accurate for the parametric Bayesian posterior.

  • It shows that semiparametric models need to be thought about carefully:

the parametric part needs to define which θ has been targeted. Future work will deal with

  • fθ with unbounded support.
  • Extension to posterior contraction rates.
  • Connections with asymptotic properties of empirical Bayes.
  • Use the C(x) function for model selection

21 / 22

slide-22
SLIDE 22

Semiparametric density estimation Asymptotics and illustration References

References

  • De Blasi & Walker (2012). Bayesian estimation of the discrepancy with

misspecified parametric models. Tech. Rep., submitted.

  • Hjort & Glad (1995). Nonparametric density estimation with a parametric
  • start. Ann. Statist. 23, 882-904.
  • Kleijn & van der Vaart (2012). The Bernstein-Von Mises theorem under
  • misspecification. Electron. J. Stat. 6, 354-381.
  • Lenk (1988). The logistic normal distribution for Bayesian, nonparametric,

predictive densities. J. Amer. Statist. Assoc. 83, 509-516.

  • Rousseau (2008). Approximating interval hypothesis: p-values and Bayes
  • factors. In Bayesian Statistics 8, 417-452.
  • Tokdar (2007). Towards a faster implementation of density estimation with

logistic Gaussian process priors. J. Comp. Graph. Statist. 16, 633-655.

  • van der Vaart & van Zanten (2008). Rates of contraction of posterior

distributions based on Gaussian process priors. Ann. Statist. 36, 1435-1463.

  • Verdinelli & Wasserman (1998). Bayesuan goodness–of–fit testing using

infinite–dimensional exponential families. Ann. Statist. 26, 1215-1241.

  • Walker (2011). Posterior sampling when the normalizing constant is
  • unknown. Comm. Statist. Simulation Comput. 40, 784-792.

22 / 22