Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P - - PowerPoint PPT Presentation

exchangeability
SMART_READER_LITE
LIVE PREVIEW

Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P - - PowerPoint PPT Presentation

Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P ( X | ) = Probability [ data | pattern ] 3 2 1 output, y 0 1 2 3 5 0 5 input, x Inference idea data = underlying pattern +


slide-1
SLIDE 1

Exchangeability

Peter Orbanz

Columbia University

slide-2
SLIDE 2

PARAMETERS AND PATTERNS

Parameters

P(X|θ) = Probability[data|pattern]

−5 5 −3 −2 −1 1 2 3 input, x

  • utput, y

Inference idea

data = underlying pattern + independent noise

Peter Orbanz 2 / 25

slide-3
SLIDE 3

TERMINOLOGY

Parametric model

◮ Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

◮ Number of parameters grows with sample size ◮ ∞-dimensional parameter space

Example: Density estimation

x2 x1

µ

Parametric

p(x)

Nonparametric

Peter Orbanz 3 / 25

slide-4
SLIDE 4

NONPARAMETRIC BAYESIAN MODEL

Definition

A nonparametric Bayesian model is a Bayesian model on an ∞-dimensional parameter space.

Interpretation

Parameter space T = set of possible patterns. Recall previous tutorials: Model T Application Gaussian process Smooth functions Regression problems DP mixtures Smooth densities Density estimation CRP, 2-param. CRP Parititons Clustering Solution to Bayesian problem = posterior distribution on patterns

[Sch95] Peter Orbanz 4 / 25

slide-5
SLIDE 5

DE FINETTI’S THEOREM Infinite exchangeability

For all π ∈ S∞ (= infinite symmetric group): P(X1, X2, . . . ) = P(Xπ(1), Xπ(2), ...)

  • r

π(P) = P

Theorem (de Finetti)

P exchangeable ⇔ P(X1, X2, . . . ) =

  • M(X)

  • n=1

Q(Xn)

  • dν(Q)

◮ Q is a random measure ◮ ν uniquely determined by P

Peter Orbanz 5 / 25

slide-6
SLIDE 6

FINITE EXCHANGEABILITY

Finite sequence X1, . . . , Xn

Exchangeability of finite sequence ⇒ de Finetti-representation

Example: Two exchangeable random bits

X1 = 0 X1 = 1 X2 = 0 1/2 X2 = 1 1/2 Suppose de Finetti holds; then 0 =

  • P(X1 = X2 = 1)

=

  • [0,1] p2dν(p)

P(X1 = X2 = 0) =

  • [0,1](1 − p)2dν(p)

ν{p = 0} = 1 ν{p = 1} = 1

Intuition

Finite exchangeability does not eliminate sequential patterns.

[DF80] Peter Orbanz 6 / 25

slide-7
SLIDE 7

SUPPORT OF PRIORS

M(X) Model P0 = Pθ0 P0 outside model: misspecified

[Gho10, KvdV06] Peter Orbanz 7 / 25

slide-8
SLIDE 8

SUPPORT OF NONPARAMETRIC PRIORS

Large support

◮ Support of nonparametric priors is larger (∞-dimensional) than of parametric

priors (finite-dimensional).

◮ However: No uniform prior (or even “neutral” improper prior) exists on M(X).

Interpretation of nonparametric prior assumptions

Concentration of nonparametric prior on subset of M(X) typically represents structural prior assumption.

◮ GP regression with unknown bandwidth:

◮ Any continuous function possible ◮ Prior can express e.g. “very smooth functions are more probable”

◮ Clustering: Expected number of clusters is...

◮ ...small

− → CRP prior

◮ ...power law

− → two-parameter CRP

Peter Orbanz 8 / 25

slide-9
SLIDE 9

PARAMETERIZED MODELS

Probability model

Ω ω P X X(ω) P(X) = X[P] T Θ(ω) X Θ

Parameterized model P[X|Θ]

Ω X ∞

X∞

M(X) ⊃ P

F

T

T Θ

◮ P = {P[X|θ]|θ ∈ T } ◮ F ≡ law of large numbers ◮ T : P[ . |Θ = θ] → θ bijection ◮ Θ := T ◦ F ◦ X∞

[Sch95] Peter Orbanz 9 / 25

slide-10
SLIDE 10

JUSTIFICATION: BY EXCHANGEABILITY

Again: de Finetti

P(X1, X2, . . . ) =

  • M(X)

  • n=1

Q(Xn)

  • dν(Q) =
  • T

  • n=1

Q(Xn|Θ = θ)

  • dνT (θ)

◮ Θ random measure (since Θ(ω) ∈ M(X))

Convergence results

The de Finetti theorem comes with a convergence result attached:

◮ Empirical measure: Fn

weakly

− − − → θ as n → ∞

◮ Posterior Λn(Θ|X1, . . . , Xn) = Λn( . , ω) in M(T ) exists ◮ Posterior convergence: Λn( . , ω)

n→∞

− − − → δΘ(ω)

[Kal01] Peter Orbanz 10 / 25

slide-11
SLIDE 11

SPECIAL TYPES OF EXCHANGEABLE DATA

slide-12
SLIDE 12

MODIFICATIONS

Pólya Urns

P(Xn+1|X1 = x1, . . . , Xn = xn) = 1 α + n

n

  • j=1

δxj(Xn+1) + α α + nG0(Xn+1) Exchangeable:

◮ ν is DP(α, G0) ◮ ∞

n=1 Q(Xn|θ) = ∞ n=1 θ(Xn) = ∞ n=1

j=1 cjδtj(Xn)

  • Exchangeable increment processes (H. Bühlmann)

Stationary, exchangeable increment process = mixture of Lévy processes P((Xt)t∈R+) =

  • Lα,γ,µ((Xt)t∈R+)dν(α, γ, µ)

Lα,γ,µ = Lévy process with jump measure µ

[B¨ 60, Kal01] Peter Orbanz 12 / 25

slide-13
SLIDE 13

MODIFICATION 2: RANDOM PARTITIONS

Random partition of N

Π = {B1, B2, . . .} e.g. {{1, 3, 5, . . .}, {2, 4}, {10}, . . .}

Paint-box distribution

◮ Weights s1, s2, . . . ≥ 0 with sj ≤ 1 ◮ U1, U2, . . . ∼ Uniform[0, 1] s1 s2

U3 U2 U1 1 −

j sj

Sampling Π ∼ β[ . |s]: i, j ∈ N in same block ⇔ Ui, Uj in same interval {i} separate block ⇔ Ui in interval 1 −

  • sj

Theorem (Kingman)

Π exchangeable ⇔ P(Π ∈ . ) =

  • β[Π ∈ . |s]Q(ds)

[Kin78] Peter Orbanz 13 / 25

slide-14
SLIDE 14

ROTATION INVARIANCE

Rotatable sequence

Pn(X1, . . . , Xn) = Pn(Rn(X1, . . . , Xn)) for all Rn ∈ O(n)

Infinite case

X1, X2, . . . rotatable :⇔ X1, . . . , Xn rotatable for all n

Theorem (Freedman)

Infinite sequence rotatable iff P(X1, X2, . . . ) =

  • R+

  • n=1

Nσ(Xn)

  • dνR+(σ)

Nσ denotes (0, σ)-Gaussian

Peter Orbanz 14 / 25

slide-15
SLIDE 15

TWO INTERPRETATIONS

As special case of de Finetti

◮ Rotatable ⇒ exchangeable ◮ General de Finetti: Parameter space T = M(X) ◮ Rotation invariance: T shrinks to {Nσ|σ ∈ R+}

As invariance under different symmetry

◮ Exchangeability = invariance of P(X1, X2, ...) under group action ◮ Freedman: Different group (O(n) rather than S∞) ◮ In these cases: symmetry ⇒ decomposition theorem

Peter Orbanz 15 / 25

slide-16
SLIDE 16

NON-EXCHANGEABLE DATA

slide-17
SLIDE 17

EXCHANGEABILITY: RANDOM GRAPHS

Random graph with independent edges

Given: θ : [0, 1]2 → [0, 1] symmetric function

◮ U1, U2, . . . ∼ Uniform[0, 1] ◮ Edge (i, j) present:

(i, j) ∼ Bernoulli(θ(Ui, Uj)) Call this distribution Γ(G ∈ . |θ).

1 1

1 θ

Theorem (Aldous; Hoover)

A random (dense) graph G is exchangeable iff P(G ∈ . ) =

  • T

Γ(G ∈ . |θ)Q(dθ)

1 2 3 4 5 6 7 8 9 [Ald81, Hoo79] Peter Orbanz 17 / 25

slide-18
SLIDE 18

EXCHANGEABILITY: RANDOM GRAPHS

Random graph with independent edges

Given: θ : [0, 1]2 → [0, 1] symmetric function

◮ U1, U2, . . . ∼ Uniform[0, 1] ◮ Edge (i, j) present:

(i, j) ∼ Bernoulli(θ(Ui, Uj)) Call this distribution Γ(G ∈ . |θ).

1 1

U1 U1 U2 U2 1

Pr{edge 1, 2}

θ

Theorem (Aldous; Hoover)

A random (dense) graph G is exchangeable iff P(G ∈ . ) =

  • T

Γ(G ∈ . |θ)Q(dθ)

1 2 3 4 5 6 7 8 9 [Ald81, Hoo79] Peter Orbanz 17 / 25

slide-19
SLIDE 19

DE FINETTI: GEOMETRY Finite case

P =

  • ei∈E

νiei

◮ E = {e1, e2, e3} ◮ (ν1, ν2, ν3) barycentric coordinates e1 e2 e3 P ν1 ν2 ν3

Infinite/continuous case

P( . ) =

  • E

e( . )dν(e) =

  • T

k(θ, . )dνT (θ)

◮ k : T → E ⊂ M(X) probability kernel (= conditional probability) ◮ k is random measure with values k(θ, . ) ∈ E ◮ de Finetti:

k(θ, . ) =

n∈N Q( . |θ)

and T = M(X)

Peter Orbanz 18 / 25

slide-20
SLIDE 20

DECOMPOSITION BY SYMMETRY

Theorem (Varadarajan)

◮ G nice group on space Y ◮ Call measure µ ergodic if µ(A) ∈ {0, 1} for all G-invariant sets A. ◮ E := {ergodic probability measures}

Then there is a Markov kernel k : Y → E s.t.: P ∈ M(V) G-invariant ⇔ P(A) =

  • T

k(θ, A)dν(θ)

de Finetti

◮ G = S∞ and Y = X ∞ ◮ G-invariant sets = exchangeable events ◮ E = factorial distributions (“Hewitt-Savage 0-1 law”)

[Var63] Peter Orbanz 19 / 25

slide-21
SLIDE 21

SYMMETRY AND SUFFICIENCY

slide-22
SLIDE 22

SUFFICIENT STATISTICS

Problem

Apparently no direct connection with standard models

Sufficient Statistic

Functions Sn of data sufficient if:

◮ Intuitively:

Sn(X1, . . . , Xn) contains all information sample provides on parameter

◮ Formally:

Pn(X1, . . . , Xn|Θ, Sn) = P(X1, . . . , Xn|S) for all n

Sufficiency and symmetry

◮ P exchangeable ⇔ Sn(x1, . . . , xn) = 1

n

n

i=1 δxn sufficient

◮ P rotatable ⇔ Sn(x1, . . . , xn) =

n

i=1 x2 i = (x1, . . . , xn)2 sufficient

Peter Orbanz 21 / 25

slide-23
SLIDE 23

DECOMPOSITION BY SUFFICIENCY

Theorem (Diaconis and Freedman; Lauritzen; several others)

Given: Sufficient statistic Sn for each n kn( . , sn) = conditional probability of X1, . . . , Xn given sn

  • 1. kn converges to a limit function:

kn( . , Sn(X1(ω), . . . , Xn(ω)))

n→∞

− − − → k∞( . , ω)

  • 2. P(X1, X2, . . . ) has the decomposition

P( . ) =

  • k∞( . , ω)dν(ω)
  • 3. The model P ⊂ M(X) is a convex set with extreme points k∞( . , ω)
  • 4. The measure ν is uniquely determined by P

(Theorem statement omits technical conditions.)

Peter Orbanz 22 / 25

slide-24
SLIDE 24

EXAMPLES

de Finetti’s theorem

P exchangeable ⇔ Sn(x1, . . . , xn) = 1 n

n

  • i=1

δxn sufficient

Rotation invariance

P rotatable ⇔ Sn(x1, . . . , xn) = (x1, . . . , xn)2 sufficient

Kingman’s theorem

Π exchangeable ⇔ asymptotic block sizes are sufficient statistic

Exponential families (Küchler and Lauritzen)

Choose X = R∞. Under suitable regularity conditions: Sn additive, i.e. Sn(x1, . . . , xn) = 1 n

n

  • i=1

S0(xi) if and only if ergodic measures are exponential family.

[KL89] Peter Orbanz 23 / 25

slide-25
SLIDE 25

SUMMARY

Non-exchangeable data

◮ Identify invariance principle and its ergodic measures ◮ Ergodic measures ↔ generalize i.i.d. distributions ↔ likelihood ◮ Prior = distribution on ergodic measures

Random structure Theorem of Mixtures of... Exchangeable sequences de Finetti product distributions Hewitt & Savage Processes with exch. increments Bühlmann Lévy processes Exchangeable partitions Kingman "paint-box distributions" Exchangeable arrays Aldous sampling scheme on [0, 1]2 Hoover Kallenberg Block-exchangeable sequences Diaconis & Freedman Markov chains Exchangeable Rd-sequences with Küchler & Lauritzen Exponential families additive sufficient statistics

Peter Orbanz 24 / 25

slide-26
SLIDE 26

REFERENCES I

[Ald81] David J. Aldous. Representations for partially exchangeable arrays of random variables. J. Multivariate Anal., 11(4):581–598, 1981. [B¨ 60]

  • H. Bühlmann. Austauschbare stochastische Variabeln und ihre Grenzwertsätze. PhD thesis, 1960. University of California Press, 1960.

[DF80]

  • P. Diaconis and D. Freedman. Finite exchangeable sequences. The Annals of Probability, 8(4):pp. 745–764, 1980.

[Gho10]

  • S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics, pages 36–83.

Cambridge University Press, 2010. [Hoo79]

  • D. N. Hoover. Relations on probability spaces and arrays of random variables. Technical report, Institute of Advanced Study, Princeton,

1979. [Kal01]

  • O. Kallenberg. Foundations of Modern Probability. Springer, 2nd edition, 2001.

[Kin78]

  • J. F. C. Kingman. The representation of partition structures. J. London Math. Soc., 2(18):374–380, 1978.

[KL89]

  • U. Küchler and S. L. Lauritzen. Exponential families, extreme point models and minimal space-time invariant functions for stochastic

processes with stationary and independent increments. Scand. J. Stat., 16:237–261, 1989. [KvdV06]

  • B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2):837–877,

2006. [Sch95]

  • M. J. Schervish. Theory of Statistics. Springer, 1995.

[Var63]

  • V. S. Varadarajan. Groups of automorphisms of Borel spaces. Transactions of the American Mathematical Society, 109(2):pp. 191–220,

1963. Peter Orbanz 25 / 25