Exchangeability
Peter Orbanz
Columbia University
Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P - - PowerPoint PPT Presentation
Exchangeability Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P ( X | ) = Probability [ data | pattern ] 3 2 1 output, y 0 1 2 3 5 0 5 input, x Inference idea data = underlying pattern +
Peter Orbanz
Columbia University
P(X|θ) = Probability[data|pattern]
−5 5 −3 −2 −1 1 2 3 input, x
data = underlying pattern + independent noise
Peter Orbanz 2 / 25
◮ Number of parameters fixed (or constantly bounded) w.r.t. sample size
◮ Number of parameters grows with sample size ◮ ∞-dimensional parameter space
x2 x1
µ
Parametric
p(x)
Nonparametric
Peter Orbanz 3 / 25
A nonparametric Bayesian model is a Bayesian model on an ∞-dimensional parameter space.
Parameter space T = set of possible patterns. Recall previous tutorials: Model T Application Gaussian process Smooth functions Regression problems DP mixtures Smooth densities Density estimation CRP, 2-param. CRP Parititons Clustering Solution to Bayesian problem = posterior distribution on patterns
[Sch95] Peter Orbanz 4 / 25
For all π ∈ S∞ (= infinite symmetric group): P(X1, X2, . . . ) = P(Xπ(1), Xπ(2), ...)
π(P) = P
P exchangeable ⇔ P(X1, X2, . . . ) =
∞
Q(Xn)
◮ Q is a random measure ◮ ν uniquely determined by P
Peter Orbanz 5 / 25
Exchangeability of finite sequence ⇒ de Finetti-representation
X1 = 0 X1 = 1 X2 = 0 1/2 X2 = 1 1/2 Suppose de Finetti holds; then 0 =
=
P(X1 = X2 = 0) =
ν{p = 0} = 1 ν{p = 1} = 1
Finite exchangeability does not eliminate sequential patterns.
[DF80] Peter Orbanz 6 / 25
M(X) Model P0 = Pθ0 P0 outside model: misspecified
[Gho10, KvdV06] Peter Orbanz 7 / 25
◮ Support of nonparametric priors is larger (∞-dimensional) than of parametric
priors (finite-dimensional).
◮ However: No uniform prior (or even “neutral” improper prior) exists on M(X).
Concentration of nonparametric prior on subset of M(X) typically represents structural prior assumption.
◮ GP regression with unknown bandwidth:
◮ Any continuous function possible ◮ Prior can express e.g. “very smooth functions are more probable”
◮ Clustering: Expected number of clusters is...
◮ ...small
− → CRP prior
◮ ...power law
− → two-parameter CRP
Peter Orbanz 8 / 25
Ω ω P X X(ω) P(X) = X[P] T Θ(ω) X Θ
Ω X ∞
X∞
M(X) ⊃ P
F
T
T Θ
◮ P = {P[X|θ]|θ ∈ T } ◮ F ≡ law of large numbers ◮ T : P[ . |Θ = θ] → θ bijection ◮ Θ := T ◦ F ◦ X∞
[Sch95] Peter Orbanz 9 / 25
P(X1, X2, . . . ) =
∞
Q(Xn)
∞
Q(Xn|Θ = θ)
◮ Θ random measure (since Θ(ω) ∈ M(X))
The de Finetti theorem comes with a convergence result attached:
◮ Empirical measure: Fn
weakly
− − − → θ as n → ∞
◮ Posterior Λn(Θ|X1, . . . , Xn) = Λn( . , ω) in M(T ) exists ◮ Posterior convergence: Λn( . , ω)
n→∞
− − − → δΘ(ω)
[Kal01] Peter Orbanz 10 / 25
P(Xn+1|X1 = x1, . . . , Xn = xn) = 1 α + n
n
δxj(Xn+1) + α α + nG0(Xn+1) Exchangeable:
◮ ν is DP(α, G0) ◮ ∞
n=1 Q(Xn|θ) = ∞ n=1 θ(Xn) = ∞ n=1
∞
j=1 cjδtj(Xn)
Stationary, exchangeable increment process = mixture of Lévy processes P((Xt)t∈R+) =
Lα,γ,µ = Lévy process with jump measure µ
[B¨ 60, Kal01] Peter Orbanz 12 / 25
Π = {B1, B2, . . .} e.g. {{1, 3, 5, . . .}, {2, 4}, {10}, . . .}
◮ Weights s1, s2, . . . ≥ 0 with sj ≤ 1 ◮ U1, U2, . . . ∼ Uniform[0, 1] s1 s2
U3 U2 U1 1 −
j sj
Sampling Π ∼ β[ . |s]: i, j ∈ N in same block ⇔ Ui, Uj in same interval {i} separate block ⇔ Ui in interval 1 −
Π exchangeable ⇔ P(Π ∈ . ) =
[Kin78] Peter Orbanz 13 / 25
Pn(X1, . . . , Xn) = Pn(Rn(X1, . . . , Xn)) for all Rn ∈ O(n)
X1, X2, . . . rotatable :⇔ X1, . . . , Xn rotatable for all n
Infinite sequence rotatable iff P(X1, X2, . . . ) =
∞
Nσ(Xn)
Nσ denotes (0, σ)-Gaussian
Peter Orbanz 14 / 25
◮ Rotatable ⇒ exchangeable ◮ General de Finetti: Parameter space T = M(X) ◮ Rotation invariance: T shrinks to {Nσ|σ ∈ R+}
◮ Exchangeability = invariance of P(X1, X2, ...) under group action ◮ Freedman: Different group (O(n) rather than S∞) ◮ In these cases: symmetry ⇒ decomposition theorem
Peter Orbanz 15 / 25
Given: θ : [0, 1]2 → [0, 1] symmetric function
◮ U1, U2, . . . ∼ Uniform[0, 1] ◮ Edge (i, j) present:
(i, j) ∼ Bernoulli(θ(Ui, Uj)) Call this distribution Γ(G ∈ . |θ).
1 1
1 θ
A random (dense) graph G is exchangeable iff P(G ∈ . ) =
Γ(G ∈ . |θ)Q(dθ)
1 2 3 4 5 6 7 8 9 [Ald81, Hoo79] Peter Orbanz 17 / 25
Given: θ : [0, 1]2 → [0, 1] symmetric function
◮ U1, U2, . . . ∼ Uniform[0, 1] ◮ Edge (i, j) present:
(i, j) ∼ Bernoulli(θ(Ui, Uj)) Call this distribution Γ(G ∈ . |θ).
1 1
U1 U1 U2 U2 1
Pr{edge 1, 2}
θ
A random (dense) graph G is exchangeable iff P(G ∈ . ) =
Γ(G ∈ . |θ)Q(dθ)
1 2 3 4 5 6 7 8 9 [Ald81, Hoo79] Peter Orbanz 17 / 25
P =
νiei
◮ E = {e1, e2, e3} ◮ (ν1, ν2, ν3) barycentric coordinates e1 e2 e3 P ν1 ν2 ν3
P( . ) =
e( . )dν(e) =
k(θ, . )dνT (θ)
◮ k : T → E ⊂ M(X) probability kernel (= conditional probability) ◮ k is random measure with values k(θ, . ) ∈ E ◮ de Finetti:
k(θ, . ) =
n∈N Q( . |θ)
and T = M(X)
Peter Orbanz 18 / 25
◮ G nice group on space Y ◮ Call measure µ ergodic if µ(A) ∈ {0, 1} for all G-invariant sets A. ◮ E := {ergodic probability measures}
Then there is a Markov kernel k : Y → E s.t.: P ∈ M(V) G-invariant ⇔ P(A) =
k(θ, A)dν(θ)
◮ G = S∞ and Y = X ∞ ◮ G-invariant sets = exchangeable events ◮ E = factorial distributions (“Hewitt-Savage 0-1 law”)
[Var63] Peter Orbanz 19 / 25
Apparently no direct connection with standard models
Functions Sn of data sufficient if:
◮ Intuitively:
Sn(X1, . . . , Xn) contains all information sample provides on parameter
◮ Formally:
Pn(X1, . . . , Xn|Θ, Sn) = P(X1, . . . , Xn|S) for all n
◮ P exchangeable ⇔ Sn(x1, . . . , xn) = 1
n
n
i=1 δxn sufficient
◮ P rotatable ⇔ Sn(x1, . . . , xn) =
n
i=1 x2 i = (x1, . . . , xn)2 sufficient
Peter Orbanz 21 / 25
Given: Sufficient statistic Sn for each n kn( . , sn) = conditional probability of X1, . . . , Xn given sn
kn( . , Sn(X1(ω), . . . , Xn(ω)))
n→∞
− − − → k∞( . , ω)
P( . ) =
(Theorem statement omits technical conditions.)
Peter Orbanz 22 / 25
P exchangeable ⇔ Sn(x1, . . . , xn) = 1 n
n
δxn sufficient
P rotatable ⇔ Sn(x1, . . . , xn) = (x1, . . . , xn)2 sufficient
Π exchangeable ⇔ asymptotic block sizes are sufficient statistic
Choose X = R∞. Under suitable regularity conditions: Sn additive, i.e. Sn(x1, . . . , xn) = 1 n
n
S0(xi) if and only if ergodic measures are exponential family.
[KL89] Peter Orbanz 23 / 25
◮ Identify invariance principle and its ergodic measures ◮ Ergodic measures ↔ generalize i.i.d. distributions ↔ likelihood ◮ Prior = distribution on ergodic measures
Random structure Theorem of Mixtures of... Exchangeable sequences de Finetti product distributions Hewitt & Savage Processes with exch. increments Bühlmann Lévy processes Exchangeable partitions Kingman "paint-box distributions" Exchangeable arrays Aldous sampling scheme on [0, 1]2 Hoover Kallenberg Block-exchangeable sequences Diaconis & Freedman Markov chains Exchangeable Rd-sequences with Küchler & Lauritzen Exponential families additive sufficient statistics
Peter Orbanz 24 / 25
[Ald81] David J. Aldous. Representations for partially exchangeable arrays of random variables. J. Multivariate Anal., 11(4):581–598, 1981. [B¨ 60]
[DF80]
[Gho10]
Cambridge University Press, 2010. [Hoo79]
1979. [Kal01]
[Kin78]
[KL89]
processes with stationary and independent increments. Scand. J. Stat., 16:237–261, 1989. [KvdV06]
2006. [Sch95]
[Var63]
1963. Peter Orbanz 25 / 25