Tractable Inference for Probabilistic Models Manfred Opper (Aston - - PowerPoint PPT Presentation
Tractable Inference for Probabilistic Models Manfred Opper (Aston - - PowerPoint PPT Presentation
Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D orthe Malzahn (TU Denmark) Lehel Csat o (Aston U) The general Structure D = Observed data S
Tractable Inference for Probabilistic Models
Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D¨
- rthe Malzahn (TU Denmark)
Lehel Csat´
- (Aston U)
The general Structure
D = Observed data S= Hidden variables (unknown causes, etc) Bayes Rule P(S|D)
- posterior
distribution = P(D|S)
- Likelihood
× P(S) prior distribution /P[D]
Overview
- Inference with probabilistic models: Examples
- A “canonical” model
- Problems with inference and approximate solutions
- Cavity/TAP approximation
- Applications
- Outlook
Example I: Modeling with Gaussian Processes
- Observations: Data D = (y1 . . . , yN) observed at points xi ∈ RD.
−20 −15 −10 −5 5 10 15 20 −2 −1 1 2 3 4 5 6 7 BV set size: 10; Lik. par: 2.0594
- Model for observations
yi = f(xi) + “noise” (Regression, eg. with positive noise ) yi = sign[f(xi) + “noise”] (Classifikation)
- A priori information about “latent variable” (function f):
Realization of Gaussian random process with covariance K(x, x′).
Modeling with Gaussian processes: Windfields
Ambiguities in local observation model for measuring wind velocity fields from satellites.
MDN network
Solution: Model prior distribution of wind fields using a Gaussian process.
Example II: Code Division Multiple Access (CDMA)
- K users in mobile communication try to transmit message bits S1, . . . SK
with Si ∈ {−1, 1} over single channel.
- Modulation: Multiply message with spreading code xk(n) for n = 1, . . . Nc
- Received signals
y(n) =
K
- k=1
Skxk(n) + σε(n)
- Inference: Estimate Sk’s from the y(n)’s
(= regression with binary variables). (introduced to machine learning community by Toshiyuki Tanaka)
A canonical Class of Distributions
P(S) = 1 Z
- i
ρi(Si) exp
i<j
SiJijSj
ρi models local observations (likelihood) / or local constraints.
i j
Jij
Normalization Z usually coincides with probability P(D) of observed data.
Problems with Inference
- Variables dependent → highdimensional integrals/sums.
- Exact inference impossible if random variables continuous (and non
Gaussian).
- Laplace approximation for integrals impossible if integrand non differen-
tiable.
- “Learning” of coupling matrix J by EM-Algorithm (Maximum Like-
lihood) requires correlations E[SiSj].
Non-variational Approximations
- Bethe approximation/Belief Propagation (Yedidia, Freeman & Weiss):
“treelike” graphs.
- ✁
site i
- TAP - type of approximations: many neighbours, weak dependencies,
Neighbourhood → Gaussian random influence.
- ✁
site i
Gibbs Free Energy
- Gives moments and Z = P(D) simultaneously.
- Applicability of optimization methods
Φ(m) . = minQ
- KL(Q||P) | EQ[Si] = mi , EQ[S2
i ] = Mi, i = 1, . . . , N
- − ln Z
m Φ(m) −lnP(D) E[S ]
i
TAP Approximation to Free Energy
Introduce tunable interaction strength l Pl(S) = 1 Z
- i
ρi(Si) exp
l
- i<j
SiJijSj
Exact result Φl=1 = Φl=0 +
1
0 dl∂Φl
∂l = Φl=0 − 1 2
- i,j
miJijmj−1 2
1
0 dl Tr(ClJ) .
with covariance Cl.
- TAP (Thouless, Anderson & Palmer) : Expand Φl to O(l2).
- Adaptive TAP (Opper & Winther): Gaussian approximation for Cl
Cg
l = (Λl − lJ)−1
Properties of TAP Free Energy
- Free Energy has the form
ΦTAP(m, M) = Φ0(m, M) + Φg(m, M) − Φg
0(m, M)
The Φ’s are convex and correspond to Φ0(m, M): true likelihood, no interactions. Φg(m, M): Gauss likelihood, full interactions. Φg
0(m, M): Gauss likelihood, no interactions.
- Minimizing hyperparameters of ΦTAP equal fixedpoints of approximate
EM algorithm.
Relation to Cavity Approach
Φ0 = max
λ0,γ0
−
- i
ln Z0
i (γ0 i , λ0 i ) + mTγ0 + 1
2MTλ0
.
with Zi(γ0
i , λ0 i ) =
- dS ρi(S) exp
- γ0
i S + 1
2λ0
i S2
- =
=
- dS ρi(S)Ez
- exp
- S(γ0
i +
- λ0
i z)
- with z a standard normal Gaussian random variable.
Algorithm: Expectation Propagation (T. Minka)
Introduce effective Gaussian distribution having likelihood
N
- i=1
ρg
i (Si) = N
- i=1
e−λiS2
i +γiSi
- ✁
site i
- → site i. Replace Gaussian likelihood by true Likelihood.
New Marginal Pi(S) ∝ P g
i (S)ρi(S)
ρg
i (S) →
Recompute E[Si] and E[S2
i ]
- Recompute λi and γi → new site.
Exact Average case behaviour: Random J matrix ensembles, N → ∞
Assume Orthogonal random matrix ensemble for JN with asymptotic scaling of generating function 1 N ln
- e
1 2 Tr(AJN)
- J
≃ Tr G(A/N) For N → ∞: Average case properties (replica symmetry) of exact inference and ADATAP approximation agree (if single solution).
Application: Non-Gaussian Regression
y = f(x) + ξ with positive noise p(ξ) = λe−λxIx>0: Estimate parameter λ with N = 1000.
−20 −15 −10 −5 5 10 15 20 −2 −1 1 2 3 4 5 6 7 BV set size: 10; Lik. par: 2.0594
Example: Estimation of Wind Fields
10ms−1 20ms−1 10ms−1 20ms−1
Likelihood Monte Carlo prediction ADATAP prediction
CDMA Results I (Winther & Fabricius)
−10 −8 −6 −4 −2 2 4 6 8 10 −10 −8 −6 −4 −2 2 4 6 8 10 xlabelexact ylabelnaive −10 −8 −6 −4 −2 2 4 6 8 10 −10 −8 −6 −4 −2 2 4 6 8 10 xlabelexact ylabeltap
Results for Bayes optimal prediction hi = artanh(mi): Exact/Mean Field and Exact/ADATAP. K = 8 users and Nc = 16
CDMA Results II (Winther & Fabricius)
16 18 20 22 24 26 28 30 10
−4
10
−3
10
−2
10
−1
10 K BER Naive Adaptive−TAP Linear MMSE Hard Serial IC Matched Filter
Biterror Rate as a function of the number of users. SNR = 10dB and Spreading factor Nc = 20
Approximate analytical Bootstrap
Goal: Estimate average case properties (eg test errors, uncertainty) of sta- tistical predictor (eg SVM) without hold out test data. Bootstrap (Efron): Generate new pseudo training data by resampling old training data with replacement. Original training data: D0 = (z1, z2, z3) Bootstrap samples: D1 = (z1, z1, z2); D2 = (z1, z2, z2); D3 = (z3, z3, z3), . . . Problem: Each sample requires time consuming retraining of predictor. Approximate analytical approach: Average over samples with help of “rep- lica trick”.
Supportvector Classifier (Vapnik)
SVM predicts y = sign[ ˆ fD0(x)] for x ∈ Rd, with ˆ fD0(x) = N
j=1 yjαiK(x, xj) and K a positive definite kernel.
Setting Si = N
j=1 yjαiK(xi, xj), the α’s can be found from the
convex optimization problem Minimize
- STK−1S
- under the constraint Siyi ≥ 1 ,
i = 1, . . . , N.
Probabilistic formulation of Supportvector Machines
Define prior µ[S] = 1
- (2π)Nβ−N|K|
exp
- −β
2STK−1S
- .
and Pseudo-likelihood
- j
P(yj|S) =
- j
Θ(yjSj − 1) where Θ(u) = 1 for u > 0 and 0 otherwise. For β → ∞, measure P[S|D] ∝ µ[S] P(D|S) concentrates at vector ˆ
S which
solves SVM optimization problem.
Analytical Average using Replicas
Let sj = # times data point yj appears in bootstrap sample D ED[Zn] = ED
- n
- a=1
(dSa µ[Sa])
- j,a
P sj(yj|Sa
j )
=
- n
- a=1
(dSa µ[Sa])
N
- j=1
exp
S
N
n
- a=1
P(yj|Sa
j )
New intractable statistical model with coupled replicas! Need approximate inference tools & limit n → 0.
Results: Classification & Regression
Compare TAP approximation theory / bootstrap simulation (= Sampling + Retraining) Generalization error:
200 400 600
Bootstrap sample size S
0.0 0.1 0.2 0.3 0.4 0.5
Bootstrapped classification error
200 400 600 0.04 0.06 0.08 0.10 0.12 0.14
Crabs, N=200 Pima, N=532 Sonar, N=208 Wisconsin, N=683
200 400 600 800 1000 Size S of bootstrap sample 10 20 30 40 Bootstrapped square loss Simulation Theory (TAP)
- Approx. theory (TAP)
Theory (Var. Gaussian) Theory (Mean field) 341 230 155 104 70 Average number of test points Boston, N=506
SVM results cont’d
Uncertainty of SVM Prediction at test points
- 2
- 1.5
- 1
- 0.5
0.5 1
Bootstrapped local field at a test input x
0.0 0.5 1.0 1.5 2.0
Density
0.2 0.4 0.6 0.8 1
Simulation: p(-1|x)
0.0 0.2 0.4 0.6 0.8 1.0
Theory: p(-1|x)
S: 0.376 T: 0.405
Regression
Distribution of predictor on training points
- 4
4 8 12 16 20 24 Bootstrapped prediction at input x 0.02 0.04 0.06 0.08 0.1 0.12 Density 372 0.1 0.2 0.3 0.4 0.5
L1 50 100 150 200 250 300 Abundance
0.2 0.3 0.4 0.5
L1
5 10
Outlook
- Systematic improvement
- Tractable substructures
- More complex dependencies (eg directed graphs)
- Fast algorithms & sparsity
- Combinatorial optimization problems, metastability
- Performance bounds?
Some worse Results
16 18 20 22 24 26 28 Bootstrapped prediction at input x 0.05 0.1 0.15 0.2 0.25 0.3 Density
4 6 8 10 12 14
Distribution at input x
0.1 0.2 0.3 0.4 230
187
- 15
- 12
- 9
- 6
- 3