Tractable Inference for Probabilistic Models Manfred Opper (Aston - - PowerPoint PPT Presentation

tractable inference for probabilistic models
SMART_READER_LITE
LIVE PREVIEW

Tractable Inference for Probabilistic Models Manfred Opper (Aston - - PowerPoint PPT Presentation

Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D orthe Malzahn (TU Denmark) Lehel Csat o (Aston U) The general Structure D = Observed data S


slide-1
SLIDE 1
slide-2
SLIDE 2

Tractable Inference for Probabilistic Models

Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D¨

  • rthe Malzahn (TU Denmark)

Lehel Csat´

  • (Aston U)
slide-3
SLIDE 3

The general Structure

D = Observed data S= Hidden variables (unknown causes, etc) Bayes Rule P(S|D)

  • posterior

distribution = P(D|S)

  • Likelihood

× P(S) prior distribution /P[D]

slide-4
SLIDE 4

Overview

  • Inference with probabilistic models: Examples
  • A “canonical” model
  • Problems with inference and approximate solutions
  • Cavity/TAP approximation
  • Applications
  • Outlook
slide-5
SLIDE 5

Example I: Modeling with Gaussian Processes

  • Observations: Data D = (y1 . . . , yN) observed at points xi ∈ RD.

−20 −15 −10 −5 5 10 15 20 −2 −1 1 2 3 4 5 6 7 BV set size: 10; Lik. par: 2.0594

  • Model for observations

yi = f(xi) + “noise” (Regression, eg. with positive noise ) yi = sign[f(xi) + “noise”] (Classifikation)

  • A priori information about “latent variable” (function f):

Realization of Gaussian random process with covariance K(x, x′).

slide-6
SLIDE 6

Modeling with Gaussian processes: Windfields

Ambiguities in local observation model for measuring wind velocity fields from satellites.

MDN network

Solution: Model prior distribution of wind fields using a Gaussian process.

slide-7
SLIDE 7

Example II: Code Division Multiple Access (CDMA)

  • K users in mobile communication try to transmit message bits S1, . . . SK

with Si ∈ {−1, 1} over single channel.

  • Modulation: Multiply message with spreading code xk(n) for n = 1, . . . Nc
  • Received signals

y(n) =

K

  • k=1

Skxk(n) + σε(n)

  • Inference: Estimate Sk’s from the y(n)’s

(= regression with binary variables). (introduced to machine learning community by Toshiyuki Tanaka)

slide-8
SLIDE 8

A canonical Class of Distributions

P(S) = 1 Z

  • i

ρi(Si) exp

 

i<j

SiJijSj

 

ρi models local observations (likelihood) / or local constraints.

i j

Jij

Normalization Z usually coincides with probability P(D) of observed data.

slide-9
SLIDE 9

Problems with Inference

  • Variables dependent → highdimensional integrals/sums.
  • Exact inference impossible if random variables continuous (and non

Gaussian).

  • Laplace approximation for integrals impossible if integrand non differen-

tiable.

  • “Learning” of coupling matrix J by EM-Algorithm (Maximum Like-

lihood) requires correlations E[SiSj].

slide-10
SLIDE 10

Non-variational Approximations

  • Bethe approximation/Belief Propagation (Yedidia, Freeman & Weiss):

“treelike” graphs.

site i

  • TAP - type of approximations: many neighbours, weak dependencies,

Neighbourhood → Gaussian random influence.

site i

slide-11
SLIDE 11

Gibbs Free Energy

  • Gives moments and Z = P(D) simultaneously.
  • Applicability of optimization methods

Φ(m) . = minQ

  • KL(Q||P) | EQ[Si] = mi , EQ[S2

i ] = Mi, i = 1, . . . , N

  • − ln Z

m Φ(m) −lnP(D) E[S ]

i

slide-12
SLIDE 12

TAP Approximation to Free Energy

Introduce tunable interaction strength l Pl(S) = 1 Z

  • i

ρi(Si) exp

 l

  • i<j

SiJijSj

 

Exact result Φl=1 = Φl=0 +

1

0 dl∂Φl

∂l = Φl=0 − 1 2

  • i,j

miJijmj−1 2

1

0 dl Tr(ClJ) .

with covariance Cl.

  • TAP (Thouless, Anderson & Palmer) : Expand Φl to O(l2).
  • Adaptive TAP (Opper & Winther): Gaussian approximation for Cl

Cg

l = (Λl − lJ)−1

slide-13
SLIDE 13

Properties of TAP Free Energy

  • Free Energy has the form

ΦTAP(m, M) = Φ0(m, M) + Φg(m, M) − Φg

0(m, M)

The Φ’s are convex and correspond to Φ0(m, M): true likelihood, no interactions. Φg(m, M): Gauss likelihood, full interactions. Φg

0(m, M): Gauss likelihood, no interactions.

  • Minimizing hyperparameters of ΦTAP equal fixedpoints of approximate

EM algorithm.

slide-14
SLIDE 14

Relation to Cavity Approach

Φ0 = max

λ0,γ0

  −

  • i

ln Z0

i (γ0 i , λ0 i ) + mTγ0 + 1

2MTλ0

   .

with Zi(γ0

i , λ0 i ) =

  • dS ρi(S) exp
  • γ0

i S + 1

2λ0

i S2

  • =

=

  • dS ρi(S)Ez
  • exp
  • S(γ0

i +

  • λ0

i z)

  • with z a standard normal Gaussian random variable.
slide-15
SLIDE 15

Algorithm: Expectation Propagation (T. Minka)

Introduce effective Gaussian distribution having likelihood

N

  • i=1

ρg

i (Si) = N

  • i=1

e−λiS2

i +γiSi

site i

  • → site i. Replace Gaussian likelihood by true Likelihood.

New Marginal Pi(S) ∝ P g

i (S)ρi(S)

ρg

i (S) →

Recompute E[Si] and E[S2

i ]

  • Recompute λi and γi → new site.
slide-16
SLIDE 16

Exact Average case behaviour: Random J matrix ensembles, N → ∞

Assume Orthogonal random matrix ensemble for JN with asymptotic scaling of generating function 1 N ln

  • e

1 2 Tr(AJN)

  • J

≃ Tr G(A/N) For N → ∞: Average case properties (replica symmetry) of exact inference and ADATAP approximation agree (if single solution).

slide-17
SLIDE 17

Application: Non-Gaussian Regression

y = f(x) + ξ with positive noise p(ξ) = λe−λxIx>0: Estimate parameter λ with N = 1000.

−20 −15 −10 −5 5 10 15 20 −2 −1 1 2 3 4 5 6 7 BV set size: 10; Lik. par: 2.0594

slide-18
SLIDE 18

Example: Estimation of Wind Fields

10ms−1 20ms−1 10ms−1 20ms−1

Likelihood Monte Carlo prediction ADATAP prediction

slide-19
SLIDE 19

CDMA Results I (Winther & Fabricius)

−10 −8 −6 −4 −2 2 4 6 8 10 −10 −8 −6 −4 −2 2 4 6 8 10 xlabelexact ylabelnaive −10 −8 −6 −4 −2 2 4 6 8 10 −10 −8 −6 −4 −2 2 4 6 8 10 xlabelexact ylabeltap

Results for Bayes optimal prediction hi = artanh(mi): Exact/Mean Field and Exact/ADATAP. K = 8 users and Nc = 16

slide-20
SLIDE 20

CDMA Results II (Winther & Fabricius)

16 18 20 22 24 26 28 30 10

−4

10

−3

10

−2

10

−1

10 K BER Naive Adaptive−TAP Linear MMSE Hard Serial IC Matched Filter

Biterror Rate as a function of the number of users. SNR = 10dB and Spreading factor Nc = 20

slide-21
SLIDE 21

Approximate analytical Bootstrap

Goal: Estimate average case properties (eg test errors, uncertainty) of sta- tistical predictor (eg SVM) without hold out test data. Bootstrap (Efron): Generate new pseudo training data by resampling old training data with replacement. Original training data: D0 = (z1, z2, z3) Bootstrap samples: D1 = (z1, z1, z2); D2 = (z1, z2, z2); D3 = (z3, z3, z3), . . . Problem: Each sample requires time consuming retraining of predictor. Approximate analytical approach: Average over samples with help of “rep- lica trick”.

slide-22
SLIDE 22

Supportvector Classifier (Vapnik)

SVM predicts y = sign[ ˆ fD0(x)] for x ∈ Rd, with ˆ fD0(x) = N

j=1 yjαiK(x, xj) and K a positive definite kernel.

Setting Si = N

j=1 yjαiK(xi, xj), the α’s can be found from the

convex optimization problem Minimize

  • STK−1S
  • under the constraint Siyi ≥ 1 ,

i = 1, . . . , N.

slide-23
SLIDE 23

Probabilistic formulation of Supportvector Machines

Define prior µ[S] = 1

  • (2π)Nβ−N|K|

exp

  • −β

2STK−1S

  • .

and Pseudo-likelihood

  • j

P(yj|S) =

  • j

Θ(yjSj − 1) where Θ(u) = 1 for u > 0 and 0 otherwise. For β → ∞, measure P[S|D] ∝ µ[S] P(D|S) concentrates at vector ˆ

S which

solves SVM optimization problem.

slide-24
SLIDE 24

Analytical Average using Replicas

Let sj = # times data point yj appears in bootstrap sample D ED[Zn] = ED

 

  • n
  • a=1

(dSa µ[Sa])

  • j,a

P sj(yj|Sa

j )

  =

  • n
  • a=1

(dSa µ[Sa])

N

  • j=1

exp

  S

N

n

  • a=1

P(yj|Sa

j )

 

New intractable statistical model with coupled replicas! Need approximate inference tools & limit n → 0.

slide-25
SLIDE 25

Results: Classification & Regression

Compare TAP approximation theory / bootstrap simulation (= Sampling + Retraining) Generalization error:

200 400 600

Bootstrap sample size S

0.0 0.1 0.2 0.3 0.4 0.5

Bootstrapped classification error

200 400 600 0.04 0.06 0.08 0.10 0.12 0.14

Crabs, N=200 Pima, N=532 Sonar, N=208 Wisconsin, N=683

200 400 600 800 1000 Size S of bootstrap sample 10 20 30 40 Bootstrapped square loss Simulation Theory (TAP)

  • Approx. theory (TAP)

Theory (Var. Gaussian) Theory (Mean field) 341 230 155 104 70 Average number of test points Boston, N=506

slide-26
SLIDE 26

SVM results cont’d

Uncertainty of SVM Prediction at test points

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1

Bootstrapped local field at a test input x

0.0 0.5 1.0 1.5 2.0

Density

0.2 0.4 0.6 0.8 1

Simulation: p(-1|x)

0.0 0.2 0.4 0.6 0.8 1.0

Theory: p(-1|x)

S: 0.376 T: 0.405

slide-27
SLIDE 27

Regression

Distribution of predictor on training points

  • 4

4 8 12 16 20 24 Bootstrapped prediction at input x 0.02 0.04 0.06 0.08 0.1 0.12 Density 372 0.1 0.2 0.3 0.4 0.5

L1 50 100 150 200 250 300 Abundance

0.2 0.3 0.4 0.5

L1

5 10

slide-28
SLIDE 28

Outlook

  • Systematic improvement
  • Tractable substructures
  • More complex dependencies (eg directed graphs)
  • Fast algorithms & sparsity
  • Combinatorial optimization problems, metastability
  • Performance bounds?
slide-29
SLIDE 29

Some worse Results

16 18 20 22 24 26 28 Bootstrapped prediction at input x 0.05 0.1 0.15 0.2 0.25 0.3 Density

4 6 8 10 12 14

Distribution at input x

0.1 0.2 0.3 0.4 230

187

  • 15
  • 12
  • 9
  • 6
  • 3

3 Bootstrapped prediction at input x 0.03 0.06 0.09 0.12 0.15 Density 8