tractable inference for probabilistic models
play

Tractable Inference for Probabilistic Models Manfred Opper (Aston - PowerPoint PPT Presentation

Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D orthe Malzahn (TU Denmark) Lehel Csat o (Aston U) The general Structure D = Observed data S


  1. Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D¨ orthe Malzahn (TU Denmark) Lehel Csat´ o (Aston U)

  2. The general Structure D = Observed data S = Hidden variables (unknown causes, etc) Bayes Rule P ( S | D) = P (D | S ) × P ( S ) /P [D] � �� � � �� � � �� � Likelihood prior posterior distribution distribution

  3. Overview • Inference with probabilistic models: Examples • A “canonical” model • Problems with inference and approximate solutions • Cavity/TAP approximation • Applications • Outlook

  4. Example I: Modeling with Gaussian Processes • Observations: Data D = ( y 1 . . . , y N ) observed at points x i ∈ R D . BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20 • Model for observations y i = f ( x i ) + “noise” (Regression, eg. with positive noise ) y i = sign[ f ( x i ) + “noise”] (Classifikation) • A priori information about “latent variable” (function f ): Realization of Gaussian random process with covariance K ( x, x ′ ).

  5. Modeling with Gaussian processes: Windfields Ambiguities in local observation model for measuring wind velocity fields from satellites. MDN network Solution: Model prior distribution of wind fields using a Gaussian process.

  6. Example II: Code Division Multiple Access (CDMA) • K users in mobile communication try to transmit message bits S 1 , . . . S K with S i ∈ {− 1 , 1 } over single channel. • Modulation: Multiply message with spreading code x k ( n ) for n = 1 , . . . N c • Received signals K � y ( n ) = S k x k ( n ) + σε ( n ) k =1 • Inference: Estimate S k ’s from the y ( n )’s (= regression with binary variables). (introduced to machine learning community by Toshiyuki Tanaka)

  7. A canonical Class of Distributions   P ( S ) = 1 � � ρ i ( S i ) exp S i J ij S j  Z i i<j ρ i models local observations (likelihood) / or local constraints. i J ij j Normalization Z usually coincides with probability P ( D ) of observed data.

  8. Problems with Inference • Variables dependent → highdimensional integrals/sums. • Exact inference impossible if random variables continuous (and non Gaussian). • Laplace approximation for integrals impossible if integrand non differen- tiable. • “Learning” of coupling matrix J by EM-Algorithm (Maximum Like- lihood) requires correlations E [ S i S j ].

  9. � �✁ ✁ � �✁ ✁ Non-variational Approximations • Bethe approximation/Belief Propagation (Yedidia, Freeman & Weiss): site i “treelike” graphs. • TAP - type of approximations: many neighbours, weak dependencies, Neighbourhood → Gaussian random influence. site i

  10. Gibbs Free Energy • Gives moments and Z = P ( D ) simultaneously. • Applicability of optimization methods Φ( m ) . � � KL ( Q || P ) | E Q [ S i ] = m i , E Q [ S 2 = min Q i ] = M i , i = 1 , . . . , N − ln Z Φ (m) −lnP(D) E[S ] m i

  11. TAP Approximation to Free Energy Introduce tunable interaction strength l   P l ( S ) = 1 � � ρ i ( S i ) exp  l S i J ij S j  Z i i<j Exact result � 1 � 1 0 dl∂ Φ l = Φ l =0 − 1 m i J ij m j − 1 � Φ l =1 = Φ l =0 + 0 dl Tr( C l J ) . ∂l 2 2 i,j with covariance C l . • TAP (Thouless, Anderson & Palmer) : Expand Φ l to O ( l 2 ). • Adaptive TAP (Opper & Winther): Gaussian approximation for C l C g l = ( Λ l − l J ) − 1

  12. Properties of TAP Free Energy • Free Energy has the form ΦTAP( m , M ) = Φ 0 ( m , M ) + Φ g ( m , M ) − Φ g 0 ( m , M ) The Φ’s are convex and correspond to Φ 0 ( m , M ): true likelihood, no interactions. Φ g ( m , M ): Gauss likelihood, full interactions. Φ g 0 ( m , M ): Gauss likelihood, no interactions. • Minimizing hyperparameters of ΦTAP equal fixedpoints of approximate EM algorithm.

  13. Relation to Cavity Approach   i ) + m T γ 0 + 1   � ln Z 0 i ( γ 0 i , λ 0 2 M T λ 0 Φ 0 = max  −  . λ 0 , γ 0 i with � � � i S + 1 Z i ( γ 0 i , λ 0 γ 0 2 λ 0 i S 2 i ) = dS ρ i ( S ) exp = � � �� � � S ( γ 0 λ 0 = dS ρ i ( S ) E z exp i + i z ) with z a standard normal Gaussian random variable.

  14. � �✁ ✁ Algorithm: Expectation Propagation (T. Minka) Introduce effective Gaussian distribution having likelihood N N e − λ i S 2 � ρ g � i + γ i S i i ( S i ) = i =1 i =1 site i • → site i . Replace Gaussian likelihood by true Likelihood. New Marginal i ( S ) ρ i ( S ) P i ( S ) ∝ P g i ( S ) → ρ g Recompute E [ S i ] and E [ S 2 i ] • Recompute λ i and γ i → new site.

  15. Exact Average case behaviour: Random J matrix ensembles, N → ∞ Assume Orthogonal random matrix ensemble for J N with asymptotic scaling of generating function � � 1 1 2 Tr( AJ N ) N ln e ≃ Tr G ( A /N ) J For N → ∞ : Average case properties (replica symmetry) of exact inference and ADATAP approximation agree (if single solution).

  16. Application: Non-Gaussian Regression y = f ( x ) + ξ with positive noise p ( ξ ) = λe − λx I x> 0 : Estimate parameter λ with N = 1000. BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20

  17. Example: Estimation of Wind Fields 10ms −1 20ms −1 10ms −1 20ms −1 Likelihood Monte Carlo prediction ADATAP prediction

  18. CDMA Results I (Winther & Fabricius) 10 10 8 8 6 6 4 4 2 2 ylabelnaive ylabeltap 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −10 −10 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 xlabelexact xlabelexact Results for Bayes optimal prediction h i = artanh( m i ): Exact/Mean Field and Exact/ADATAP. K = 8 users and N c = 16

  19. CDMA Results II (Winther & Fabricius) 0 10 −1 10 BER −2 10 −3 10 Naive Adaptive−TAP Linear MMSE Hard Serial IC Matched Filter −4 10 16 18 20 22 24 26 28 30 K Biterror Rate as a function of the number of users. SNR = 10 dB and Spreading factor N c = 20

  20. Approximate analytical Bootstrap Goal: Estimate average case properties (eg test errors, uncertainty) of sta- tistical predictor (eg SVM) without hold out test data. Bootstrap (Efron): Generate new pseudo training data by resampling old training data with replacement. Original training data: D 0 = ( z 1 , z 2 , z 3 ) Bootstrap samples: D 1 = ( z 1 , z 1 , z 2 ); D 2 = ( z 1 , z 2 , z 2 ); D 3 = ( z 3 , z 3 , z 3 ) , . . . Problem: Each sample requires time consuming retraining of predictor. Approximate analytical approach: Average over samples with help of “rep- lica trick”.

  21. Supportvector Classifier (Vapnik) SVM predicts y = sign[ ˆ f D 0 ( x )] for x ∈ R d , f D 0 ( x ) = � N with ˆ j =1 y j α i K ( x, x j ) and K a positive definite kernel. Setting S i = � N j =1 y j α i K ( x i , x j ), the α ’s can be found from the convex optimization problem � � S T K − 1 S Minimize under the constraint S i y i ≥ 1 , i = 1 , . . . , N.

  22. Probabilistic formulation of Supportvector Machines Define prior � � 1 − β 2 S T K − 1 S µ [ S ] = exp . � (2 π ) N β − N | K | and Pseudo-likelihood � � P ( y j | S ) = Θ( y j S j − 1) j j where Θ( u ) = 1 for u > 0 and 0 otherwise. For β → ∞ , measure P [ S | D ] ∝ µ [ S ] P ( D | S ) concentrates at vector ˆ S which solves SVM optimization problem.

  23. Analytical Average using Replicas Let s j = # times data point y j appears in bootstrap sample D   n � ( d S a µ [ S a ]) � � E D [ Z n ] = E D P s j ( y j | S a  = j )  a =1 j,a   n N n �  S � ( d S a µ [ S a ]) � � P ( y j | S a exp j )  N a =1 j =1 a =1 New intractable statistical model with coupled replicas! Need approximate inference tools & limit n → 0.

  24. Results: Classification & Regression Compare TAP approximation theory / bootstrap simulation (= Sampling + Retraining) Generalization error: 0.5 Average number of test points 0.14 341 230 155 104 70 Wisconsin, N=683 0.12 40 Bootstrapped classification error 0.10 Simulation 0.4 0.08 Theory (TAP) Bootstrapped square loss 0.06 30 Approx. theory (TAP) 0.04 0.3 Theory (Var. Gaussian) 0 200 400 600 Theory (Mean field) 20 Pima, N=532 0.2 Boston, N=506 10 Sonar, N=208 0.1 Crabs, N=200 0 0.0 0 200 400 800 1000 600 0 200 400 600 Bootstrap sample size S Size S of bootstrap sample

  25. SVM results cont’d Uncertainty of SVM Prediction at test points 2.0 Simulation: p(-1|x) 0 0.2 0.4 0.6 0.8 1 1.0 0.8 Theory: p(-1|x) 1.5 0.6 0.4 Density 1.0 0.2 0.0 0.5 S: 0.376 T: 0.405 0.0 -2 -1 0 1 -1.5 -0.5 0.5 Bootstrapped local field at a test input x

  26. Regression Distribution of predictor on training points 300 0.12 250 10 0.1 200 0.08 Abundance 5 Density 150 0.06 0 0.2 0.3 0.4 0.5 100 0.04 L1 50 0.02 0 0 -4 0 4 8 12 20 24 16 0 0.1 0.2 0.3 0.4 0.5 Bootstrapped prediction at input x 372 L1

  27. Outlook • Systematic improvement • Tractable substructures • More complex dependencies (eg directed graphs) • Fast algorithms & sparsity • Combinatorial optimization problems, metastability • Performance bounds?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend