Tractable Inference for Probabilistic Models Manfred Opper (Aston - PowerPoint PPT Presentation

Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D¨ orthe Malzahn (TU Denmark) Lehel Csat´ o (Aston U)

The general Structure D = Observed data S = Hidden variables (unknown causes, etc) Bayes Rule P ( S | D) = P (D | S ) × P ( S ) /P [D] � �� Likelihood prior posterior distribution distribution

Overview • Inference with probabilistic models: Examples • A “canonical” model • Problems with inference and approximate solutions • Cavity/TAP approximation • Applications • Outlook

Example I: Modeling with Gaussian Processes • Observations: Data D = ( y 1 . . . , y N ) observed at points x i ∈ R D . BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20 • Model for observations y i = f ( x i ) + “noise” (Regression, eg. with positive noise ) y i = sign[ f ( x i ) + “noise”] (Classifikation) • A priori information about “latent variable” (function f ): Realization of Gaussian random process with covariance K ( x, x ′ ).

Modeling with Gaussian processes: Windfields Ambiguities in local observation model for measuring wind velocity fields from satellites. MDN network Solution: Model prior distribution of wind fields using a Gaussian process.

Example II: Code Division Multiple Access (CDMA) • K users in mobile communication try to transmit message bits S 1 , . . . S K with S i ∈ {− 1 , 1 } over single channel. • Modulation: Multiply message with spreading code x k ( n ) for n = 1 , . . . N c • Received signals K � y ( n ) = S k x k ( n ) + σε ( n ) k =1 • Inference: Estimate S k ’s from the y ( n )’s (= regression with binary variables). (introduced to machine learning community by Toshiyuki Tanaka)

A canonical Class of Distributions   P ( S ) = 1 � � ρ i ( S i ) exp S i J ij S j  Z i i<j ρ i models local observations (likelihood) / or local constraints. i J ij j Normalization Z usually coincides with probability P ( D ) of observed data.

Problems with Inference • Variables dependent → highdimensional integrals/sums. • Exact inference impossible if random variables continuous (and non Gaussian). • Laplace approximation for integrals impossible if integrand non differen- tiable. • “Learning” of coupling matrix J by EM-Algorithm (Maximum Like- lihood) requires correlations E [ S i S j ].

� �✁ ✁ � �✁ ✁ Non-variational Approximations • Bethe approximation/Belief Propagation (Yedidia, Freeman & Weiss): site i “treelike” graphs. • TAP - type of approximations: many neighbours, weak dependencies, Neighbourhood → Gaussian random influence. site i

Gibbs Free Energy • Gives moments and Z = P ( D ) simultaneously. • Applicability of optimization methods Φ( m ) . � � KL ( Q || P ) | E Q [ S i ] = m i , E Q [ S 2 = min Q i ] = M i , i = 1 , . . . , N − ln Z Φ (m) −lnP(D) E[S ] m i

TAP Approximation to Free Energy Introduce tunable interaction strength l   P l ( S ) = 1 � � ρ i ( S i ) exp  l S i J ij S j  Z i i<j Exact result � 1 � 1 0 dl∂ Φ l = Φ l =0 − 1 m i J ij m j − 1 � Φ l =1 = Φ l =0 + 0 dl Tr( C l J ) . ∂l 2 2 i,j with covariance C l . • TAP (Thouless, Anderson & Palmer) : Expand Φ l to O ( l 2 ). • Adaptive TAP (Opper & Winther): Gaussian approximation for C l C g l = ( Λ l − l J ) − 1

Properties of TAP Free Energy • Free Energy has the form ΦTAP( m , M ) = Φ 0 ( m , M ) + Φ g ( m , M ) − Φ g 0 ( m , M ) The Φ’s are convex and correspond to Φ 0 ( m , M ): true likelihood, no interactions. Φ g ( m , M ): Gauss likelihood, full interactions. Φ g 0 ( m , M ): Gauss likelihood, no interactions. • Minimizing hyperparameters of ΦTAP equal fixedpoints of approximate EM algorithm.

Relation to Cavity Approach   i ) + m T γ 0 + 1   � ln Z 0 i ( γ 0 i , λ 0 2 M T λ 0 Φ 0 = max  −  . λ 0 , γ 0 i with � � � i S + 1 Z i ( γ 0 i , λ 0 γ 0 2 λ 0 i S 2 i ) = dS ρ i ( S ) exp = � � �� S ( γ 0 λ 0 = dS ρ i ( S ) E z exp i + i z ) with z a standard normal Gaussian random variable.

� �✁ ✁ Algorithm: Expectation Propagation (T. Minka) Introduce effective Gaussian distribution having likelihood N N e − λ i S 2 � ρ g � i + γ i S i i ( S i ) = i =1 i =1 site i • → site i . Replace Gaussian likelihood by true Likelihood. New Marginal i ( S ) ρ i ( S ) P i ( S ) ∝ P g i ( S ) → ρ g Recompute E [ S i ] and E [ S 2 i ] • Recompute λ i and γ i → new site.

Exact Average case behaviour: Random J matrix ensembles, N → ∞ Assume Orthogonal random matrix ensemble for J N with asymptotic scaling of generating function � � 1 1 2 Tr( AJ N ) N ln e ≃ Tr G ( A /N ) J For N → ∞ : Average case properties (replica symmetry) of exact inference and ADATAP approximation agree (if single solution).

Application: Non-Gaussian Regression y = f ( x ) + ξ with positive noise p ( ξ ) = λe − λx I x> 0 : Estimate parameter λ with N = 1000. BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20

Example: Estimation of Wind Fields 10ms −1 20ms −1 10ms −1 20ms −1 Likelihood Monte Carlo prediction ADATAP prediction

CDMA Results I (Winther & Fabricius) 10 10 8 8 6 6 4 4 2 2 ylabelnaive ylabeltap 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −10 −10 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 xlabelexact xlabelexact Results for Bayes optimal prediction h i = artanh( m i ): Exact/Mean Field and Exact/ADATAP. K = 8 users and N c = 16

CDMA Results II (Winther & Fabricius) 0 10 −1 10 BER −2 10 −3 10 Naive Adaptive−TAP Linear MMSE Hard Serial IC Matched Filter −4 10 16 18 20 22 24 26 28 30 K Biterror Rate as a function of the number of users. SNR = 10 dB and Spreading factor N c = 20

Approximate analytical Bootstrap Goal: Estimate average case properties (eg test errors, uncertainty) of statistical predictor (eg SVM) without hold out test data. Bootstrap (Efron): Generate new pseudo training data by resampling old training data with replacement. Original training data: D 0 = ( z 1 , z 2 , z 3 ) Bootstrap samples: D 1 = ( z 1 , z 1 , z 2 ); D 2 = ( z 1 , z 2 , z 2 ); D 3 = ( z 3 , z 3 , z 3 ) , . . . Problem: Each sample requires time consuming retraining of predictor. Approximate analytical approach: Average over samples with help of “replica trick”.

Supportvector Classifier (Vapnik) SVM predicts y = sign[ ˆ f D 0 ( x )] for x ∈ R d , f D 0 ( x ) = � N with ˆ j =1 y j α i K ( x, x j ) and K a positive definite kernel. Setting S i = � N j =1 y j α i K ( x i , x j ), the α ’s can be found from the convex optimization problem � � S T K − 1 S Minimize under the constraint S i y i ≥ 1 , i = 1 , . . . , N.

Probabilistic formulation of Supportvector Machines Define prior � � 1 − β 2 S T K − 1 S µ [ S ] = exp . � (2 π ) N β − N | K | and Pseudo-likelihood � � P ( y j | S ) = Θ( y j S j − 1) j j where Θ( u ) = 1 for u > 0 and 0 otherwise. For β → ∞ , measure P [ S | D ] ∝ µ [ S ] P ( D | S ) concentrates at vector ˆ S which solves SVM optimization problem.

Analytical Average using Replicas Let s j = # times data point y j appears in bootstrap sample D   n � ( d S a µ [ S a ]) � � E D [ Z n ] = E D P s j ( y j | S a  = j )  a =1 j,a   n N n �  S � ( d S a µ [ S a ]) � � P ( y j | S a exp j )  N a =1 j =1 a =1 New intractable statistical model with coupled replicas! Need approximate inference tools & limit n → 0.

Results: Classification & Regression Compare TAP approximation theory / bootstrap simulation (= Sampling + Retraining) Generalization error: 0.5 Average number of test points 0.14 341 230 155 104 70 Wisconsin, N=683 0.12 40 Bootstrapped classification error 0.10 Simulation 0.4 0.08 Theory (TAP) Bootstrapped square loss 0.06 30 Approx. theory (TAP) 0.04 0.3 Theory (Var. Gaussian) 0 200 400 600 Theory (Mean field) 20 Pima, N=532 0.2 Boston, N=506 10 Sonar, N=208 0.1 Crabs, N=200 0 0.0 0 200 400 800 1000 600 0 200 400 600 Bootstrap sample size S Size S of bootstrap sample

SVM results cont’d Uncertainty of SVM Prediction at test points 2.0 Simulation: p(-1|x) 0 0.2 0.4 0.6 0.8 1 1.0 0.8 Theory: p(-1|x) 1.5 0.6 0.4 Density 1.0 0.2 0.0 0.5 S: 0.376 T: 0.405 0.0 -2 -1 0 1 -1.5 -0.5 0.5 Bootstrapped local field at a test input x

Regression Distribution of predictor on training points 300 0.12 250 10 0.1 200 0.08 Abundance 5 Density 150 0.06 0 0.2 0.3 0.4 0.5 100 0.04 L1 50 0.02 0 0 -4 0 4 8 12 20 24 16 0 0.1 0.2 0.3 0.4 0.5 Bootstrapped prediction at input x 372 L1

Outlook • Systematic improvement • Tractable substructures • More complex dependencies (eg directed graphs) • Fast algorithms & sparsity • Combinatorial optimization problems, metastability • Performance bounds?

Tractable Inference for Probabilistic Models Manfred Opper (Aston - PowerPoint PPT Presentation

Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D orthe Malzahn (TU Denmark) Lehel Csat o (Aston U) The general Structure D = Observed data S

t ' ! tractable probabilistic inference meeting ! December 11th 2019 - NeurIPS 2019 , Vancouver

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Tractable Representations Inference Probabilistic Learning Models Applications Guy Van den

Tractable Representations Inference Probabilistic Learning Models Applications Guy Van den

Chapter 4 ICS-275 Fall 2010 Fall 2010 ICS 275 - Constraint Networks 1 Tractable Tractable

Adversarial BoltzmannMachines Belief Nets Networks Variational NADE

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

On Computational and Probabilistic Inference Rajat Mani Thomas Objectives: Revisiting Bayesian

Faster Attend-Infer-Repeat with Tractable Probabilistic Models Karl Stelzner 1 , Robert Peharz 2 ,

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Approximate Inference: Mean Field Methods Probabilistic Graphical Models (10- Probabilistic

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Thesis Presentation Thesis Presentation Study and Simulation of UMTS Study and Simulation of

CELL PHONES : INVES TIGATING DATA By: Christopher Robinson Cell Phones TECHNOLOGIES Terms

Interims Results 2015 Progressing Our Strategy Agenda Introduction Bob Murphy Chief

EEN 336 COMMUNICATION SYSTEMS Research Project Supervised By: Dr. Montasir Qasymeh By:

Global Positioning System Timing Criticality Assessment Preliminary Performance Results

Subdivisions in Tennessee Annual Planning Commissioner Training October 2017 What is a

The Economic Impact of Building Code Change Requiring a 2 nd Fire Access Elevator in Florida High

JMA/WMO Workshop on Quality Management in Surface, JMA/WMO Workshop on Quality Management in