BEYOND MEAN-FIELD APPROXIMATION AURLIEN DECELLE LABORATOIRE DE - - PowerPoint PPT Presentation

beyond mean field
SMART_READER_LITE
LIVE PREVIEW

BEYOND MEAN-FIELD APPROXIMATION AURLIEN DECELLE LABORATOIRE DE - - PowerPoint PPT Presentation

BEYOND MEAN-FIELD APPROXIMATION AURLIEN DECELLE LABORATOIRE DE RECHERCHE EN INFORMATIQUE UNIVERSIT PARIS SUD MOTIVATIONS Why inverse problems ? In Machine Learning online recognition tasks In Physics understanding a


slide-1
SLIDE 1

BEYOND MEAN-FIELD APPROXIMATION

AURÉLIEN DECELLE

LABORATOIRE DE RECHERCHE EN INFORMATIQUE UNIVERSITÉ PARIS SUD

slide-2
SLIDE 2

MOTIVATIONS

Why inverse problems ?  In Machine Learning → online recognition tasks  In Physics → understanding a physical system from observations  In social science → getting insight of latent properties

slide-3
SLIDE 3

HOW HARD ?

Direct problems are already hard : understanding equilibrium properties can be (very) challenging (e.g. spin glasses) Inverse problems can be harder : ideally maximizing the likelihood would involve to compute the partition function many times In particular, serious problems can appear because if

  • Overfitting
  • Non-convex functions
  • Slow convergence in the direct problem
slide-4
SLIDE 4

HOW HARD ?

Depending on the system, different optimization scheme can be adopted

slide-5
SLIDE 5

DEEP LEARNING

slide-6
SLIDE 6

ICML STUFFS

slide-7
SLIDE 7

WHY IT IS NEEDED TO GO BEYOND MF

MF is mapping the distribution of the data onto a particular form of probability distribution

min

𝜘 𝐿𝑀( 𝑞𝑒𝑏𝑢𝑏|| 𝑞𝑢𝑏𝑠𝑕𝑓𝑢(𝜘))

nMF 𝑞𝑜𝑁𝐺 𝜘 = 𝑗 𝑞𝑗(𝑡𝑗) Bethe approx 𝑞𝐶𝐵 𝜘 = 𝑗𝑘

𝑞𝑗𝑘(𝑡𝑗,𝑡𝑘) 𝑞𝑗(𝑡𝑗)𝑞𝑘(𝑡𝑘) 𝑗 𝑞𝑗(𝑡𝑗)

slide-8
SLIDE 8

WHY IT IS NEEDED TO GO BEYOND MF

What about when the system can not be describe by this particular form of distribution ?

  • Long-range correlations
  • Very specific topology
  • Presence of hidden nodes

⊕ how to put prior information ?

slide-9
SLIDE 9

OTHER METHODS ?

Pseudo udo-Lik Likeli elihood

  • Trade off between complexity and the level of approximation
  • Consistent for infinite sampling
  • Can deal with priors

But overfit

Max x likelih kelihood

  • Same as the two last points of above

But overfit and can be very slow

slide-10
SLIDE 10

OTHER METHODS ?

Adap apti tive e cluster uster exansio ansion

  • Avoid overfitting
  • Consistently develop cluster of larger sizes

But it is hard to write it …

Con

  • ntrastic

trastic diver vergen gence ce

  • Very fast
  • A trade off can be found between speed and exactness

Overfit, and can be bad if very slow convergence ! Minimum um Probabilistic tic Flow

  • Fast to converge
  • Consistent

But probably does not work well for small sampling.

slide-11
SLIDE 11

PSEUDO-LIKELIHOOD METHOD

  • Principle
  • Comparison with MF
  • Regularization
  • Decimation
  • Generalisation and extension
slide-12
SLIDE 12

SETTINGS

We consider the following problem : A system of discrete variables 𝑡𝑗 = 1, … , 𝑟 (ok let’s say 𝑡𝑗 = ± 1 in the following)

  • Interacting by pairs and having biases.

𝓘 =

<𝑗,𝑘>

𝐾𝑗𝑘𝑡𝑗𝑡

𝑘 + 𝑗

ℎ𝑗𝑡𝑗 p( 𝑡) = 𝑓−𝛾𝓘(

𝑡)

𝑎

Then, a set of configuration is collected : { 𝑡 𝑏 }𝑏=1,..,𝑁 Using them, it is possible to compute the likelihood Reconstruction error ε2 =

(𝐾𝑗𝑘−𝐾𝑗𝑘

∗ )2

𝐾𝑗𝑘

2

slide-13
SLIDE 13

SETTINGS

The likelih ihood functi ction Proba of observing the configurations = 𝑏 𝑓−𝛾𝓘(𝑡(𝑏)) 𝑎 Define the log-likelihood ℒ = 𝑏(−𝛾𝓘(

𝑡(𝑏)) − log(𝑎))

Problem of maximization … How to compute average values efficiently ? 𝜖ℒ 𝜖𝐾𝑗𝑘 ∝< 𝑡𝑗𝑡

𝑘 > 𝑒𝑏𝑢𝑏 −< 𝑡𝑗𝑡 𝑘 > 𝑛𝑝𝑒𝑓𝑚

slide-14
SLIDE 14

PSEUDO-LIKELIHOOD

Goal : find a function that can be maximize and would infer correctly the Js

𝑞 𝑡 = 𝑞 𝑡𝑗 𝑡

𝑘\i) 𝑡𝑗

𝑞 𝑡 = 𝑞 𝑡𝑗 𝑡

𝑘\i)𝑞(

𝑡

𝑘\i)

𝑞 𝑡𝑗 𝑡

𝑘\i) = 𝑓−𝛾𝑡𝑗( 𝑘 𝐾𝑗𝑘𝑡𝑘+ℎ𝑗) 2cosh(𝛾 ( 𝑘 𝐾𝑗𝑘𝑡𝑘+ℎ𝑗) ) can be minimized ! Ekeberg et al. : Protein foldings ??? : training RBM

slide-15
SLIDE 15

PSEUDO-LIKELIHOOD

Can we have theoretical insight ? Yes, for gibbs infinite sampling, the maximum is correct ! Consider : 𝒬ℒ𝑗 = 𝑏 log(𝑞 𝑡𝑗 𝑡𝑘\i)) we replace the distribution over the data by Boltzmann 𝒬ℒ𝑗 =

𝒟

𝑓−𝛾𝓘𝐻 (

𝑡𝒟)

𝑎𝐻 log(𝑞 𝑡𝑗

𝒟

𝑡𝑘\i

𝒟 ))

The maximum is reached when the couplings from 𝓘𝐻 and 𝓘 of are equals

slide-16
SLIDE 16

PSEUDO-LIKELIHOOD

When no hidden variables are present, the PL is convex ! Therefore only one maxima exists ! The PL can be minimized without too much trouble using for instance

  • Newton method
  • Gradient descent

And the complexity goes as O(N2M) Let’s understand how this works and how it compares to MF

slide-17
SLIDE 17

RECALL OF THE SETTING

A set of M equilibrium configurations 𝑡(𝑙) , 𝑙 = 1, . . , 𝑁 On one side we use the MF equations 𝐾𝑗𝑘 = −𝑑𝑗𝑘

−1

On the other side we maximize the Pseudo-Likelihood distributions 𝒬ℒ𝑗 = 𝑙 log(1 + 𝑓−2𝛾𝑡𝑗

(𝑙) 𝐾𝑗𝑘𝑡𝑘 (𝑙)

) ∀𝑗 𝑛𝑗 = tanh(

𝑘

𝐾𝑗𝑘𝑛𝑘 + ℎ𝑘)

slide-18
SLIDE 18

MEAN-FIELD AND PLM

Curie-Weiss 𝐾𝑗𝑘 = −1/𝑂 with N=100 spins Hopfield 𝐾𝑗𝑘 = 𝜊𝑗

𝑏𝜊𝑘 𝑏 with N=100 spins

and two states, M=100k

slide-19
SLIDE 19

MEAN-FIELD AND PLM

SK model, N=64, with M=106, 107, 108 2D model, 𝐾𝑗𝑘 = −1, N=49, with M=104, 105, 106

  • E. Aurell and M. Ekeberg 2012
slide-20
SLIDE 20

WHAT ABOUT THE STRUCTURE ?

slide-21
SLIDE 21

WHAT ABOUT THE STRUCTURE ?

How does the L1-norm is included in PLM ? 𝒬ℒ𝑗 = 𝑙 log 1 + 𝑓−2𝛾𝑡𝑗

𝑙 𝐾𝑗𝑘𝑡𝑘 𝑙

− 𝜇 𝑘 |𝐾𝑗𝑘| ∀𝑗 Leads to sparse solution … how to fix 𝜇 ?

slide-22
SLIDE 22

WHAT ABOUT THE STRUCTURE ?

slide-23
SLIDE 23

WHAT ABOUT THE STRUCTURE ?

slide-24
SLIDE 24

VERY SIMPLE IDEA : DECIMATION

Progressively decimating parameters with a small absolute values Not NEW :

  • In optimization problem using BP (Montanari et al.)
  • Brain damage (Lecun)
slide-25
SLIDE 25

DECIMATION ALGORITHM

Given a set of equilibrium configurations and all unfixed paramaters

  • 1. Maximize the Pseudo-Likelihood function over all non-fixed variables
  • 2. Decimate the 𝜍(𝑢) smallest variables (in magnitude) and fixed them
  • 3. If (criterion is reached)
  • 1. exit
  • 4. Else
  • 1. 𝑢 ← 𝑢 + 1
  • 2. goto 1.

Join work with F. Ricci-Tersenghi

slide-26
SLIDE 26

DECIMATION ALGORITHM

Given a set of equilibrium configurations and all unfixed paramaters

  • 1. Maximize the Pseudo-Likelihood function over all non-fixed variables
  • 2. Decimate the 𝜍(𝑢) smallest variables (in magnitude) and fixed them
  • 3. If (criterion is reached)
  • 1. exit
  • 4. Else
  • 1. 𝑢 ← 𝑢 + 1
  • 2. goto 1.

????

slide-27
SLIDE 27

CAN YOU GUESS THE CRITERION ?

Random graph with 16 nodes

slide-28
SLIDE 28

CAN YOU GUESS THE CRITERION ?

Random graph with 16 nodes The difference increases The difference decreases

slide-29
SLIDE 29

HOW DOES IT LOOK!

2D ferro model M=4500 𝞬=0.8

slide-30
SLIDE 30

COMPARISON WITH L1 : ROC

# true negative # true positive My objective!

slide-31
SLIDE 31

COMPARISON WITH L1 : ROC

slide-32
SLIDE 32

SOME MORE COMPARISONS (IF TIME)

slide-33
SLIDE 33

TO BE CONTINUED …

Can be adapted for the max-likelihood of the parallel dynamics (A.D and P. Zhang) p( 𝑡(𝑢 + 1)| 𝑡(𝑢)) =

𝑗

𝑓−𝛾𝑡𝑗(𝑢+1)( 𝑘 𝐾𝑗𝑘𝑡𝑘(𝑢)+ℎ𝑗) 2cosh(𝛾( 𝑘 𝐾𝑗𝑘𝑡

𝑘(𝑢) + ℎ𝑗) )

Has been applied to « detection of cheating by decimation algorithm » Shogo Yamanaka, Masayuki Ohzeki, A.D.

slide-34
SLIDE 34

EXTENSION ?

The PLM relies on the evaluation of the one-point marginal, why not use two-points or more ? “Composite Likelihood Estimation for Restricted Boltzmann machines” by Yasuda et al. Define 𝒬ℒ𝑙 =

1 #𝑙−𝑢𝑣𝑞𝑚𝑓𝑡 𝑙−𝑣𝑞𝑚𝑓 𝑑 𝑒𝑏𝑢𝑏 𝑞(

𝑡𝑑

(𝑒𝑏𝑢𝑏)|

𝑡

𝑑 (𝑒𝑏𝑢𝑏))

They show that 𝒬ℒ1 ≤ 𝒬ℒ2 ≤ ⋯ ≤ 𝒬ℒ𝑙 ≤ ⋯ ≤ 𝒬ℒ𝑂 True Likelihood !

slide-35
SLIDE 35

EXTENSION : THREE-BODY INTERACTIONS

The maximum likelihood can be seen as a maximum entropy problem where we would like to fit the 2-point correlations and local bias !

𝓘 =

𝑗<𝑘

𝐾𝑗𝑘𝑡𝑗𝑡

𝑘 + 𝑗

ℎ𝑗𝑡𝑗

There are already a lot of parameters O(N2) What if the system « could » have n-body interactions ? 𝓘 =

𝑗<𝑘

𝐾𝑗𝑘𝑡𝑗𝑡

𝑘 + 𝑗

ℎ𝑗𝑡𝑗 +

𝑗<𝑘<𝑙

𝐾𝑗𝑘𝑡𝑗𝑡

𝑘𝑡𝑙 + ⋯

slide-36
SLIDE 36

EXTENSION : THREE-BODY INTERACTIONS

We need to find an indicato ator that there could be new interactions Let’s consider the following experience

  • Take a system S1, 2D ferro without field
  • Take a system S2, 2D ferro without field but with some 3B interactions
  • Make the inference on the two models with a pairwise model and a

model with 3B interactions included

slide-37
SLIDE 37

EXTENSION : THREE-BODY INTERACTIONS

LEFT : S1 (whatever model I use for inferences) RIGHT : S2 when doing inference with the wrong model Error on t the correlati ation

  • n matrix
slide-38
SLIDE 38

EXTENSION : THREE-BODY INTERACTIONS

Take the error on the 3points correlation functions, plot them by decreasing order! Can you guess uess how many three-bo body y intera racti ctions

  • ns there

re are ?

slide-39
SLIDE 39

EXTENSION : THREE-BODY INTERACTIONS

  • Wrong model –

Histogram of the error on the 3p-corr

  • Correct model –

Histogram of the error on the 3p-corr

slide-40
SLIDE 40

SUMMARY - CONCLUSION

  • Beyond MF method : perform much better on non-trivial topology

(or strong coupling regime)

  • Recovering exact or approximate structure (by Decimation)

(without the need of fixing parameters)

  • Detection many-body interactions inside high order correlations

« Generalizing » max-ent As seen : PLM can be extend to become better and better at the cost of complexity!