BEYOND MEAN-FIELD APPROXIMATION
AURÉLIEN DECELLE
LABORATOIRE DE RECHERCHE EN INFORMATIQUE UNIVERSITÉ PARIS SUD
BEYOND MEAN-FIELD APPROXIMATION AURLIEN DECELLE LABORATOIRE DE - - PowerPoint PPT Presentation
BEYOND MEAN-FIELD APPROXIMATION AURLIEN DECELLE LABORATOIRE DE RECHERCHE EN INFORMATIQUE UNIVERSIT PARIS SUD MOTIVATIONS Why inverse problems ? In Machine Learning online recognition tasks In Physics understanding a
AURÉLIEN DECELLE
LABORATOIRE DE RECHERCHE EN INFORMATIQUE UNIVERSITÉ PARIS SUD
Why inverse problems ? In Machine Learning → online recognition tasks In Physics → understanding a physical system from observations In social science → getting insight of latent properties
Direct problems are already hard : understanding equilibrium properties can be (very) challenging (e.g. spin glasses) Inverse problems can be harder : ideally maximizing the likelihood would involve to compute the partition function many times In particular, serious problems can appear because if
Depending on the system, different optimization scheme can be adopted
MF is mapping the distribution of the data onto a particular form of probability distribution
min
𝜘 𝐿𝑀( 𝑞𝑒𝑏𝑢𝑏|| 𝑞𝑢𝑏𝑠𝑓𝑢(𝜘))
𝑞𝑗𝑘(𝑡𝑗,𝑡𝑘) 𝑞𝑗(𝑡𝑗)𝑞𝑘(𝑡𝑘) 𝑗 𝑞𝑗(𝑡𝑗)
What about when the system can not be describe by this particular form of distribution ?
⊕ how to put prior information ?
Pseudo udo-Lik Likeli elihood
But overfit
Max x likelih kelihood
But overfit and can be very slow
Adap apti tive e cluster uster exansio ansion
But it is hard to write it …
Con
trastic diver vergen gence ce
Overfit, and can be bad if very slow convergence ! Minimum um Probabilistic tic Flow
But probably does not work well for small sampling.
We consider the following problem : A system of discrete variables 𝑡𝑗 = 1, … , 𝑟 (ok let’s say 𝑡𝑗 = ± 1 in the following)
𝓘 =
<𝑗,𝑘>
𝐾𝑗𝑘𝑡𝑗𝑡
𝑘 + 𝑗
ℎ𝑗𝑡𝑗 p( 𝑡) = 𝑓−𝛾𝓘(
𝑡)
Then, a set of configuration is collected : { 𝑡 𝑏 }𝑏=1,..,𝑁 Using them, it is possible to compute the likelihood Reconstruction error ε2 =
(𝐾𝑗𝑘−𝐾𝑗𝑘
∗ )2
𝐾𝑗𝑘
2
The likelih ihood functi ction Proba of observing the configurations = 𝑏 𝑓−𝛾𝓘(𝑡(𝑏)) 𝑎 Define the log-likelihood ℒ = 𝑏(−𝛾𝓘(
Problem of maximization … How to compute average values efficiently ? 𝜖ℒ 𝜖𝐾𝑗𝑘 ∝< 𝑡𝑗𝑡
𝑘 > 𝑒𝑏𝑢𝑏 −< 𝑡𝑗𝑡 𝑘 > 𝑛𝑝𝑒𝑓𝑚
Goal : find a function that can be maximize and would infer correctly the Js
𝑞 𝑡 = 𝑞 𝑡𝑗 𝑡
𝑘\i) 𝑡𝑗
𝑞 𝑡 = 𝑞 𝑡𝑗 𝑡
𝑘\i)𝑞(
𝑡
𝑘\i)
𝑞 𝑡𝑗 𝑡
𝑘\i) = 𝑓−𝛾𝑡𝑗( 𝑘 𝐾𝑗𝑘𝑡𝑘+ℎ𝑗) 2cosh(𝛾 ( 𝑘 𝐾𝑗𝑘𝑡𝑘+ℎ𝑗) ) can be minimized ! Ekeberg et al. : Protein foldings ??? : training RBM
Can we have theoretical insight ? Yes, for gibbs infinite sampling, the maximum is correct ! Consider : 𝒬ℒ𝑗 = 𝑏 log(𝑞 𝑡𝑗 𝑡𝑘\i)) we replace the distribution over the data by Boltzmann 𝒬ℒ𝑗 =
𝒟
𝑓−𝛾𝓘𝐻 (
𝑡𝒟)
𝑎𝐻 log(𝑞 𝑡𝑗
𝒟
𝑡𝑘\i
𝒟 ))
The maximum is reached when the couplings from 𝓘𝐻 and 𝓘 of are equals
When no hidden variables are present, the PL is convex ! Therefore only one maxima exists ! The PL can be minimized without too much trouble using for instance
And the complexity goes as O(N2M) Let’s understand how this works and how it compares to MF
A set of M equilibrium configurations 𝑡(𝑙) , 𝑙 = 1, . . , 𝑁 On one side we use the MF equations 𝐾𝑗𝑘 = −𝑑𝑗𝑘
−1
On the other side we maximize the Pseudo-Likelihood distributions 𝒬ℒ𝑗 = 𝑙 log(1 + 𝑓−2𝛾𝑡𝑗
(𝑙) 𝐾𝑗𝑘𝑡𝑘 (𝑙)
) ∀𝑗 𝑛𝑗 = tanh(
𝑘
𝐾𝑗𝑘𝑛𝑘 + ℎ𝑘)
Curie-Weiss 𝐾𝑗𝑘 = −1/𝑂 with N=100 spins Hopfield 𝐾𝑗𝑘 = 𝜊𝑗
𝑏𝜊𝑘 𝑏 with N=100 spins
and two states, M=100k
SK model, N=64, with M=106, 107, 108 2D model, 𝐾𝑗𝑘 = −1, N=49, with M=104, 105, 106
How does the L1-norm is included in PLM ? 𝒬ℒ𝑗 = 𝑙 log 1 + 𝑓−2𝛾𝑡𝑗
𝑙 𝐾𝑗𝑘𝑡𝑘 𝑙
− 𝜇 𝑘 |𝐾𝑗𝑘| ∀𝑗 Leads to sparse solution … how to fix 𝜇 ?
Progressively decimating parameters with a small absolute values Not NEW :
Given a set of equilibrium configurations and all unfixed paramaters
Join work with F. Ricci-Tersenghi
Given a set of equilibrium configurations and all unfixed paramaters
????
Random graph with 16 nodes
Random graph with 16 nodes The difference increases The difference decreases
2D ferro model M=4500 𝞬=0.8
# true negative # true positive My objective!
Can be adapted for the max-likelihood of the parallel dynamics (A.D and P. Zhang) p( 𝑡(𝑢 + 1)| 𝑡(𝑢)) =
𝑗
𝑓−𝛾𝑡𝑗(𝑢+1)( 𝑘 𝐾𝑗𝑘𝑡𝑘(𝑢)+ℎ𝑗) 2cosh(𝛾( 𝑘 𝐾𝑗𝑘𝑡
𝑘(𝑢) + ℎ𝑗) )
Has been applied to « detection of cheating by decimation algorithm » Shogo Yamanaka, Masayuki Ohzeki, A.D.
The PLM relies on the evaluation of the one-point marginal, why not use two-points or more ? “Composite Likelihood Estimation for Restricted Boltzmann machines” by Yasuda et al. Define 𝒬ℒ𝑙 =
1 #𝑙−𝑢𝑣𝑞𝑚𝑓𝑡 𝑙−𝑣𝑞𝑚𝑓 𝑑 𝑒𝑏𝑢𝑏 𝑞(
𝑡𝑑
(𝑒𝑏𝑢𝑏)|
𝑡
𝑑 (𝑒𝑏𝑢𝑏))
They show that 𝒬ℒ1 ≤ 𝒬ℒ2 ≤ ⋯ ≤ 𝒬ℒ𝑙 ≤ ⋯ ≤ 𝒬ℒ𝑂 True Likelihood !
The maximum likelihood can be seen as a maximum entropy problem where we would like to fit the 2-point correlations and local bias !
𝓘 =
𝑗<𝑘
𝐾𝑗𝑘𝑡𝑗𝑡
𝑘 + 𝑗
ℎ𝑗𝑡𝑗
There are already a lot of parameters O(N2) What if the system « could » have n-body interactions ? 𝓘 =
𝑗<𝑘
𝐾𝑗𝑘𝑡𝑗𝑡
𝑘 + 𝑗
ℎ𝑗𝑡𝑗 +
𝑗<𝑘<𝑙
𝐾𝑗𝑘𝑡𝑗𝑡
𝑘𝑡𝑙 + ⋯
We need to find an indicato ator that there could be new interactions Let’s consider the following experience
model with 3B interactions included
LEFT : S1 (whatever model I use for inferences) RIGHT : S2 when doing inference with the wrong model Error on t the correlati ation
Take the error on the 3points correlation functions, plot them by decreasing order! Can you guess uess how many three-bo body y intera racti ctions
re are ?
Histogram of the error on the 3p-corr
Histogram of the error on the 3p-corr
(or strong coupling regime)
(without the need of fixing parameters)
« Generalizing » max-ent As seen : PLM can be extend to become better and better at the cost of complexity!