On convergence of Approximate Message Passing Francesco Caltagirone - - PowerPoint PPT Presentation

on convergence of approximate message passing
SMART_READER_LITE
LIVE PREVIEW

On convergence of Approximate Message Passing Francesco Caltagirone - - PowerPoint PPT Presentation

On convergence of Approximate Message Passing Francesco Caltagirone (1) , Florent Krzakala (2) and Lenka Zdeborova (1) (1) Institut de Physique Thorique, CEA Saclay (2) LPS, Ecole Normale Suprieure, Paris Compressed Sensing the signal is an N


slide-1
SLIDE 1

On convergence of Approximate Message Passing

Francesco Caltagirone(1), Florent Krzakala(2) and Lenka Zdeborova(1)

(1) Institut de Physique Théorique, CEA Saclay (2) LPS, Ecole Normale Supérieure, Paris

slide-2
SLIDE 2

Compressed Sensing

y = Fx + ξ

ρ = K N

α = M N

hξ2i = ∆

MxN random matrix with i.i.d. elements the signal is an N components vector

  • nly K<N components are non-zero

the measurement is an M<N components vector white noise with variance

= +

y x ξ F

GIVEN RECONSTRUCT

slide-3
SLIDE 3

Standard techniques

Minimization of the norm under linear constraint

l0 min

x ||x||0 with Fx = y

non-convex norm, exponentially hard to find

||x||0 = number of non-zero elements

Candès, Tao, Donoho Minimization of the norm

l1 ||x||1 =

N

X

i=1

|xi|

convex norm, easy to minimize The norm well approximates the norm

l1 l0

slide-4
SLIDE 4

The Donoho-Tanner line

= +

y

x ξ

F

α = 1 α < 1

K = ρN

M = αN

α = ρ

Square matrix, we can invert it Rectangular matrix, under-determined Information-theoretical limit Let us consider the noiseless case with a measurement matrix with i.i.d. elements distributed according to a gaussian with zero mean and a variance of order 1/N.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

ρ0 α

α (ρ0) αEM-BP(ρ0) s-BP, N=104 s-BP, N=103 α = ρ0

`1

Donoho-Tanner, L1 minimization AMP Bayesian Information-Theoretical limit

slide-5
SLIDE 5

Bayesian setting GOAL Reconstruct the signal, given the measurement vector, the measurement matrix and a prior knowledge of the (sparse) distribution of signal elements Approximate Message Passing

Donoho, Maleki, Montanari (2009)

Powerful algorithm. Convergence issues.

Setting and motivation

slide-6
SLIDE 6

Setting and motivation

Fµi = γ N + 1 √ N N(0, 1) P(x) = (1 − ρ)δ(x) + ρN(0, 1)

Simplest case in which Approximate Massage Passing (AMP) has convergence problems. If the mean is sufficiently large then AMP displays violent divergencies. This kind of divergencies are observed in many other cases and are the main obstacle to a wide use of AMP .

y = Fx + ξ

In this simple case there are workarounds that ensure convergence, like a “mean-removal” procedure. BUT it is interesting because want to understand the origin of the non- convergence that, we argue, is of the same nature in more complicated settings.

slide-7
SLIDE 7

Bayesian Inference with Belief Propagation

P(x|F, y) = P(x|F)P(y|F, x) P(y|F)

P(y|F, x) =

M

Y

µ=1

1 p 2π∆µ e−

1 2∆µ (yµ−PN i=1 Fµixi)2

P(x|F, y) = 1 Z(y, F)

N

Y

i=1

[(1 − ρ)δ(xi) + ρφ(xi)]

M

Y

µ=1

1 p 2π∆µ e−

1 2∆µ (yµ−PN i=1 Fµixi)2

x?

i =

Z dxi xi νi(xi) νi(xi) ≡ Z

{xj}j6=i

P(x|F, y)

Bayes formula Conditional probability of the measurement vector

E =

N

X

i=1

(xi − si)2/N

MMSE estimator Takes an exponential time, unfeasible

slide-8
SLIDE 8

Bayes optimal setting

If we know exactly the prior distribution on the signal elements and on the noise we are in the so-called BAYES OPTIMAL setting In the following we will consider that this is the case. When it is not the case, the prior can be efficiently learned adding a step to the algorithm that I will present. (I will not talk about this)

slide-9
SLIDE 9

Belief Propagation (Cavity method)

mi→µ

m

µ → i P(x)

x

F

(xi)

Two kinds of nodes: factors (matrix lines) and variables (signal elements) Belief propagation works for: locally tree-like graphs or densely and weakly connected graphs. Messages represent an approximation to the marginal distribution of a variable. We can introduce a third kind of nodes: the prior distribution on the signal elements, local field. Messages are updated according to a sequential or parallel schedule until convergence (fixed point).

slide-10
SLIDE 10

Belief Propagation, r-BP and AMP

BP r-BP AMP

O(N 2) O(N 2)

continuous messages

mi→µ

m

µ → i P(x)

x

F

(xi)

numbers projection

  • perations

dense matrix, TAP For the last step one assumes parallel update

O(N 2)

In this case, fast matrix multiplication algorithms can be applied, reducing the complexity to

Nlog(N)

Donoho, Maleki, Montanari (2009) Krzakala et al. (2012)

slide-11
SLIDE 11

AMP Algorithm

V t+1

µ

= X

i

F 2

µi vt i ,

(1) ωt+1

µ

= X

i

Fµi at

i − (yµ − ωt µ)

∆ + V t

µ

X

i

F 2

µi vt i ,

(2) (Σt+1

i

)2 = "X

µ

F 2

µi

∆ + V t+1

µ

#−1 , (3) Rt+1

i

= at

i +

P

µ Fµi (yµ−ωt+1

µ

) ∆+V t+1

µ

P

µ F 2

µi

∆+V t+1

µ

, (4) at+1

i

= f1

  • (Σt+1

i

)2, Rt+1

i

  • ,

(5) vt+1

i

= f2

  • (Σt+1

i

)2, Rt+1

i

  • .

(6)

Et = 1 N

N

X

i=1

(si − at

i)2

V t = 1 N

N

X

i=1

vi

The performance of the algorithm can be evaluated through

fk(Σ2, R)

Q(x) = 1 Z(Σ2, R)P(x)e− (x−R)2

2Σ2

√ 2πΣ2

k-th connected cumulants w.r.t. the measure and are the AMP estimators for the mean and variance of the i-th signal component.

ai

vi

slide-12
SLIDE 12

AMP Algorithm

V t+1

µ

= X

i

F 2

µi vt i ,

(1) ωt+1

µ

= X

i

Fµi at

i − (yµ − ωt µ)

∆ + V t

µ

X

i

F 2

µi vt i ,

(2) (Σt+1

i

)2 = "X

µ

F 2

µi

∆ + V t+1

µ

#−1 , (3) Rt+1

i

= at

i +

P

µ Fµi (yµ−ωt+1

µ

) ∆+V t+1

µ

P

µ F 2

µi

∆+V t+1

µ

, (4) at+1

i

= f1

  • (Σt+1

i

)2, Rt+1

i

  • ,

(5) vt+1

i

= f2

  • (Σt+1

i

)2, Rt+1

i

  • .

(6)

Et = 1 N

N

X

i=1

(si − at

i)2

V t = 1 N

N

X

i=1

vi

The performance of the algorithm can be evaluated through The AMP algorithm does NOT depend explicitly on the value of the mean of the matrix.

Fµi = γ N + 1 √ N N(0, 1)

slide-13
SLIDE 13

Convergence

Given a certain (sufficiently high) measurement ratio. Very small or zero noise. Bayes optimal case.

slide-14
SLIDE 14

State Evolution (infinite N)

Bayati, Montanari (rigorous in the zero-mean case) ‘11 Krzakala et al. (replicas in the zero-mean case) ‘12 Caltagirone, Krzakala, Zdeborova (replicas in the non-zero-mean case) ‘14

State evolution is the asymptotic analysis of the average performance of the inference algorithm when the size of the signal goes to infinity. It gives a good indication of what happens in a practical situation if the size of the signal is sufficiently large. It can be obtained rigorously in simple cases and non rigorously with the replica method in more involved cases.

slide-15
SLIDE 15

State Evolution (infinite N)

V t+1 = Z ds P(s) Z Dz × f2 ✓∆ + V t α , s + zA(Et, Dt) + γ2Dt ◆

Et+1 = Z ds P(s) Z Dz ×  s − f1 ✓∆ + V t α , s + zA(Et, Dt) + γ2Dt ◆2

Dt+1 = Z ds P(s) Z Dz ×  s − f1 ✓∆ + V t α , s + zA(Et, Dt) + γ2Dt ◆

A(Et, Dt) = r Et + ∆ + γ2(Dt)2 α

with If the mean is zero the density evolution that does not depend on D

Dt = 1 N X

j

(sj − at

j)

Et = 1 N

N

X

i=1

(si − at

i)2

V t = 1 N

N

X

i=1

vi

Bayati, Montanari (rigorous in the zero-mean case) ‘11 Krzakala et al. (replicas in the zero-mean case) ‘12 Caltagirone, Krzakala, Zdeborova (replicas in the non-zero-mean case) ‘14

slide-16
SLIDE 16

The Nishimori Condition

Et = V t Dt = 0 Et+1 = V t+1 Dt+1 = 0

Bayes optimal setting Therefore, analytically, if the evolution starts (exactly) on the Nishimori Line it stays on it until convergence. BUT What is the effect of small perturbations with respect to the NL?

  • Very small fluctuations due to numerical precision in the DE
  • Fluctuations due to finite size in the AMP algorithm
slide-17
SLIDE 17

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.01 0.02 0.03 0.04 0.05 0.06 0.07 E V Gaussian Signal, Gaussian inference, ρ=0.2, no spinodal

Zero-mean case

Convergence on the NL (Bayati, Montanari)

The non-zero mean adds a third dimension to the phase space!

D

slide-18
SLIDE 18

Stability Analysis (I)

D V E

K = V − E D

δK δD

(K∗ = 0, D∗ = 0)

The NL is a “fixed line”:

Kt+1 = fK(V t, Kt, Dt) Dt+1 = fD(V t, Kt, Dt)

slide-19
SLIDE 19

Stability Analysis (II)

δKt = Kt − K∗ δDt = Dt − D∗ ✓ δKt+1 δDt+1 ◆ = M · ✓ δKt δDt ◆

We linearize the equations with

M = ✓∂KfK(V t, 0, 0) ∂DfK(V t, 0, 0) ∂KfD(V t, 0, 0) ∂DfD(V t, 0, 0) ◆

slide-20
SLIDE 20

Stability Analysis (II)

δKt = Kt − K∗ δDt = Dt − D∗ ✓ δKt+1 δDt+1 ◆ = M · ✓ δKt δDt ◆

We linearize the equations with When the signal is Gauss-Bernoulli with zero mean, the off-diagonal terms vanish.

M = ✓∂KfK(V t, 0, 0) ∂DfD(V t, 0, 0) ◆

slide-21
SLIDE 21

Stability Analysis (II)

∂DfD(V t) = − αγ2 ∆ + V t Z ds P(s) Z Dz f2

  • A2, s + zA
  • = −αγ2V t+1

∆ + V t , ∂KfK(V t) = −1 2 1 ∆ + V t Z ds P(s) Z Dz

  • f4
  • A2, s + zA
  • + 2(f2
  • A2, s + zA
  • )2

+2 ⇥ f1

  • A2, s + zA
  • − s

⇤ f3

  • A2, s + zA
  • ,
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5
  • 9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

D log10V =1.9 =2.5 =2.9 =3.6

γ < γ(1)

c

γ(1)

c

< γ < γ(2)

c

γ > γ(2)

c

ρ = 0.1 α = 0.3

∆ = 10−10

  • the eigenvalue is always less

than 1 in modulus.

  • the eigenvalue

becomes larger than 1 in a limited region.

  • the eigenvalue is larger than 1

in modulus down to the fixed point.

λD λK

slide-22
SLIDE 22

Density Evolution and AMP

1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1

  • c

(1)

c

(2) 0.2 0.4 0.6 0.8 1 1.6 2 2.4 2.8 3.2

R

  • c

(1)

c

(2) N=1000 N=4000 N=16000

For zero measurement noise both the critical values do NOT depend on the undersampling rate . For weak noise only the second critical value has a very weak dependence on both and .

α α

[Inset] Convergence Rate of the AMP algorithm for different signal sizes. The transition becomes sharper and sharper It is expected to move towards the second critical value and behave similarly to the density evolution.

N → ∞

slide-23
SLIDE 23

With random sequential update convergence problems disappear.

SwAMP algorithm, a possible solution

Manoel, Krzakala, Tramel, Zdeborova (2014)

slide-24
SLIDE 24

SwAMP algorithm, a possible solution

Very effective solution that works well in many interesting cases! Looses the property of involving only matrix multiplications.

arXiv:1401.6384 arXiv:1406.4311

Caltagirone, Manoel, Krzakala, Tramel, Zdeborova. Rangan, Schniter, Fletcher LOW RANK! It is not a universal solution. Kabashima

slide-25
SLIDE 25

Conclusions and Perspectives

  • We found that the origin of the convergence problems is an instability of the

Nishimori Line

  • We provided a possible solution with the SwAMP algorithm.
  • Relate this kind of instability in the density evolution to the shape of the

replica potential.

  • Perform the same kind of analysis for the case of dictionary learning.

THANK YOU!