Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and - - PowerPoint PPT Presentation

pooling fidelity and phase recovery joan bruna arthur
SMART_READER_LITE
LIVE PREVIEW

Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and - - PowerPoint PPT Presentation

Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and Yann LeCun Neural networks: (Simplest version) functions of the form L k L k 1 ... L 0 , each L j is of the form L j ( x j 1 ) = h ( A j x j b j ) ,


slide-1
SLIDE 1

Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and Yann LeCun

slide-2
SLIDE 2

Neural networks:

  • (Simplest version) functions of the form

Lk ◦ Lk−1 ◦ ... ◦ L0,

  • each Lj is of the form

Lj(xj−1) = h(Ajxj − bj),

  • Aj is a matrix and bj is a vector
  • h is an elementwise nonlinearity.
  • Aj optimized for a given task, usually via gradient descent.
slide-3
SLIDE 3

Convolutional neural networks:

  • The input xj has a grid structure, and Aj specializes to a convolution.
  • The pointwise nonlinearity is followed by a pooling operator.
  • Pooling introduces invariance (on the grid) at the cost of lower res-
  • lution (on the grid).
slide-4
SLIDE 4

Pooling in neural networks:

  • Usually block lp:

[P(z)]i =

p

  • |zi1|p + |zi2|p + ... + |zis|p = ||zIi||p
  • In words: the ith coordinate of the output is the lp norm of the ith

block of z.

  • In convolutional nets, blocks of indices are usually small spatial blocks,
  • p is either 1, 2, or most usually, ∞.
slide-5
SLIDE 5

Examples/History:

  • Hubel & Wiesel 1962

– neuroscientists’ description of lower level mammalian vision

  • Fukushima 1971: Neocognitron

– An artificial version. – Filters hand designed in first versions, later used Hebbian learning. – Each layer has the same architecture

slide-6
SLIDE 6
  • LeCun 1988, many others: Convolutional nets

– Filters trained discriminatively (original versions), and for recon- struction and discrimination, (2005+). – Training end to end via backpropagation (the chain rule) and stochastic gradient descent. – currently state of the art in object recognition, localization, and detection in images, and in various speech recognition, localiza- tion, and detection tasks. – mathematically poorly understood. (not controversial) – poorly understood. (controversial)

slide-7
SLIDE 7

some wild speculation:

  • sparse coding
  • piecewise linear maps
  • pooling is key! (sparsify/decorrelate then contract).
  • output of network should be invariant to things we don’t care about,

but sensitive to things we care about.

slide-8
SLIDE 8

Have results of Mallat and co-authors:

  • Convnet with l2 pooling, specially chosen filters, and some other

modifications, is provably invariant to deformations while preserving signal energy.

  • what about sensitivity to things we care about?
slide-9
SLIDE 9

The phase recovery problem:

  • Classical version: goal is to find a signal z ∈ Cd given the absolute

value of the (discrete) Fourier transform |F(z)|

  • With no additional information, this is not possible:
  • 1. each coefficient can be rotated independently.
  • 2. even if z is real, absolute value of Fourier is translation invariant.
  • 3. In a strong sense, the majority of the information in signals we

know and love is in the Fourier phase, not the magnitude:

slide-10
SLIDE 10

cow duck

slide-11
SLIDE 11

phase of duck, abs of cow phase of cow, abs of duck

slide-12
SLIDE 12
  • Much simpler: consider a real dictionary; “phase” is the signs of the

analysis coefficients.

  • If dictionary is orthogonal, can flip the sign of an inner product at

will.

  • If dictionary is overcomplete, interactions force rigidity:
slide-13
SLIDE 13

Proposition 1 (Balan et al 2006,2013). Let F = (f1, . . . , fm) with fi ∈ Rn and set d(x, x′) = min(x − x′, x + x′), and λ−(G) and λ+(G) to be the lower and upper frame bounds of a set of vectors G. The mapping M(x) = {|x, fi|}i≤m satisfies ∀ x, x′ ∈ Rn , A d(x, x′) ≤ M(x) − M(x′) ≤ B d(x, x′) , (1) where A = min

S⊂{1...m}

  • λ2

−(FS) + λ2 −(FSc) ,

(2) B = λ+(F) . (3) In particular, M is injective if and only if for any subset S ⊆ {1, . . . , m}, either FS or FSc is an invertible frame.

slide-14
SLIDE 14

Proof of “in particular” (assuming F is spanning):

  • Suppose for any subset S ⊆ {1, . . . , m}, either FS or FSc is spanning.

Fix x, x′ ∈ Rn with |fT

i x| = |fT i x′| for all i, and let S be the set

S = {i : sign(fT

i x) = sign(fT i x′)}.

If FS is spanning, x = x′; else x = −x′.

  • Suppose not; let S be the offending set of indices. Pick x = 0 such

that F T

S x = 0, and x′ = 0 such that F T Scx′ = 0. Then x + x′ and x − x′

have the same modulus.

slide-15
SLIDE 15

An example for the l2 subspace case:

  • consider

F =

  

1 1/ √ 2 1/ √ 2 1/ √ 2 1 1/ √ 2 1 −1/ √ 2 1/ √ 2

  

with groups I1 = {1, 2}, I2 = {3, 4}, and I3 = {5, 6}

  • Is l2 pooling invertible here?
slide-16
SLIDE 16

An example for the l2 subspace case:

  • consider

F =

  

1 1/ √ 2 1/ √ 2 1/ √ 2 1 1/ √ 2 1 −1/ √ 2 1/ √ 2

  

with groups I1 = {1, 2}, I2 = {3, 4}, and I3 = {5, 6}

  • Is l2 pooling invertible here?
  • no.
slide-17
SLIDE 17

Proposition 2 (Cassaza et al 2013, BLS 2013). The ℓ2 pooling operator P2 satisfies ∀ x, x′; , A2d(x, x′) ≤ P2(x) − P2(x′) ≤ B2d(x, x′) , (4) where A2 = min

G∈Q2

min

S⊂{1...m}

  • λ2

−(GS) + λ2 −(GSc) ,

B2 = λ+(F) (5) In particular, P2 is injective (up to a global sign) if and only if for any subset S ⊆ {1, . . . , m}, either GS or GSc is an invertible frame for all G ∈ Q2. Here Q2 is the set of all block orthogonal transforms applied to F.

slide-18
SLIDE 18

Proof of “in particular” (assuming F is spanning):

  • Suppose for any subset S ⊆ {1, . . . , m}, either GS or GSc is spanning

for every G ∈ Q2. – Fix x, x′ ∈ Rn with P2(x) = P2(x′). – Choose o.n. bases Gk for the subspace spanned by Fk so that the coordinates u = GT

k x and u′ = GT k x′ satisfy

∗ u′

i = ui = 0 for i ∈ {3, ..., d}

∗ u1 = u′

1 and |u2| = |u′ 2|

Now use the previous argument.

  • Suppose not; rotate into bad coordinates, use previous method. P2

is invariant to block rotations.

slide-19
SLIDE 19

Corollary 3 (less than a week old). IF K > 2n and F has the property that any n columns are spanning, P2 is invertible (in particular, for random block orthogonal F with K > 2n).

slide-20
SLIDE 20

Half rectification:

  • Let Ω = Ω(F, α) be the set of subsets S of {1, ..., m} such that some

x have fT

i x > α for i ∈ S and fT i x ≤ α for i ∈ Sc.

Proposition 4. Let A0 = minS∈Ω λ−(FS

  • VS). Then the half-rectification
  • perator Mα(x) = ρα(F T x) is injective if and only if A0 > 0. Moreover,

it satisfies ∀ x, x′ , A0x − x′ ≤ Mα(x) − Mα(x′) ≤ B0x − x′ , (6) with B0 = maxS∈Ω λ+(FS) ≤ λ+(F).

slide-21
SLIDE 21

Corollary 5. Let d = 2. Then the rectified ℓ2 pooling operator R2 satisfies ∀ x, x′; , ˜ A2d(x, x′) ≤ R2(x) − R2(x′) ≤ B2d(x, x′) , (7) where ˜ Ap = inf

x,x′

min

F ′∈ Qp,x,x′

min

S⊂Sx∩Sx′

  • λ2

−(FSx∪Sx′\(Sx∩Sx′)) +

λ2

−(F ′ S) + λ2 −(F ′ Sc)

1/2

,

slide-22
SLIDE 22
  • for l1, l∞: need to replace Q.
  • statements somewhat messier, not tight.
  • for random (block orthonormal) frames with K > 4n, invertibility with

probability 1.

slide-23
SLIDE 23

Will now discuss some experiments. But first, need algorithms for phase recovery:

  • alternating minimization
  • phaselift [Candes et al] and phasecut [Walsdspurger et al]
slide-24
SLIDE 24
  • As above, denote the frame {f1, ..., fm} = F and set F (−1) to be the

pseudoinverse of F;

  • let Fk be the frame vectors in the kth block, with Ik to be the indices
  • f the kth block.
  • Starting with an initial signal x0, update
  • 1. y(n)

Ik

= (Pp(x))k

Fkx(n) ||Fkx(n)||p, k = 1 . . . K,

  • 2. x(n+1) = F (−1)y(n).
slide-25
SLIDE 25
  • alternating minimization
  • phaselift [Candes et al] and phasecut [Walsdspurger et al]
slide-26
SLIDE 26
  • phaselift and phasecut both use the lifting trick of [Balan et al]:

consider matrix variable corresponding to xx∗.

  • absolute value constraints are linear when lifted.
  • many more variables
  • ugly nonconvex rank 1 constraint. phaselift and phase cut are differ-

ent relaxations of the lifted problem.

slide-27
SLIDE 27
  • alternating minimization is not, as far as we know, guarantee to

converge to the correct solution, even when Pp is invertible.

  • phasecut and phaselift are gauranteed with high probability for the

(classical) phase recovery problem if have enough (random!) mea- surements.

  • In practice, if the inversion is easy enough, or if x0 is close to the

true solution, alternating minimization can work well. Moreover,

  • alternating minimization can be run essentially unchanged for each

ℓp; for half rectification, we only use the nonegative entries in y for reconstruction.

slide-28
SLIDE 28
  • We would like to use the same basic algorithm for all settings to get

an idea of the relative difficulty of the recovery problem for different p,

  • but if our algorithm simply returns poor results in each case, differ-

ences between the case might be masked.

  • The alternating minimization can be very effective when well initial-

ized.

slide-29
SLIDE 29
  • When given a training set of the data to recover, we use a simple

regression to find x0.

  • Fix a number of neighbors q (in the experiments below we use q = 10,

and suppose X is the training set).

  • Set G = Pp(X), and for a new point x to recover from Pp(x), find the

q nearest neighbors in G of Pp(x), and take their principal component to serve as x0 in the alternating minimization algorithm.

slide-30
SLIDE 30
  • random data, random dictionary
  • phasecut+alternating minimization vs alternating minimization vs rec-

tified alternating minimization

50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

phaselift+am linear am relu am

slide-31
SLIDE 31
  • structured data with training samples, random dictionary
  • phasecut+am vs. phasecut vs. nn-regress am vs. am

50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

phasecut phasecut+am am + nn init am 50 100 150 200 250 300 350 400 450 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

phasecut phasecut+am am + nn init am

MNIST PATCHES

slide-32
SLIDE 32
  • alternating minimization with good intialization gives great results!
  • see also [Yang et al 2013] on phaselift with a sparse prior.
  • but this is trivial and much faster.
slide-33
SLIDE 33

20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear relu 20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear relu 50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear relu 50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear relu

slide-34
SLIDE 34

50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress

slide-35
SLIDE 35

100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean

|rT x|2 ||r||2||x||2

linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress

slide-36
SLIDE 36

The experiments show that:

  • For every data set, with random initializations and dictionaries, re-

covery is easier with half rectification before pooling than without.

  • ℓ∞, ℓ1, and ℓ2 pooling are all roughly the same difficulty to invert.
  • Good initialization improves performance; indeed, alternating mini-

mization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does).

  • Adapted analysis “frames” (with regularization) are easier to invert

than random analysis frames, with or without regularization.

slide-37
SLIDE 37
  • Each of these conclusions is unfortunately only true up to the opti-

mization method- it may be true that a different optimizer will lead to different results.