Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and - - PowerPoint PPT Presentation
Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and - - PowerPoint PPT Presentation
Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and Yann LeCun Neural networks: (Simplest version) functions of the form L k L k 1 ... L 0 , each L j is of the form L j ( x j 1 ) = h ( A j x j b j ) ,
Neural networks:
- (Simplest version) functions of the form
Lk ◦ Lk−1 ◦ ... ◦ L0,
- each Lj is of the form
Lj(xj−1) = h(Ajxj − bj),
- Aj is a matrix and bj is a vector
- h is an elementwise nonlinearity.
- Aj optimized for a given task, usually via gradient descent.
Convolutional neural networks:
- The input xj has a grid structure, and Aj specializes to a convolution.
- The pointwise nonlinearity is followed by a pooling operator.
- Pooling introduces invariance (on the grid) at the cost of lower res-
- lution (on the grid).
Pooling in neural networks:
- Usually block lp:
[P(z)]i =
p
- |zi1|p + |zi2|p + ... + |zis|p = ||zIi||p
- In words: the ith coordinate of the output is the lp norm of the ith
block of z.
- In convolutional nets, blocks of indices are usually small spatial blocks,
- p is either 1, 2, or most usually, ∞.
Examples/History:
- Hubel & Wiesel 1962
– neuroscientists’ description of lower level mammalian vision
- Fukushima 1971: Neocognitron
– An artificial version. – Filters hand designed in first versions, later used Hebbian learning. – Each layer has the same architecture
- LeCun 1988, many others: Convolutional nets
– Filters trained discriminatively (original versions), and for recon- struction and discrimination, (2005+). – Training end to end via backpropagation (the chain rule) and stochastic gradient descent. – currently state of the art in object recognition, localization, and detection in images, and in various speech recognition, localiza- tion, and detection tasks. – mathematically poorly understood. (not controversial) – poorly understood. (controversial)
some wild speculation:
- sparse coding
- piecewise linear maps
- pooling is key! (sparsify/decorrelate then contract).
- output of network should be invariant to things we don’t care about,
but sensitive to things we care about.
Have results of Mallat and co-authors:
- Convnet with l2 pooling, specially chosen filters, and some other
modifications, is provably invariant to deformations while preserving signal energy.
- what about sensitivity to things we care about?
The phase recovery problem:
- Classical version: goal is to find a signal z ∈ Cd given the absolute
value of the (discrete) Fourier transform |F(z)|
- With no additional information, this is not possible:
- 1. each coefficient can be rotated independently.
- 2. even if z is real, absolute value of Fourier is translation invariant.
- 3. In a strong sense, the majority of the information in signals we
know and love is in the Fourier phase, not the magnitude:
cow duck
phase of duck, abs of cow phase of cow, abs of duck
- Much simpler: consider a real dictionary; “phase” is the signs of the
analysis coefficients.
- If dictionary is orthogonal, can flip the sign of an inner product at
will.
- If dictionary is overcomplete, interactions force rigidity:
Proposition 1 (Balan et al 2006,2013). Let F = (f1, . . . , fm) with fi ∈ Rn and set d(x, x′) = min(x − x′, x + x′), and λ−(G) and λ+(G) to be the lower and upper frame bounds of a set of vectors G. The mapping M(x) = {|x, fi|}i≤m satisfies ∀ x, x′ ∈ Rn , A d(x, x′) ≤ M(x) − M(x′) ≤ B d(x, x′) , (1) where A = min
S⊂{1...m}
- λ2
−(FS) + λ2 −(FSc) ,
(2) B = λ+(F) . (3) In particular, M is injective if and only if for any subset S ⊆ {1, . . . , m}, either FS or FSc is an invertible frame.
Proof of “in particular” (assuming F is spanning):
- Suppose for any subset S ⊆ {1, . . . , m}, either FS or FSc is spanning.
Fix x, x′ ∈ Rn with |fT
i x| = |fT i x′| for all i, and let S be the set
S = {i : sign(fT
i x) = sign(fT i x′)}.
If FS is spanning, x = x′; else x = −x′.
- Suppose not; let S be the offending set of indices. Pick x = 0 such
that F T
S x = 0, and x′ = 0 such that F T Scx′ = 0. Then x + x′ and x − x′
have the same modulus.
An example for the l2 subspace case:
- consider
F =
1 1/ √ 2 1/ √ 2 1/ √ 2 1 1/ √ 2 1 −1/ √ 2 1/ √ 2
with groups I1 = {1, 2}, I2 = {3, 4}, and I3 = {5, 6}
- Is l2 pooling invertible here?
An example for the l2 subspace case:
- consider
F =
1 1/ √ 2 1/ √ 2 1/ √ 2 1 1/ √ 2 1 −1/ √ 2 1/ √ 2
with groups I1 = {1, 2}, I2 = {3, 4}, and I3 = {5, 6}
- Is l2 pooling invertible here?
- no.
Proposition 2 (Cassaza et al 2013, BLS 2013). The ℓ2 pooling operator P2 satisfies ∀ x, x′; , A2d(x, x′) ≤ P2(x) − P2(x′) ≤ B2d(x, x′) , (4) where A2 = min
G∈Q2
min
S⊂{1...m}
- λ2
−(GS) + λ2 −(GSc) ,
B2 = λ+(F) (5) In particular, P2 is injective (up to a global sign) if and only if for any subset S ⊆ {1, . . . , m}, either GS or GSc is an invertible frame for all G ∈ Q2. Here Q2 is the set of all block orthogonal transforms applied to F.
Proof of “in particular” (assuming F is spanning):
- Suppose for any subset S ⊆ {1, . . . , m}, either GS or GSc is spanning
for every G ∈ Q2. – Fix x, x′ ∈ Rn with P2(x) = P2(x′). – Choose o.n. bases Gk for the subspace spanned by Fk so that the coordinates u = GT
k x and u′ = GT k x′ satisfy
∗ u′
i = ui = 0 for i ∈ {3, ..., d}
∗ u1 = u′
1 and |u2| = |u′ 2|
Now use the previous argument.
- Suppose not; rotate into bad coordinates, use previous method. P2
is invariant to block rotations.
Corollary 3 (less than a week old). IF K > 2n and F has the property that any n columns are spanning, P2 is invertible (in particular, for random block orthogonal F with K > 2n).
Half rectification:
- Let Ω = Ω(F, α) be the set of subsets S of {1, ..., m} such that some
x have fT
i x > α for i ∈ S and fT i x ≤ α for i ∈ Sc.
Proposition 4. Let A0 = minS∈Ω λ−(FS
- VS). Then the half-rectification
- perator Mα(x) = ρα(F T x) is injective if and only if A0 > 0. Moreover,
it satisfies ∀ x, x′ , A0x − x′ ≤ Mα(x) − Mα(x′) ≤ B0x − x′ , (6) with B0 = maxS∈Ω λ+(FS) ≤ λ+(F).
Corollary 5. Let d = 2. Then the rectified ℓ2 pooling operator R2 satisfies ∀ x, x′; , ˜ A2d(x, x′) ≤ R2(x) − R2(x′) ≤ B2d(x, x′) , (7) where ˜ Ap = inf
x,x′
min
F ′∈ Qp,x,x′
min
S⊂Sx∩Sx′
- λ2
−(FSx∪Sx′\(Sx∩Sx′)) +
λ2
−(F ′ S) + λ2 −(F ′ Sc)
1/2
,
- for l1, l∞: need to replace Q.
- statements somewhat messier, not tight.
- for random (block orthonormal) frames with K > 4n, invertibility with
probability 1.
Will now discuss some experiments. But first, need algorithms for phase recovery:
- alternating minimization
- phaselift [Candes et al] and phasecut [Walsdspurger et al]
- As above, denote the frame {f1, ..., fm} = F and set F (−1) to be the
pseudoinverse of F;
- let Fk be the frame vectors in the kth block, with Ik to be the indices
- f the kth block.
- Starting with an initial signal x0, update
- 1. y(n)
Ik
= (Pp(x))k
Fkx(n) ||Fkx(n)||p, k = 1 . . . K,
- 2. x(n+1) = F (−1)y(n).
- alternating minimization
- phaselift [Candes et al] and phasecut [Walsdspurger et al]
- phaselift and phasecut both use the lifting trick of [Balan et al]:
consider matrix variable corresponding to xx∗.
- absolute value constraints are linear when lifted.
- many more variables
- ugly nonconvex rank 1 constraint. phaselift and phase cut are differ-
ent relaxations of the lifted problem.
- alternating minimization is not, as far as we know, guarantee to
converge to the correct solution, even when Pp is invertible.
- phasecut and phaselift are gauranteed with high probability for the
(classical) phase recovery problem if have enough (random!) mea- surements.
- In practice, if the inversion is easy enough, or if x0 is close to the
true solution, alternating minimization can work well. Moreover,
- alternating minimization can be run essentially unchanged for each
ℓp; for half rectification, we only use the nonegative entries in y for reconstruction.
- We would like to use the same basic algorithm for all settings to get
an idea of the relative difficulty of the recovery problem for different p,
- but if our algorithm simply returns poor results in each case, differ-
ences between the case might be masked.
- The alternating minimization can be very effective when well initial-
ized.
- When given a training set of the data to recover, we use a simple
regression to find x0.
- Fix a number of neighbors q (in the experiments below we use q = 10,
and suppose X is the training set).
- Set G = Pp(X), and for a new point x to recover from Pp(x), find the
q nearest neighbors in G of Pp(x), and take their principal component to serve as x0 in the alternating minimization algorithm.
- random data, random dictionary
- phasecut+alternating minimization vs alternating minimization vs rec-
tified alternating minimization
50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
phaselift+am linear am relu am
- structured data with training samples, random dictionary
- phasecut+am vs. phasecut vs. nn-regress am vs. am
50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
phasecut phasecut+am am + nn init am 50 100 150 200 250 300 350 400 450 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
phasecut phasecut+am am + nn init am
MNIST PATCHES
- alternating minimization with good intialization gives great results!
- see also [Yang et al 2013] on phaselift with a sparse prior.
- but this is trivial and much faster.
20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear relu 20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear relu 50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear relu 50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear relu
50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 50 100 150 200 250 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress
100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 100 200 300 400 500 600 700 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of samples vs reconstruction accuracy Number of samples mean
|rT x|2 ||r||2||x||2
linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress
The experiments show that:
- For every data set, with random initializations and dictionaries, re-
covery is easier with half rectification before pooling than without.
- ℓ∞, ℓ1, and ℓ2 pooling are all roughly the same difficulty to invert.
- Good initialization improves performance; indeed, alternating mini-
mization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does).
- Adapted analysis “frames” (with regularization) are easier to invert
than random analysis frames, with or without regularization.
- Each of these conclusions is unfortunately only true up to the opti-