Pooling fidelity and phase recovery Joan Bruna, Arthur Szlam, and Yann LeCun
Neural networks: • (Simplest version) functions of the form L k ◦ L k − 1 ◦ ... ◦ L 0 , • each L j is of the form L j ( x j − 1 ) = h ( A j x j − b j ) , • A j is a matrix and b j is a vector • h is an elementwise nonlinearity. • A j optimized for a given task, usually via gradient descent.
Convolutional neural networks: • The input x j has a grid structure, and A j specializes to a convolution. • The pointwise nonlinearity is followed by a pooling operator. • Pooling introduces invariance (on the grid) at the cost of lower res- olution (on the grid).
Pooling in neural networks: • Usually block l p : � | z i 1 | p + | z i 2 | p + ... + | z i s | p = || z I i || p p [ P ( z )] i = • In words: the i th coordinate of the output is the l p norm of the i th block of z . • In convolutional nets, blocks of indices are usually small spatial blocks, • p is either 1, 2, or most usually, ∞ .
Examples/History: • Hubel & Wiesel 1962 – neuroscientists’ description of lower level mammalian vision • Fukushima 1971: Neocognitron – An artificial version. – Filters hand designed in first versions, later used Hebbian learning. – Each layer has the same architecture
• LeCun 1988, many others: Convolutional nets – Filters trained discriminatively (original versions), and for recon- struction and discrimination, (2005+). – Training end to end via backpropagation (the chain rule) and stochastic gradient descent. – currently state of the art in object recognition, localization, and detection in images, and in various speech recognition, localiza- tion, and detection tasks. – mathematically poorly understood. (not controversial) – poorly understood. (controversial)
some wild speculation: • sparse coding • piecewise linear maps • pooling is key! (sparsify/decorrelate then contract). • output of network should be invariant to things we don’t care about, but sensitive to things we care about.
Have results of Mallat and co-authors: • Convnet with l 2 pooling, specially chosen filters, and some other modifications, is provably invariant to deformations while preserving signal energy. • what about sensitivity to things we care about?
The phase recovery problem: • Classical version: goal is to find a signal z ∈ C d given the absolute value of the (discrete) Fourier transform | F ( z ) | • With no additional information, this is not possible: 1. each coefficient can be rotated independently. 2. even if z is real, absolute value of Fourier is translation invariant. 3. In a strong sense, the majority of the information in signals we know and love is in the Fourier phase, not the magnitude:
cow duck
phase of duck, abs of cow phase of cow, abs of duck
• Much simpler: consider a real dictionary; “phase” is the signs of the analysis coefficients. • If dictionary is orthogonal, can flip the sign of an inner product at will. • If dictionary is overcomplete, interactions force rigidity:
Proposition 1 (Balan et al 2006,2013) . Let F = ( f 1 , . . . , f m ) with f i ∈ R n and set d ( x, x ′ ) = min( � x − x ′ � , � x + x ′ � ) , and λ − ( G ) and λ + ( G ) to be the lower and upper frame bounds of a set of vectors G . The mapping M ( x ) = {|� x, f i �|} i ≤ m satisfies ∀ x, x ′ ∈ R n , A d ( x, x ′ ) ≤ � M ( x ) − M ( x ′ ) � ≤ B d ( x, x ′ ) , (1) where � λ 2 − ( F S ) + λ 2 A = min − ( F S c ) , (2) S ⊂{ 1 ...m } B = λ + ( F ) . (3) In particular, M is injective if and only if for any subset S ⊆ { 1 , . . . , m } , either F S or F S c is an invertible frame.
Proof of “in particular” (assuming F is spanning): • Suppose for any subset S ⊆ { 1 , . . . , m } , either F S or F S c is spanning. Fix x, x ′ ∈ R n with | f T i x | = | f T i x ′ | for all i , and let S be the set S = { i : sign( f T i x ) = sign( f T i x ′ ) } . If F S is spanning, x = x ′ ; else x = − x ′ . • Suppose not; let S be the offending set of indices. Pick x � = 0 such S x = 0, and x ′ � = 0 such that F T S c x ′ = 0. Then x + x ′ and x − x ′ that F T have the same modulus.
An example for the l 2 subspace case: • consider √ √ √ 1 0 1 / 2 0 1 / 2 1 / 2 √ F = 0 1 1 / 2 0 0 0 √ √ 0 0 0 1 − 1 / 2 1 / 2 with groups I 1 = { 1 , 2 } , I 2 = { 3 , 4 } , and I 3 = { 5 , 6 } • Is l 2 pooling invertible here?
An example for the l 2 subspace case: • consider √ √ √ 1 0 1 / 2 0 1 / 2 1 / 2 √ F = 0 1 1 / 2 0 0 0 √ √ 0 0 0 1 − 1 / 2 1 / 2 with groups I 1 = { 1 , 2 } , I 2 = { 3 , 4 } , and I 3 = { 5 , 6 } • Is l 2 pooling invertible here? • no.
Proposition 2 (Cassaza et al 2013, BLS 2013) . The ℓ 2 pooling operator P 2 satisfies ∀ x, x ′ ; , A 2 d ( x, x ′ ) ≤ � P 2 ( x ) − P 2 ( x ′ ) � ≤ B 2 d ( x, x ′ ) , (4) where � λ 2 − ( G S ) + λ 2 A 2 = min min − ( G S c ) , G ∈ Q 2 S ⊂{ 1 ...m } = λ + ( F ) (5) B 2 In particular, P 2 is injective (up to a global sign) if and only if for any subset S ⊆ { 1 , . . . , m } , either G S or G S c is an invertible frame for all G ∈ Q 2 . Here Q 2 is the set of all block orthogonal transforms applied to F .
Proof of “in particular” (assuming F is spanning): • Suppose for any subset S ⊆ { 1 , . . . , m } , either G S or G S c is spanning for every G ∈ Q 2 . – Fix x, x ′ ∈ R n with P 2 ( x ) = P 2 ( x ′ ). – Choose o.n. bases G k for the subspace spanned by F k so that the k x and u ′ = G T k x ′ satisfy coordinates u = G T ∗ u ′ i = u i = 0 for i ∈ { 3 , ..., d } ∗ u 1 = u ′ 1 and | u 2 | = | u ′ 2 | Now use the previous argument. • Suppose not; rotate into bad coordinates, use previous method. P 2 is invariant to block rotations.
Corollary 3 (less than a week old) . IF K > 2 n and F has the property that any n columns are spanning, P 2 is invertible (in particular, for random block orthogonal F with K > 2 n ).
Half rectification: • Let Ω = Ω( F, α ) be the set of subsets S of { 1 , ..., m } such that some x have f T i x > α for i ∈ S and f T i x ≤ α for i ∈ S c . � � Proposition 4. Let A 0 = min S ∈ Ω λ − ( F S � V S ) . Then the half-rectification operator M α ( x ) = ρ α ( F T x ) is injective if and only if A 0 > 0 . Moreover, it satisfies ∀ x, x ′ , A 0 � x − x ′ � ≤ � M α ( x ) − M α ( x ′ ) � ≤ B 0 � x − x ′ � , (6) with B 0 = max S ∈ Ω λ + ( F S ) ≤ λ + ( F ) .
Corollary 5. Let d = 2 . Then the rectified ℓ 2 pooling operator R 2 satisfies ∀ x, x ′ ; , A 2 d ( x, x ′ ) ≤ � R 2 ( x ) − R 2 ( x ′ ) � ≤ B 2 d ( x, x ′ ) , ˜ (7) where � λ 2 ˜ = inf min min − ( F S x ∪ S x ′ \ ( S x ∩ S x ′ ) ) + A p x,x ′ S ⊂ S x ∩ S x ′ F ′ ∈ � Q p,x,x ′ � 1 / 2 λ 2 S ) + λ 2 − ( F ′ − ( F ′ S c ) ,
• for l 1 , l ∞ : need to replace Q . • statements somewhat messier, not tight. • for random (block orthonormal) frames with K > 4 n , invertibility with probability 1.
Will now discuss some experiments. But first, need algorithms for phase recovery: • alternating minimization • phaselift [Candes et al] and phasecut [Walsdspurger et al]
• As above, denote the frame { f 1 , ..., f m } = F and set F ( − 1) to be the pseudoinverse of F ; • let F k be the frame vectors in the k th block, with I k to be the indices of the k th block. • Starting with an initial signal x 0 , update F k x ( n ) 1. y ( n ) = ( P p ( x )) k || F k x ( n ) || p , k = 1 . . . K , I k 2. x ( n +1) = F ( − 1) y ( n ) .
• alternating minimization • phaselift [Candes et al] and phasecut [Walsdspurger et al]
• phaselift and phasecut both use the lifting trick of [Balan et al]: consider matrix variable corresponding to xx ∗ . • absolute value constraints are linear when lifted. • many more variables • ugly nonconvex rank 1 constraint. phaselift and phase cut are differ- ent relaxations of the lifted problem.
• alternating minimization is not, as far as we know, guarantee to converge to the correct solution, even when P p is invertible. • phasecut and phaselift are gauranteed with high probability for the (classical) phase recovery problem if have enough (random!) mea- surements. • In practice, if the inversion is easy enough, or if x 0 is close to the true solution, alternating minimization can work well. Moreover, • alternating minimization can be run essentially unchanged for each ℓ p ; for half rectification, we only use the nonegative entries in y for reconstruction.
• We would like to use the same basic algorithm for all settings to get an idea of the relative difficulty of the recovery problem for different p , • but if our algorithm simply returns poor results in each case, differ- ences between the case might be masked. • The alternating minimization can be very effective when well initial- ized.
Recommend
More recommend