harmonic analysis of deep convolutional networks
play

Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST - PowerPoint PPT Presentation

Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST Based on Mallat and Bolcskei talks etc. Acknowledgement A following-up course at HKUST: https://deeplearning-math.github.io/ High Dimensional Natural Image Classification


  1. Harmonic Analysis of Deep Convolutional Networks 1 Yuan YAO HKUST Based on Mallat and Bolcskei talks etc.

  2. Acknowledgement A following-up course at HKUST: https://deeplearning-math.github.io/

  3. High Dimensional Natural Image Classification • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants

  4. Curse of Dimensionality • Analysis in high dimension: x ∈ R d with d ≥ 10 6 . • Points are far away in high dimensions d : o o o o o o o o o o - 10 points cover [0 , 1] at a distance 10 − 1 o o o o o o o o o o - 100 points for [0 , 1] 2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o - need 10 d points over [0 , 1] d o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o impossible if d ≥ 20 points are volume sphere of radius r concentrated lim = 0 in 2 d corners! volume [0 , r ] d d →∞ ⇒ Euclidean metrics are not appropriate on raw data .

  5. A Blessing from Physical world? Multiscale “compositional” sparsity • Variables x ( u ) indexed by a low-dimensional u : time/space... pixels in images, particles in physics, words in text... • Mutliscale interactions of d variables: u 1 u 2 From d 2 interactions to O (log 2 d ) multiscale interactions. • Multiscale analysis: wavelets on groups of symmetries. hierarchical architecture.

  6. Learning as an Approximation • To estimate f ( x ) from a sampling { x i , y i = f ( x i ) } i ≤ M we must build an M -parameter approximation f M of f . • Precise sparse approximation requires some ”regularity”. ⇢ 1 if x ∈ Ω • For binary classification f ( x ) = − 1 if x / ∈ Ω f ( x ) = sign( ˜ f ( x )) where ˜ f is potentially regular. • What type of regularity ? How to compute f M ?

  7. 1 Hidden Layer Neural Networks One-hidden layer neural network: ρ ( w n .x + b n ) x α n M w n .x = P k w k,n x k X f M ( x ) = α n ρ ( w n .x + b n ) n =1 { w k,k } k,n and { α n } n are learned d non-linear approximation. M Fourier series: ρ ( u ) = e iu M X α n e iw n .x f M ( x ) = n =1 For nearly all ρ : essentially same approximation results.

  8. Piecewise Linear Approximation • Piecewise linear approximation: ρ ( u ) = max( u, 0) ˜ X f ( x ) = a n ⇢ ( x − n ✏ ) f ( x ) n x ✏ n ✏ If f is Lipschitz: | f ( x ) − f ( x 0 ) | ≤ C | x − x 0 | | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ Need M = ✏ − 1 points to cover [0 , 1] at a distance ✏ k f � f M k  C M − 1 ⇒

  9. Linear Ridge Approximation • Piecewise linear ridge approximation: x ∈ [0 , 1] d ˜ X f ( x ) = a n ⇢ ( w n .x − n ✏ ) ρ ( u ) = max( u, 0) n If f is Lipschitz: | f ( x ) � f ( x 0 ) |  C k x � x 0 k Sampling at a distance ✏ : | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ need M = ✏ − d points to cover [0 , 1] d at a distance ✏ ⇒ k f � f M k  C M − 1 /d Curse of dimensionality!

  10. Approximation with Regularity • What prior condition makes learning possible ? • Approximation of regular functions in C s [0 , 1] d : | f ( x ) − p u ( x ) | ≤ C | x − u | s with p u ( x ) polynomial ∀ x, u f ( x ) u x p u ( x ) | x − u | ≤ ✏ 1 /s | f ( x ) − p u ( x ) | ≤ C ✏ ⇒ Need M − d/s point to cover [0 , 1] d at a distance ✏ 1 /s k f � f M k  C M − s/d ⇒ • Can not do better in C s [0 , 1] d , not good because s ⌧ d . Failure of classical approximation theory.

  11. Kernel Learning Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x Metric: k x � x 0 k k Φ ( x ) � Φ ( x 0 ) k • How and when is possible to find such a Φ ? • What ”regularity” of f is needed ?

  12. Spirit in Fisher’s Linear Discriminant Analysis Reduction of Dimensionality • Discriminative change of variable Φ ( x ): Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ∃ ˜ f with f ( x ) = ˜ f ( Φ ( x )) ⇒ • If ˜ f is Lipschitz: | ˜ f ( z ) � ˜ f ( z 0 ) |  C k z � z 0 k z = Φ ( x ) | f ( x ) � f ( x 0 ) |  C k Φ ( x ) � Φ ( x 0 ) k , Discriminative: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) | • For x ∈ Ω , if Φ ( Ω ) is bounded and a low dimension d 0 ) k f � f M k  C M − 1 /d 0

  13. Deep Convolution Neworks • The revival of neural networks: Y. LeCun x L 1 linear convolution ρ ( u ) = max( u, 0) non-linear scalar: neuron ρ Hierarchical L 2 linear convolution invariants ρ Linearization . . . Linear Classificat. y = ˜ Φ ( x ) f ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, language, bio-data... Why does it work so well ? A di ffi cult problem

  14. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • ρ is contractive: | ρ ( u ) − ρ ( u 0 ) | ≤ | u − u 0 | ρ ( u ) = max( u, 0) or ρ ( u ) = | u |

  15. Many Questions x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 ρ L j k 1 k 2 • Why convolutions ? Translation covariance. • Why no overfitting ? Contractions, dimension reduction • Why hierarchical cascade ? • Why introducing non-linearities ? • How and what to linearise ? • What are the roles of the multiple channels in each layer ?

  16. Linear Dimension Reduction Ω 1 Classes Ω 2 x Ω 3 Level sets of f ( x ) Ω t = { x : f ( x ) = t } Φ ( x ) If level sets (classes) are parallel to a linear space then variables are eliminated by linear projections: invariants .

  17. Linearise for Dimensionality Reduction Classes x Level sets of f ( x ) Ω 2 Ω 1 Ω t = { x : f ( x ) = t } Ω 3 Φ ( x ) • If level sets Ω t are not parallel to a linear space - Linearise them with a change of variable Φ ( x ) - Then reduce dimension with linear projections • Di ffi cult because Ω t are high-dimensional, irregular, known on few samples.

  18. Level Set Geometry: Symmetries • Curse of dimensionality ⇒ not local but global geometry Level sets: classes , characterised by their global symmetries. Ω 1 g g Ω 2 • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry f ( g 1 .g 2 .x ) = f ( g 2 .x ) = f ( x )

  19. Groups of symmetries • G = { all symmetries } is a group: unknown ⇒ g.g 0 ∈ G ∀ ( g, g 0 ) ∈ G 2 g − 1 ∈ G ∀ g ∈ G , Inverse: ( g.g 0 ) .g 00 = g. ( g 0 .g 00 ) Associative: If commutative g.g 0 = g 0 .g : Abelian group. • Group of dimension n if it has n generators: g = g p 1 1 g p 2 2 ... g p n n • Lie group: infinitely small generators (Lie Algebra)

  20. Translation and Deformations • Digit classification: x 0 ( u ) = x ( u − τ ( u )) x ( u ) Ω 3 Ω 5 - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson

  21. Rotation and Scaling Variability • Rotation and deformations SO (2) × Di ff ( SO (2)) Group: • Scaling and deformations R × Di ff ( R ) Group:

  22. Linearize Symmetries • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G g p g 1 x 1 .x x x 0 g 1 x 0 g p 1 .x 0 • Linearise symmetries with a change of variable Φ ( x ) Φ ( g p 1 .x ) Φ ( x ) Φ ( x 0 ) Φ ( g p 1 .x 0 ) • Lipschitz: 8 x, g : k Φ ( x ) � Φ ( g.x ) k  C k g k

  23. Translation and Deformations • Digit classification: x 0 ( u ) x ( u ) - Globally invariant to the translation group - Locally invariant to small di ff eomorphisms Linearize small di ff eomorphisms: ⇒ Lipschitz regular Video of Philipp Scott Johnson

  24. Translations and Deformations • Invariance to translations: g.x ( u ) = x ( u − c ) Φ ( g.x ) = Φ ( x ) . ⇒ • Small di ff eomorphisms: g.x ( u ) = x ( u − τ ( u )) Metric: k g k = kr τ k ∞ maximum scaling Linearisation by Lipschitz continuity k Φ ( x ) � Φ ( g.x ) k  C kr τ k ∞ . • Discriminative change of variable: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) |

  25. Fourier Deformation Instability x ( t ) e − i ω t dt R • Fourier transform ˆ x ( ω ) = x c ( ω ) = e − ic ω ˆ x c ( t ) = x ( t − c ) ˆ x ( ω ) ⇒ The modulus is invariant to translations: Φ ( x ) = | ˆ x | = | ˆ x c | • Instabilites to small deformations x τ ( t ) = x ( t − τ ( t )) : | | ˆ x τ ( ω ) | − | ˆ x ( ω ) | | is big at high frequencies | b x ( ω ) | | b x τ ( ω ) | ⌧ ( t ) = ✏ t ω ) k | ˆ x | � | ˆ x τ | k � kr τ k ∞ k x k

  26. Wavelet Transform • Complex wavelet: ψ ( t ) = ψ a ( t ) + i ψ b ( t ) ψ λ ( t ) = 2 − j ψ (2 − j t ) with λ = 2 − j . • Dilated: ψ λ � ( t ) | ˆ | ˆ ψ λ ( ω ) | 2 | ˆ ψ λ � ( ω ) | 2 φ ( ω ) | 2 ψ λ ( t ) x ( ω ) ˆ λ � λ 0 ω Z x ? λ ( t ) = x ( u ) λ ( t − u ) du • Wavelet transform: ✓ x ? � ( t ) ◆ Wx = x ? λ ( t ) t, λ � Wx � 2 = � x � 2 . Unitary:

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend