understanding or not deep convolutional networks
play

Understanding (or not) Deep Convolutional Networks Stphane Mallat - PowerPoint PPT Presentation

Understanding (or not) Deep Convolutional Networks Stphane Mallat cole Normale Suprieure www.di.ens.fr/data Deep Neural Networks Approximations of high-dimensional functions from examples, for classification and regression.


  1. Understanding (or not) Deep Convolutional Networks Stéphane Mallat École Normale Supérieure www.di.ens.fr/data

  2. Deep Neural Networks • Approximations of high-dimensional functions from examples, for classification and regression. • Applications: computer vision, audio and music classification, natural language analysis, bio-medical data, unstructured data… • Related to: neurophysiology of vision and audition, quantum and statistical physics, linguistics, … • Mathematics: statistics, probability, harmonic analysis, geometry, optimization. Little is understood.

  3. High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants

  4. Curse of Dimensionality • f ( x ) can be approximated from examples { x i , f ( x i ) } i by local interpolation if f is regular and there are close examples: ? x • Need ✏ − d points to cover [0 , 1] d at a Euclidean distance ✏ ) k x � x i k is always large Huge variability inside classes

  5. Linearisation by Change of Variable Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x

  6. Deep Convolution Neworks • The revival of an old (1950) idea: Y. LeCun , G. Hinton x L 1 linear convolution ρ ( u ) = | u | non-linear scalar: neuron ρ L 2 linear convolution ρ . . . Linear Classificat. Φ ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, bio-data classification. Products by FaceBook, IBM, Google, Microsoft, Yahoo... Why does it work so well ?

  7. ImageNet Data Basis • Data basis with 1 million images and 2000 classes

  8. Alex Deep Convolution Network A. Krizhevsky, Sutsever, Hinton • Imagenet supervised training: 1.2 10 6 examples, 10 3 classes 15.3% testing error in 2012 New networks with 5% errors. with 150 layers! Wavelets

  9. Image Classification

  10. Scene Labeling / Car Driving

  11. Overview • Linearisation of symmetries • Deep convolutional networks architectures • Simplified convolutional trees: wavelet scattering • Deep networks: contractions, linearization and separations

  12. Separation and Linearization with Φ • Separation: change of variable f ( x ) = f ( Φ ( x )) Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ) f ( z ) is Lipschitz , k Φ ( x ) � Φ ( x 0 ) k � ✏ | f ( x ) � f ( x 0 ) | • Linearization: f ( z ) = h w, z i linearize level sets Ω t = { x : f ( x ) = t } 8 x 2 Ω t , f ( x ) = h Φ ( x ) , w i = t Φ ( Ω t ) for all t are in parallel linear spaces Ω t w

  13. Linearization of Symmetries • No local estimations because of dimensionality curse • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry ⇒ groups G of symmetries. : high dimensional • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G Problem: find the symmetries and linearise them.

  14. Contract Linearize Symmetries • A change of variable Φ ( x ) must linearize the orbits { g.x } g ∈ G Problem: find the symmetries and linearise them. • Regularize the orbit, remove high curvature: linearisation

  15. Translation and Deformations • Digit classification: x 0 ( u ) x ( u ) - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson

  16. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) up to J = 150 x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 2 ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • ρ is a pointwise contractive non-linearity: ∀ ( α , α 0 ) ∈ R 2 , | ρ ( α ) − ρ ( α 0 ) | ≤ | α − α 0 | Examples: ρ ( u ) = max( u, 0) or ρ ( u ) = | u | . • Optimisation of the L j to minimise the training error with stochastic gradient descent and back-propagation. • What is the role of the linear operators L j and of ρ ?

  17. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) up to J = 150 x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 2 ρ L 1 k 1 k 2 x j = ρ L j x j − 1 L j has several roles: • L j eliminates useless linear variable: dimension reduction • L j computes appropriate variables contracted by ρ Linearizes and computes invariants to groups of symmetries • L j is a linear preprocessing for the next layers

  18. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • Optimization of h k j ,k ( u ) to minimise the training error

  19. Simplified Convolutional Networks x J ( u, k J ) • No channel combination x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k j − 1 ( u ) no channel interaction • If α ≥ 0 then ρ ( α ) = α ⇒ if h k j ,k j − 1 is an averaging filter then x j ( u, k j ) = x j − 1 ( · , k ) ? h k j ,k j − 1 ( u )

  20. Convolution Tree Network • No channel combination x ρ L 1 ρ x 1 ρ L 2 ρ ρ x 2 ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ L J x J : averaging filters : band-pass filters

  21. Wavelet Transform x ρ ρ W 1 x ρ ρ ρ : averaging filters : band-pass filters W 1 : cascade of low-pass filters and a band-pass filter

  22. Wavelet Filter Bank x ( u ) 2 0 ρ ( α ) = | α | | W 1 | | W 1 | 2 1 | x ? 2 1 , θ | 2 2 | x ? 2 2 , θ | ψ 2 j , θ : equivalent filter | x ? 2 j , θ | 2 J Scale • Sparse representation

  23. Scale separation with Wavelets • Complex wavelet: ψ ( u ) = g ( u ) exp i ξ u , u ∈ R 2 ψ 2 j , θ ( u ) = 2 − j ψ (2 − j r θ u ) rotated and dilated: real parts imaginary parts ✓ x ? � 2 J ( u ) ◆ : average • Wavelet transform: Wx = x ? 2 j , θ ( u ) : higher j ≤ J, θ frequencies | x ? 2 j , θ ( u ) | : eliminates phase which encodes local translation

  24. Wavelet Scattering Network x ρ L 1 ρ L 2 ρ L J x J : averaging filters ρ W 1 ρ W 2 ... ρ W J x x J = ρ ( α ) = | α | n o ||| x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ? ... | ? 2 jm , θ m | ? � J Sx = j k , θ k

  25. Scattering Properties   x ? � 2 J | x ? λ 1 | ? � 2 J   = . . . | W 3 | | W 2 | | W 1 | x   || x ? λ 1 | ? λ 2 | ? � 2 J S J x =     ||| x ? λ 2 | ? λ 2 | ? λ 3 | ? � 2 J   ... λ 1 , λ 2 , λ 3 ,... k | W k x | � | W k x 0 | k  k x � x 0 k k W k x k = k x k ) Lemma : k [ W k , D τ ] k = k W k D τ � D τ W k k  C kr τ k ∞ Theorem : For appropriate wavelets, a scattering is contractive k S J x � S J y k  k x � y k ( L 2 stability ) translations invariance and linearizes small deformations: if D τ x ( u ) = x ( u − τ ( u )) then J →∞ k S J D τ x � S J x k  C kr τ k ∞ k x k lim

  26. Digit Classification: MNIST Joan Bruna Supervised y = f ( x ) S J x x Linear classifier Invariants to specific deformations Invariants to translations Separates di ff erent patterns Linearises small deformations No learning Classification Errors Training size Conv. Net. Scattering 50000 0 . 5% 0 . 4 % LeCun et. al.

  27. Classification of Textures J. Bruna CUREt database 61 classes Texte Scat. Moments Supervised y = f ( x ) S J x x Linear classifier 2 J = image size Classification Errors Training Fourier Histogr. Scattering per class Spectr. Features 46 1% 1% 0 . 2 %

  28. Reconstruction from Scattering • Second order scattering: n o x ? � J , | x ? 2 j 1 , θ 1 | ? � J , | x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ? � J S J x = If x has N 2 pixels and J = log 2 N : translation invariant then S J x has O ([log 2 N ] 2 ) coe ffi cients. • If x ( u ) is a stationary process n o E ( x ) , E ( | x ? 2 j 1 , θ 1 | ) , E ( || x ? 2 j 1 , θ 1 | ? 2 j 2 , θ 2 | ) S J x ≈ • Gradient descent reconstruction: given a random initialisation x 0 iteratively update x n to minimise k S J x � S J x n k

  29. Translation Invariant Models Joan Bruna Original Textures 2D Turbulence Sparse Gaussian process model with same second order moments From O ((log 2 N ) 2 ) scattering coe ffi cients of order 2

  30. Complex Image Classification Edouard Oyallon Arbre de Joshua Castore Ancre Metronome Nénuphare Bateau Supervised y = f ( x ) S J x x Linear classifier No learning Data Basis Deep-Net Scat/Unsupervised CIFAR-10 7% 20%

  31. Generation with Deep Networks A. Radford, L. Metz, S. Chintala • Unsupervised generative models with convolutional networks • Trained on a data basis of faces: linearization • On a data basis including bedrooms: interpolaitons

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend