deep neural network mathematical mysteries for high
play

Deep Neural Network Mathematical Mysteries for High Dimensional - PowerPoint PPT Presentation

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat cole Normale Suprieure www.di.ens.fr/data High Dimensional Learning High-dimensional x = ( x (1) , ..., x ( d )) R d : Classification:


  1. Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stéphane Mallat École Normale Supérieure www.di.ens.fr/data

  2. High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants

  3. High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Regression: approximate a functional f ( x ) given n sample values { x i , y i = f ( x i ) ∈ R } i ≤ n Physics: energy f ( x ) of a state vector x Astronomy Quantum Chemistry Importance of symmetries.

  4. Curse of Dimensionality • f ( x ) can be approximated from examples { x i , f ( x i ) } i by local interpolation if f is regular and there are close examples: d=2 0 1 o o o o o o o o o o ? o o o o o o o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1 o o o o o o o o o o • Need ✏ − d points to cover [0 , 1] d at a Euclidean distance ✏ Problem: k x � x i k is always large

  5. Multiscale Separation • Variables x ( u ) indexed by a low-dimensional u : time/space... pixels in images, particles in physics, words in text... • Mutliscale interactions of d variables: u 1 u 2 From d 2 interactions to O (log 2 d ) multiscale interactions. • Multiscale analysis: wavelets on groups of symmetries. hierarchical architecture.

  6. Overview • 1 Hidden Layer Network, Approximation theory and Curse • Kernel learning • Dimension reduction with change of variables • Deep Neural networks and symmetry groups • Wavelet Scattering transforms • Applications and many open questions Understanding Deep Convolutional Networks , arXiv 2016.

  7. Learning as an Approximation • To estimate f ( x ) from a sampling { x i , y i = f ( x i ) } i ≤ M we must build an M -parameter approximation f M of f . • Precise sparse approximation requires some ”regularity”. ⇢ 1 if x ∈ Ω • For binary classification f ( x ) = − 1 if x / ∈ Ω f ( x ) = sign( ˜ f ( x )) where ˜ f is potentially regular. • What type of regularity ? How to compute f M ?

  8. 1 Hidden Layer Neural Networks ridge functions One-hidden layer neural network: ρ ( x.w n + b n ) ρ ( w n .x + b n ) x w n α n w n .x = P k w k,n x k M X f M ( x ) = α n ρ ( w n .x + b n ) n =1 d { w k,k } k,n and { α n } n are learned non-linear approximation. M Cybenko, Hornik, Stinchcombe, White Theorem: For ”resonnable” bounded ρ ( u ) and appropriate choices of w n,k and α n : 8 f 2 L 2 [0 , 1] d M →∞ k f � f M k = 0 . lim No big deal: curse of dimensionality still there.

  9. 1 Hidden Layer Neural Networks One-hidden layer neural network: ρ ( w n .x + b n ) x α n M w n .x = P k w k,n x k X f M ( x ) = α n ρ ( w n .x + b n ) n =1 { w k,k } k,n and { α n } n are learned d non-linear approximation. M Fourier series: ρ ( u ) = e iu M X α n e iw n .x f M ( x ) = n =1 For nearly all ρ : essentially same approximation results.

  10. Piecewise Linear Approximation • Piecewise linear approximation: ρ ( u ) = max( u, 0) ˜ X f ( x ) = a n ⇢ ( x − n ✏ ) f ( x ) n x ✏ n ✏ If f is Lipschitz: | f ( x ) − f ( x 0 ) | ≤ C | x − x 0 | | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ Need M = ✏ − 1 points to cover [0 , 1] at a distance ✏ k f � f M k  C M − 1 ⇒

  11. Linear Ridge Approximation • Piecewise linear ridge approximation: x ∈ [0 , 1] d ˜ X f ( x ) = a n ⇢ ( w n .x − n ✏ ) ρ ( u ) = max( u, 0) n If f is Lipschitz: | f ( x ) � f ( x 0 ) |  C k x � x 0 k Sampling at a distance ✏ : | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ need M = ✏ − d points to cover [0 , 1] d at a distance ✏ ⇒ k f � f M k  C M − 1 /d Curse of dimensionality!

  12. Approximation with Regularity • What prior condition makes learning possible ? • Approximation of regular functions in C s [0 , 1] d : | f ( x ) − p u ( x ) | ≤ C | x − u | s with p u ( x ) polynomial ∀ x, u f ( x ) x u p u ( x ) | x − u | ≤ ✏ 1 /s | f ( x ) − p u ( x ) | ≤ C ✏ ⇒ Need M − d/s point to cover [0 , 1] d at a distance ✏ 1 /s k f � f M k  C M − s/d ⇒ • Can not do better in C s [0 , 1] d , not good because s ⌧ d . Failure of classical approximation theory.

  13. Kernel Learning Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x Metric: k x � x 0 k k Φ ( x ) � Φ ( x 0 ) k • How and when is possible to find such a Φ ? • What ”regularity” of f is needed ?

  14. Increase Dimensionality Proposition: There exists a hyperplane separating any two subsets of N points { Φ x i } i in dimension d 0 > N + 1 if { Φ x i } i are not in an a ffi ne subspace of dimension < N . ⇒ Choose Φ increasing dimensionality ! , overfitting. Problem: generalisation. ⇣ �k x � x 0 k 2 ⌘ Example: Gaussian kernel h Φ ( x ) , Φ ( x 0 ) i = exp 2 σ 2 Φ ( x ) is of dimension d 0 = ∞ If σ is small, nearest neighbor classifier type: σ

  15. Reduction of Dimensionality • Discriminative change of variable Φ ( x ): Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ∃ ˜ f with f ( x ) = ˜ f ( Φ ( x )) ⇒ • If ˜ f is Lipschitz: | ˜ f ( z ) � ˜ f ( z 0 ) |  C k z � z 0 k | f ( x ) � f ( x 0 ) |  C k Φ ( x ) � Φ ( x 0 ) k z = Φ ( x ) , Discriminative: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) | • For x ∈ Ω , if Φ ( Ω ) is bounded and a low dimension d 0 ) k f � f M k  C M − 1 /d 0

  16. Deep Convolution Neworks • The revival of neural networks: Y. LeCun x L 1 linear convolution ρ ( u ) = max( u, 0) non-linear scalar: neuron ρ Hierarchical L 2 linear convolution invariants ρ Linearization . . . Linear Classificat. y = ˜ Φ ( x ) f ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, language, bio-data... Why does it work so well ? A di ffi cult problem

  17. ImageNet Data Basis • Data basis with 1 million images and 2000 classes

  18. Alex Deep Convolution Network A. Krizhevsky, Sutsever, Hinton • Imagenet supervised training: 1.2 10 6 examples, 10 3 classes 15.3% testing error in 2012 New networks with 5% errors. Up to 150 layers! Wavelets

  19. Image Classification

  20. Scene Labeling / Car Driving

  21. Why Understading ? Sz egedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus k ✏ k < 10 − 2 k x k ˜ x + with ✏ x = correctly classified as classified ostrich • Trial and error testing can not guarantee reliability.

  22. Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • ρ is contractive: | ρ ( u ) − ρ ( u 0 ) | ≤ | u − u 0 | ρ ( u ) = max( u, 0) or ρ ( u ) = | u |

  23. Linearisation in Deep Networks A. Radford, L. Metz, S. Chintala • Trained on a data basis of faces: linearization • On a data basis including bedrooms: interpolaitons

  24. Many Questions x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 ρ L j k 1 k 2 • Why convolutions ? Translation covariance. • Why no overfitting ? Contractions, dimension reduction • Why hierarchical cascade ? • Why introducing non-linearities ? • How and what to linearise ? • What are the roles of the multiple channels in each layer ?

  25. Linear Dimension Reduction Ω 1 Classes Ω 2 x Ω 3 Level sets of f ( x ) Ω t = { x : f ( x ) = t } Φ ( x ) If level sets (classes) are parallel to a linear space then variables are eliminated by linear projections: invariants .

  26. Linearise for Dimensionality Reduction Classes x Level sets of f ( x ) Ω 2 Ω 1 Ω t = { x : f ( x ) = t } Ω 3 Φ ( x ) • If level sets Ω t are not parallel to a linear space - Linearise them with a change of variable Φ ( x ) - Then reduce dimension with linear projections • Di ffi cult because Ω t are high-dimensional, irregular, known on few samples.

  27. Level Set Geometry: Symmetries • Curse of dimensionality ⇒ not local but global geometry , characterised by their global symmetries. Level sets: classes Ω 1 g g Ω 2 • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry f ( g 1 .g 2 .x ) = f ( g 2 .x ) = f ( x )

  28. Groups of symmetries • G = { all symmetries } is a group: unknown ⇒ g.g 0 ∈ G ∀ ( g, g 0 ) ∈ G 2 g − 1 ∈ G ∀ g ∈ G , Inverse: ( g.g 0 ) .g 00 = g. ( g 0 .g 00 ) Associative: If commutative g.g 0 = g 0 .g : Abelian group. • Group of dimension n if it has n generators: g = g p 1 1 g p 2 2 ... g p n n • Lie group: infinitely small generators (Lie Algebra)

  29. Translation and Deformations • Digit classification: x 0 ( u ) = x ( u − τ ( u )) x ( u ) Ω 3 Ω 5 - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend