Deep Neural Network Mathematical Mysteries for High Dimensional - PowerPoint PPT Presentation

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stéphane Mallat École Normale Supérieure www.di.ens.fr/data

High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Classification: estimate a class label f ( x ) given n sample values { x i , y i = f ( x i ) } i ≤ n Image Classification d = 10 6 Huge variability Anchor Joshua Tree Beaver Lotus Water Lily inside classes Find invariants

High Dimensional Learning • High-dimensional x = ( x (1) , ..., x ( d )) ∈ R d : • Regression: approximate a functional f ( x ) given n sample values { x i , y i = f ( x i ) ∈ R } i ≤ n Physics: energy f ( x ) of a state vector x Astronomy Quantum Chemistry Importance of symmetries.

Curse of Dimensionality • f ( x ) can be approximated from examples { x i , f ( x i ) } i by local interpolation if f is regular and there are close examples: d=2 0 1 o o o o o o o o o o ? o o o o o o o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1 o o o o o o o o o o • Need ✏ − d points to cover [0 , 1] d at a Euclidean distance ✏ Problem: k x � x i k is always large

Multiscale Separation • Variables x ( u ) indexed by a low-dimensional u : time/space... pixels in images, particles in physics, words in text... • Mutliscale interactions of d variables: u 1 u 2 From d 2 interactions to O (log 2 d ) multiscale interactions. • Multiscale analysis: wavelets on groups of symmetries. hierarchical architecture.

Overview • 1 Hidden Layer Network, Approximation theory and Curse • Kernel learning • Dimension reduction with change of variables • Deep Neural networks and symmetry groups • Wavelet Scattering transforms • Applications and many open questions Understanding Deep Convolutional Networks , arXiv 2016.

Learning as an Approximation • To estimate f ( x ) from a sampling { x i , y i = f ( x i ) } i ≤ M we must build an M -parameter approximation f M of f . • Precise sparse approximation requires some ”regularity”. ⇢ 1 if x ∈ Ω • For binary classification f ( x ) = − 1 if x / ∈ Ω f ( x ) = sign( ˜ f ( x )) where ˜ f is potentially regular. • What type of regularity ? How to compute f M ?

1 Hidden Layer Neural Networks ridge functions One-hidden layer neural network: ρ ( x.w n + b n ) ρ ( w n .x + b n ) x w n α n w n .x = P k w k,n x k M X f M ( x ) = α n ρ ( w n .x + b n ) n =1 d { w k,k } k,n and { α n } n are learned non-linear approximation. M Cybenko, Hornik, Stinchcombe, White Theorem: For ”resonnable” bounded ρ ( u ) and appropriate choices of w n,k and α n : 8 f 2 L 2 [0 , 1] d M →∞ k f � f M k = 0 . lim No big deal: curse of dimensionality still there.

1 Hidden Layer Neural Networks One-hidden layer neural network: ρ ( w n .x + b n ) x α n M w n .x = P k w k,n x k X f M ( x ) = α n ρ ( w n .x + b n ) n =1 { w k,k } k,n and { α n } n are learned d non-linear approximation. M Fourier series: ρ ( u ) = e iu M X α n e iw n .x f M ( x ) = n =1 For nearly all ρ : essentially same approximation results.

Piecewise Linear Approximation • Piecewise linear approximation: ρ ( u ) = max( u, 0) ˜ X f ( x ) = a n ⇢ ( x − n ✏ ) f ( x ) n x ✏ n ✏ If f is Lipschitz: | f ( x ) − f ( x 0 ) | ≤ C | x − x 0 | | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ Need M = ✏ − 1 points to cover [0 , 1] at a distance ✏ k f � f M k  C M − 1 ⇒

Linear Ridge Approximation • Piecewise linear ridge approximation: x ∈ [0 , 1] d ˜ X f ( x ) = a n ⇢ ( w n .x − n ✏ ) ρ ( u ) = max( u, 0) n If f is Lipschitz: | f ( x ) � f ( x 0 ) |  C k x � x 0 k Sampling at a distance ✏ : | f ( x ) − ˜ f ( x ) | ≤ C ✏ . ⇒ need M = ✏ − d points to cover [0 , 1] d at a distance ✏ ⇒ k f � f M k  C M − 1 /d Curse of dimensionality!

Approximation with Regularity • What prior condition makes learning possible ? • Approximation of regular functions in C s [0 , 1] d : | f ( x ) − p u ( x ) | ≤ C | x − u | s with p u ( x ) polynomial ∀ x, u f ( x ) x u p u ( x ) | x − u | ≤ ✏ 1 /s | f ( x ) − p u ( x ) | ≤ C ✏ ⇒ Need M − d/s point to cover [0 , 1] d at a distance ✏ 1 /s k f � f M k  C M − s/d ⇒ • Can not do better in C s [0 , 1] d , not good because s ⌧ d . Failure of classical approximation theory.

Kernel Learning Change of variable Φ ( x ) = { φ k ( x ) } k ≤ d 0 to nearly linearize f ( x ), which is approximated by: ˜ X f ( x ) = h Φ ( x ) , w i = w k φ k ( x ) . 1D projection k Φ ( x ) ∈ R d 0 x ∈ R d Data: Linear Classifier w Φ x Metric: k x � x 0 k k Φ ( x ) � Φ ( x 0 ) k • How and when is possible to find such a Φ ? • What ”regularity” of f is needed ?

Increase Dimensionality Proposition: There exists a hyperplane separating any two subsets of N points { Φ x i } i in dimension d 0 > N + 1 if { Φ x i } i are not in an a ffi ne subspace of dimension < N . ⇒ Choose Φ increasing dimensionality ! , overfitting. Problem: generalisation. ⇣ �k x � x 0 k 2 ⌘ Example: Gaussian kernel h Φ ( x ) , Φ ( x 0 ) i = exp 2 σ 2 Φ ( x ) is of dimension d 0 = ∞ If σ is small, nearest neighbor classifier type: σ

Reduction of Dimensionality • Discriminative change of variable Φ ( x ): Φ ( x ) 6 = Φ ( x 0 ) if f ( x ) 6 = f ( x 0 ) ∃ ˜ f with f ( x ) = ˜ f ( Φ ( x )) ⇒ • If ˜ f is Lipschitz: | ˜ f ( z ) � ˜ f ( z 0 ) |  C k z � z 0 k | f ( x ) � f ( x 0 ) |  C k Φ ( x ) � Φ ( x 0 ) k z = Φ ( x ) , Discriminative: k Φ ( x ) � Φ ( x 0 ) k � C � 1 | f ( x ) � f ( x 0 ) | • For x ∈ Ω , if Φ ( Ω ) is bounded and a low dimension d 0 ) k f � f M k  C M − 1 /d 0

Deep Convolution Neworks • The revival of neural networks: Y. LeCun x L 1 linear convolution ρ ( u ) = max( u, 0) non-linear scalar: neuron ρ Hierarchical L 2 linear convolution invariants ρ Linearization . . . Linear Classificat. y = ˜ Φ ( x ) f ( x ) Optimize L j with architecture constraints: over 10 9 parameters Exceptional results for images, speech, language, bio-data... Why does it work so well ? A di ffi cult problem

ImageNet Data Basis • Data basis with 1 million images and 2000 classes

Alex Deep Convolution Network A. Krizhevsky, Sutsever, Hinton • Imagenet supervised training: 1.2 10 6 examples, 10 3 classes 15.3% testing error in 2012 New networks with 5% errors. Up to 150 layers! Wavelets

Image Classification

Scene Labeling / Car Driving

Why Understading ? Sz egedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus k ✏ k < 10 − 2 k x k ˜ x + with ✏ x = correctly classified as classified ostrich • Trial and error testing can not guarantee reliability.

Deep Convolutional Networks x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 k 1 k 2 x j = ρ L j x j − 1 • L j is a linear combination of convolutions and subsampling: ⇣ X ⌘ x j ( u, k j ) = ⇢ x j − 1 ( · , k ) ? h k j ,k ( u ) k sum across channels • ρ is contractive: | ρ ( u ) − ρ ( u 0 ) | ≤ | u − u 0 | ρ ( u ) = max( u, 0) or ρ ( u ) = | u |

Linearisation in Deep Networks A. Radford, L. Metz, S. Chintala • Trained on a data basis of faces: linearization • On a data basis including bedrooms: interpolaitons

Many Questions x J ( u, k J ) x 2 ( u, k 2 ) x ( u ) x 1 ( u, k 1 ) ρ L J classification ρ L 1 ρ L j k 1 k 2 • Why convolutions ? Translation covariance. • Why no overfitting ? Contractions, dimension reduction • Why hierarchical cascade ? • Why introducing non-linearities ? • How and what to linearise ? • What are the roles of the multiple channels in each layer ?

Linear Dimension Reduction Ω 1 Classes Ω 2 x Ω 3 Level sets of f ( x ) Ω t = { x : f ( x ) = t } Φ ( x ) If level sets (classes) are parallel to a linear space then variables are eliminated by linear projections: invariants .

Linearise for Dimensionality Reduction Classes x Level sets of f ( x ) Ω 2 Ω 1 Ω t = { x : f ( x ) = t } Ω 3 Φ ( x ) • If level sets Ω t are not parallel to a linear space - Linearise them with a change of variable Φ ( x ) - Then reduce dimension with linear projections • Di ffi cult because Ω t are high-dimensional, irregular, known on few samples.

Level Set Geometry: Symmetries • Curse of dimensionality ⇒ not local but global geometry , characterised by their global symmetries. Level sets: classes Ω 1 g g Ω 2 • A symmetry is an operator g which preserves level sets: f ( g.x ) = f ( x ) . ∀ x , : global If g 1 and g 2 are symmetries then g 1 .g 2 is also a symmetry f ( g 1 .g 2 .x ) = f ( g 2 .x ) = f ( x )

Groups of symmetries • G = { all symmetries } is a group: unknown ⇒ g.g 0 ∈ G ∀ ( g, g 0 ) ∈ G 2 g − 1 ∈ G ∀ g ∈ G , Inverse: ( g.g 0 ) .g 00 = g. ( g 0 .g 00 ) Associative: If commutative g.g 0 = g 0 .g : Abelian group. • Group of dimension n if it has n generators: g = g p 1 1 g p 2 2 ... g p n n • Lie group: infinitely small generators (Lie Algebra)

Translation and Deformations • Digit classification: x 0 ( u ) = x ( u − τ ( u )) x ( u ) Ω 3 Ω 5 - Globally invariant to the translation group : small - Locally invariant to small di ff eomorphisms : huge group Video of Philipp Scott Johnson

Deep Neural Network Mathematical Mysteries for High Dimensional - PowerPoint PPT Presentation

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat cole Normale Suprieure www.di.ens.fr/data High Dimensional Learning High-dimensional x = ( x (1) , ..., x ( d )) R d : Classification:

Mysteries of the Real Estate Licensing Exam Mysteries of the Real Estate Licensing Exam REEA

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Solving UV Mysteries Anything that you can measure, you have a better chance of controlling.

MYSTERIES AND BEAUTY OF THE WORLD CHANNEL DESCRIPTION A rich collection of fascinating and

The Rosary is a prayer to remind us that Jesus loved us so much that he died on the cross for us.

WEBINAR SERIES www.shlb.org/webinars December 11, 2018 Unsolved Mysteries of the Rural Health

Symmetry- -breaking for SAT: breaking for SAT: Symmetry The Mysteries of Logic Minimization

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

Fast Homomorphic Evaluation of Deep Discretized Neural Networks Florian Bourse Michele Minelli

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides

THE NEURAL SIMULATION TOOL NEST 1st HPAC Platform Training December 11, 2018 Jochen M. Eppler

Neural Networks Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

MULTILAYER NEURAL NETWORKS Jeff Robble, Brian Renzenbrink, Doug Roberts Multilayer Neural

Neural Networks: Introduction Machine Learning Based on slides and material from Geoffrey Hinton,

Lecture 1: Neurons Lecture 2: Coding with spikes Lecture 3: Tuning curves and receptive fields

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Deep Neural Network Mathematical Mysteries for High Dimensional - PowerPoint PPT Presentation

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat cole Normale Suprieure www.di.ens.fr/data High Dimensional Learning High-dimensional x = ( x (1) , ..., x ( d )) R d : Classification:

Mysteries of the Real Estate Licensing Exam Mysteries of the Real Estate Licensing Exam REEA

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Solving UV Mysteries Anything that you can measure, you have a better chance of controlling.

MYSTERIES AND BEAUTY OF THE WORLD CHANNEL DESCRIPTION A rich collection of fascinating and

The Rosary is a prayer to remind us that Jesus loved us so much that he died on the cross for us.

WEBINAR SERIES www.shlb.org/webinars December 11, 2018 Unsolved Mysteries of the Rural Health

Symmetry- -breaking for SAT: breaking for SAT: Symmetry The Mysteries of Logic Minimization

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

Fast Homomorphic Evaluation of Deep Discretized Neural Networks Florian Bourse Michele Minelli

Logistic Regression &amp; Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides

THE NEURAL SIMULATION TOOL NEST 1st HPAC Platform Training December 11, 2018 Jochen M. Eppler

Neural Networks Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

MULTILAYER NEURAL NETWORKS Jeff Robble, Brian Renzenbrink, Doug Roberts Multilayer Neural

Neural Networks: Introduction Machine Learning Based on slides and material from Geoffrey Hinton,

Lecture 1: Neurons Lecture 2: Coding with spikes Lecture 3: Tuning curves and receptive fields

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides