mlcc 2017 deep learning
play

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, - PowerPoint PPT Presentation

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification Object classification Whats in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco What makes the


  1. MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017

  2. What? Classification Object classification What’s in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco

  3. ⚠ What makes the problem hard? ◮ Viewpoint ◮ Semantic variability Note : Identification vs categorization. . . L.Rosasco

  4. Categorization: a learning approach Training mug mug mug … remote remote remote … Test mug mug remote mug remote remote L.Rosasco

  5. Supervised learning Given ( x 1 , y 1 ) , . . . , ( x n , y n ) find f such that sign f ( x new ) = y new ◮ x ∈ R D a vectorization of an image ◮ y = ± 1 a label (mug/remote) L.Rosasco

  6. Learning and data representation Consider f ( x ) = w ⊤ Φ( x ) a two steps learning scheme is often considered ◮ supervised learning of w ◮ expert design or unsupervised learning of the data representation Φ L.Rosasco

  7. Data representation Φ : R D → R p A mapping of data in a new format better suited for further processing L.Rosasco

  8. Data representation by design Dictionaries of features ◮ Wavelet & friends. ◮ SIFT, HoG etc. Kernels 2 γ ◮ Classic off the shelf: Gaussian K ( x, x ′ ) = e − � x − x ′ � ◮ Structured input: kernels on histograms, graphs etc. L.Rosasco

  9. In practice all is multi-layer! (an old slide) Data representation schemes e.g. vision-speech, involve multiple ( layers ). Pipeline Raw data are often processed: ◮ first computing some of low level features, ◮ then learning some mid level representation, ◮ . . . ◮ finally using supervised learning. These stages are often done separately: ◮ good way to exploit unlabelled data. . . ◮ but is it possible to design end-to-end learning systems? L.Rosasco

  10. In practice all is deep-learning! (updated slide) Data representation schemes e.g. vision-speech, involve deep learning . Pipeline ◮ Design some wild- but “differentiable” hierarchical architecture. ◮ Proceed with end-to-end learning!! Architecture (rather than feature) engineering L.Rosasco

  11. Road Map Part I: Basics neural networks ◮ Neural networks definition ◮ Optimization +approximation and statistics Part II: One step beyond ◮ Auto-encoders ◮ Convolutional neural networks ◮ Tips and tricks L.Rosasco

  12. Part I: Basic Neural Networks L.Rosasco

  13. Shallow nets f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� � Fixed . Examples ◮ Dictionaries Φ( x ) = cos( B ⊤ x ) = (cos( β ⊤ 1 x ) , . . . , cos( β ⊤ p x )) with B = β 1 , . . . , β p fixed frequencies. ◮ Kernel methods Φ( x ) = ( e −� β 1 − x � 2 , . . . , e −� β n − x � 2 ) with β 1 = x 1 , . . . , β n = x n the input points. L.Rosasco

  14. Shallow nets (cont.) f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� � Fixed Empirical Risk Minimization (ERM) � n ( y i − w ⊤ Φ( x i )) 2 min w i =1 Note : The function f depends linearly on w , the ERM problem is convex ! L.Rosasco

  15. Interlude: optimization by Gradient Descent (GD) Batch gradient descent w t +1 = w t − γ ∇ w � E ( w t ) where � n � ( y i − w ⊤ Φ( x i )) 2 E ( w ) = i =1 so that � n ∇ w � Φ( x i ) ⊤ ( y i − w ⊤ Φ( x i )) E ( w ) = − 2 i =1 ◮ Constant step-size depending on the curvature (Hessian norm) ◮ It is a descent method L.Rosasco

  16. Gradient descent illustrated L.Rosasco

  17. Stochastic gradient descent (SGD) w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ t Φ( x t )) Compare to � n Φ( x i ) ⊤ ( y i − w ⊤ w t +1 = w t + 2 γ t Φ( x i )) i =1 √ ◮ Decaying step-size γ = 1 / t ◮ Lower iteration cost ◮ It is not a descent method (SG D ?) ◮ Multiple passes ( epochs ) over data needed L.Rosasco

  18. SGD vs GD L.Rosasco

  19. Summary so far Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) and a fixed representation Φ ◮ Consider f ( x ) = w ⊤ Φ( x ) ◮ Find w by SGD w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ Φ( x t )) Can we jointly learn Φ ? L.Rosasco

  20. Neural Nets Basic idea: compose simply parameterized representations Φ = Φ L ◦ · · · ◦ Φ 2 ◦ Φ 1 Let d 0 = D and Φ ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L and in particular Φ ℓ = σ ◦ W ℓ , ℓ = 1 , . . . , L where W ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L linear/affine and σ is a non linear map acting component-wise σ : R → R . L.Rosasco

  21. Deep neural nets f ( x ) = w ⊤ Φ L ( x ) , Φ L = Φ L ◦ · · · ◦ Φ 1 � �� � compositional representation Φ 1 = σ ◦ W 1 . . . Φ L = σ ◦ W L ERM � n 1 ( y i − w ⊤ Φ L ( x i )) 2 min n w, ( W j ) j i =1 L.Rosasco

  22. Neural networks jargoon Φ L ( x ) = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) ◮ Each intermediate representation corresponds to a (hidden) layer ◮ The dimensionalities ( d ℓ ) ℓ correspond to the number of hidden units ◮ The non linearity σ is called activation function L.Rosasco

  23. Neural networks & neurons X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 hi, i am a neuron ◮ Each neuron compute an inner product based on a column of a weight matrix W ◮ The non-linearity σ is the neuron activation function. L.Rosasco

  24. Deep neural networks X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 L.Rosasco

  25. Activation functions For α ∈ R consider, ◮ sigmoid s ( α ) = 1 / (1 + e − α ) t , ◮ hyperbolic tangent s ( α ) = ( e α − e − α ) / ( e α + e − α ) , ◮ ReLU s ( α ) = | α | + (aka ramp, hinge), ◮ Softplus s ( α ) = log(1 + e α ) . L.Rosasco

  26. Some questions f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) We have our model but: ◮ Optimization: Can we train efficiently? ◮ Approximation: Are we dealing with rich models? ◮ Statistics: How hard is it generalize from finite data ? L.Rosasco

  27. Neural networks function spaces Consider the non linear space of functions of the form f w, ( W ℓ ) ℓ : R D → R , f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) where w, ( W ℓ ) ℓ may vary. Very little structure. . . but we can : ◮ train by gradient descent (next) ◮ get (some) approximation/statistical guarantees (later) L.Rosasco

  28. One layer neural networks Consider only one hidden layer: � u � x ⊤ W j � f w,W ( x ) = w ⊤ σ ( Wx ) = w j σ j =1 and ERM again n � ( y i − f w,W ( x i )) 2 , i =1 L.Rosasco

  29. Computations Consider � n � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 Problem is non-convex! ( possibly smooth depending on σ ) L.Rosasco

  30. Back-propagation & GD Empirical risk minimization, n � � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 An approximate minimizer is computed via the following gradient method ∂ � E w t +1 w t ( w t , W t ) = j − γ t j ∂w j ∂ � E W t +1 W t ( w t +1 , W t ) = j,k − γ t j,k ∂W j,k where the step-size ( γ t ) t is often called learning rate. L.Rosasco

  31. Back-propagation & chain rule Direct computations show that: n ∂ � � E ( w, W ) = − 2 ( y i − f ( w,W ) ( x i ))) h j,i ∂w j � �� � i =1 ∆ j,i ∂ � � n E ( y i − f ( w,W ) ( x i ))) w j σ ′ ( w ⊤ x k ( w, W ) = − 2 j x ) i ∂W j,k � �� � i =1 η i,k Back-prop equations: η i,k = ∆ j,i c j σ ′ ( w ⊤ j x ) Using above equations, the updates are performed in two steps: ◮ Forward pass compute function values keeping weights fixed, ◮ Backward pass compute errors and propagate ◮ Hence the weights are updated. L.Rosasco

  32. SGD is typically preferred w t +1 w t = j − γ t 2( y t − f ( w t ,W t ) ( x t ))) h j,t j W t +1 W t j,k − γ t 2( y t − f ( w t +1 ,W t ) ( x t ))) w j σ ′ ( w ⊤ j x ) x k = j,k t L.Rosasco

  33. Non convexity and SGD L.Rosasco

  34. Few remarks ◮ Optimization by gradient methods – typically SGD ◮ Online update rules are potentially biologically plausible– Hebbian learning rules describing neuron plasticity ◮ Multiple layers can be analogously considered ◮ Multiple step-size per layers can be considered ◮ Initialization is tricky- more later ◮ NO convergence guarantees ◮ More tricks later L.Rosasco

  35. Some questions ◮ What is the benefit of multiple layers? ◮ Why does stochastic gradient seem to work? L.Rosasco

  36. Wrapping up part I ◮ Learning classifier and representation ◮ From shallow to deep learning ◮ SGD and backpropagation L.Rosasco

  37. Coming up ◮ Autoencoders and unsupervised data? ◮ Convolutional neural networks ◮ Tricks and tips L.Rosasco

  38. Part II: L.Rosasco

  39. Unsupervised learning with neural networks ◮ Because unlabeled data abound ◮ Because one could use obtained weight for initialize supervised learning (pre-training) L.Rosasco

  40. Auto-encoders x W x ◮ A neural network with one input layer, one output layer and one (or more) hidden layers connecting them. ◮ The output layer has equally many nodes as the input layer, ◮ It is trained to predict the input rather than some target output. L.Rosasco

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend