 
              MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017
What? Classification Object classification What’s in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco
⚠ What makes the problem hard? ◮ Viewpoint ◮ Semantic variability Note : Identification vs categorization. . . L.Rosasco
Categorization: a learning approach Training mug mug mug … remote remote remote … Test mug mug remote mug remote remote L.Rosasco
Supervised learning Given ( x 1 , y 1 ) , . . . , ( x n , y n ) find f such that sign f ( x new ) = y new ◮ x ∈ R D a vectorization of an image ◮ y = ± 1 a label (mug/remote) L.Rosasco
Learning and data representation Consider f ( x ) = w ⊤ Φ( x ) a two steps learning scheme is often considered ◮ supervised learning of w ◮ expert design or unsupervised learning of the data representation Φ L.Rosasco
Data representation Φ : R D → R p A mapping of data in a new format better suited for further processing L.Rosasco
Data representation by design Dictionaries of features ◮ Wavelet & friends. ◮ SIFT, HoG etc. Kernels 2 γ ◮ Classic off the shelf: Gaussian K ( x, x ′ ) = e − � x − x ′ � ◮ Structured input: kernels on histograms, graphs etc. L.Rosasco
In practice all is multi-layer! (an old slide) Data representation schemes e.g. vision-speech, involve multiple ( layers ). Pipeline Raw data are often processed: ◮ first computing some of low level features, ◮ then learning some mid level representation, ◮ . . . ◮ finally using supervised learning. These stages are often done separately: ◮ good way to exploit unlabelled data. . . ◮ but is it possible to design end-to-end learning systems? L.Rosasco
In practice all is deep-learning! (updated slide) Data representation schemes e.g. vision-speech, involve deep learning . Pipeline ◮ Design some wild- but “differentiable” hierarchical architecture. ◮ Proceed with end-to-end learning!! Architecture (rather than feature) engineering L.Rosasco
Road Map Part I: Basics neural networks ◮ Neural networks definition ◮ Optimization +approximation and statistics Part II: One step beyond ◮ Auto-encoders ◮ Convolutional neural networks ◮ Tips and tricks L.Rosasco
Part I: Basic Neural Networks L.Rosasco
Shallow nets f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� � Fixed . Examples ◮ Dictionaries Φ( x ) = cos( B ⊤ x ) = (cos( β ⊤ 1 x ) , . . . , cos( β ⊤ p x )) with B = β 1 , . . . , β p fixed frequencies. ◮ Kernel methods Φ( x ) = ( e −� β 1 − x � 2 , . . . , e −� β n − x � 2 ) with β 1 = x 1 , . . . , β n = x n the input points. L.Rosasco
Shallow nets (cont.) f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� � Fixed Empirical Risk Minimization (ERM) � n ( y i − w ⊤ Φ( x i )) 2 min w i =1 Note : The function f depends linearly on w , the ERM problem is convex ! L.Rosasco
Interlude: optimization by Gradient Descent (GD) Batch gradient descent w t +1 = w t − γ ∇ w � E ( w t ) where � n � ( y i − w ⊤ Φ( x i )) 2 E ( w ) = i =1 so that � n ∇ w � Φ( x i ) ⊤ ( y i − w ⊤ Φ( x i )) E ( w ) = − 2 i =1 ◮ Constant step-size depending on the curvature (Hessian norm) ◮ It is a descent method L.Rosasco
Gradient descent illustrated L.Rosasco
Stochastic gradient descent (SGD) w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ t Φ( x t )) Compare to � n Φ( x i ) ⊤ ( y i − w ⊤ w t +1 = w t + 2 γ t Φ( x i )) i =1 √ ◮ Decaying step-size γ = 1 / t ◮ Lower iteration cost ◮ It is not a descent method (SG D ?) ◮ Multiple passes ( epochs ) over data needed L.Rosasco
SGD vs GD L.Rosasco
Summary so far Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) and a fixed representation Φ ◮ Consider f ( x ) = w ⊤ Φ( x ) ◮ Find w by SGD w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ Φ( x t )) Can we jointly learn Φ ? L.Rosasco
Neural Nets Basic idea: compose simply parameterized representations Φ = Φ L ◦ · · · ◦ Φ 2 ◦ Φ 1 Let d 0 = D and Φ ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L and in particular Φ ℓ = σ ◦ W ℓ , ℓ = 1 , . . . , L where W ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L linear/affine and σ is a non linear map acting component-wise σ : R → R . L.Rosasco
Deep neural nets f ( x ) = w ⊤ Φ L ( x ) , Φ L = Φ L ◦ · · · ◦ Φ 1 � �� � compositional representation Φ 1 = σ ◦ W 1 . . . Φ L = σ ◦ W L ERM � n 1 ( y i − w ⊤ Φ L ( x i )) 2 min n w, ( W j ) j i =1 L.Rosasco
Neural networks jargoon Φ L ( x ) = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) ◮ Each intermediate representation corresponds to a (hidden) layer ◮ The dimensionalities ( d ℓ ) ℓ correspond to the number of hidden units ◮ The non linearity σ is called activation function L.Rosasco
Neural networks & neurons X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 hi, i am a neuron ◮ Each neuron compute an inner product based on a column of a weight matrix W ◮ The non-linearity σ is the neuron activation function. L.Rosasco
Deep neural networks X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 L.Rosasco
Activation functions For α ∈ R consider, ◮ sigmoid s ( α ) = 1 / (1 + e − α ) t , ◮ hyperbolic tangent s ( α ) = ( e α − e − α ) / ( e α + e − α ) , ◮ ReLU s ( α ) = | α | + (aka ramp, hinge), ◮ Softplus s ( α ) = log(1 + e α ) . L.Rosasco
Some questions f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) We have our model but: ◮ Optimization: Can we train efficiently? ◮ Approximation: Are we dealing with rich models? ◮ Statistics: How hard is it generalize from finite data ? L.Rosasco
Neural networks function spaces Consider the non linear space of functions of the form f w, ( W ℓ ) ℓ : R D → R , f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) where w, ( W ℓ ) ℓ may vary. Very little structure. . . but we can : ◮ train by gradient descent (next) ◮ get (some) approximation/statistical guarantees (later) L.Rosasco
One layer neural networks Consider only one hidden layer: � u � x ⊤ W j � f w,W ( x ) = w ⊤ σ ( Wx ) = w j σ j =1 and ERM again n � ( y i − f w,W ( x i )) 2 , i =1 L.Rosasco
Computations Consider � n � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 Problem is non-convex! ( possibly smooth depending on σ ) L.Rosasco
Back-propagation & GD Empirical risk minimization, n � � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 An approximate minimizer is computed via the following gradient method ∂ � E w t +1 w t ( w t , W t ) = j − γ t j ∂w j ∂ � E W t +1 W t ( w t +1 , W t ) = j,k − γ t j,k ∂W j,k where the step-size ( γ t ) t is often called learning rate. L.Rosasco
Back-propagation & chain rule Direct computations show that: n ∂ � � E ( w, W ) = − 2 ( y i − f ( w,W ) ( x i ))) h j,i ∂w j � �� � i =1 ∆ j,i ∂ � � n E ( y i − f ( w,W ) ( x i ))) w j σ ′ ( w ⊤ x k ( w, W ) = − 2 j x ) i ∂W j,k � �� � i =1 η i,k Back-prop equations: η i,k = ∆ j,i c j σ ′ ( w ⊤ j x ) Using above equations, the updates are performed in two steps: ◮ Forward pass compute function values keeping weights fixed, ◮ Backward pass compute errors and propagate ◮ Hence the weights are updated. L.Rosasco
SGD is typically preferred w t +1 w t = j − γ t 2( y t − f ( w t ,W t ) ( x t ))) h j,t j W t +1 W t j,k − γ t 2( y t − f ( w t +1 ,W t ) ( x t ))) w j σ ′ ( w ⊤ j x ) x k = j,k t L.Rosasco
Non convexity and SGD L.Rosasco
Few remarks ◮ Optimization by gradient methods – typically SGD ◮ Online update rules are potentially biologically plausible– Hebbian learning rules describing neuron plasticity ◮ Multiple layers can be analogously considered ◮ Multiple step-size per layers can be considered ◮ Initialization is tricky- more later ◮ NO convergence guarantees ◮ More tricks later L.Rosasco
Some questions ◮ What is the benefit of multiple layers? ◮ Why does stochastic gradient seem to work? L.Rosasco
Wrapping up part I ◮ Learning classifier and representation ◮ From shallow to deep learning ◮ SGD and backpropagation L.Rosasco
Coming up ◮ Autoencoders and unsupervised data? ◮ Convolutional neural networks ◮ Tricks and tips L.Rosasco
Part II: L.Rosasco
Unsupervised learning with neural networks ◮ Because unlabeled data abound ◮ Because one could use obtained weight for initialize supervised learning (pre-training) L.Rosasco
Auto-encoders x W x ◮ A neural network with one input layer, one output layer and one (or more) hidden layers connecting them. ◮ The output layer has equally many nodes as the input layer, ◮ It is trained to predict the input rather than some target output. L.Rosasco
Recommend
More recommend