neural networks
play

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - PowerPoint PPT Presentation

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos for a more detailed description of neural networks


  1. Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain

  2. 2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos ‣ for a more detailed 
 description of 
 neural networks… ‣ … and much more!

  3. 2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos ‣ for a more detailed 
 description of 
 neural networks… ‣ … and much more!

  4. 3 NEURAL NETWORKS • What we’ll cover ... • f ( x ) ‣ how neural networks take input x and make predict f ( x ) - forward propagation - types of units 1 ... ... ‣ how to train neural nets (classifiers) on data - loss function - backpropagation - gradient descent algorithms 1 ... ... - tricks of the trade ‣ deep learning - unsupervised pre-training 1 ... ... • x 1 x j x d - dropout x - batch normalization

  5. Neural Networks Making predictions with feedforward neural networks

  6. 
 
 5 ARTIFICIAL NEURON Topics: connection weights, bias, activation function • Neuron pre-activation (or input activation): i w i x i = b + w > x • a ( x ) = b + P P P • • Neuron (output) activation 
 d b w • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) 1 b w 1 w d ... + w > • x 1 x d are the connection weights 
 • { is the neuron bias 
 b is called the activation function • g ( · )

  7. 6 ARTIFICIAL NEURON Topics: connection weights, bias, activation function y 1 • w • { x 2 • { 1 range determined 
 • g ( · ) by 0 1 · ) b bias only -1 0 changes the biais -1 position of 0 -1 the riff 1 x 1 (from Pascal Vincent’s slides)

  8. 7 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network R´ eseaux de neurones x 2 z 1 0 1 -1 0 -1 0 -1 1 x 1 z k sortie k y 1 y 2 x 2 x 2 1 -.4 w kj -1 1 y 1 y 2 .7 0 0 1 1 cach´ ee j -1 -1 -1.5 0 0 biais -1 -1 0 .5 0 -1 -1 1 w ji 1 1 1 1 x 1 x 1 1 entr´ ee i x 2 x 1 x (from Pascal Vincent’s slides)

  9. 8 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 z 1 x 1 y 2 z 1 y 3 y 1 y 1 y 2 y 3 y 4 y 4 x 1 x 2 (from Pascal Vincent’s slides)

  10. 9 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 trois couches R 1 R 2 ... R 2 R 1 x 1 x 2 x 1 (from Pascal Vincent’s slides)

  11. 10 CAPACITY OF NEURAL NETWORK Topics: universal approximation • Universal approximation theorem (Hornik, 1991) : ‣ ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ • The result applies for sigmoid, tanh and many other hidden layer activation functions • This is a good result, but it doesn’t mean there is a learning algorithm that can find the necessary parameter values!

  12. 11 NEURAL NETWORK Topics: multilayer neural network • Could have L hidden layers: ... • ‣ layer pre-activation for k> 0 ( h (0) ( x ) = x ) b (3) W (3) • a ( k ) ( x ) = b ( k ) + W ( k ) h ( k � 1) ( x ) ( 1 ... ... h (2) ( x ) • b (2) ‣ hidden layer activation ( k from 1 to L ): W (2) • • h ( k ) ( x ) = g ( a ( k ) ( x )) 1 ... ... • h (1) ( x ) b (1) ) W (1) • ‣ output layer activation ( k = L + 1 ): 1 ... ... • h ( L +1) ( x ) = o ( a ( L +1) ( x )) = f ( x ) • x 1 x j x d

  13. 12 ACTIVATION FUNCTION Topics: sigmoid activation function • Squashes the neuron’s 
 pre-activation between 
 0 and 1 • Always positive • Bounded • Strictly increasing • 1 • g ( a ) = sigm( a ) = 1+exp( � a )

  14. 13 ACTIVATION FUNCTION Topics: hyperbolic tangent (‘‘tanh’’) activation function • Squashes the neuron’s 
 pre-activation between 
 -1 and 1 • Can be positive or 
 negative • Bounded • Strictly increasing • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1

  15. 14 ACTIVATION FUNCTION Topics: rectified linear activation function • Bounded below by 0 
 (always non-negative) • Not upper bounded • Strictly increasing • Tends to give neurons 
 with sparse activities • • g ( a ) = reclin( a ) = max(0 , a )

  16. 
 15 ACTIVATION FUNCTION Topics: softmax activation function • For multi-class classification: ⇣ ‣ we need multiple outputs (1 output per class) • p ( y = c | x ) ‣ we would like to estimate the conditional probability • We use the softmax activation function at the output: 
 • | i > h exp( a 1 ) exp( a C ) • o ( a ) = softmax( a ) = c exp( a c ) . . . P P c exp( a c ) ‣ strictly positive ‣ sums to one • Predicted class is the one with highest estimated probability

  17. 16 FLOW GRAPH Topics: flow graph • Forward propagation can be 
 represented as an acyclic 
 x f ( x ) flow graph • (3) W (2) • It’s a nice way of implementing 
 • a (2) ( x ) = forward propagation in a modular 
 (3) b (2) • way • h (1) ( x ) = ‣ each box could be an object with an fprop method, 
 • that computes the value of the box given its 
 (2) W (1) parents • a (1) ( x ) = (2) b (1) ‣ calling the fprop method of each box in the 
 right order yield forward propagation (1) x

  18. Neural Networks Training feedforward neural networks

  19. 18 MACHINE LEARNING Topics: empirical risk minimization, regularization • Empirical (structural) risk minimization ‣ framework to design learning algorithms X 1 l ( f ( x ( t ) ; θ ) , y ( t ) ) + λ Ω ( θ ) arg min T θ t � • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ is a loss function ‣ is a regularizer (penalizes certain values of ) • Ω ( θ ) θ • Learning is cast as optimization ‣ ideally, we’d optimize classification error, but it’s not smooth ‣ loss function is a surrogate for what we truly should optimize (e.g. upper bound)

  20. 19 MACHINE LEARNING Topics: stochastic gradient descent (SGD) � • Algorithm that performs updates after each example • • θ ⌘ { W (1) , b (1) , . . . , W ( L +1) , b ( L +1) } ‣ initialize ( ) θ ‣ for N epochs • �r • r 8 ) } - for each training example • ( x ( t ) , y ( t ) ) P r training epoch • � • ∆ = �r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � λ r θ Ω ( θ ) ✓ = iteration over all examples • θ θ + α ∆ ✓ • To apply this algorithm to neural network training, we need • • • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ the loss function • r • ‣ a procedure to compute the parameter gradients • r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � ‣ the regularizer (and the gradient ) • r θ Ω ( θ ) • Ω ( θ ) ‣ initialization method for θ

  21. 20 LOSS FUNCTION Topics: loss function for classification • Neural network estimates • • f ( x ) c = p ( y = c | x ) y ( t ) ‣ we could maximize the probabilities of given in the training set • x ( t ) • To frame as minimization, we minimize the 
 negative log-likelihood natural log (ln) • • l ( f ( x ) , y ) = � P c 1 ( y = c ) log f ( x ) c = � log f ( x ) y ‣ we take the log to simplify for numerical stability and math simplicity ‣ sometimes referred to as cross-entropy

  22. 21 BACKPROPAGATION Topics: backpropagation algorithm • Use the chain rule to efficiently compute gradients, top to bottom ‣ compute output gradient (before activation) 
 • r a ( L +1) ( x ) � log f ( x ) y = � ( e ( y ) � f ( x )) ( ‣ for k from L +1 to 1 � - compute gradients of hidden layer parameter � • r � ( r � h ( k � 1) ( x ) > � � • r W ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( • r b ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( - compute gradient of hidden layer below = W ( k ) > � � • r h ( k − 1) ( x ) � log f ( x ) y r a ( k ) ( x ) � log f ( x ) y ( � � - compute gradient of hidden layer below (before activation) � � � [ . . . , g 0 ( a ( k � 1) ( x ) j ) , . . . ] � � • r a ( k − 1) ( x ) � log f ( x ) y = r h ( k − 1) ( x ) � log f ( x ) y (

  23. 22 ACTIVATION FUNCTION Topics: sigmoid activation function gradient • Partial derivative: • • g 0 ( a ) = g ( a )(1 � g ( a )) • 1 • g ( a ) = sigm( a ) = 1+exp( � a )

  24. 23 ACTIVATION FUNCTION Topics: tanh activation function gradient • Partial derivative: � • • g 0 ( a ) = 1 � g ( a ) 2 • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1

  25. 24 ACTIVATION FUNCTION Topics: rectified linear activation function gradient • Partial derivative: g 0 ( a ) = 1 a> 0 • • g ( a ) = reclin( a ) = max(0 , a )

  26. 25 FLOW GRAPH Topics: automatic differentiation • Each object also has a bprop method ‣ it computes the gradient of the loss with 
 x f ( x ) respect to each parent • (3) W (2) ‣ fprop depends on the fprop of a box’s parents, 
 • a (2) ( x ) = while bprop depends the bprop of a box’s children (3) b (2) • • By calling bprop in the reverse order, 
 we get backpropagation • h (1) ( x ) = • (2) W (1) ‣ only need to reach the parameters • a (1) ( x ) = (2) b (1) (1) x

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend