 
              Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain
2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos ‣ for a more detailed description of neural networks… ‣ … and much more!
2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos ‣ for a more detailed description of neural networks… ‣ … and much more!
3 NEURAL NETWORKS • What we’ll cover ... • f ( x ) ‣ how neural networks take input x and make predict f ( x ) - forward propagation - types of units 1 ... ... ‣ how to train neural nets (classifiers) on data - loss function - backpropagation - gradient descent algorithms 1 ... ... - tricks of the trade ‣ deep learning - unsupervised pre-training 1 ... ... • x 1 x j x d - dropout x - batch normalization
Neural Networks Making predictions with feedforward neural networks
5 ARTIFICIAL NEURON Topics: connection weights, bias, activation function • Neuron pre-activation (or input activation): i w i x i = b + w > x • a ( x ) = b + P P P • • Neuron (output) activation d b w • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) 1 b w 1 w d ... + w > • x 1 x d are the connection weights • { is the neuron bias b is called the activation function • g ( · )
6 ARTIFICIAL NEURON Topics: connection weights, bias, activation function y 1 • w • { x 2 • { 1 range determined • g ( · ) by 0 1 · ) b bias only -1 0 changes the biais -1 position of 0 -1 the riff 1 x 1 (from Pascal Vincent’s slides)
7 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network R´ eseaux de neurones x 2 z 1 0 1 -1 0 -1 0 -1 1 x 1 z k sortie k y 1 y 2 x 2 x 2 1 -.4 w kj -1 1 y 1 y 2 .7 0 0 1 1 cach´ ee j -1 -1 -1.5 0 0 biais -1 -1 0 .5 0 -1 -1 1 w ji 1 1 1 1 x 1 x 1 1 entr´ ee i x 2 x 1 x (from Pascal Vincent’s slides)
8 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 z 1 x 1 y 2 z 1 y 3 y 1 y 1 y 2 y 3 y 4 y 4 x 1 x 2 (from Pascal Vincent’s slides)
9 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 trois couches R 1 R 2 ... R 2 R 1 x 1 x 2 x 1 (from Pascal Vincent’s slides)
10 CAPACITY OF NEURAL NETWORK Topics: universal approximation • Universal approximation theorem (Hornik, 1991) : ‣ ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ • The result applies for sigmoid, tanh and many other hidden layer activation functions • This is a good result, but it doesn’t mean there is a learning algorithm that can find the necessary parameter values!
11 NEURAL NETWORK Topics: multilayer neural network • Could have L hidden layers: ... • ‣ layer pre-activation for k> 0 ( h (0) ( x ) = x ) b (3) W (3) • a ( k ) ( x ) = b ( k ) + W ( k ) h ( k � 1) ( x ) ( 1 ... ... h (2) ( x ) • b (2) ‣ hidden layer activation ( k from 1 to L ): W (2) • • h ( k ) ( x ) = g ( a ( k ) ( x )) 1 ... ... • h (1) ( x ) b (1) ) W (1) • ‣ output layer activation ( k = L + 1 ): 1 ... ... • h ( L +1) ( x ) = o ( a ( L +1) ( x )) = f ( x ) • x 1 x j x d
12 ACTIVATION FUNCTION Topics: sigmoid activation function • Squashes the neuron’s pre-activation between 0 and 1 • Always positive • Bounded • Strictly increasing • 1 • g ( a ) = sigm( a ) = 1+exp( � a )
13 ACTIVATION FUNCTION Topics: hyperbolic tangent (‘‘tanh’’) activation function • Squashes the neuron’s pre-activation between -1 and 1 • Can be positive or negative • Bounded • Strictly increasing • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1
14 ACTIVATION FUNCTION Topics: rectified linear activation function • Bounded below by 0 (always non-negative) • Not upper bounded • Strictly increasing • Tends to give neurons with sparse activities • • g ( a ) = reclin( a ) = max(0 , a )
15 ACTIVATION FUNCTION Topics: softmax activation function • For multi-class classification: ⇣ ‣ we need multiple outputs (1 output per class) • p ( y = c | x ) ‣ we would like to estimate the conditional probability • We use the softmax activation function at the output: • | i > h exp( a 1 ) exp( a C ) • o ( a ) = softmax( a ) = c exp( a c ) . . . P P c exp( a c ) ‣ strictly positive ‣ sums to one • Predicted class is the one with highest estimated probability
16 FLOW GRAPH Topics: flow graph • Forward propagation can be represented as an acyclic x f ( x ) flow graph • (3) W (2) • It’s a nice way of implementing • a (2) ( x ) = forward propagation in a modular (3) b (2) • way • h (1) ( x ) = ‣ each box could be an object with an fprop method, • that computes the value of the box given its (2) W (1) parents • a (1) ( x ) = (2) b (1) ‣ calling the fprop method of each box in the right order yield forward propagation (1) x
Neural Networks Training feedforward neural networks
18 MACHINE LEARNING Topics: empirical risk minimization, regularization • Empirical (structural) risk minimization ‣ framework to design learning algorithms X 1 l ( f ( x ( t ) ; θ ) , y ( t ) ) + λ Ω ( θ ) arg min T θ t � • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ is a loss function ‣ is a regularizer (penalizes certain values of ) • Ω ( θ ) θ • Learning is cast as optimization ‣ ideally, we’d optimize classification error, but it’s not smooth ‣ loss function is a surrogate for what we truly should optimize (e.g. upper bound)
19 MACHINE LEARNING Topics: stochastic gradient descent (SGD) � • Algorithm that performs updates after each example • • θ ⌘ { W (1) , b (1) , . . . , W ( L +1) , b ( L +1) } ‣ initialize ( ) θ ‣ for N epochs • �r • r 8 ) } - for each training example • ( x ( t ) , y ( t ) ) P r training epoch • � • ∆ = �r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � λ r θ Ω ( θ ) ✓ = iteration over all examples • θ θ + α ∆ ✓ • To apply this algorithm to neural network training, we need • • • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ the loss function • r • ‣ a procedure to compute the parameter gradients • r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � ‣ the regularizer (and the gradient ) • r θ Ω ( θ ) • Ω ( θ ) ‣ initialization method for θ
20 LOSS FUNCTION Topics: loss function for classification • Neural network estimates • • f ( x ) c = p ( y = c | x ) y ( t ) ‣ we could maximize the probabilities of given in the training set • x ( t ) • To frame as minimization, we minimize the negative log-likelihood natural log (ln) • • l ( f ( x ) , y ) = � P c 1 ( y = c ) log f ( x ) c = � log f ( x ) y ‣ we take the log to simplify for numerical stability and math simplicity ‣ sometimes referred to as cross-entropy
21 BACKPROPAGATION Topics: backpropagation algorithm • Use the chain rule to efficiently compute gradients, top to bottom ‣ compute output gradient (before activation) • r a ( L +1) ( x ) � log f ( x ) y = � ( e ( y ) � f ( x )) ( ‣ for k from L +1 to 1 � - compute gradients of hidden layer parameter � • r � ( r � h ( k � 1) ( x ) > � � • r W ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( • r b ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( - compute gradient of hidden layer below = W ( k ) > � � • r h ( k − 1) ( x ) � log f ( x ) y r a ( k ) ( x ) � log f ( x ) y ( � � - compute gradient of hidden layer below (before activation) � � � [ . . . , g 0 ( a ( k � 1) ( x ) j ) , . . . ] � � • r a ( k − 1) ( x ) � log f ( x ) y = r h ( k − 1) ( x ) � log f ( x ) y (
22 ACTIVATION FUNCTION Topics: sigmoid activation function gradient • Partial derivative: • • g 0 ( a ) = g ( a )(1 � g ( a )) • 1 • g ( a ) = sigm( a ) = 1+exp( � a )
23 ACTIVATION FUNCTION Topics: tanh activation function gradient • Partial derivative: � • • g 0 ( a ) = 1 � g ( a ) 2 • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1
24 ACTIVATION FUNCTION Topics: rectified linear activation function gradient • Partial derivative: g 0 ( a ) = 1 a> 0 • • g ( a ) = reclin( a ) = max(0 , a )
25 FLOW GRAPH Topics: automatic differentiation • Each object also has a bprop method ‣ it computes the gradient of the loss with x f ( x ) respect to each parent • (3) W (2) ‣ fprop depends on the fprop of a box’s parents, • a (2) ( x ) = while bprop depends the bprop of a box’s children (3) b (2) • • By calling bprop in the reverse order, we get backpropagation • h (1) ( x ) = • (2) W (1) ‣ only need to reach the parameters • a (1) ( x ) = (2) b (1) (1) x
Recommend
More recommend