cmp784
play

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - PowerPoint PPT Presentation

Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 1 is out! Learning neural word embeddings Due Friday, Mar. 16,


  1. Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 – Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2018

  2. Breaking news! • Practical 1 is out! — Learning neural word embeddings — Due Friday, Mar. 16, 23:59:59 • Paper presentations and quizzes will start next week! − Discuss your slides with me 3-4 days prior to your presentation − submit your final slides by the night before the class. − We don’t have any code walker or demonstrator. 2

  3. Previously on CMP784 • Learning problem • Parametric vs. non-parametric models • Nearest—neighbor classifier • Linear classification • Linear regression • Capacity • Hyperparameter • Underfitting • Overfitting • Variance-Bias tradeoff • Model selection • Cross-validation 3

  4. Lecture overview • the perceptron • the multi-layer perceptron • stochastic gradient descent • backpropagation • shallow yet very powerful: word2vec sclaimer: Much of the material and slides for this lecture were borrowed from • Discl — Hugo Larochelle’s Neural networks slides — Nick Locascio’s MIT 6.S191 slides — Efstratios Gavves and Max Willing’s UvA deep learning class — Leonid Sigal’s CPSC532L class — Richard Socher’s CS224d class — Dan Jurafsky’s CS124 class 4

  5. A Brief History of Neural Networks today Image: VUNI Inc. 5

  6. The Perceptron 6

  7. The Perceptron non-linearity sum inputs weights x 0 w 0 x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias 7

  8. Perceptron Forward Pass • Neuron pre-activation non-linearity sum inputs weights (or input activation) x 0 w 0 i w i x i = b + w > x • a ( x ) = b + P x 1 w 1 P • P • Neuron output activation: w 2 x 2 ∑ • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) w n … b where x n w are the weights (parameters) 1 b is the bias term bias g (·) is called the activation function • • 8

  9. Output Activation of The Neuron P • non-linearity sum inputs weights • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) x 0 w 0 x 1 w 1 Range is determined by g (·) w 2 x 2 ∑ w n … s b x n ed Bias only changes the Bi position of 1 bias t the riff ri (from Pascal Vincent’s slides) a ( x Image credit: Pascal Vincent ( x 9

  10. Linear Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights • { x 0 • g ( a ) = a w 0 tion x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias No nonlinear transformation • No input squashing • 10

  11. Sigmoid Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) • non-linearity sum inputs weights x 0 1 • g ( a ) = sigm( a ) = s w 0 1+exp( � a ) output between 0 and 1 • x 1 w 1 � output between 0 and 1 Squashes • w 2 the neuron’s x 2 ∑ output w n between … 0 and 1 b x n Always • positive 1 Bounded • bias Strictly • Increasing 11

  12. Perceptron Forward Pass P • non-linearity sum inputs weights • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) 2 0.1 3 0.5 2.5 -1 ∑ 0.2 5 3.0 1 bias 12

  13. Perceptron Forward Pass • non-linearity sum inputs weights • h ( x ) = g ( a (2*0.1) + 2 0.1 (3*0.5) + 3 0.5 (-1*2.5) + 2.5 -1 ∑ 0.2 (5*0.2) + 5 3.0 (1*3.0) x i ) 1 bias ( x 13 •

  14. Perceptron Forward Pass non-linearity sum inputs weights h ( x ) = g (3 . 2) = σ (3 . 2) 2 0.1 1 3 1 + e − 3 . 2 = 0 . 96 0.5 2.5 -1 ∑ 0.2 5 3.0 1 bias 14

  15. 
 Hyperbolic Tangent (tanh) Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) x 0 exp( a )+exp( � a ) w 0 h( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 x 1 w 1 exp(2 a )+1 w 2 x 2 ∑ Squashes the • neuron’s output w n … between b -1 and 1 x n Can be positive • or negative 1 bias Bounded • Strictly • Increasing 15

  16. Rectified Linear (ReLU) Activation Function P • • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) non-linearity sum inputs weights x 0 • g ( a ) = reclin( a ) = max(0 , a ) w 0 • x 1 w 1 Bounded below • by 0 (always w 2 x 2 ∑ non-negative) w n Not upper • … bounded b Strictly • x n increasing Tends to • 1 bias produce units with sparse activities 16

  17. Decision Boundary of a Neuron • Could do binary classification: — with sigmoid, one can interpret neuron as estimating p ( y = 1 | x ) — also known as logistic regression classifier Decision boundary is linear — if activation is greater than 0.5, predict 1 — otherwise predict 0 han Same idea can be applied to a tanh activation Image credit: Pascal Vincent (from Pascal Vincent’s slides) 17

  18. Capacity of Single Neuron • Can solve linearly separable problems AND ( x 1 , x 2 ) OR ( x 1 , x 2 ) AND ( x 1 , x 2 ) 1 1 1 , x 2 ) , x 2 ) , x 2 ) 0 0 0 0 1 0 1 0 1 ( x 1 ( x 1 ( x 1 18

  19. Capacity of Single Neuron • Can not solve non-linearly separable problems XOR ( x 1 , x 2 ) XOR ( x 1 , x 2 ) AND ( x 1 , x 2 ) 1 1 , x 2 ) ? 0 0 0 1 0 1 ( x 1 AND ( x 1 , x 2 ) • Need to transform the input into a better representation • Remember basis functions ! 19

  20. Perceptron Diagram Simplified non-linearity sum inputs weights x 0 w 0 x 1 w 1 w 2 x 2 ∑ w n … b x n 1 bias 20

  21. Perceptron Diagram Simplified output inputs x 0 x 1 x 2 o 0 … x n 21

  22. Multi-Output Perceptron • Remember multi-way classification output inputs — We need multiple outputs (1 output per class) x 0 — We need to estimate conditional probability p ( y = c| x ) — Discriminative Learning x 1 o 0 • | • Softmax activation function at the output x 2 o n i > h exp( a 1 ) exp( a C ) • o ( a ) = softmax( a ) = c exp( a c ) . . . … P P c exp( a c ) x n — Strictly positive — sums to one • Predict class with the highest estimated class conditional probability. 22

  23. Multi-Layer Perceptron 23

  24. Single Hidden Layer Neural Network • Hidden layer pre-activation: inputs hidden output layer layer • a ( x ) = b (1) + W (1) x h 0 x 0 ⇣ ⌘ a ( x ) i = b (1) j W (1) + P i,j x j i h 1 o 0 x 1 • • Hidden layer activation: h 2 o n … • h ( x ) = g ( a ( x )) x n h n • Output layer activation: > b (2) + w (2) h (1) x ⇣ ⌘ o ( x ) = o 24

  25. Multi-Layer Perceptron (MLP) • Consider a network with L hidden inputs hidden output layer layer layers. - • — layer pre-activation for k>0 h 0 x 0 • a ( k ) ( x ) = b ( k ) + W ( k ) h ( k � 1) ( x ) ( h 1 o 0 x 1 • — hidden layer activation from 1 to L: h 2 o n … • h ( k ) ( x ) = g ( a ( k ) ( x )) x n h n - • — output layer activation (k=L+1) • h ( L +1) ( x ) = o ( a ( L +1) ( x )) = f ( x ) ( h (0) ( x ) = x ) 25

  26. Deep Neural Network inputs hidden output layer layer h 0 h 0 x 0 h 1 h 1 o 0 x 1 … h 2 h 2 o n … x n h n h n 26

  27. Capacity of Neural Networks • Consider a single layer neural network R´ eseaux de neurones x 2 z 1 0 1 -1 0 -1 0 -1 1 x 1 z k Output sortie k y 1 y 2 x 2 x 2 1 -.4 w kj -1 1 y 1 y 2 .7 0 0 1 1 Hidden cach´ ee j -1 -1 -1.5 0 bias 0 biais -1 -1 0 .5 0 -1 -1 1 w ji 1 1 1 1 x 1 x 1 1 Input entr´ ee i x 2 x 1 x (from Pascal Vincent’s slides) Image credit: Pascal Vincent 27

  28. Capacity of Neural Networks • Consider a single layer neural network x 2 z 1 x 1 y 2 z 1 y 3 y 1 y 1 y 2 y 3 y 4 y 4 x 1 x 2 Image credit: Pascal Vincent (from Pascal Vincent’s slides) 28

  29. Capacity of Neural Networks • Consider a single layer neural network (from Pascal Vincent’s slides) Image credit: Pascal Vincent 29

  30. Universal Approximation • Universal Approximation Theorem (Hornik, 1991): — “a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ • This applies for sigmoid, tanh and many other activation functions. • However, this does not mean that there is learning algorithm that can find the necessary parameter values. 30

  31. Applying Neural Networks 31

  32. Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 Temperature: -20 F x 0 Wind Speed: 45 mph Predicted: 0.05 h 1 o 0 [-20, 45] x 1 h 2 32

  33. Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 x 0 Predicted: 0.05 h 0 [-20, 45] h 1 x 0 o 0 Predicted: 0.05 x 1 h 1 o 0 [-20, 45] h 2 x 1 h 2 33

  34. Example Problem: Will my flight be delayed? Example Problem: Will my Flight be Delayed? h 0 x 0 Predicted: 0.05 h 0 [-20, 45] h 1 Actual: 1 x 0 o 0 Predicted: 0.05 x 1 h 1 o 0 [-20, 45] h 2 x 1 h 2 34

  35. Quantifying Loss h 0 x 0 Predicted: 0.05 [-20, 45] h 1 Actual: 1 o 0 x 1 h 2 ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) Actual Predicted 35

  36. Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 36

  37. Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend