neural networks
play

Neural Networks CE417: Introduction to Artificial Intelligence - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2


  1. Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragan’s slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.

  2. 2

  3. McCulloch-Pitts neuron: binary threshold 𝑧 𝑦 " π‘₯ " π‘₯ # 𝑦 # 𝑧 πœ„ : activation threshold … π‘₯ $ 𝑦 $ 𝑧 = '1, 𝑨 β‰₯ πœ„ Equivale 0, 𝑨 < πœ„ nt to 1 𝑐 𝑦 " π‘₯ " 𝑧 π‘₯ # 𝑦 # 𝑧 … π‘₯ $ bias: 𝑐 = βˆ’πœ„ 𝑦 $ 3

  4. Neural nets and the brain 𝑦 " π‘₯ " 𝑦 # π‘₯ # 𝑦 3 π‘₯ 3 + . .... 𝑦 $ π‘₯ $ 𝑐 β€’ Neural nets are composed of networks of computational models of neurons called perceptrons 4

  5. The perceptron 𝑦 " π‘₯ " 𝑦 # π‘₯ # 𝑦 3 π‘₯ 3 + . .... 1 if 7 π‘₯ 8 𝑦 8 β‰₯ πœ„ y = 8 0 else 𝑐 = βˆ’πœ„ 𝑦 $ π‘₯ $ β€’ A threshold unit – β€œFires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate β€’ A basic unit of Boolean circuits 5

  6. οΏ½ Perceptron 𝑦 " π‘₯ " z = 7 w > x > + b 𝑦 # π‘₯ # > 𝑦 3 π‘₯ 3 + .... x 2 𝑦 # 1 𝑐 𝑦 $ π‘₯ $ x 1 𝑦 " 0 } Lean this function } A step function across a hyperplane 6

  7. οΏ½ Learning the perceptron 𝑦 " π‘₯ " 𝑧 = H1 𝑗𝑔 7 w > x > + b β‰₯ 0 𝑦 # π‘₯ # > 0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓 𝑦 3 π‘₯ 3 𝑦 # + .... 𝑦 " 𝑦 $ 𝑐 π‘₯ $ β€’ Given a number of input output pairs, learn the weights and bias – Learn 𝑋 = [π‘₯ " , … , π‘₯ $ ] and b, given several π‘Œ, 𝑧 pairs 7

  8. Restating the perceptron x 1 x 2 x 3 𝑋 $ x d W d+1 x d+1 =1 } Restating the perceptron equation by adding another dimension to π‘Œ $R" 𝑧 = '1 𝑗𝑔 βˆ‘ π‘₯ 8 𝑦 8 β‰₯ 0 8S" Where 𝑦 $R" = 1 0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓 $R" } Note that the boundary βˆ‘ π‘₯ 8 𝑦 8 β‰₯ 0 is now a hyperplane 8S" through origin 8

  9. The Perceptron Problem $R" 7 π‘₯ 8 𝑦 8 = 0 β€’ Find the hyperplane that perfectly separates the two groups of points 8S" 34 9

  10. Perceptron Algorithm: Summary β€’ Cycle through the training instances β€’ Only update 𝒙 on misclassified instances β€’ If instance misclassified: – If instance is positive class 𝒙 = 𝒙 + π’š – If instance is negative class 𝒙 = 𝒙 βˆ’ π’š 10

  11. Training of Single Layer 𝒙 VR" = 𝒙 V βˆ’ πœƒπ›ΌπΉ Z 𝒙 V } Weight update for a training pair (π’š Z , 𝑧 (Z) ) : } Perceptron: If sign(𝒙 _ π’š (Z) ) β‰  𝑧 (Z) then if misclassified 𝛼𝐹 Z 𝒙 V = βˆ’πœƒπ’š (Z) 𝑧 (Z) 𝐹 Z 𝒙 = βˆ’π’™ _ π’š (Z) 𝑧 (Z) } } 𝛼𝐹 Z 𝒙 V = βˆ’πœƒ(𝑧 (Z) βˆ’ 𝒙 _ π’š (Z) )π’š (Z) } ADALINE: } Widrow-Hoff, LMS, or delta rule 𝐹 Z 𝒙 = 𝑧 (Z) βˆ’ 𝒙 _ π’š (Z) # 11

  12. Perceptron vs. Delta Rule } Perceptron learning rule: } guaranteed to succeed if training examples are linearly separable } Delta rule: } guaranteed to converge to the hypothesis with the minimum squared error } can also be used for regression problems 12

  13. Reminder: Linear Classifiers Inputs are feature values Β§ Each feature has a weight Β§ Sum is the activation Β§ w 1 If the activation is: Β§ f 1 w 2 S Β§ Positive, output +1 f 2 >0? w 3 Β§ Negative, output -1 f 3 13

  14. οΏ½ The β€œsoft” perceptron (logistic) π’š 𝟐 𝒙 𝟐 π’š πŸ‘ 𝒙 πŸ‘ z = 7 w > x > βˆ’ ΞΈ π’š πŸ’ 𝒙 πŸ’ > + ..... 1 y = 1 + exp(βˆ’z) π’š 𝑢 𝒙 𝑢 𝑐 = βˆ’πœ„ β€’ A β€œsquashing” function instead of a threshold at the output – The sigmoid β€œactivation” replaces the threshold β€’ Activation: The function that acts on the weighted combination of inputs (and threshold) 14

  15. Sigmoid neurons } These give a real-valued output that is a smooth and bounded function of their total input. } Typically they use the logistic function } They have nice derivatives. 15

  16. Other β€œactivations” π’š 𝟐 𝒙 𝟐 sigmoid π’š πŸ‘ 𝒙 πŸ‘ 1 1 π’š πŸ’ 𝒙 πŸ’ 1 + exp (βˆ’π‘¨) + .... tanh 𝒄 π’š 𝑢 𝒙 𝑢 tanh 𝑨 (1 +𝑓 l ) log β€’ Does not always have to be a squashing function – We will hear more about activations later β€’ We will continue to assume a β€œthreshold” activation in this lecture 16

  17. How to get probabilistic decisions? } Activation: } If 𝑨 = 𝒙 _ π’š very positive Γ  want probability going to 1 } If 𝑨 = 𝒙 _ π’š very negative Γ  want probability going to 0 } Sigmoid function 1 Ο† ( z ) = 1 + e βˆ’ z 17

  18. Best w? Maximum likelihood estimation: } X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w with: i 1 𝑄 𝑧 8 = +1 𝑦 8 ; π‘₯ = 1 + 𝑓 o𝒙 p π’š 1 𝑄 𝑧 8 = βˆ’1 𝑦 8 ; π‘₯ = 1 βˆ’ 1 + 𝑓 o𝒙 p π’š = Logistic Regression 18

  19. Multiclass Logistic Regression _ π’š } Multi-class linear classification 𝒙 " A weight vector for each class: } 𝒙 q Score (activation) of a class y: _ π’š } 𝒙 q _ π’š 𝒙 3 _ π’š _ π’š 𝒙 q 𝒙 # Prediction w/highest score wins: } } How to make the scores into probabilities? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 β†’ e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations 19

  20. Best w? } Maximum likelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i with: p π’š (s) 𝒙 r s e w y ( i ) Β· f ( x ( i ) ) 𝑓 𝑄 𝑧 8 𝑦 8 ; π‘₯ = P ( y ( i ) | x ( i ) ; w ) = y e w y Β· f ( x ( i ) ) p π’š (s) P u 𝑓 𝒙 t βˆ‘ vS" = Multi-Class Logistic Regression 20

  21. Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i g ( w ) } init w } for iter = 1, 2, … X r log P ( y ( i ) | x ( i ) ; w ) w w + Ξ± ⇀ i 21

  22. Logistic regression: multi-class VR" = 𝒙 w V βˆ’ πœƒπ›Ό 𝑿 𝐾(𝑿 V ) 𝒙 w Z 𝑄 𝑧 w |π’š 8 ; 𝑿 βˆ’ 𝑧 w 8 π’š 8 𝛼 𝒙 z 𝐾 𝑿 = 7 8S" 22

  23. Stochastic Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: once gradient on one training example has been computed, might as well incorporate before computing next one } init w } for iter = 1, 2, … } pick random j w w + Ξ± ⇀ r log P ( y ( j ) | x ( j ) ; w ) 23

  24. Mini-Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: gradient over small set of training examples (=mini-batch) can be computed, might as well do that instead of a single one } init w } for iter = 1, 2, … } pick random subset of training examples J X r log P ( y ( j ) | x ( j ) ; w ) w w + Ξ± ⇀ j ∈ J 24

  25. Networks with hidden units } Networks without hidden units are very limited in the input-output mappings they can learn to model. } A simple function such as XOR cannot be modeled with single layer network } More layers of linear units do not help. Its still linear. } Fixed output non-linearities are not enough. } We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 25

  26. Neural Networks 26

  27. The multi-layer perceptron 27

  28. Neural Networks Properties } Theorem (Universal Function Approximators). A two- layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy. } Practical considerations } Can be seen as learning the features } Large number of neurons } Danger for overfitting } (hence early stopping!) 28

  29. Universal Function Approximation Theorem* } In words: Given any continuous function f(x), if a 2-layer neural network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x). Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 29

  30. Universal Function Approximation Theorem* Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 30

  31. Expressiveness of neural networks } All Boolean functions can be represented by a network with a single hidden layer } But it might require exponential (in number of inputs) hidden units } Continuous functions: } Any continuous function on a compact domain can be approximated to an arbitrary accuracy, by network with one hidden layer [Cybenko 1989] } Any function can be approximated to an arbitrarily accuracy by a network with two hidden layers [Cybenko 1988] 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend