artificial neural networks
play

Artificial Neural Networks Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Artificial Neural Networks Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. PLAN 1. Introduction Connectionist models 2 NN


  1. 0. Artificial Neural Networks Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

  2. 1. PLAN 1. Introduction Connectionist models 2 NN examples: ALVINN driving system, face recognition 2. The perceptron; the linear unit; the sigmoid unit The gradient descent learning rule 3. Multilayer networks of sigmoid units The Backpropagation algorithm Hidden layer representations Overfitting in NNs 4. Advanced topics Alternative error functions Predicting a probability function [ Recurrent networks ] [ Dynamically modifying the network structure ] [ Alternative error minimization procedures ] 5. Expressive capabilities of NNs

  3. 2. Connectionist Models Consider humans: Properties of artificial neural nets • Many neuron-like threshold • Neuron switching time: . 001 sec. switching units • Number of neurons: 10 10 • Many weighted interconnections • Connections per neuron: 10 4 − 5 among units • Scene recognition time: 0 . 1 sec. • Highly parallel, distributed pro- • 100 inference steps doesn’t seem like cess enough • Emphasis on tuning weights au- → much parallel computation tomatically

  4. 3. A First NN Example: ALVINN drives at 70 mph on highways [ Pomerleau, 1993 ] Sharp Straight Sharp Left Ahead Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina

  5. 4. A Second NN Example Typical input images: Neural Nets for Face Recognition http://www.cs.cmu.edu/ ∼ tom/faces.html left strt rght up ... ... 30x32 inputs Results: 90% accurate learning head pose, and recognizing 1-of-20 faces

  6. 5. Learned Weights after 1 epoch: left strt rght up ... ... 30x32 inputs after 100 epochs:

  7. 6. Design Issues for these two NN Examples See Tom Mitchell’s “Machine Learning” book, pag. 82-83, and 114 for ALVINN, and pag. 112-177 for face recognition: • input encoding • output encoding • network graph structure • learning parameters: η (learning rate), α (momentum), number of itera- tions

  8. 7. When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant

  9. 8. 2. The Perceptron [Rosenblat, 1962] x 1 w 1 x 0 =1 w 0 AAA x 2 w 2 Σ AAA . . n { AAA Σ wi xi n Σ wi xi . 1 if > 0 i =0 o = i =0 w n -1 otherwise x n � 1 if w 0 + w 1 x 1 + · · · + w n x n ≥ 0 o ( x 1 , . . . , x n ) = − 1 otherwise. Sometimes we’ll use simpler vector notation: � 1 if � w · � x ≥ 0 o ( � x ) = − 1 otherwise.

  10. 9. Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 ) ? But certain examples may not be linearly separable • Therefore, we’ll want networks of these...

  11. 10. The Perceptron Training Rule w i ← w i + ∆ w i with ∆ w i = η ( t − o ) x i or, in vectorial notation: w ← � � w + ∆ � w with ∆ � w = η ( t − o ) � x where: • t = c ( � x ) is target value • o is perceptron output • η is small positive constant (e.g., .1) called learning rate It will converge (proven by [ Minsky & Papert, 1969 ] ) • if the training data is linearly separable • and η is sufficiently small.

  12. 2 ′ . The Linear Unit 11. To understand the perceptron’s x =1 x 1 training rule, consider a (simpler) 0 w w 0 linear unit , where 1 x 2 o = w 0 + w 1 x 1 + · · · + w n x n w Σ 2 . . Let’s learn w i ’s that minimize the . w o = � n w i x i squared error n i =0 w ] ≡ 1 � ( t d − o d ) 2 x n E [ � 2 d ∈ D where D is set of training examples. The linear unit uses the descent gradient training rule, presented on the next slides. Remark: Ch. 6 (Bayesian Learning) shows that the hypothesis h = ( w 0 , w 1 , . . . , w n ) that minimises E is the most probable hypothesis given the training data.

  13. 12. The Gradient Descent Rule Gradient: � ∂E 25 � , ∂E , · · · ∂E ∇ E [ � w ] ≡ ∂w 0 ∂w 1 ∂w n 20 Training rule: 15 E[w] w = � � w + ∆ � w, 10 5 with ∆ � w = − η ∇ E [ � w ] . 0 Therefore, 2 w i = w i + ∆ w i , 1 -2 -1 0 0 with ∆ w i = − η ∂E 1 . 2 -1 ∂w i 3 w1 w0

  14. 13. The Gradient Descent Rule for the Linear Unit Computation ∂E ∂ 1 ( t d − o d ) 2 = 1 ∂ � � ( t d − o d ) 2 = ∂w i ∂w i 2 2 ∂w i d d 1 2( t d − o d ) ∂ ( t d − o d ) ∂ � � = ( t d − o d ) = ( t d − � w · � x d ) 2 ∂w i ∂w i d d � = ( t d − o d )( − x i,d ) d Therefore � ∆ w i = η ( t d − o d ) x i,d d

  15. 14. The Gradient Descent Algorithm for the Linear Unit Gradient-Descent ( training examples, η ) Each training example is a pair of the form � � x, t � , where � x is the vector of input values t is the target output value. η is the learning rate (e.g., .05). • Initialize each w i to some small random value • Until the termination condition is met – Initialize each ∆ w i to zero. – For each � � x, t � in training examples ∗ Input the instance � x to the unit and compute the output o ∗ For each linear unit weight w i ∆ w i ← ∆ w i + η ( t − o ) x i – For each linear unit weight w i w i ← w i + ∆ w i

  16. 15. Convergence [ Hertz et al., 1991 ] The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error • given a sufficiently small learning rate η • even when the training data contains noise • even when the training data is not separable by H Note: If η is too large, the gradient descent search runs the risk of over- stepping the minimum in the error surface rather than settling into it. For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows.

  17. 16. Remark Gradient descent (and similary, gradient ascent: � w ← � w + η ∇ E ) is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever • the hypothesis space contains continuously parametrized hypotheses • the error can be differentiated w.r.t. these hypothesis parameters. Practical difficulties in applying the gradient method: • if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum. • converging to a local optimum can sometimes be quite slow. To alleviate these difficulties, a variation called incremental (or: stochastic) gradient method was proposed.

  18. 17. Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Incremental mode Gradient Descent: Do until satisfied Do until satisfied 1. Compute the gradient ∇ E D [ � w ] • For each training example d in D 2. � w ← � w − η ∇ E D [ � w ] 1. Compute the gradient ∇ E d [ � w ] 2. � w ← � w − η ∇ E d [ � w ] w ] ≡ 1 w ] ≡ 1 � ( t d − o d ) 2 2( t d − o d ) 2 E D [ � E d [ � 2 d ∈ D Covergence: The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough.

  19. 18. 2 ′′ . The Sigmoid Unit x1 w1 x0 = 1 x2 w2 w0 A Σ . . A n net = Σ wi xi 1 . o = σ (net) = A i =0 -net wn 1 + e xn 1 σ ( x ) is the sigmoid function 1+ e − x dσ ( x ) Nice property: = σ ( x )(1 − σ ( x )) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation

  20. 19. Error Gradient for the Sigmoid Unit ∂E ∂ 1 � ( t d − o d ) 2 = But ∂w i ∂w i 2 d ∈ D ∂o d ∂σ ( net d ) 1 ∂ = = o d (1 − o d ) � ( t d − o d ) 2 = ∂net d ∂net d 2 ∂w i d ∂net d ∂ ( � w · � x d ) = = x i,d 1 ∂ � ∂w i ∂w i = 2( t d − o d ) ( t d − o d ) 2 ∂w i d So: � � − ∂o d � = ( t d − o d ) ∂E ∂w i � = − o d (1 − o d )( t d − o d ) x i,d d ∂w i ∂o d ∂net d d ∈ D � = − ( t d − o d ) ∂net d ∂w i d where net d = � n i =0 w i x i,d

  21. 20. Remark Instead of gradient descent, one could use linear pro- gramming to find hypothesis consistent with separable data. [ Duda & Hart, 1973 ] have shown that linear program- ming can be extended to the non-linear separable case. However, linear programming does not scale to multi- layer networks, as gradient descent does (see next sec- tion).

  22. 21. 3. Multilayer Networks of Sigmoid Units An example This network was trained to recognize head hid who’d hood 1 of 10 vowel sounds occurring in the ... ... context “h d” (e.g. “head”, “hid”). The inputs have been obtained from a spectral analysis of sound. The 10 network outputs correspond to the 10 possible vowel sounds. The net- work prediction is the output whose F1 F2 value is the highest.

  23. 22. This plot illustrates the highly non-linear decision surface represented by the learned network. Points shown on the plot are test examples distinct from the examples used to train the network. from [ Haug & Lippmann, 1988 ]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend