 
              0. Artificial Neural Networks Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
1. PLAN 1. Introduction Connectionist models 2 NN examples: ALVINN driving system, face recognition 2. The perceptron; the linear unit; the sigmoid unit The gradient descent learning rule 3. Multilayer networks of sigmoid units The Backpropagation algorithm Hidden layer representations Overfitting in NNs 4. Advanced topics Alternative error functions Predicting a probability function [ Recurrent networks ] [ Dynamically modifying the network structure ] [ Alternative error minimization procedures ] 5. Expressive capabilities of NNs
2. Connectionist Models Consider humans: Properties of artificial neural nets • Many neuron-like threshold • Neuron switching time: . 001 sec. switching units • Number of neurons: 10 10 • Many weighted interconnections • Connections per neuron: 10 4 − 5 among units • Scene recognition time: 0 . 1 sec. • Highly parallel, distributed pro- • 100 inference steps doesn’t seem like cess enough • Emphasis on tuning weights au- → much parallel computation tomatically
3. A First NN Example: ALVINN drives at 70 mph on highways [ Pomerleau, 1993 ] Sharp Straight Sharp Left Ahead Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina
4. A Second NN Example Typical input images: Neural Nets for Face Recognition http://www.cs.cmu.edu/ ∼ tom/faces.html left strt rght up ... ... 30x32 inputs Results: 90% accurate learning head pose, and recognizing 1-of-20 faces
5. Learned Weights after 1 epoch: left strt rght up ... ... 30x32 inputs after 100 epochs:
6. Design Issues for these two NN Examples See Tom Mitchell’s “Machine Learning” book, pag. 82-83, and 114 for ALVINN, and pag. 112-177 for face recognition: • input encoding • output encoding • network graph structure • learning parameters: η (learning rate), α (momentum), number of itera- tions
7. When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant
8. 2. The Perceptron [Rosenblat, 1962] x 1 w 1 x 0 =1 w 0 AAA x 2 w 2 Σ AAA . . n { AAA Σ wi xi n Σ wi xi . 1 if > 0 i =0 o = i =0 w n -1 otherwise x n � 1 if w 0 + w 1 x 1 + · · · + w n x n ≥ 0 o ( x 1 , . . . , x n ) = − 1 otherwise. Sometimes we’ll use simpler vector notation: � 1 if � w · � x ≥ 0 o ( � x ) = − 1 otherwise.
9. Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 ) ? But certain examples may not be linearly separable • Therefore, we’ll want networks of these...
10. The Perceptron Training Rule w i ← w i + ∆ w i with ∆ w i = η ( t − o ) x i or, in vectorial notation: w ← � � w + ∆ � w with ∆ � w = η ( t − o ) � x where: • t = c ( � x ) is target value • o is perceptron output • η is small positive constant (e.g., .1) called learning rate It will converge (proven by [ Minsky & Papert, 1969 ] ) • if the training data is linearly separable • and η is sufficiently small.
2 ′ . The Linear Unit 11. To understand the perceptron’s x =1 x 1 training rule, consider a (simpler) 0 w w 0 linear unit , where 1 x 2 o = w 0 + w 1 x 1 + · · · + w n x n w Σ 2 . . Let’s learn w i ’s that minimize the . w o = � n w i x i squared error n i =0 w ] ≡ 1 � ( t d − o d ) 2 x n E [ � 2 d ∈ D where D is set of training examples. The linear unit uses the descent gradient training rule, presented on the next slides. Remark: Ch. 6 (Bayesian Learning) shows that the hypothesis h = ( w 0 , w 1 , . . . , w n ) that minimises E is the most probable hypothesis given the training data.
12. The Gradient Descent Rule Gradient: � ∂E 25 � , ∂E , · · · ∂E ∇ E [ � w ] ≡ ∂w 0 ∂w 1 ∂w n 20 Training rule: 15 E[w] w = � � w + ∆ � w, 10 5 with ∆ � w = − η ∇ E [ � w ] . 0 Therefore, 2 w i = w i + ∆ w i , 1 -2 -1 0 0 with ∆ w i = − η ∂E 1 . 2 -1 ∂w i 3 w1 w0
13. The Gradient Descent Rule for the Linear Unit Computation ∂E ∂ 1 ( t d − o d ) 2 = 1 ∂ � � ( t d − o d ) 2 = ∂w i ∂w i 2 2 ∂w i d d 1 2( t d − o d ) ∂ ( t d − o d ) ∂ � � = ( t d − o d ) = ( t d − � w · � x d ) 2 ∂w i ∂w i d d � = ( t d − o d )( − x i,d ) d Therefore � ∆ w i = η ( t d − o d ) x i,d d
14. The Gradient Descent Algorithm for the Linear Unit Gradient-Descent ( training examples, η ) Each training example is a pair of the form � � x, t � , where � x is the vector of input values t is the target output value. η is the learning rate (e.g., .05). • Initialize each w i to some small random value • Until the termination condition is met – Initialize each ∆ w i to zero. – For each � � x, t � in training examples ∗ Input the instance � x to the unit and compute the output o ∗ For each linear unit weight w i ∆ w i ← ∆ w i + η ( t − o ) x i – For each linear unit weight w i w i ← w i + ∆ w i
15. Convergence [ Hertz et al., 1991 ] The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error • given a sufficiently small learning rate η • even when the training data contains noise • even when the training data is not separable by H Note: If η is too large, the gradient descent search runs the risk of over- stepping the minimum in the error surface rather than settling into it. For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows.
16. Remark Gradient descent (and similary, gradient ascent: � w ← � w + η ∇ E ) is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever • the hypothesis space contains continuously parametrized hypotheses • the error can be differentiated w.r.t. these hypothesis parameters. Practical difficulties in applying the gradient method: • if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum. • converging to a local optimum can sometimes be quite slow. To alleviate these difficulties, a variation called incremental (or: stochastic) gradient method was proposed.
17. Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Incremental mode Gradient Descent: Do until satisfied Do until satisfied 1. Compute the gradient ∇ E D [ � w ] • For each training example d in D 2. � w ← � w − η ∇ E D [ � w ] 1. Compute the gradient ∇ E d [ � w ] 2. � w ← � w − η ∇ E d [ � w ] w ] ≡ 1 w ] ≡ 1 � ( t d − o d ) 2 2( t d − o d ) 2 E D [ � E d [ � 2 d ∈ D Covergence: The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough.
18. 2 ′′ . The Sigmoid Unit x1 w1 x0 = 1 x2 w2 w0 A Σ . . A n net = Σ wi xi 1 . o = σ (net) = A i =0 -net wn 1 + e xn 1 σ ( x ) is the sigmoid function 1+ e − x dσ ( x ) Nice property: = σ ( x )(1 − σ ( x )) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation
19. Error Gradient for the Sigmoid Unit ∂E ∂ 1 � ( t d − o d ) 2 = But ∂w i ∂w i 2 d ∈ D ∂o d ∂σ ( net d ) 1 ∂ = = o d (1 − o d ) � ( t d − o d ) 2 = ∂net d ∂net d 2 ∂w i d ∂net d ∂ ( � w · � x d ) = = x i,d 1 ∂ � ∂w i ∂w i = 2( t d − o d ) ( t d − o d ) 2 ∂w i d So: � � − ∂o d � = ( t d − o d ) ∂E ∂w i � = − o d (1 − o d )( t d − o d ) x i,d d ∂w i ∂o d ∂net d d ∈ D � = − ( t d − o d ) ∂net d ∂w i d where net d = � n i =0 w i x i,d
20. Remark Instead of gradient descent, one could use linear pro- gramming to find hypothesis consistent with separable data. [ Duda & Hart, 1973 ] have shown that linear program- ming can be extended to the non-linear separable case. However, linear programming does not scale to multi- layer networks, as gradient descent does (see next sec- tion).
21. 3. Multilayer Networks of Sigmoid Units An example This network was trained to recognize head hid who’d hood 1 of 10 vowel sounds occurring in the ... ... context “h d” (e.g. “head”, “hid”). The inputs have been obtained from a spectral analysis of sound. The 10 network outputs correspond to the 10 possible vowel sounds. The net- work prediction is the output whose F1 F2 value is the highest.
22. This plot illustrates the highly non-linear decision surface represented by the learned network. Points shown on the plot are test examples distinct from the examples used to train the network. from [ Haug & Lippmann, 1988 ]
Recommend
More recommend