Artificial Neural Networks Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Artificial Neural Networks Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

1. PLAN 1. Introduction Connectionist models 2 NN examples: ALVINN driving system, face recognition 2. The perceptron; the linear unit; the sigmoid unit The gradient descent learning rule 3. Multilayer networks of sigmoid units The Backpropagation algorithm Hidden layer representations Overfitting in NNs 4. Advanced topics Alternative error functions Predicting a probability function [ Recurrent networks ] [ Dynamically modifying the network structure ] [ Alternative error minimization procedures ] 5. Expressive capabilities of NNs

2. Connectionist Models Consider humans: Properties of artificial neural nets • Many neuron-like threshold • Neuron switching time: . 001 sec. switching units • Number of neurons: 10 10 • Many weighted interconnections • Connections per neuron: 10 4 − 5 among units • Scene recognition time: 0 . 1 sec. • Highly parallel, distributed pro- • 100 inference steps doesn’t seem like cess enough • Emphasis on tuning weights au- → much parallel computation tomatically

3. A First NN Example: ALVINN drives at 70 mph on highways [ Pomerleau, 1993 ] Sharp Straight Sharp Left Ahead Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina

4. A Second NN Example Typical input images: Neural Nets for Face Recognition http://www.cs.cmu.edu/ ∼ tom/faces.html left strt rght up ... ... 30x32 inputs Results: 90% accurate learning head pose, and recognizing 1-of-20 faces

5. Learned Weights after 1 epoch: left strt rght up ... ... 30x32 inputs after 100 epochs:

6. Design Issues for these two NN Examples See Tom Mitchell’s “Machine Learning” book, pag. 82-83, and 114 for ALVINN, and pag. 112-177 for face recognition: • input encoding • output encoding • network graph structure • learning parameters: η (learning rate), α (momentum), number of itera- tions

7. When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant

8. 2. The Perceptron [Rosenblat, 1962] x 1 w 1 x 0 =1 w 0 AAA x 2 w 2 Σ AAA . . n { AAA Σ wi xi n Σ wi xi . 1 if > 0 i =0 o = i =0 w n -1 otherwise x n � 1 if w 0 + w 1 x 1 + · · · + w n x n ≥ 0 o ( x 1 , . . . , x n ) = − 1 otherwise. Sometimes we’ll use simpler vector notation: � 1 if � w · � x ≥ 0 o ( � x ) = − 1 otherwise.

9. Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 ) ? But certain examples may not be linearly separable • Therefore, we’ll want networks of these...

10. The Perceptron Training Rule w i ← w i + ∆ w i with ∆ w i = η ( t − o ) x i or, in vectorial notation: w ← � � w + ∆ � w with ∆ � w = η ( t − o ) � x where: • t = c ( � x ) is target value • o is perceptron output • η is small positive constant (e.g., .1) called learning rate It will converge (proven by [ Minsky & Papert, 1969 ] ) • if the training data is linearly separable • and η is sufficiently small.

2 ′ . The Linear Unit 11. To understand the perceptron’s x =1 x 1 training rule, consider a (simpler) 0 w w 0 linear unit , where 1 x 2 o = w 0 + w 1 x 1 + · · · + w n x n w Σ 2 . . Let’s learn w i ’s that minimize the . w o = � n w i x i squared error n i =0 w ] ≡ 1 � ( t d − o d ) 2 x n E [ � 2 d ∈ D where D is set of training examples. The linear unit uses the descent gradient training rule, presented on the next slides. Remark: Ch. 6 (Bayesian Learning) shows that the hypothesis h = ( w 0 , w 1 , . . . , w n ) that minimises E is the most probable hypothesis given the training data.

12. The Gradient Descent Rule Gradient: � ∂E 25 � , ∂E , · · · ∂E ∇ E [ � w ] ≡ ∂w 0 ∂w 1 ∂w n 20 Training rule: 15 E[w] w = � � w + ∆ � w, 10 5 with ∆ � w = − η ∇ E [ � w ] . 0 Therefore, 2 w i = w i + ∆ w i , 1 -2 -1 0 0 with ∆ w i = − η ∂E 1 . 2 -1 ∂w i 3 w1 w0

13. The Gradient Descent Rule for the Linear Unit Computation ∂E ∂ 1 ( t d − o d ) 2 = 1 ∂ � � ( t d − o d ) 2 = ∂w i ∂w i 2 2 ∂w i d d 1 2( t d − o d ) ∂ ( t d − o d ) ∂ � � = ( t d − o d ) = ( t d − � w · � x d ) 2 ∂w i ∂w i d d � = ( t d − o d )( − x i,d ) d Therefore � ∆ w i = η ( t d − o d ) x i,d d

14. The Gradient Descent Algorithm for the Linear Unit Gradient-Descent ( training examples, η ) Each training example is a pair of the form � � x, t � , where � x is the vector of input values t is the target output value. η is the learning rate (e.g., .05). • Initialize each w i to some small random value • Until the termination condition is met – Initialize each ∆ w i to zero. – For each � � x, t � in training examples ∗ Input the instance � x to the unit and compute the output o ∗ For each linear unit weight w i ∆ w i ← ∆ w i + η ( t − o ) x i – For each linear unit weight w i w i ← w i + ∆ w i

15. Convergence [ Hertz et al., 1991 ] The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error • given a sufficiently small learning rate η • even when the training data contains noise • even when the training data is not separable by H Note: If η is too large, the gradient descent search runs the risk of over- stepping the minimum in the error surface rather than settling into it. For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows.

16. Remark Gradient descent (and similary, gradient ascent: � w ← � w + η ∇ E ) is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever • the hypothesis space contains continuously parametrized hypotheses • the error can be differentiated w.r.t. these hypothesis parameters. Practical difficulties in applying the gradient method: • if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum. • converging to a local optimum can sometimes be quite slow. To alleviate these difficulties, a variation called incremental (or: stochastic) gradient method was proposed.

17. Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Incremental mode Gradient Descent: Do until satisfied Do until satisfied 1. Compute the gradient ∇ E D [ � w ] • For each training example d in D 2. � w ← � w − η ∇ E D [ � w ] 1. Compute the gradient ∇ E d [ � w ] 2. � w ← � w − η ∇ E d [ � w ] w ] ≡ 1 w ] ≡ 1 � ( t d − o d ) 2 2( t d − o d ) 2 E D [ � E d [ � 2 d ∈ D Covergence: The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough.

18. 2 ′′ . The Sigmoid Unit x1 w1 x0 = 1 x2 w2 w0 A Σ . . A n net = Σ wi xi 1 . o = σ (net) = A i =0 -net wn 1 + e xn 1 σ ( x ) is the sigmoid function 1+ e − x dσ ( x ) Nice property: = σ ( x )(1 − σ ( x )) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation

19. Error Gradient for the Sigmoid Unit ∂E ∂ 1 � ( t d − o d ) 2 = But ∂w i ∂w i 2 d ∈ D ∂o d ∂σ ( net d ) 1 ∂ = = o d (1 − o d ) � ( t d − o d ) 2 = ∂net d ∂net d 2 ∂w i d ∂net d ∂ ( � w · � x d ) = = x i,d 1 ∂ � ∂w i ∂w i = 2( t d − o d ) ( t d − o d ) 2 ∂w i d So: � � − ∂o d � = ( t d − o d ) ∂E ∂w i � = − o d (1 − o d )( t d − o d ) x i,d d ∂w i ∂o d ∂net d d ∈ D � = − ( t d − o d ) ∂net d ∂w i d where net d = � n i =0 w i x i,d

20. Remark Instead of gradient descent, one could use linear programming to find hypothesis consistent with separable data. [ Duda & Hart, 1973 ] have shown that linear programming can be extended to the non-linear separable case. However, linear programming does not scale to multilayer networks, as gradient descent does (see next sec- tion).

21. 3. Multilayer Networks of Sigmoid Units An example This network was trained to recognize head hid who’d hood 1 of 10 vowel sounds occurring in the ... ... context “h d” (e.g. “head”, “hid”). The inputs have been obtained from a spectral analysis of sound. The 10 network outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose F1 F2 value is the highest.

22. This plot illustrates the highly non-linear decision surface represented by the learned network. Points shown on the plot are test examples distinct from the examples used to train the network. from [ Haug & Lippmann, 1988 ]

Artificial Neural Networks Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Artificial Neural Networks Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. PLAN 1. Introduction Connectionist models 2 NN

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Artificial Neural Networks Roger Barlow CODATA School - Roger Barlow -Artificial Neural Networks

How Neural Networks (NN) Biological Neuron: A . . . Can (Hopefully) Learn Artificial Neural . .

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Networks Luke Schuler Overview What is an Artificial Neural Network? History

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Evaluations of Deep Convolutional Neural Networks for Automatic Identification of Malaria Infected

Adaptive Layout Decomposition with Graph Embedding Neural Networks Wei Li 1 , Jialu Xia 1 , Yuzhe

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey

Quality-Aware Neural Complementary Item Recommendation Yin Zhang , Haokai Lu, Wei Niu, James

Nonparametric regression using deep neural networks with ReLU activation function Johannes

Identifying the relevant dependencies of the neural network response on characteristics of the

11/3/2019 Steve Gordon, AUSA, Civil Rights Enforcement Coordinator United States Attorneys

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &

Artificial Neural Networks Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Artificial Neural Networks Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 4 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. PLAN 1. Introduction Connectionist models 2 NN

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Artificial Neural Networks Roger Barlow CODATA School - Roger Barlow -Artificial Neural Networks

How Neural Networks (NN) Biological Neuron: A . . . Can (Hopefully) Learn Artificial Neural . .

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Networks Luke Schuler Overview What is an Artificial Neural Network? History

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Evaluations of Deep Convolutional Neural Networks for Automatic Identification of Malaria Infected

Adaptive Layout Decomposition with Graph Embedding Neural Networks Wei Li 1 , Jialu Xia 1 , Yuzhe

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey

Quality-Aware Neural Complementary Item Recommendation Yin Zhang , Haokai Lu, Wei Niu, James

Nonparametric regression using deep neural networks with ReLU activation function Johannes

Identifying the relevant dependencies of the neural network response on characteristics of the

11/3/2019 Steve Gordon, AUSA, Civil Rights Enforcement Coordinator United States Attorneys

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &amp;

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &