Artificial Neural Networks Threshold units Gradient descent - PDF document

Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face Recognition • Advanced topics 1

Connectionist Models Consider humans: • Neuron switching time ˜ . 001 second • Number of neurons ˜ 10 10 • Connections per neuron ˜ 10 4 − 5 • Scene recognition time ˜ . 1 second • 100 inference steps doesn’t seem like enough → much parallel computation Properties of artificial neural nets (ANN’s): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically 2

When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction 3

ALVINN drives 70 mph on highways 4

Perceptron x 1 w 1 x 0 =1 w 0 x 2 w 2 Σ . �� n . { Σ wi xi n Σ wi xi . 1 if > 0 i =0 o = w n i =0 -1 otherwise x n ⎧ 1 if w 0 + w 1 x 1 + · · · + w n x n > 0 ⎪ ⎪ o ( x 1 , . . . , x n ) = ⎨ − 1 otherwise. ⎪ ⎪ ⎩ Sometimes we’ll use simpler vector notation: ⎧ 1 if � w · � x > 0 ⎪ ⎪ o ( � x ) = ⎨ − 1 otherwise. ⎪ ⎪ ⎩ 5

Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 )? But some functions not representable • e.g., not linearly separable • Therefore, we’ll want networks of these... 6

Perceptron training rule w i ← w i + Δ w i where Δ w i = η ( t − o ) x i Where: • t = c ( � x ) is target value • o is perceptron output • η is small constant (e.g., .1) called learning rate 7

Perceptron training rule Can prove it will converge • If training data is linearly separable • and η sufficiently small 8

Gradient Descent To understand, consider simpler linear unit , where o = w 0 + w 1 x 1 + · · · + w n x n Let’s learn w i ’s that minimize the squared error w ] ≡ 1 d ∈ D ( t d − o d ) 2 E [ � � 2 Where D is set of training examples 9

Gradient Descent 25 20 15 E[w] 10 5 0 2 1 -2 -1 0 0 1 2 -1 3 w1 w0 Gradient ⎣ ∂E , ∂E , · · · ∂E ⎡ ⎤ ∇ E [ � w ] ≡ ⎢ ⎥ ⎦ ∂w 0 ∂w 1 ∂w n Training rule: Δ � w = − η ∇ E [ � w ] i.e., Δ w i = − η ∂E ∂w i 10

Gradient Descent 1 ∂E ∂ d ( t d − o d ) 2 = � 2 ∂w i ∂w i = 1 ∂ ( t d − o d ) 2 � 2 ∂w i d = 1 d 2( t d − o d ) ∂ ( t d − o d ) � 2 ∂w i d ( t d − o d ) ∂ = ( t d − � w · � x d ) � ∂w i ∂E = d ( t d − o d )( − x i,d ) � ∂w i 11

Gradient Descent Gradient-Descent ( training examples, η ) Each training example is a pair of the form � � x, t � , where � x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). • Initialize each w i to some small random value • Until the termination condition is met, Do – Initialize each Δ w i to zero. – For each � � x, t � in training examples , Do ∗ Input the instance � x to the unit and compute the output o ∗ For each linear unit weight w i , Do Δ w i ← Δ w i + η ( t − o ) x i – For each linear unit weight w i , Do w i ← w i + Δ w i 12

Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H 13

Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ∇ E D [ � w ] 2. � w ← � w − η ∇ E D [ � w ] Incremental mode Gradient Descent: Do until satisfied • For each training example d in D 1. Compute the gradient ∇ E d [ � w ] 2. � w ← � w − η ∇ E d [ � w ] w ] ≡ 1 d ∈ D ( t d − o d ) 2 E D [ � � 2 w ] ≡ 1 2( t d − o d ) 2 E d [ � Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough 14

Multilayer Networks of Sigmoid Units 15

Learning Complex Concepts with Gradient Descent • Threshold Units – Complex decision surfaces – But, cannot differentiate threshold rule • Linear Units – Differentiable – But, networks can only learn linear functions Need a non-linear, differentiable threshold function 16

Sigmoid Unit x1 w1 x0 = 1 x2 w2 w0 � Σ . � . n net = Σ wi xi � 1 . o = σ (net) = i =0 -net wn � 1 + e xn σ ( x ) is the sigmoid function 1 1 + e − x Nice property: dσ ( x ) = σ ( x )(1 − σ ( x )) dx We can derive gradient decent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation 17

Error Gradient for a Sigmoid Unit 1 ∂E ∂ d ∈ D ( t d − o d ) 2 = � 2 ∂w i ∂w i = 1 ∂ ( t d − o d ) 2 � 2 ∂w i d = 1 ∂ d 2( t d − o d ) ( t d − o d ) � 2 ∂w i ⎝ − ∂o d ⎛ ⎞ = d ( t d − o d ) � ⎜ ⎟ ⎠ ∂w i ∂o d ∂net d = − d ( t d − o d ) � ∂net d ∂w i But we know: = ∂σ ( net d ) ∂o d = o d (1 − o d ) ∂net d ∂net d ∂net d = ∂ ( � w · � x d ) = x i,d ∂w i ∂w i So: ∂E = − d ∈ D ( t d − o d ) o d (1 − o d ) x i,d � ∂w i 18

Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do • For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k δ k ← o k (1 − o k )( t k − o k ) 3. For each hidden unit h δ h ← o h (1 − o h ) k ∈ outputs w h,k δ k � 4. Update each network weight w i,j w i,j ← w i,j + Δ w i,j where Δ w i,j = ηδ j x i,j 19

More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α Δ w i,j ( n ) = ηδ j x i,j + α Δ w i,j ( n − 1) • Minimizes error over training examples – Will it generalize well to subsequent examples? • Training can take thousands of iterations → slow! • Using network after training is very fast 20

Learning Hidden Layer Representations Inputs Outputs A target function: Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 → 00000001 00000001 Can this be learned?? 21

Learning Hidden Layer Representations A network: Inputs Outputs Learned hidden layer representation: Input Hidden Output Values 10000000 → .89 .04 .08 → 10000000 01000000 → .01 .11 .88 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001 22

Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 23

Training Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 24

Training Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500 25

Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum... • Add momentum • Stochastic gradient descent • Train multiple nets with different initial weights Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions possible as training progresses 26

Expressive Capabilities of ANNs Boolean functions: • Every boolean function can be represented by network with single hidden layer • but might require exponential (in number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. 27

Overfitting in ANNs Error versus weight updates (example 1) 0.01 Training set error 0.009 Validation set error 0.008 0.007 Error 0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 20000 Number of weight updates Error versus weight updates (example 2) 0.08 Training set error 0.07 Validation set error 0.06 0.05 Error 0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 5000 6000 Number of weight updates 28

Neural Nets for Face Recognition left strt rght up 30x32 inputs Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces 29

Artificial Neural Networks Threshold units Gradient descent - PDF document

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans: Neuron

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Artificial Neural Networks Roger Barlow CODATA School - Roger Barlow -Artificial Neural Networks

How Neural Networks (NN) Biological Neuron: A . . . Can (Hopefully) Learn Artificial Neural . .

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Networks Luke Schuler Overview What is an Artificial Neural Network? History

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

IN CONVERSATION WITH PRINCIPAL (P1 & P2 PARENTS/GUARDIAN) 4 NOVEMBER 2019 / 4.30 6.00 P.M.

You! 3 _ _ Getting yourself

Course Overview and Introduction CE417: Introduction to Artificial Intelligence Sharif University

University of Iceland High Performance Computing An introduction M ani and Hj olli August

1 2 3 4 5 6 7

802.1 Plenary - 07/2010 San Diego, CA Closing Agenda 802.1 officers etc Officers Chair:

Designing Applications that See Designing Applications that See Lecture 5: Motion and Tracking

RBF-Partition of Unity Method: an overview of recent results Alessandra De Rossi a In

Artificial Neural Networks Threshold units Gradient descent - PDF document

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans: Neuron

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Artificial Neural Networks Roger Barlow CODATA School - Roger Barlow -Artificial Neural Networks

How Neural Networks (NN) Biological Neuron: A . . . Can (Hopefully) Learn Artificial Neural . .

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Networks Luke Schuler Overview What is an Artificial Neural Network? History

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

IN CONVERSATION WITH PRINCIPAL (P1 &amp; P2 PARENTS/GUARDIAN) 4 NOVEMBER 2019 / 4.30 6.00 P.M.

You! 3 ___________________________________ ___________________________________ Getting yourself

Course Overview and Introduction CE417: Introduction to Artificial Intelligence Sharif University

University of Iceland High Performance Computing An introduction M ani and Hj olli August

1 2 3 4 5 6 7

802.1 Plenary - 07/2010 San Diego, CA Closing Agenda 802.1 officers etc Officers Chair:

Designing Applications that See Designing Applications that See Lecture 5: Motion and Tracking

RBF-Partition of Unity Method: an overview of recent results Alessandra De Rossi a In

IN CONVERSATION WITH PRINCIPAL (P1 & P2 PARENTS/GUARDIAN) 4 NOVEMBER 2019 / 4.30 6.00 P.M.

You! 3 _ _ Getting yourself