Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks

Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs) to be learned. Weights are modified by a complex procedure (back-propagation) based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a) lots of input-output training data, and b) goal of a mapping (function) from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof . Keith L. Downing Supervised Learning in Neural Networks

Backpropagation Overview Training/Test Cases: {(d1, r1) (d2, r2) (d3, r3)....} d3 r3 Encoder Decoder r* E = r3 - r* dE/dW Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way. Keith L. Downing Supervised Learning in Neural Networks

Training -vs- Testing Cases N times, with learning Training Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on). Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases. Keith L. Downing Supervised Learning in Neural Networks

Widrow-Hoff (a.k.a. Delta) Rule 2 3 T = target output value X δ = error w 1 Node N 1 Y 0 0 S δ = T - Y Y Δ w = ηδ X Delta ( δ ) = error; Eta ( η ) = learning rate Goal: change w so as to reduce | δ | . Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing) w if X is positive (negative). Similar for δ < 0 Assumes derivative of N’s transfer function is everywhere non-negative. Keith L. Downing Supervised Learning in Neural Networks

Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels) to follow the route of steepest descent in error space. min Δ E Error(E) Weight Vector (W) ∆ w ij = − η ∂ E i ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ E i ∂ w ij tid 1 fT wi1 x1d sumid oid i win xnd Eid n Sum of Squared Errors (SSE) E i = 1 ( t id − o id ) 2 2 ∑ d ∈ D ∂ E = 1 2 ( t id − o id ) ∂ ( t id − o id ) ( t id − o id ) ∂ ( − o id ) = ∑ 2 ∑ ∂ w ij ∂ w ij ∂ w ij d ∈ D d ∈ D Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ ( − o id ) ∂ w ij tid 1 fT wi1 x1d sumid i oid win xnd Eid n Since output = f(sum weighted inputs) ∂ E ( t id − o id ) ∂ ( − f T ( sum id )) = ∑ ∂ w ij ∂ w ij d ∈ D where n ∑ sum id = w ik x kd k = 1 Using Chain Rule: ∂ f ( g ( x )) ∂ g ( x ) × ∂ g ( x ) ∂ f = ∂ x ∂ x ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = ∂ f T ( sum id ) × x jd ∂ w ij ∂ sum id ∂ w ij ∂ sum id Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ sum id - Easy!! ∂ w ij ∑ n � � = ∂ � � = ∂ w i 1 x 1 d + w i 2 x 2 d + ... + w ij x jd + ... + w in x nd ∂ sum id k = 1 w ik x kd ∂ w ij ∂ w ij ∂ w ij + ... + ∂ ( w ij x jd ) = ∂ ( w i 1 x 1 d ) + ∂ ( w i 2 x 2 d ) + ... + ∂ ( w in x nd ) ∂ w ij ∂ w ij ∂ w ij ∂ w ij = 0 + 0 + ... + x jd + ... + 0 = x jd Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ f T ( sum id ) - Harder for some f T ∂ sum id f T = Identity function: f T ( sum id ) = sum id ∂ f T ( sum id ) = 1 ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = 1 × x jd = x jd ∂ w ij ∂ sum id ∂ w ij 1 f T = Sigmoid: f T ( sum id ) = 1 + e − sumid ∂ f T ( sum id ) = o id ( 1 − o id ) ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = o id ( 1 − o id ) × x jd = o id ( 1 − o id ) x jd ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

The only non-trivial calculation ( 1 + e − sum id ) − 1 � = ( − 1 ) ∂ ( 1 + e − sum id ) = ∂ � ∂ f T ( sum id ) ( 1 + e − sum id ) − 2 ∂ sum id ∂ sum id ∂ sum id e − sum id = ( − 1 )( − 1 ) e − sum id ( 1 + e − sum id ) − 2 = ( 1 + e − sum id ) 2 But notice that: e − sum id ( 1 + e − sum id ) 2 = f T ( sum id )( 1 − f T ( sum id )) = o id ( 1 − o id ) Keith L. Downing Supervised Learning in Neural Networks

Putting it all together ∂ E i ( t id − o id ) ∂ ( − f T ( sum id )) � ( t id − o id ) ∂ f T ( sum id ) × ∂ sum id � = ∑ = − ∑ ∂ w ij ∂ w ij ∂ sum id ∂ w ij d ∈ D d ∈ D So for f T = Identity: ∂ E i = − ∑ ( t id − o id ) x jd ∂ w ij d ∈ D and for f T = Sigmoid: ∂ E i = − ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D Keith L. Downing Supervised Learning in Neural Networks

Weight Updates ( f T = Sigmoid) Batch: update weights after each training epoch ∆ w ij = − η ∂ E i = η ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D The weight changes are actually computed after each training case, but w ij is not updated until the epoch’s end. Incremental: update weights after each training case ∆ w ij = − η ∂ E i = η ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij A lower learning rate ( η ) recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch. Keith L. Downing Supervised Learning in Neural Networks

Backpropagation in Multi-Layered Neural Networks d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) For each node (j) and each training case (d), backpropagation computes an error term: ∂ E d δ jd = − ∂ sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer. Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) ∂ E d Along the upper path, the contribution to ∂ sum jd is: ∂ o jd × ∂ sum 1 d ∂ E d × ∂ sum jd ∂ o jd ∂ sum 1 d So summing along all paths: n ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ = ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: ∂ sum kd = w kj ∂ o jd Assuming f T = a sigmoid: ∂ o jd = ∂ f T ( sum jd ) = o jd ( 1 − o jd ) ∂ sum jd ∂ sum jd Thus: n = − ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ δ jd = − ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 n n ∑ ∑ = − o jd ( 1 − o jd ) w kj ( − δ kd ) = o jd ( 1 − o jd ) w kj δ kd k = 1 k = 1 Keith L. Downing Supervised Learning in Neural Networks

Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: n ∑ δ jd = o jd ( 1 − o jd ) w kj δ kd k = 1 So all δ values in the network can be computed by moving backwards, one layer at a time. Keith L. Downing Supervised Learning in Neural Networks

Computing ∂ E d ∂ w ij from δ jd - Easy!! 1 d(Ed) wi1 d(sumid) ojd sumid wij i j The only effect of w ij upon the error is via its effect upon sum id , which is: ∂ sum id = o jd ∂ w ij So: ∂ E d = ∂ sum id ∂ E d = ∂ sum id × × ( − δ id ) = − o jd δ id ∂ w ij ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

Computing ∆ w ij Given an error term, δ id (for node i on training case d), the update of w ij for all nodes j that feed into i is: ∆ w ij = − η ∂ E d = − η ( − o jd δ id ) = ηδ id o jd ∂ w ij So given δ i , you can easily calculate ∆ w ij for all incoming arcs. Keith L. Downing Supervised Learning in Neural Networks

Learning XOR Sum-Squared-Error Sum-Squared-Error Sum-Squared-Error 1.2 1.2 1.2 1.0 1.0 1.0 0.8 0.8 0.8 Error Error Error 0.6 Error 0.6 Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Epoch Epoch Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs) X 2 (hidden) X 1 (output) network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init. Keith L. Downing Supervised Learning in Neural Networks

Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 ··· ··· 1 13.2 1.78 2.14 11.2 100 2.65 ··· ··· 2 13.11 1.01 1.7 15 78 2.98 ··· ··· 3 13.17 2.59 2.37 20 120 1.65 ··· ··· . . . Wine 1 Properties 2 Wine 1 Class 1 5 Hidden Layer 13 Keith L. Downing Supervised Learning in Neural Networks

Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

CS 225 Data Structures Fe Feb. 18 It Iterators G G Carl Evans It Iter erators Suppose

Greg Please solve our problems and Double Production! Do I need a technical expert

Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Community structure in networks Argimiro Arratia &amp; Marta Arias Universitat Polit` ecnica de

Scala &amp; Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

CS 225 Data Structures Fe Feb. 18 It Iterators G G Carl Evans It Iter erators Suppose

Greg Please solve our problems and Double Production! Do I need a technical expert

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)