Error-correcting learning: Delta rule Effects of training on - PowerPoint PPT Presentation

Error-correcting learning: Delta rule Effects of training on response to input patterns Important distinction (and notation): Calculated in terms of changes to activations for pattern p ′ caused by training on single t j target of unit j ; the (correct) activation specified by the environment (training example) pattern p: a j activation of unit j that results from actually running the network △ a [p ′ ] a [p ′ ] � = △ w ij j i Note: in Hebb rule, a j was specified and so would now be called t j i a [p ′ ] � t [p] − a [p] � a [p] � Hebb rule: △ w ij = ǫ t j a i (where t j is activation “clamped” on the output unit) = ǫ i j j i i � � � a [p ′ ] Delta rule : Change weights so as to reduce difference between actual output ( a j ) and target t [p] − a [p] a [p] = ǫ j j i i output ( t j ) (“delta” = difference between target and activation) i � � t [p] − a [p] dp (p ′ , p) = ǫ ( t j − a j ) a i = ǫ △ w ij j j Similar to correlation with error If p and p ′ are orthogonal, training on p will have no effect on p ′ If p and p ′ are not orthogonal, training on p will affect performance on p ′ (weighted by Weight changes focus on predictive differences similarity) which may be good (generalization) or bad (interference) Hebbian/correlational learning depends on predictive similarities 1 / 11 3 / 11 Learning on orthogonal patterns (one pass): Delta = Hebb Delta rule as gradient descent in error (linear units) Delta rule: △ w ij = ǫ ( t j − a j ) a i (assume linear units: a j = n j ) Note : Delta = Hebb if a j = 0 � a j = a i w ij For first pattern p 1 , w ij = 0 so a [ p 1 ] = n [ p 1 ] = 0, and i j j � � t [ p 1 ] t [ p 1 ] a [ p 1 ] 1 w ij t j Hebb rule with target △ w ij (= w ij ) = ǫ − 0 = � ( t j − a j ) 2 Lens does not j j i as output activation Error E = include the 1/2 2 a i → a j → E j � � For p 2 , a [ p 2 ] i a [ p 2 ] i a [ p 2 ] t [ p 1 ] a [ p 1 ] = t [ p 1 ] i a [ p 2 ] a [ p 1 ] i a [ p 2 ] a [ p 1 ] = � w ij = � � � (dot product of p 1 , p 2 ) j i i j i j i i i i − ǫ ∂ E Gradient descent: △ w ij = i a [ p 2 ] a [ p 1 ] = 0, so a [ p 2 ] Since p 1 and p 2 are orthogonal, � = 0. Thus ∂ w ij i i j △ w ij = t [ p 2 ] a [ p 2 ] w ij = t [ p 1 ] a [ p 1 ] + t [ p 2 ] a [ p 2 ] Hebb rule again ∂ E ∂ E ∂ a j j i j i j i = (Chain rule) ∂ w ij ∂ a j ∂ w ij In fact, a [p] = 0 for the first presentation of each training pattern p, so at the end of one sweep through j all the patterns: � � = − ( t j − a j ) a i � t [p] − a [p] a [p] � t [p] j a [p] w ij = ǫ = ǫ Lens has an extra factor of 2 j j i i p p This is just Hebbian learning using targets t j as output activations ( a j ). − ǫ ∂ E △ w ij = = ǫ ( t j − a j ) a i ∂ w ij Note that the Delta rule is inherrently multi-pass ( a j � = 0 on subsequent presentations) Weight changes caused by one pattern affect error on others = Delta rule 2 / 11 4 / 11

Delta rule as gradient descent in error (sigmoid units) Linear separability � n j = a i w ij Delta rule is guaranteed to succeed at binary classification if the task is linearly separable i 1 a j = 1 + exp ( − n j ) w ij t j Weights define a plane (line for two input units) through 1 a i → n j → a j → E input (state) space for which n j = 0 � ( t j − a j ) 2 = Error E 2 Must be possible to position this plane such that all j patterns requiring n j < 0 are on one side and all patterns − ǫ ∂ E requiring n j > 0 are on the other side = Gradient descent: △ w ij ∂ w ij Property of the relationship between input and target n j = a 1 w 1 + a 2 w 2 + b j = 0 ∂ E ∂ E d a j ∂ n j patterns = a 2 = − w 1 a 1 − b j ∂ w ij ∂ a j d n j ∂ w ij AND and OR are linearly separable but XOR is not w 2 w 2 = − ( t j − a j ) a j (1 − a j ) a i ( y = + b ) a x − ǫ ∂ E △ w ij = = ǫ ( t j − a j ) a j (1 − a j ) a i ∂ w ij 5 / 11 7 / 11 When does the Delta rule succeed or fail? XOR Delta rule is optimal n j = a 1 w 1 + a 2 w 2 + b j = 0 Will find a set of weights that produces zero error if such a set exists a 2 = − w 1 a 1 − b j w 2 w 2 ( y = + b ) a x Need to distinguish “succeed” = zero error from “succeed” = correct binary classification Guaranteed to succeed (zero error) if input patterns are linearly independent (LI) No pattern can be created by recombining scaled versions of the others (i.e., there is something unique about each pattern; cf. Hebb: no similarity) Orthogonal patterns are linearly independent (LI is a weaker constraint) Linearly independent patterns can be similar as long as other aspects are unique Succeed at binary classification of outputs: Linear separability 6 / 11 8 / 11

XOR with extra dimension XOR task can be converted to one that is linearly separable by adding a new “input” Corresponds to a third dimension in state space Task is no longer XOR Inputs Output 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 9 / 11 11 / 11 XOR with intermediate (“hidden”) units Intermediate units can re-represent input patterns as new patterns with altered similarities Targets which are not linearly separable in the input space can be linearly separable in the intermediate representational space Intermediate units are called “hidden” because their activations are not determined directly by the training environment (inputs and targets) 10 / 11

Error-correcting learning: Delta rule Effects of training on - PowerPoint PPT Presentation

Error-correcting learning: Delta rule Effects of training on response to input patterns Important distinction (and notation): Calculated in terms of changes to activations for pattern p caused by training on single t j target of unit j ; the

Delta highlighting Delta highlighting edits highlighted Delta highlighting edits highlighted

Error Codes Correcting Gary Lecture 11 toolkit CMU Preliminaries Setting Error of

QEC11 Quantum Error Correction and Quantum Error-Correcting Codes Todd A. Brun Center for

Sacramento Countys Policies and Response To The Delta Vision and Bay-Delta Conservation Plan

Quantum Error-Correcting Codes by Concatenation Markus Grassl joint work with Bei Zeng Centre

Error-correcting codes and Cryptography Henk van Tilborg Code-based Cryptography Workshop

Quantum Error-Correcting Codes: Discrete Math meets Physics Markus Grassl

/k Introduction and content 2/37 Error-correcting pair - Generalized Reed-Solomon codes -

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Quantum Error-Correcting Codes: Discrete Math meets Physics Markus Grassl

Turning error-reducing quantum turbo codes into error-correcting codes Mamdouh Abbara (MEc),

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

DELTA Microelectronics Microchip for every industry Make more with less DELTA / ASIC

Wind Turbine Noise Measurements in Practice Carsten Thomsen, DELTA, cth@delta.dk Simon Mller

Analyzing with P Analyzing with P Delta y y g g Delta Presenter: Presenter: Deborah

Phi Gamma Delta University of Tampa Phi Gamma Delta Delta Colony at UT 164 universities

Searching MDS Burst-Correcting Codes Ana Lucila Sandoval Orozco Advisor : Luis Javier Garca

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

Hierarchical Polynomial Approximation Vincent LEFVRE, Jean-Michel MULLER, Serge TORRES

Examples and comparisons of rigorous error bounds for MCMC estimates Part I Wojciech Niemiro

Long-read error correction: a survey and qualitative comparison Pierre Morisse 1 , Arnaud Lefebvre

Quantum Computation - Lecture 08 - Quantum Error Correction II Mateus de Oliveira Oliveira

Coding and A Applications in Sensor Networks pplications in Sensor Networks Coding and Jie Gao

CIRM - Dynamic Error Detection Peter Pirkelbauer Center for Applied Scientific Computing (CASC)