SLIDE 14 Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 28
Back-propagation principle
Smart method for efficient computing of gradient (w.r.t. weights) of a Neural Network cost function, based on chain rule for derivation.
Cost function is Q(t) = Sm loss(Ym,Dm), where m runs over
training set examples
Usually, loss(Ym,Dm) = ||Ym-Dm||2 [quadratic error] Total gradient: W(t+1) = W(t) - l(t) gradW(Q(t)) + m(t)(W(t)-W(t-1)) Stochastic gradient: W(t+1) = W(t) - l(t) gradW(Qm(t)) + m(t)(W(t)-W(t-1))
where Qm=loss(Ym,Dm), is error computed on only ONE example randomly drawn from training set at every iteration and l(t) = learning rate (fixed, decreasing or adaptive), m(t) = momentum
Now, how to compute dQm/dWij?
Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 29
Backprop through fully-connected layers: use of chain rule derivative computation
wij
yj yi f
si sj f
f
sk
wjk
Otherwise, dj=(dEm/dsj)=Sk (dEm/dsk)(dsk/dsj)=Sk dk(dsk/dsj) =Sk dkWjk(dyj/dsj) so dj = (Sk Wjkdk)f'(sj) if neuron j is “hidden” dEm/dWij =(dEm/dsj)(dsj/dWij)=(dEm/dsj) yi Let dj = (dEm/dsj). Then Wij(t+1) = Wij(t) - l(t) yi dj If neuron j is output, dj = (dEm/dsj) = (dEm/dyj)(dyj/dsj) with Em=||Ym-Dm||2 so dj = 2(yj-Dj)f'(sj) if neuron j is an output
(and W0j(t+1) = W0j(t) - l(t)dj)
è all the dj can be computed successively from last layer
to upstream layers by “error backpropagation” from output