Training Neural Networks with Local Error Signals Arild Nkland - - PowerPoint PPT Presentation

training neural networks
SMART_READER_LITE
LIVE PREVIEW

Training Neural Networks with Local Error Signals Arild Nkland - - PowerPoint PPT Presentation

Training Neural Networks with Local Error Signals Arild Nkland Lars H. Eidnes Local learning Typically we train neural networks by backpropagating errors from the loss function and back through the layers. Hard to explain how the


slide-1
SLIDE 1

Training Neural Networks with Local Error Signals

Arild Nøkland Lars H. Eidnes

slide-2
SLIDE 2

Local learning

  • Typically we train neural networks by backpropagating errors from

the loss function and back through the layers.

  • Hard to explain how the brain could do this.
  • Backward locking, weight symmetry, other problems
  • Massive practical benefits if you could avoid this.
  • Don't have to keep activations in memory
  • Can parallelize easily. Put each layer on its own GPU, train all at the same

time.

slide-3
SLIDE 3

Training each layer on its own works!

Results on more datasets later.

slide-4
SLIDE 4

The approach

Train each layer with two sub-networks, each with its own loss function

slide-5
SLIDE 5

Similarity matching loss

Intuition: Want things from the same class to have similar representations. Measure similarity with a matrix of cosine similarities.

slide-6
SLIDE 6

Results

slide-7
SLIDE 7

Results

slide-8
SLIDE 8

Results

slide-9
SLIDE 9

Optimization vs generalization

  • Back-prop has fastest &

lowest drop in training error

  • Local learning is competitive

with back-prop in terms of test error

  • Local learning is a good

regularizer

  • But: Both pred and sim-

losses help optimization in a complementary way.

slide-10
SLIDE 10

Sim-loss + global backprop

slide-11
SLIDE 11

Results, back-prop free version

  • Still have 1-step backprop. To remove it:
  • Remove the conv2d before the sim-loss
  • Use Feedback Alignment [Lillicrap et al, 2014] through linear before the pred-loss
  • Also: Use a random projection of the labels
slide-12
SLIDE 12

Summary

  • We train each layer on its own, without global backprop
  • We use two loss functions
  • Standard cross entropy loss
  • A similarity matching loss
  • Squared error on similarity matrices
  • Wants similar activations for things of the same class
  • Works well on VGG-like networks
slide-13
SLIDE 13

Intriguing questions

  • We’ve just prodded the space of local loss functions, and stumbled

across something that helps a lot. Is there more to be found in this space?

  • Can we better understand how layers interact when they are trained
  • n their own? I.e. why does this work?
  • Does something like this happen in the brain?