Regularization Effect of Large Initial Learning Rate Yuanzhi Li* - - PowerPoint PPT Presentation

regularization effect of large initial learning rate
SMART_READER_LITE
LIVE PREVIEW

Regularization Effect of Large Initial Learning Rate Yuanzhi Li* - - PowerPoint PPT Presentation

Regularization Effect of Large Initial Learning Rate Yuanzhi Li* Colin Wei* Tengyu Ma Carnegie Mellon University Stanford University Stanford University Large Initial Learning Rate is Crucial for Generalization Large Initial Learning Rate is


slide-1
SLIDE 1

Regularization Effect of Large Initial Learning Rate

Yuanzhi Li* Carnegie Mellon University Colin Wei* Stanford University Tengyu Ma Stanford University

slide-2
SLIDE 2

Large Initial Learning Rate is Crucial for Generalization

slide-3
SLIDE 3

Large Initial Learning Rate is Crucial for Generalization

  • Common schedule: large initial learning rate + annealing
slide-4
SLIDE 4

Large Initial Learning Rate is Crucial for Generalization

  • Common schedule: large initial learning rate + annealing
  • … But small learning rate: better train and test performance up until annealing

Train Accuracy Val Accuracy

annealing

slide-5
SLIDE 5

Large Initial Learning Rate is Crucial for Generalization

  • Common schedule: large initial learning rate + annealing
  • … But small learning rate: better train and test performance up until annealing
  • Large LR outperforms small LR after annealing!

Train Accuracy Val Accuracy

annealing

slide-6
SLIDE 6

LR schedule changes order of learning patterns => generalization

slide-7
SLIDE 7

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
slide-8
SLIDE 8

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
slide-9
SLIDE 9

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
slide-10
SLIDE 10

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
slide-11
SLIDE 11

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
  • => learns to use all patterns, helping generalization!
slide-12
SLIDE 12

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
  • => learns to use all patterns, helping generalization!
  • Intuition: larger LR
  • ⇒ larger noise in activations
slide-13
SLIDE 13

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
  • => learns to use all patterns, helping generalization!
  • Intuition: larger LR
  • ⇒ larger noise in activations
  • ⇒ effectively weaker representational power
slide-14
SLIDE 14

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
  • => learns to use all patterns, helping generalization!
  • Intuition: larger LR
  • ⇒ larger noise in activations
  • ⇒ effectively weaker representational power
  • ⇒ won’t overfit to “signatures”
slide-15
SLIDE 15

LR schedule changes order of learning patterns => generalization

  • Small LR quickly memorizes hard-to-fit “class signatures”
  • Ignores other patterns, harming generalization
  • Large initial LR + annealing learns easy-to-fit patterns first
  • Only memorizes hard-to-fit patterns after annealing
  • => learns to use all patterns, helping generalization!
  • Intuition: larger LR
  • ⇒ larger noise in activations
  • ⇒ effectively weaker representational power
  • ⇒ won’t overfit to “signatures”
  • Non-convexity is crucial: different LR schedules find different solutions
slide-16
SLIDE 16

Demonstration on Modified CIFAR10

slide-17
SLIDE 17

Demonstration on Modified CIFAR10

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns

  • riginal image
slide-18
SLIDE 18

Demonstration on Modified CIFAR10

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns hard-to-fit patch indicating class

  • riginal image
slide-19
SLIDE 19

Demonstration on Modified CIFAR10

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns hard-to-fit patch indicating class

  • riginal image
slide-20
SLIDE 20

Demonstration on Modified CIFAR10

  • Small LR memorizes patch, ignores rest of the image
  • ⇒ learns image from 20% examples

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns hard-to-fit patch indicating class

  • riginal image
slide-21
SLIDE 21

Demonstration on Modified CIFAR10

  • Small LR memorizes patch, ignores rest of the image
  • ⇒ learns image from 20% examples
  • Large initial LR initially ignores patch, only learns it after annealing
  • ⇒ learns image from 80% examples

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns hard-to-fit patch indicating class

  • riginal image
slide-22
SLIDE 22

Theoretical Setting

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns linearly classifiable patterns

slide-23
SLIDE 23

Theoretical Setting

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns linearly classifiable patterns clustered but not linearly separable

slide-24
SLIDE 24

Theoretical Setting

Group 1: 20% examples with hard-to-generalize, easy-to- fit patterns Group 2: 20% examples with easy-to-generalize, hard-to- fit patterns Group 3: 60% examples with both patterns linearly classifiable patterns clustered but not linearly separable Contains both patterns

slide-25
SLIDE 25

Conclusion

slide-26
SLIDE 26

Conclusion

  • Small LR optimizes faster, but generalizes worse than large initial LR + annealing
slide-27
SLIDE 27

Conclusion

  • Small LR optimizes faster, but generalizes worse than large initial LR + annealing
  • Explanation: order of learning pattern types
  • Easy-to-generalize, hard-to-fit patterns
  • Hard-to-generalize, easy-to-fit patterns
slide-28
SLIDE 28

Conclusion

  • Small LR optimizes faster, but generalizes worse than large initial LR + annealing
  • Explanation: order of learning pattern types
  • Easy-to-generalize, hard-to-fit patterns
  • Hard-to-generalize, easy-to-fit patterns
  • SGD noise from large LR is mechanism for regularization
slide-29
SLIDE 29

Conclusion

  • Small LR optimizes faster, but generalizes worse than large initial LR + annealing
  • Explanation: order of learning pattern types
  • Easy-to-generalize, hard-to-fit patterns
  • Hard-to-generalize, easy-to-fit patterns
  • SGD noise from large LR is mechanism for regularization

Come find our poster: 10:45 AM -- 12:45 PM @ East Exhibition Hall B + C #144!