Musings on Continual Learning Pulkit Agrawal tv.99 chair.98 - - PowerPoint PPT Presentation

musings on continual learning
SMART_READER_LITE
LIVE PREVIEW

Musings on Continual Learning Pulkit Agrawal tv.99 chair.98 - - PowerPoint PPT Presentation

Musings on Continual Learning Pulkit Agrawal tv.99 chair.98 chair.99 chair.90 dining table.99 chair.96 wine glass.97 chair.86 bottle.99 chair.99 wine glass.93 wine glass1.00 bowl.85 wine glass.99 wine glass1.00 chair.96 chair.99


slide-1
SLIDE 1

Musings on Continual Learning

Pulkit Agrawal

slide-2
SLIDE 2 dining table.99 chair.99 chair.90 chair.99 chair.98 chair.96 chair.86 chair.99 bowl.81 chair.96 tv.99 bottle.99 wine glass.99 wine glass1.00 bowl.85 knife.83 wine glass1.00 wine glass.93 wine glass.97 fork.95
slide-3
SLIDE 3

What is a zebra?

slide-4
SLIDE 4

What is a zebra?

slide-5
SLIDE 5

Success in Reinforcement Learning

ATARI Games 21 million games! ~10-50 million interactions!

Simulation, Closed World, Known Model

slide-6
SLIDE 6

Impressive Specialists

slide-7
SLIDE 7

???

Task Specific

Today’s AI AI we want

Generalists

slide-8
SLIDE 8

Learn to perform N tasks Solve the (N+1)th task Core Characteristic: Reuse past knowledge to solve new tasks faster

  • r,

more complex task

slide-9
SLIDE 9

Success on Imagenet

slide-10
SLIDE 10

Knowledge for classification Training on N tasks —> Object classification knowledge

slide-11
SLIDE 11

Knowledge for classification Training on N tasks —> Object classification knowledge

slide-12
SLIDE 12

Orange? Apple? Reuse knowledge by fine-tuning

slide-13
SLIDE 13

Imagenet: 1000 examples/class Orange? Apple? New task: ~100 examples/class

slide-14
SLIDE 14

Still need hundreds of “labelled” data points! Fine-tuning with very few data points, won’t be effective!

slide-15
SLIDE 15

Problem Setup

Apple Orange Training Set

slide-16
SLIDE 16

Apple Orange Training Set Test

Problem Setup

Apple

  • r

Orange?

slide-17
SLIDE 17

Apple Orange Training Set Test Apple

  • r

Orange?

Use Nearest Neighbors

slide-18
SLIDE 18

Apple Orange Training Set Test Apple

  • r

Orange?

Use Nearest Neighbors

slide-19
SLIDE 19

Apple Orange Training Set Test Apple

  • r

Orange?

What does the performance depend on??

slide-20
SLIDE 20

Apple Orange Training Set Apple

  • r

Orange?

What does the performance depend on??

Features might not be optimized for matching!

slide-21
SLIDE 21

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

Instead of one v/s all classification

slide-22
SLIDE 22

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

slide-23
SLIDE 23

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

1

Same class: Output = 1

slide-24
SLIDE 24

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

1

Same class: Output = 1

slide-25
SLIDE 25

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

Same class: Output = 1 Different class: Output = 0

slide-26
SLIDE 26

Metric Learning via Siamese Networks*

(*Hadsell et. al. 2006)

Same class: Output = 1 Different class: Output = 0

slide-27
SLIDE 27

Apple Orange Training Set

Solving using Siamese Network

Test Apple

  • r

Orange?

slide-28
SLIDE 28

Apple Orange Training Set

Solving using Siamese Network

Siamese Net 0.1

slide-29
SLIDE 29

Apple Orange Training Set

Solving using Siamese Network

Siamese Net 0.1 Siamese Net 0.8

slide-30
SLIDE 30

Apple Orange Training Set

Solving using Siamese Network

Siamese Net 0.1 Siamese Net 0.8 Also look at Matching Networks, Vinyals et al. 2017

slide-31
SLIDE 31

Another perspective

: parameters after training on say Imagenet

slide-32
SLIDE 32

Another perspective

: parameters after training on say Imagenet Task1: Apple v/s Orange

slide-33
SLIDE 33

Another perspective

: parameters after training on say Imagenet Task1: Apple v/s Orange fine-tuning

slide-34
SLIDE 34

Another perspective

: parameters after training on say Imagenet Task1: Apple v/s Orange Task 2: Dog v/s Cat

slide-35
SLIDE 35

Another perspective

: parameters after training on say Imagenet Task1: Apple v/s Orange Task 2: Dog v/s Cat

slide-36
SLIDE 36

Another perspective

Amount of fine-tuning: Task1: Apple v/s Orange Task 2: Dog v/s Cat

slide-37
SLIDE 37

What if?

Amount of fine-tuning: Task1: Apple v/s Orange Task 2: Dog v/s Cat fine-tuning would be faster! can we optimize to make fine-tuning easier?

slide-38
SLIDE 38

How to do it?

Task1: Apple v/s Orange

Hariharan et al. 2016, Finn et al. 2017

slide-39
SLIDE 39

How to do it?

Task1: Apple v/s Orange

Hariharan et al. 2016, Finn et al. 2017

slide-40
SLIDE 40

How to do it?

Task1: Apple v/s Orange

Hariharan et al. 2016, Finn et al. 2017

(i.e. train for fast fine-tuning!)

slide-41
SLIDE 41

Generalizing to N tasks

Task1: Apple v/s Orange

Hariharan et al. 2016, Finn et al. 2017

slide-42
SLIDE 42

More Details

Task1: Apple v/s Orange

Hariharan et al. 2016, Finn et al. 2017

Low Shot Visual Recognition Hariharan et al. 2016 Model Agnostic Meta-learning Finn et al. 2017

slide-43
SLIDE 43

Until Now

Finetuning Nearest Neighbor Matching Siamese Network based Metric Learning Meta-Learning: Training for fine-tuning Better Features —> Better Transfer!

slide-44
SLIDE 44

In practice, how good are these features?

Accuracy ~80% Dog from Imagenet Accuracy ~20% Dog

slide-45
SLIDE 45

Consider the task of identifying cars …

Positives Negatives

slide-46
SLIDE 46

???

Testing the model

slide-47
SLIDE 47

Learning Spurious Correlations

Unbiased look at Dataset bias, Torralba et al. 2011

slide-48
SLIDE 48

More parameters in the network More chances of learning spurious correlations!! Maybe this problem will be avoided if we first learn simple tasks and then more complex ones??

slide-49
SLIDE 49

Fine-tuning

Sequential/Continual Task Learning

Poor performance on Task-1 !!!

Catastrophic Forgetting!!!

slide-50
SLIDE 50

Catastrophic forgetting in closely related tasks

Training on rotating MNIST Test

High Accuracy Low Accuracy

slide-51
SLIDE 51

In machine learning, we generally assume IID* data

*IID: Independently and Identically Distributed

Sample batches

  • f data!

Each batch: uniform distribution of rotations

slide-52
SLIDE 52

In machine learning, we generally assume IID* data

*IID: Independently and Identically Distributed

Sample batches

  • f data!

Each batch: uniform distribution of rotations

In real world, data is often not batched :)

slide-53
SLIDE 53

Continual learning is natural …

slide-54
SLIDE 54

In the context of reinforcement learning

slide-55
SLIDE 55
slide-56
SLIDE 56

Inves&ga&ng Human Priors for Playing Video Games, Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Alyosha Efros, Tom Griffiths (ICML 2018)

slide-57
SLIDE 57

Humans make use of prior knowledge for exploration

Inves&ga&ng Human Priors for Playing Video Games, Dubey R., Agrawal P., Deepak P., Efros A., Griffiths T. (ICML 2018)

slide-58
SLIDE 58

Humans make use of prior knowledge for exploration

Inves&ga&ng Human Priors for Playing Video Games, Dubey R., Agrawal P., Deepak P., Efros A., Griffiths T. (ICML 2018)

slide-59
SLIDE 59

What about Reinforcement Learning Agents?

slide-60
SLIDE 60

In a simpler version of the game ..

Inves&ga&ng Human Priors for Playing Video Games, Dubey R., Agrawal P., Deepak P., Efros A., Griffiths T. (ICML 2018)

slide-61
SLIDE 61

For RL agents, both games are the same!

Inves&ga&ng Human Priors for Playing Video Games, Dubey R., Agrawal P., Deepak P., Efros A., Griffiths T. (ICML 2018)

slide-62
SLIDE 62

Equip Reinforcement Learning Agents with prior knowledge?

slide-63
SLIDE 63

Hand-design

Common-Sense/Prior Knowledge

slide-64
SLIDE 64

Hand-design Learn from Experience

Common-Sense/Prior Knowledge

Transfer in Reinforcement Learning —> Very limited success Good solution to continual learning required!

slide-65
SLIDE 65

How to deal with catastrophic forgetting?

Just remember the weights for each task!

slide-66
SLIDE 66

Progressive Networks (Rusu et al. 2016)

slide-67
SLIDE 67

Can we do something smarter than storing all the weights?

slide-68
SLIDE 68

Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017) EWC: Elastic Weight Consolidation Don’t change weights that are informative

  • f task A

Fisher Information

slide-69
SLIDE 69

Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017)

slide-70
SLIDE 70

Eventually we will run out of capacity! Is there a better way to make use of the neural network capacity?

slide-71
SLIDE 71

(Han et. al. 2015)

Neural Networks are compressible post-training

(Slide adapted from Brian Cheung)

slide-72
SLIDE 72

(Han et. al. 2015)

Neural Networks are compressible post-training

(Slide adapted from Brian Cheung)

slide-73
SLIDE 73

Negligible performance change after pruning —> Neural Networks are over-parameterized Can we make use of over-parameterization? We will have to make use of “excess” capacity during training

slide-74
SLIDE 74

Superposition of many models into one (Cheung et al., 2019)

W W(1) W(2) W(3) W(1) Superposition: One Model: Implementation:

Refer to the paper for details