Multiple scales of task and reward - based learning Jane Wang Zeb - - PowerPoint PPT Presentation

multiple scales of task and reward based learning
SMART_READER_LITE
LIVE PREVIEW

Multiple scales of task and reward - based learning Jane Wang Zeb - - PowerPoint PPT Presentation

Multiple scales of task and reward - based learning Jane Wang Zeb Kurth - Nelson , Sam Ritter , Hubert Soyer , Remi Munos , Charles Blundell , Joel Leibo , Dhruva Tirumala , Dharshan Kumaran , Matt Botvinick NIPS 2017 Meta - learning Workshop


slide-1
SLIDE 1

Multiple scales of task and reward-based learning

Jane Wang

Zeb Kurth-Nelson, Sam Ritter, Hubert Soyer, Remi Munos, Charles Blundell, Joel Leibo, Dhruva Tirumala, Dharshan Kumaran, Matt Botvinick NIPS 2017 Meta-learning Workshop December 9, 2017

slide-2
SLIDE 2
slide-3
SLIDE 3

Building machines that learn and think like people, Lake et al, 2016

slide-4
SLIDE 4

Raven’s progressive matrices (J. C. Raven, 1936)

?

slide-5
SLIDE 5

Meta-Learning: Learning inductive biases or priors

Evolutionary principles in self-referential learning (Schmidhuber, 1987) Learning to learn (Thrun & Pratt,1998)

Learning faster with more tasks, benefiting from transfer across tasks and learning on related tasks

slide-6
SLIDE 6

Harlow, Psychological Review, 1949

Training episodes

Meta-RL: learning to learn from reward feedback

slide-7
SLIDE 7

Harlow, Psychological Review, 1949

Ceiling performance

Training episodes

Meta-RL: learning to learn from reward feedback

slide-8
SLIDE 8

Multiple scales of reward-based learning

Learning task specifics 1 task Time

slide-9
SLIDE 9

Multiple scales of reward-based learning

Learning priors

Distribution of tasks

1 task Time Learning task specifics

Nested learning algorithms happening in parallel, on different timescales

slide-10
SLIDE 10

Multiple scales of reward-based learning

Distribution of tasks

1 task Time

A lifetime?

Learning physics, universal structure, architecture Learning task specifics Learning priors

slide-11
SLIDE 11

Multiple scales of reward-based learning

Distribution of tasks

1 task Time

Learning priors

Learning task specifics

slide-12
SLIDE 12

Different ways of building priors

Handcrafted features, expert knowledge, teaching signals Learning good initialization Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (Finn et al, 2017 ICML) Learning a meta-optimizer Learning to learn by gradient descent by gradient descent (Andrychowicz et al, 2016) Learning an embedding function Matching networks for one shot learning (Vinyals et al, 2016) Bayesian program learning Human-level concept learning through probabilistic program induction (Lake et al, 2015) Implicitly learned via recurrent neural networks/external memory Meta-learning with memory-augmented neural networks (Santoro et al, 2016) …

What all these have in common is a way to build in assumptions that constrain the space of hypotheses to search over

slide-13
SLIDE 13

RNNs + distribution of tasks to learn prior implicitly

Distribution of tasks

1 task Time

Learning priors (in weights)

Learning task specifics (in activations)

Constrain hypothesis space with task distribution, correlated in the prior we want to learn, but different in ways we want to abstract over (ie specific image, reward contingency)

Prefrontal cortex and flexible cognitive control: Rules without symbols (Rougier et al, 2005) Domain randomization for transferring deep neural networks from simulation to the real world (Tobin et al, 2017)

Use activations of a recurrent neural network (RNN) to implement RL in dynamics, shaped by priors learned in the weights

slide-14
SLIDE 14

Learning the correct policy

Environment (or task) Observation Action

Map observations to actions in order to maximize reward for environment

Policy RL Learning Algorithm (Deep NN)

slide-15
SLIDE 15

Action

Map history of observations and states to future actions in order to maximize reward for a sequential task

Environment (or task)

Learning the correct policy with an RNN

Song et al, 2017 eLife; Miconi et al, 2017 eLife; Barak, 2017 Curr Opin Neurobiol Policy Observation RL Learning Algorithm (RNN)

slide-16
SLIDE 16

Learning to learn the correct policy: meta-RL

Action

Map history of observations and past rewards/actions to future actions in order to maximize reward for a distribution of tasks

Environment 1 Environment i Environment 1 Task i Policy Observation RL Learning Algorithm (RNN)

slide-17
SLIDE 17

Learning to learn the correct policy: meta-RL

Observation Last reward, Last action Action

Map history of observations and past rewards/actions to future actions in order to maximize reward for a distribution of tasks

Environment 1 Environment i Environment 1 Task i Policy RL Learning Algorithm (RNN)

Wang et al, 2016. Learning to reinforcement learn. arXiv:1611.05763 Duan et al, 2016. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779

slide-18
SLIDE 18

What is a “task distribution”? What is “task structure”?

slide-19
SLIDE 19

What is a task?

slide-20
SLIDE 20

What is a task?

➢ Visuospatial/perceptual features

slide-21
SLIDE 21

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.)

slide-22
SLIDE 22

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies

slide-23
SLIDE 23

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics

slide-24
SLIDE 24

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics ➢ Interactivity and actions

slide-25
SLIDE 25

Task

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics ➢ Interactivity and actions

slide-26
SLIDE 26

What is a task?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics ➢ Interactivity and actions

slide-27
SLIDE 27

Task Task

OVERFITTING

Training tasks

slide-28
SLIDE 28

Task Task

OVERFITTING

Training tasks

slide-29
SLIDE 29

Task Task

CATASTROPHIC FORGETTING INTERFERENCE

Training tasks

slide-30
SLIDE 30

Task Task

CATASTROPHIC FORGETTING INTERFERENCE

Training tasks

slide-31
SLIDE 31

What is the sweet spot of task relatedness?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics ➢ Interactivity and actions

slide-32
SLIDE 32

What is the sweet spot of task relatedness?

➢ Visuospatial/perceptual features ➢ Domain (language, images, robotics, etc.) ➢ Reward contingencies ➢ Temporal structure/dynamics ➢ Interactivity and actions

(but eventually vary over!)

slide-33
SLIDE 33

Harlow, Psychological Review, 1949

Ceiling performance

Training episodes

Harlow task

slide-34
SLIDE 34

Harlow, Psychological Review, 1949

Training episodes

Meta-RL

Training episodes

Ceiling performance

Meta-RL in the Harlow task

slide-35
SLIDE 35

Ingredients: Environment

TASK Φ

Taski φi Task1 φ1 TaskN φN ... ... Episode 1 Episode i Episode N

  • Distribution of RL tasks with structure
slide-36
SLIDE 36

Ingredients: Architecture

... ...

  • Primary RL algorithm to train

weights: Advantage actor-critic (Mnih et al 2016) ○ Turned off during test

  • Auxiliary inputs in addition to
  • bservation: reward and action
  • Recurrence (LSTM) to integrate

history

  • Emergence of secondary RL

algorithm implemented in recurrent activity dynamics ○ Operates in absence of weight changes ○ With potentially radically different properties

slide-37
SLIDE 37

Independent bandits

2-armed bandits independently drawn from uniform Bernoulli distribution Held constant for 100 trials =1 episode

p1 p2 pi = probability of payout, drawn uniformly from [0,1],

slide-38
SLIDE 38

Independent bandits

2-armed bandits independently drawn from uniform Bernoulli distribution Tested with fixed weights

slide-39
SLIDE 39

Independent bandits

Meta-RL_i 2-armed bandits independently drawn from uniform Bernoulli distribution Tested with fixed weights

Worse Better

slide-40
SLIDE 40

Independent bandits

Meta-RL_i 2-armed bandits independently drawn from uniform Bernoulli distribution Tested with fixed weights Performance comparable to standard bandit algorithms

Worse Better

slide-41
SLIDE 41

Ablation Experiments

Meta-RL_i

slide-42
SLIDE 42

Ablation Experiments t

slide-43
SLIDE 43

Ablation Experiments t

slide-44
SLIDE 44

Structured bandits

Bandits with correlational structure: {pL , pR} = {𝝂, 1-𝝂} Meta-RL learns to exploit structure in the environment

Independent Correlated

slide-45
SLIDE 45

LSTM hidden states internalize structure

pL pR Independent Correlated pL pR

... ...

slide-46
SLIDE 46

LSTM hidden states internalize structure

pL pR Independent Correlated pL pR

... ...

slide-47
SLIDE 47

LSTM hidden states internalize structure

pL pR Independent Correlated pL pR

... ...

slide-48
SLIDE 48

LSTM hidden states internalize structure

Independent Correlated

slide-49
SLIDE 49

11-arm bandits that require sampling lower-reward arm in

  • rder to gain information for

maximal long-term gain

Structured bandits

Informative arm $0.3 $1 $1 $5 $1 $1 $1 $1 $1 $1 $1

slide-50
SLIDE 50

11-arm bandits that require sampling lower-reward arm in

  • rder to gain information for

maximal long-term gain

Structured bandits

Informative arm $0.3 $1 $1 $5

...

Meta-RL_i

slide-51
SLIDE 51

Volatile bandits

Each episode, a new parameter value for volatility is sampled

Low volatility episode High volatility episode

slide-52
SLIDE 52

Volatile bandits

Each episode, a new parameter value for volatility is sampled Meta-RL achieves lowest total regret

  • ver traditional methods

Low volatility episode High volatility episode

Meta-RL_

slide-53
SLIDE 53

Volatile bandits

Each episode, a new parameter value for volatility is sampled Meta-RL achieves lowest total regret

  • ver traditional methods

Also adjusts effective learning rate to volatility (despite frozen weights)

Low volatility episode High volatility episode

slide-54
SLIDE 54

Emergent RL algorithm is capable of conforming to wide variety of task structure

  • Negotiate exploration-exploitation tradeoff
  • Leverage task structure (correlations in

environment, informative choices, abstractions, etc.)

  • Display different effective

hyperparameters (e.g., learning rate)

  • ...
slide-55
SLIDE 55

Drawbacks to using RNNs

Distribution of tasks

1 task Time

Learning priors

Learning task specifics

Information is lost here

slide-56
SLIDE 56

Using memory of specific past experiences to influence decision-making

What did I like the last time I was here?

slide-57
SLIDE 57

Contextual bandits

0.1 0.9 0.1 pr = 0.1 0.1 0.9

Context 1 Context 2

slide-58
SLIDE 58

Using memory of past exploration

KEY VALUE

Context 1

slide-59
SLIDE 59

Using memory of past exploration

KEY VALUE

Context 1 Interaction for 1 episode

A1: Hidden state at end

  • f episode; contains

critical task-related information

slide-60
SLIDE 60

Using memory of past exploration

Context 1

A1 KEY VALUE

Context 1 Interaction for 1 episode

A1: Hidden state at end

  • f episode; contains

critical task-related information

slide-61
SLIDE 61

Using memory of past exploration

Context 1

A1

Context 2

A2 KEY VALUE

Context 2 Interaction for 1 episode

slide-62
SLIDE 62

Using memory of past exploration

Context 1

A1

Context 2

A2

Context 3

A3 KEY VALUE

Context 3 Interaction for 1 episode

slide-63
SLIDE 63

Using memory of past exploration

Context 1

A1

Context 2

A2

Context 3

A3 KEY VALUE

Context 1

slide-64
SLIDE 64

Contextual bandits: Barcodes

0.1 0.9 pr =

0010100101

0.9 0.1

1110100001

0.1 0.9

1010111101

...

Ritter et al, in prep

slide-65
SLIDE 65

Contextual bandits: Barcodes

Meta-RL_i

First exposure Repeat exposure

Ritter et al, in prep

slide-66
SLIDE 66

Meta-reinforcement learning

  • Key requirements:

○ Recurrent dynamics integrating past reward, history, and observations ○ Primary error-based RL algorithm that uses reward prediction error to adjust weights ○ Distribution of related tasks with shared structure

  • Resultant effects

○ Structure of of tasks is absorbed into the weights as priors, leading to faster learning with more tasks ○ Learned RL algorithm is implemented in recurrent activation, not weights, with potential to be drastically different from base algorithm, matched to task structure

slide-67
SLIDE 67

Complex task structure Exploration-exploitation Adaptive hyperparameters

Trained on a set of interrelated RL tasks META-RL Recurrent network with history input

ENV Φ

Envi φi Env1 φ1 EnvN φN

Internalized task structure

slide-68
SLIDE 68

Matt Botvinick Zeb Kurth-Nelson Sam Ritter Dharshan Kumaran Chris Summerfield Hubert Soyer Joel Leibo Dhruva Tirumala Remi Munos Charles Blundell Demis Hassabis ...and many others at DeepMind All of you

Thank you!

joinus@deepmind.com