Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - - PowerPoint PPT Presentation

model based reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - - PowerPoint PPT Presentation

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS The Reinforcement Learning Paradigm Maximize Return -


slide-1
SLIDE 1

Oriol Vinyals (DeepMind) @OriolVinyalsML

May 2018 Stanford University

Model Based Reinforcement Learning

slide-2
SLIDE 2

The Reinforcement Learning Paradigm

GOAL

Agent Environment

OBSERVATIONS ACTIONS

slide-3
SLIDE 3

The Reinforcement Learning Paradigm

Action at Reward rt State xt

Maximize Return - long term reward: Rt=∑t’≥tt’-trt’ = rt+Rt+1 ∈[0,1] With Policy - action distribution: =P(at|xt,...) Measure success with Value Function: V(xt)=E(Rt)

slide-4
SLIDE 4

A Classic Dilemma

Deep Learning Researcher “Old school” AI Researcher

slide-5
SLIDE 5

A Classic Dilemma

slide-6
SLIDE 6

A Classic Dilemma

Deep RL Model Based RL

slide-7
SLIDE 7

(Deep) Model Based RL

Deep RL Model Based RL Deep Generative Model + Deep RL Imagination Augmented Agents Learning Model Based Planning from Scratch

slide-8
SLIDE 8

Imagination Augmented Agents (NIPS17)

Joint work with: Theo Weber*, Sebastien Racaniere*, David Reichert*, Razvan Pascanu*, Yujia Li*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

slide-9
SLIDE 9
  • We have good environment models ⇒ can we use them to solve tasks?
  • How do we do model-based RL and deal with imperfect simulators?
  • In this particular approach, we treat the generative model as an oracle of

possible futures. ⇒ How do we interpret those ‘warnings’?

Intro to I2A

slide-10
SLIDE 10

Imagination Augmented Agents (I2A)

slide-11
SLIDE 11

Imagination Planning Networks (IPNs)

slide-12
SLIDE 12

Imagination Planning Networks (IPNs)

slide-13
SLIDE 13

Sokoban environment

  • Procedurally generated
  • Irreversible decisions
slide-14
SLIDE 14

Sokoban environment

slide-15
SLIDE 15

Video

Success Failure

slide-16
SLIDE 16

What happens if our model is bad?

slide-17
SLIDE 17
slide-18
SLIDE 18

Mental retries with I2A

slide-19
SLIDE 19

Mental retries with I2A

slide-20
SLIDE 20

Mental retries with I2A

Solves 95% of levels!

slide-21
SLIDE 21

Imagination efficiency

Imagination is expensive ⇒ can we limit the number of times we ask the agent to imagine a transition in order to solve a levels? In other words, can we guide the search more efficiently than current methods?

slide-22
SLIDE 22

One model, many tasks

slide-23
SLIDE 23

Five events:

  • Do nothing
  • Eat a small pill
  • Eat a power pill
  • Eat a ghost
  • Be eaten by a ghost

We assign to each event a different reward, and create five different games:

  • ‘Regular’
  • ‘Rush’ (eat big pills as fast as possible)
  • ‘Hunt’ (eat ghosts, pills are ok i guess)
  • ‘Ambush’ (eat ghosts, avoid everything else)
  • ‘Avoid’ (everything hurts)

Metaminipacman

slide-24
SLIDE 24

Results

Avoid Ambush

slide-25
SLIDE 25

Learning model-based planning from scratch

Joint work with: Razvan Pascanu*, Yujia Li*, Theo Weber*, Sebastien Racaniere*, David Reichert*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

slide-26
SLIDE 26

Prior work: Spaceship Task v1.0

Hamrick, Ballard, Pascanu, Vinyals, Heess, Battaglia (2017) Metacontrol for Adaptive Imagination-Based Optimization, ICLR 2017.

  • Propel spaceship to home planet (white) by

choosing thruster force and magnitude

  • Other planets’ (grey) gravitational fields

influence the trajectory

  • Continuous, context bandit problem
slide-27
SLIDE 27

Prior work: Imagination-based metacontroller

  • Restricted to bandit problems
slide-28
SLIDE 28

This paper:

Imagination-based Planner (IBP)

slide-29
SLIDE 29

Spaceship Task v2.0: Multiple actions

  • Use thruster multiple times
  • Increase difficult than Spaceship Task v1.0:

1. Pay for fuel 2. Multiplicative control noise

  • Opens up new strategies, such as:

1. Move away from challenging gravity wells 2. Apply thruster toward target

slide-30
SLIDE 30

Spaceship Task v2.0: Multiple actions

  • Use thruster multiple times
  • Increase difficult than Spaceship Task v1.0:

1. Pay for fuel 2. Multiplicative control noise

  • Opens up new strategies, such as:

1. Move away from challenging gravity wells 2. Apply thruster toward target

slide-31
SLIDE 31

Spaceship Task v2.0: Multiple actions

  • Use thruster multiple times
  • Increase difficult than Spaceship Task v1.0:

1. Pay for fuel 2. Multiplicative control noise

  • Opens up new strategies, such as:

1. Move away from challenging gravity wells 2. Apply thruster toward target

slide-32
SLIDE 32

Spaceship Task v2.0: Multiple actions

  • Use thruster multiple times
  • Increase difficult than Spaceship Task v1.0:

1. Pay for fuel 2. Multiplicative control noise

  • Opens up new strategies, such as:

1. Move away from challenging gravity wells 2. Apply thruster toward target

slide-33
SLIDE 33

Spaceship Task v2.0: Multiple actions

  • Use thruster multiple times
  • Increase difficult than Spaceship Task v1.0:

1. Pay for fuel 2. Multiplicative control noise

  • Opens up new strategies, such as:

1. Move away from challenging gravity wells 2. Apply thruster toward target

slide-34
SLIDE 34
  • Imagination can be:

○ Current step only: imagine only from the current state

Imagination-based Planner

slide-35
SLIDE 35
  • Imagination can be:

○ Current step only: imagine only from the current state ○ Chained steps only: imagine a sequence of actions

Imagination-based Planner

slide-36
SLIDE 36
  • Imagination can be:

○ Current step only: imagine only from the current state ○ Chained steps only: imagine a sequence of actions ○ Imagination tree: manager chooses whether to use current (root) state, or chain imagined states together

Imagination-based Planner

slide-37
SLIDE 37

Imagination-based Planner

slide-38
SLIDE 38

Imagination-based Planner

slide-39
SLIDE 39

Real trials: 3 actions

0 imaginations per action 1 imagination per action 2 imaginations per action

More complex plans: 1. Moves away from complex gravity 2. Slows its velocity 3. Moves to target

slide-40
SLIDE 40

Different strategies for exploration

1 step n step Imagination trees

slide-41
SLIDE 41

Results

slide-42
SLIDE 42

Results

slide-43
SLIDE 43

Results

slide-44
SLIDE 44

Results

slide-45
SLIDE 45

Imagination-based Planner

How does it work? (learnable components are bold)

1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far

slide-46
SLIDE 46

Imagination-based Planner

How does it work? (learnable components are bold)

1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt

slide-47
SLIDE 47

Imagination-based Planner

How does it work? (learnable components are bold)

1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1

slide-48
SLIDE 48

Imagination-based Planner

How does it work? (learnable components are bold)

1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1 b. “World”, model predicts new state, st+1

slide-49
SLIDE 49

Imagination-based Planner

How does it work? (learnable components are bold)

1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1 b. “World”, model predicts new state, st+1 5. Memory aggregates new info into updated history, ht+1

slide-50
SLIDE 50

Imagination-based Planner

How is it trained?

Three distinct, concurrent, on-policy training loops

slide-51
SLIDE 51

Imagination-based Planner

How is it trained?

Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1

slide-52
SLIDE 52

Imagination-based Planner

How is it trained?

Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, ut , is assumed to be |st+1- s*|2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete rt choices are assumed to be constants.

slide-53
SLIDE 53

Imagination-based Planner

How is it trained?

Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, ut , is assumed to be |st+1- s*|2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete rt choices are assumed to be constants. 3. Manager: finite horizon MDP (MLP q-net, stochastic) REINFORCE: Return = (reward + comp. costs), (ut+ ct )

slide-54
SLIDE 54

Bonus Paper: MCTSnet

Joint work with: Arthur Guez*, Theo Weber*, Ioannis Antonoglou, Karen Simonyan, Daan Wierstra, Remi Munos, David Silver

slide-55
SLIDE 55

Vanilla MCTS: a single simulation (tree-policy phase)

Tree after some sims UCB-type rule. Fixed function of Q, visits, (prior net) Q, {N}

slide-56
SLIDE 56

MCTS: a single simulation (tree-policy phase)

...

Expand (Using true model for each transition) Q, {N}

slide-57
SLIDE 57

MCTS: a single simulation

...

x_leaf Value network (pretrained) V

slide-58
SLIDE 58

MCTS: a single simulation (backup phase)

V Q ← V, Na+1

slide-59
SLIDE 59

MCTS: a single simulation (backup phase)

...

V Q ← V, Na+1 Q ← V, Na+1

MCTS output: After many simulations, take max Q (or max N) at the root node

slide-60
SLIDE 60

MCTSnet model: a single simulation (tree-policy phase)

Tree after some sims Simulation policy network Root embedding

slide-61
SLIDE 61

MCTSnet model: a single simulation (tree-policy phase)

...

Expand (Using true model for each transition)

slide-62
SLIDE 62

MCTSnet model: a single simulation (tree-policy phase)

...

x_leaf Embed network

slide-63
SLIDE 63

MCTSnet model: a single simulation (backup phase)

Backup network Note: reward and action should also be provided as input to bnet

slide-64
SLIDE 64

MCTSnet model: a single simulation (backup phase)

...

slide-65
SLIDE 65

... Embeddings represent a tree-shaped memory of past rollouts ...

Simulation that expanded node x (first time embedNet is called) x

...

Later, another simulation visits the same node x (and expands a node y) Later, another simulation visits the same node x (and expands a node y)

  • utput
slide-66
SLIDE 66

Multiple simulations / search

A single forward of the MCTSnet:

sim 1 sim2 sim 3 … sim K MCTSnet net output Readout network loss side information x

slide-67
SLIDE 67

Backup along traversed path

...

Simulation down The tree

...

x_leaf Evaluate/embed new tree node MCTSnet architecture (cartoon)

slide-68
SLIDE 68

Recap of MCTSnet modules

Embed network Backup network Readout network x Simulation policy network

logits Net output

slide-69
SLIDE 69

Problem setting: classification

Data: Input: x - Sokoban frame Target: a* - “oracle” action (obtained from running long MCTS+vnet+TT search)

slide-70
SLIDE 70

Loss

Classification loss (predict the oracle action in each state x):

Straight-through BP REINFORCE trained All internal actions

Gradient of the loss splits into differentiable and non-differentiable parts.

pseudo-reward

slide-71
SLIDE 71
slide-72
SLIDE 72

Thanks!!

@OriolVinyalsML

May 2018