Oriol Vinyals (DeepMind) @OriolVinyalsML
May 2018 Stanford University
Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - - PowerPoint PPT Presentation
Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS The Reinforcement Learning Paradigm Maximize Return -
Oriol Vinyals (DeepMind) @OriolVinyalsML
May 2018 Stanford University
GOAL
Agent Environment
OBSERVATIONS ACTIONS
Action at Reward rt State xt
Maximize Return - long term reward: Rt=∑t’≥tt’-trt’ = rt+Rt+1 ∈[0,1] With Policy - action distribution: =P(at|xt,...) Measure success with Value Function: V(xt)=E(Rt)
Deep Learning Researcher “Old school” AI Researcher
Deep RL Model Based RL
Deep RL Model Based RL Deep Generative Model + Deep RL Imagination Augmented Agents Learning Model Based Planning from Scratch
Joint work with: Theo Weber*, Sebastien Racaniere*, David Reichert*, Razvan Pascanu*, Yujia Li*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra
possible futures. ⇒ How do we interpret those ‘warnings’?
Success Failure
Solves 95% of levels!
Imagination is expensive ⇒ can we limit the number of times we ask the agent to imagine a transition in order to solve a levels? In other words, can we guide the search more efficiently than current methods?
Five events:
We assign to each event a different reward, and create five different games:
Avoid Ambush
Joint work with: Razvan Pascanu*, Yujia Li*, Theo Weber*, Sebastien Racaniere*, David Reichert*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra
Hamrick, Ballard, Pascanu, Vinyals, Heess, Battaglia (2017) Metacontrol for Adaptive Imagination-Based Optimization, ICLR 2017.
choosing thruster force and magnitude
influence the trajectory
1. Pay for fuel 2. Multiplicative control noise
1. Move away from challenging gravity wells 2. Apply thruster toward target
1. Pay for fuel 2. Multiplicative control noise
1. Move away from challenging gravity wells 2. Apply thruster toward target
1. Pay for fuel 2. Multiplicative control noise
1. Move away from challenging gravity wells 2. Apply thruster toward target
1. Pay for fuel 2. Multiplicative control noise
1. Move away from challenging gravity wells 2. Apply thruster toward target
1. Pay for fuel 2. Multiplicative control noise
1. Move away from challenging gravity wells 2. Apply thruster toward target
○ Current step only: imagine only from the current state
○ Current step only: imagine only from the current state ○ Chained steps only: imagine a sequence of actions
○ Current step only: imagine only from the current state ○ Chained steps only: imagine a sequence of actions ○ Imagination tree: manager chooses whether to use current (root) state, or chain imagined states together
0 imaginations per action 1 imagination per action 2 imaginations per action
More complex plans: 1. Moves away from complex gravity 2. Slows its velocity 3. Moves to target
1 step n step Imagination trees
How does it work? (learnable components are bold)
1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far
How does it work? (learnable components are bold)
1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt
How does it work? (learnable components are bold)
1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1
How does it work? (learnable components are bold)
1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1 b. “World”, model predicts new state, st+1
How does it work? (learnable components are bold)
1. On each step, inputs: ○ State, st : the planet and ship positions, etc. ○ Imagined state, s’t : internal state belief ○ History, ht : summary of planning steps so far 2. Controller policy returns action, at 3. Manager routes actions to world or imagination, rt 4. If route, rt , indicates: a. “Imagination”, predicts imagined state, s’t+1 b. “World”, model predicts new state, st+1 5. Memory aggregates new info into updated history, ht+1
How is it trained?
Three distinct, concurrent, on-policy training loops
How is it trained?
Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1
How is it trained?
Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, ut , is assumed to be |st+1- s*|2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete rt choices are assumed to be constants.
How is it trained?
Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: st , at → st+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, ut , is assumed to be |st+1- s*|2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete rt choices are assumed to be constants. 3. Manager: finite horizon MDP (MLP q-net, stochastic) REINFORCE: Return = (reward + comp. costs), (ut+ ct )
Joint work with: Arthur Guez*, Theo Weber*, Ioannis Antonoglou, Karen Simonyan, Daan Wierstra, Remi Munos, David Silver
Tree after some sims UCB-type rule. Fixed function of Q, visits, (prior net) Q, {N}
Expand (Using true model for each transition) Q, {N}
x_leaf Value network (pretrained) V
V Q ← V, Na+1
V Q ← V, Na+1 Q ← V, Na+1
MCTS output: After many simulations, take max Q (or max N) at the root node
Tree after some sims Simulation policy network Root embedding
Expand (Using true model for each transition)
x_leaf Embed network
Backup network Note: reward and action should also be provided as input to bnet
Simulation that expanded node x (first time embedNet is called) x
Later, another simulation visits the same node x (and expands a node y) Later, another simulation visits the same node x (and expands a node y)
sim 1 sim2 sim 3 … sim K MCTSnet net output Readout network loss side information x
Backup along traversed path
Simulation down The tree
x_leaf Evaluate/embed new tree node MCTSnet architecture (cartoon)
Embed network Backup network Readout network x Simulation policy network
logits Net output
Data: Input: x - Sokoban frame Target: a* - “oracle” action (obtained from running long MCTS+vnet+TT search)
Classification loss (predict the oracle action in each state x):
Straight-through BP REINFORCE trained All internal actions
Gradient of the loss splits into differentiable and non-differentiable parts.
pseudo-reward
@OriolVinyalsML
May 2018