Reinforcement Learning and Simulation-Based Search David Silver - - PowerPoint PPT Presentation

reinforcement learning and simulation based search
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning and Simulation-Based Search David Silver - - PowerPoint PPT Presentation

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and Simulation-Based Search Outline 1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under


slide-1
SLIDE 1

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search

David Silver

slide-2
SLIDE 2

Reinforcement Learning and Simulation-Based Search

Outline

1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under Uncertainty

slide-3
SLIDE 3

Reinforcement Learning and Simulation-Based Search Reinforcement Learning

Markov Decision Process

Definition A Markov Decision Process is a tuple S, A, P, R S is a finite set of states A is a finite set of actions P is a state transition probability matrix, Pa

ss′ = P [s′ | s, a]

R is a reward function, Ra

s = E [r | s, a]

Assume for this talk that all sequences terminate, γ = 1

slide-4
SLIDE 4

Reinforcement Learning and Simulation-Based Search Reinforcement Learning

Planning and Reinforcement Learning

Planning: Given MDP M, maximise expected future reward Reinforcement Learning: Given sample sequences from MDP {s1, ak

1, rk 1 , sk 2 , ak 2, ..., sk T K }K k=1 ∼ M

Maximise expected future reward

slide-5
SLIDE 5

Reinforcement Learning and Simulation-Based Search Simulation-Based Search

Simulation-Based Search

A simulator M is a generative model of an MDP

Given a state st and action at The simulator can generate a next state st+1 and reward rt+1

A simulator can be used to generate sequences of experience Starting from any “root” state s1 {s1, a1, r1, s2, a2, ..., sT} ∼ M Simulation-based search applies reinforcement learning to simulated experience

slide-6
SLIDE 6

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

Monte-Carlo Simulation

Given a model M and a simulation policy π(s, a) = Pr(a | s) Simulate K episodes from root state s1 {s1, ak

1, rk 1 , sk 2 , ak 2, ..., sk T K }K k=1 ∼ M, π

Evaluate state by mean total reward (Monte-Carlo evaluation) V (s1) = 1 K

K

  • k=1

T K

  • t=1

rk

t P

→ E  

T K

  • t=1

rk

t

  • s1

 

slide-7
SLIDE 7

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

Simple Monte-Carlo Search

Given a model M and a simulation policy π For each action a ∈ A

Simulate K episodes from root state st {s1, a, ak

1, r k 1 , sk 2 , ak 2, ..., sk T}K k=1 ∼ M, π

Evaluate actions by mean total reward Q(s1, a) = 1 K

K

  • k=1

T K

  • t=1

r k

t P

→ E  

T K

  • t=1

r k

t

  • s1, a

 

Select real action with maximum value at = argmax

a∈A

Q(st, a)

slide-8
SLIDE 8

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

Monte-Carlo Tree Search

Simulate sequences starting from root state s1 Build a search tree containing all visited states Repeat (each simulation)

Evaluate states V (s) by mean total reward of all sequences through node s Improve simulation policy by picking child s′ with max V (s′)

Converges on the optimal search tree, V (s) → V ∗(s)

slide-9
SLIDE 9

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

0/1 6/7 2/3 3/4 0/1 1/1 2/2 0/1 2/2 1/1 1/1 1 1 1 1 1 1 1 1 1 9/12

root search tree roll-outs reward max min max min max

a1 a2 a3 b1 b3 b1 b2 a1 a3 a1 b1

slide-10
SLIDE 10

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

Advantages of MC Tree Search

Highly selective best-first search Focused on the future Uses sampling to break curse of dimensionality Works for “black-box” simulators (only requires samples) Computationally efficient, anytime, parallelisable

slide-11
SLIDE 11

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search

Disadvantages of MC Tree Search

Monte-Carlo estimates have high variance No generalisation between related states

slide-12
SLIDE 12

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search

Temporal-Difference Search

Simulate sequences starting from root state s1 Build a search tree containing all visited states Repeat (each simulation)

Evaluate states V (s) by temporal-difference learning Improve simulation policy by picking child s′ with max V (s′)

Converges on the optimal search tree, V (s) → V ∗(s)

slide-13
SLIDE 13

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search

Linear Temporal-Difference Search

Simulate sequences starting from root state s1 Build a linear function approximator V (s) = φ(s)⊤θ

  • ver all visited states

Repeat (each simulation)

Evaluate states V (s) by linear temporal-difference learning Improve simulation policy by picking child s′ with max V (s′)

slide-14
SLIDE 14

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search

Demo

slide-15
SLIDE 15

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Planning Under Uncertainty

Consider a history ht of actions, observations and rewards h = a1, o1, r1, ..., at, ot, rt What if the state s is unknown? i.e. we only have some beliefs b(s) = P(s | ht) What if the MDP dynamics P are unknown? i.e. we only have some beliefs b(P) = p(P | ht) What if the MDP reward function R is unknown? i.e. we only have some beliefs b(R) = p(R | ht)

slide-16
SLIDE 16

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Belief State MDP

Plan in augmented state space over beliefs Each action now transitions to a new belief state This defines an enormous MDP over belief states

slide-17
SLIDE 17

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Histories and Belief States

ε

a1 a2 a1o1 a1o2 a2o1 a2o2 a1o1a1 a1o1a2

... ... ...

a1 a2

  • 1
  • 2
  • 1
  • 2

a1 a2

... ... ...

a1 a2

  • 1
  • 2
  • 1
  • 2

a1 a2 P(s) P(s|a1) P(s|a2) P(s|a1o1) P(s|a1o2) P(s|a2o1) P(s|a2o2)

History tree Belief tree

P(s|a1o1a1) P(s|a1o1a2)

slide-18
SLIDE 18

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Belief State Planning

We can apply simulation-based search to the belief state MDP Since these methods are effective in very large state spaces Unfortunately updating belief states is slow Belief state planners cannot scale up to realistic problems

slide-19
SLIDE 19

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Root Sampling

Each simulation, pick one world from root beliefs: sample state/transitions/reward function Run simulation as if that world is real Build plan in history space (fast!) Evaluate histories V (h) e.g. by Monte-Carlo evaluation Improve simulation policy e.g. by greedy action selection at = argmax

a

V (hta) Never updates beliefs during search But still converges on the optimal search tree w.r.t. beliefs, V (h) → V ∗(h) Intuitively, it averages over different worlds, tree provides filter

slide-20
SLIDE 20

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty

Demo