Reinforcement Learning and Simulation-Based Search David Silver - PowerPoint PPT Presentation

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver

Reinforcement Learning and Simulation-Based Search Outline 1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under Uncertainty

Reinforcement Learning and Simulation-Based Search Reinforcement Learning Markov Decision Process Definition A Markov Decision Process is a tuple �S , A , P , R� S is a finite set of states A is a finite set of actions ss ′ = P [ s ′ | s , a ] P is a state transition probability matrix, P a R is a reward function, R a s = E [ r | s , a ] Assume for this talk that all sequences terminate, γ = 1

Reinforcement Learning and Simulation-Based Search Reinforcement Learning Planning and Reinforcement Learning Planning: Given MDP M , maximise expected future reward Reinforcement Learning: Given sample sequences from MDP { s 1 , a k 1 , r k 1 , s k 2 , a k 2 , ..., s k T K } K k =1 ∼ M Maximise expected future reward

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Simulation-Based Search A simulator M is a generative model of an MDP Given a state s t and action a t The simulator can generate a next state s t +1 and reward r t +1 A simulator can be used to generate sequences of experience Starting from any “root” state s 1 { s 1 , a 1 , r 1 , s 2 , a 2 , ..., s T } ∼ M Simulation-based search applies reinforcement learning to simulated experience

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search Monte-Carlo Simulation Given a model M and a simulation policy π ( s , a ) = Pr ( a | s ) Simulate K episodes from root state s 1 { s 1 , a k 1 , r k 1 , s k 2 , a k 2 , ..., s k T K } K k =1 ∼ M , π Evaluate state by mean total reward ( Monte-Carlo evaluation )   T K T K K � V ( s 1 ) = 1 � P � � � r k r k → E � s 1 � t  t  K � t =1 t =1 k =1

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search Simple Monte-Carlo Search Given a model M and a simulation policy π For each action a ∈ A Simulate K episodes from root state s t { s 1 , a , a k 1 , r k 1 , s k 2 , a k 2 , ..., s k T } K k =1 ∼ M , π Evaluate actions by mean total reward   T K T K � K Q ( s 1 , a ) = 1 � � � P � r k r k → E � s 1 , a �   t t K � k =1 t =1 t =1 Select real action with maximum value a t = argmax Q ( s t , a ) a ∈A

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search Monte-Carlo Tree Search Simulate sequences starting from root state s 1 Build a search tree containing all visited states Repeat (each simulation) Evaluate states V ( s ) by mean total reward of all sequences through node s Improve simulation policy by picking child s ′ with max V ( s ′ ) Converges on the optimal search tree, V ( s ) → V ∗ ( s )

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search root max 9/12 a 1 a 2 a 3 min 0/1 6/7 2/3 b 1 b 3 b 1 b 2 max search tree 3/4 2/2 0/1 1/1 a 1 a 3 a 1 min 0/1 2/2 1/1 b 1 max 1/1 roll-outs reward 0 0 1 1 1 1 1 1 1 0 1 1

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search Advantages of MC Tree Search Highly selective best-first search Focused on the future Uses sampling to break curse of dimensionality Works for “black-box” simulators (only requires samples) Computationally efficient, anytime, parallelisable

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Monte-Carlo Search Disadvantages of MC Tree Search Monte-Carlo estimates have high variance No generalisation between related states

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search Temporal-Difference Search Simulate sequences starting from root state s 1 Build a search tree containing all visited states Repeat (each simulation) Evaluate states V ( s ) by temporal-difference learning Improve simulation policy by picking child s ′ with max V ( s ′ ) Converges on the optimal search tree, V ( s ) → V ∗ ( s )

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search Linear Temporal-Difference Search Simulate sequences starting from root state s 1 Build a linear function approximator V ( s ) = φ ( s ) ⊤ θ over all visited states Repeat (each simulation) Evaluate states V ( s ) by linear temporal-difference learning Improve simulation policy by picking child s ′ with max V ( s ′ )

Reinforcement Learning and Simulation-Based Search Simulation-Based Search Temporal-Difference Search Demo

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Planning Under Uncertainty Consider a history h t of actions, observations and rewards h = a 1 , o 1 , r 1 , ..., a t , o t , r t What if the state s is unknown? i.e. we only have some beliefs b ( s ) = P ( s | h t ) What if the MDP dynamics P are unknown? i.e. we only have some beliefs b ( P ) = p ( P | h t ) What if the MDP reward function R is unknown? i.e. we only have some beliefs b ( R ) = p ( R | h t )

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Belief State MDP Plan in augmented state space over beliefs Each action now transitions to a new belief state This defines an enormous MDP over belief states

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Histories and Belief States History tree Belief tree ε P(s) a 1 a 2 a 1 a 2 a 1 a 2 P(s|a 1 ) P(s|a 2 ) o 1 o 2 o 1 o 2 o 1 o 2 o 1 o 2 a 1 o 1 a 1 o 2 a 2 o 1 a 2 o 2 P(s|a 1 o 1 ) P(s|a 1 o 2 ) P(s|a 2 o 1 ) P(s|a 2 o 2 ) a 2 a 1 a 1 a 2 ... ... ... ... ... ... a 1 o 1 a 1 a 1 o 1 a 2 P(s|a 1 o 1 a 1 ) P(s|a 1 o 1 a 2 )

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Belief State Planning We can apply simulation-based search to the belief state MDP Since these methods are effective in very large state spaces Unfortunately updating belief states is slow Belief state planners cannot scale up to realistic problems

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Root Sampling Each simulation, pick one world from root beliefs: sample state/transitions/reward function Run simulation as if that world is real Build plan in history space (fast!) Evaluate histories V ( h ) e.g. by Monte-Carlo evaluation Improve simulation policy e.g. by greedy action selection a t = argmax V ( h t a ) a Never updates beliefs during search But still converges on the optimal search tree w.r.t. beliefs, V ( h ) → V ∗ ( h ) Intuitively, it averages over different worlds, tree provides filter

Reinforcement Learning and Simulation-Based Search Planning Under Uncertainty Demo

Reinforcement Learning and Simulation-Based Search David Silver - PowerPoint PPT Presentation

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and Simulation-Based Search Outline 1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

A Pedagogical Framework for Modeling and Simulating Intelligent Agents and Control Systems Dan

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

Work-in-Progress: RWS A Roulette Wheel Scheduler for Preventing Execution Pattern Leakage

Planning and Optimization G4. Asymptotically Suboptimal Monte-Carlo Methods Gabriele R oger

Imprecise random variables, random sets, and Monte Carlo simulation Thomas Fetz, Michael

Simulating the Greeks in Finance By: John Lehoczky Carnegie Mellon University June 7,

DR. WALTER S. DAVIS President 1943 -1968 Academic Excellence and Civil Rights Era 1965 - 1969

Reinforcement Learning and Simulation-Based Search David Silver - PowerPoint PPT Presentation

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and Simulation-Based Search Outline 1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

A Pedagogical Framework for Modeling and Simulating Intelligent Agents and Control Systems Dan

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &amp;

Work-in-Progress: RWS A Roulette Wheel Scheduler for Preventing Execution Pattern Leakage

Planning and Optimization G4. Asymptotically Suboptimal Monte-Carlo Methods Gabriele R oger

Imprecise random variables, random sets, and Monte Carlo simulation Thomas Fetz, Michael

Simulating the Greeks in Finance By: John Lehoczky Carnegie Mellon University June 7,

DR. WALTER S. DAVIS President 1943 -1968 Academic Excellence and Civil Rights Era 1965 - 1969

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &