Offline Policy-search in Bayesian Reinforcement Learning Castronovo - PowerPoint PPT Presentation

Offline Policy-search in Bayesian Reinforcement Learning Castronovo Michael University of Li` ege, Belgium Advisor: Damien Ernst 15th March 2017

Contents • Introduction • Problem Statement • Offline Prior-based Policy-search (OPPS) • Artificial Neural Networks for BRL (ANN-BRL) • Benchmarking for BRL • Conclusion 2

Introduction What is Reinforcement Learning (RL)? A sequential decision-making process where an agent observes an environment, collects data and reacts appropriately. Example: Train a Dog with Food Rewards • Context: Markov-decision process (MDP) • Single trajectory (= only 1 try) • Discounted rewards (= early decisions are more important) • Infinite horizon (= the number of decisions is infinite) 3

The Exploration / Exploitation dilemma (E/E dilemma) An agent has two objectives: • Increase its knowledge of the environment • Maximise its short-term rewards ⇒ Find a compromise to avoid suboptimal long-term behaviour In this work, we assume that • The reward function is known (= agent knows if an action is good or bad) • The transition function is unknown (= agent does not know how actions modify the environment) 4

Reasonable assumption: Transition function is not unknown, but is instead uncertain: ⇒ We have some prior knowledge about it ⇒ This setting is called Bayesian Reinforcement Learning What is Bayesian Reinforcement Learning (BRL)? A Reinforcement Learning problem where we assume some prior knowledge is available on start in the form of a MDP distribution. 5

Intuitively... A process that allows to simulate decision-making problems similar to the one we expect to face. Example: A robot has to find the exit of an unknown maze. → Perform simulations on other mazes beforehand → Learn an algorithm based on those experiences → (e.g.: Wall follower ) 6

Problem statement Let M = ( X , U , x 0 , f M ( · ) , ρ M ( · ) , γ ) be a given unknown MDP, where • X = { x (1) , . . . , x ( n X ) } denotes its finite state space • U = { u (1) , . . . , u ( n U ) } denotes its finite action space • x 0 M denotes its initial state. • x ′ ∼ f M ( x , u ) denotes the next state when performing action u in state x • r t = ρ M ( x t , u t , x t +1 ) ∈ [ R min , R max ] denotes an instantaneous deterministic, bounded reward • γ ∈ [0 , 1] denotes its discount factor Let h t = ( x 0 M , u 0 , r 0 , x 1 , · · · , x t − 1 , u t − 1 , r t − 1 , x t ) denote the history observed until time t . 8

An E/E strategy is a stochastic policy π that, given the current history h t returns an action u t : u t ∼ π ( h t ) The expected return of a given E/E strategy π on MDP M : �� γ t r t J π M = E M t where x 0 = x 0 M x t +1 ∼ f M ( x t , u t ) r t = ρ M ( x t , u t , x t +1 ) 9

RL (no prior distribution) We want to find a high-performance E/E strategy π ∗ M for a given MDP M : π ∗ M ∈ arg max J π M π BRL (prior distribution p 0 M ( · ) ) A prior distribution defines a distribution over each uncertain component of M ( f M ( · ) in our case). Given a prior distribution p 0 M ( · ), the goal is to find a policy π ∗ , called Bayes optimal : π ∗ = arg max J π p 0 M ( · ) π where J π M ( · ) = E M ( · ) J π p 0 M M ∼ p 0 10

Offline Prior-based Policy-search (OPPS) 1 . Define a rich set of E/E strategies: → Build a large set of N formulas → Build a formula-based strategy for each formula of this set 2 . Search for the best E/E strategy in average, according to the given MDP distribution: → Formalise this problem as an N -armed bandit problem 12

1. Define a rich set of E/E strategies Let F K be the discrete set of formulas of size at most K . A formula of size K is obtained by combining K elements among: • Variables: ˆ 1 ( x , u ) , ˆ 2 ( x , u ) , ˆ Q t Q t Q t 3 ( x , u ) • Operators: · , √· , min( · , · ) , max( · , · ) + , − , × , /, | · | , 1 Examples: • Formula of size 2: F ( x , u ) = | ˆ Q t 1 ( x , u ) | • Formula of size 4: F ( x , u ) = ˆ 3 ( x , u ) − | ˆ Q t Q t 1 ( x , u ) | To each formula F ∈ F K , we associate a formula-based strategy π F , defined as follows: π F ( h t ) ∈ arg max F ( x t , u ) u ∈ U 13

Problems: • F K is too large ( | F 5 | ≃ 300 , 000 formulas for 3 variables and 11 operators) • Formulas of F K are redundant (= different formulas can define the same policy) Examples: 1 . Q t 1 ( x , u ) and Q t 1 ( x , u ) − Q t 3 ( x , u ) + Q t 3 ( x , u ) � 2 . Q t Q t 1 ( x , u ) and 1 ( x , u ) Solution: ⇒ Reduce F K 14

Reduction process → Partition F K into equivalence classes, two formulas being equivalent if and only if they lead to the same policy → Retrieve the formula of minimal length of each class into a set ¯ F K Example: | ¯ F 5 | ≃ 3 , 000 while | F 5 | ≃ 300 , 000 F K may be Computing ¯ expensive. We instead use an efficient heuristic approach to compute a good approximation of this set. 15

2. Search for the best E/E strategy in average A naive approach based on Monte-Carlo simulations (= evaluating all strategies) is time-inefficient, even after the reduction of the set of formulas. Problem: In order to discriminate between the formulas, we need to compute an accurate estimation of J π M ( · ) for each formula, which requires a p 0 large number of simulations. Solution: Distribute the computational ressources efficiently. ⇒ Formalise this problem as a multi-armed bandit problem and use a well-studied algorithm to solve it. 16

What is a multi-armed bandit problem? A reinforcement learning problem where the agent is facing bandit machines and has to identify the one providing the highest reward in average with a given number of tries. 17

Formalisation Formalise this research as a N -armed bandit problem. F K ( n ∈ { 1 , . . . , N } ), we associate an • To each formula F n ∈ ¯ arm • Pulling the arm n consists in randomly drawing a MDP M according to p 0 M ( · ), and perform a single simulation of policy π F n on M • The reward associated to arm n is the observed discounted return of π F n on M ⇒ This defines a multi-armed bandit problem for which many algorithms have been proposed (e.g.: UCB1, UCB-V, KL-UCB, . . . ) 18

Learning Exploration/Exploitation in Reinforcement Learning M. Castronovo, F. Maes, R. Fonteneau & D. Ernst (EWRL 2012, 8 pages) BAMCP versus OPPS: an Empirical Comparison M. Castronovo, D. Ernst & R. Fonteneau (BENELEARN 2014, 8 pages) 19

Artificial Neural Networks for BRL (ANN-BRL) We exploit an analogy between decision-making and classification problems. A reinforcement learning A multi-class classification problem consists in finding problem consists in finding a policy π which associates a rule C ( · ) which associates an action u ∈ U to a class c ∈ { 1 , . . . , K } to any vector v ∈ R n ( n ∈ N ). any history h . ⇒ Formalise a BRL problem as a classification problem in order to use any classification algorithms such as Artificial Neural Networks 21

1 . Generate a training dataset: → Perform simulations on MDPs drawn from p 0 M ( · ) → For each encountered history, recommend an action → Reprocess each history h into a vector of fixed size ⇒ Extract a fixed set of features (= variables for OPPS) 2 . Train ANNs: ⇒ Use a boosting algorithm 22

1. Generate a training dataset In order to generate a trajectory, we need a policy: • A random policy? Con: Lack of histories for late decisions • An optimal policy? ( f M ( · ) is known for M ∼ p 0 M ( · )) Con: Lack of histories for early decisions ⇒ Why not both? Let π ( i ) be an ǫ -Optimal policy used for drawing trajectory i (on a total of n trajectories). For ǫ = i n : π ( i ) ( h t ) = u ∗ with probability 1 − ǫ and is drawn randomly in U else. 23

To each history h (1) 0 , . . . , h (1) T − 1 , . . . , h ( n ) 0 , . . . , h ( n ) T − 1 observed during the simulations, we associate a label to each action: • − 1 if we recommend the action • − 1 else Example: U = { u (1) , u (2) , u (3) } : h (1) ↔ ( − 1 , 1 , − 1) 0 ⇒ We recommend action u (2) We recommend actions which are optimal w.r.t. M ( f M ( · ) is known for M ∼ p 0 M ( · )). 24

Reprocess of all histories in order to fed the ANNs with vectors of fixed size. ⇒ Extract a fixed set of N features: φ h t = [ φ (1) h t , . . . , φ ( N ) h t ] We considered two types of features: • Q-Values: φ h t = [ Q h t ( x t , u (1) ) , . . . , Q h t ( x t , u ( n U ) )] • Transition counters: φ h t = [ C h t ( < x (1) , u (1) , x (1) > ) , . . . , C h t ( < x ( n X ) , u ( n U ) , x ( n X ) > )] 25

Offline Policy-search in Bayesian Reinforcement Learning Castronovo - PowerPoint PPT Presentation

Offline Policy-search in Bayesian Reinforcement Learning Castronovo Michael University of Li` ege, Belgium Advisor: Damien Ernst 15th March 2017 Contents Introduction Problem Statement Offline Prior-based Policy-search (OPPS)

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Introduction to Reinforcement Learning Bayesian Methods in Reinforcement Learning ICML 2007

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Brazils macro has weakened for years The fiscal imbalance is one Unemployment has of the

Restoratio ion o of all ll thin ings - Lindy Strong Energy Healing Kingdom Living Before we

Food Science Basics Session 2: Food Chemistry Basics FoodCrumbles.com Ready, to immerse

Explicit-State Abstraction: A New Method Abstractions for Generating Heuristic Functions

LICENSING OF SEPs ON FAIR, REASONABLE AND NON-DISCRIMINATORY (FRAND) TERMS By: Prof. Manveen

Future research needs to assure the safety of CAVs: a point of view from the field of robotics and

FXC Forum on Dodd Frank Act & the Foreign Exchange Market SEF Panel April 2013 The

Developing Long Term Sustainable Solutions IOWA STATE UNIVERSITYS Agricultural Biosystems

Offline Policy-search in Bayesian Reinforcement Learning Castronovo - PowerPoint PPT Presentation

Offline Policy-search in Bayesian Reinforcement Learning Castronovo Michael University of Li` ege, Belgium Advisor: Damien Ernst 15th March 2017 Contents Introduction Problem Statement Offline Prior-based Policy-search (OPPS)

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Introduction to Reinforcement Learning Bayesian Methods in Reinforcement Learning ICML 2007

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Brazils macro has weakened for years The fiscal imbalance is one Unemployment has of the

Restoratio ion o of all ll thin ings - Lindy Strong Energy Healing Kingdom Living Before we

Food Science Basics Session 2: Food Chemistry Basics FoodCrumbles.com Ready, to immerse

Explicit-State Abstraction: A New Method Abstractions for Generating Heuristic Functions

LICENSING OF SEPs ON FAIR, REASONABLE AND NON-DISCRIMINATORY (FRAND) TERMS By: Prof. Manveen

Future research needs to assure the safety of CAVs: a point of view from the field of robotics and

FXC Forum on Dodd Frank Act &amp; the Foreign Exchange Market SEF Panel April 2013 The

Developing Long Term Sustainable Solutions IOWA STATE UNIVERSITYS Agricultural Biosystems

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

FXC Forum on Dodd Frank Act & the Foreign Exchange Market SEF Panel April 2013 The