Safe Reinforcement Learning for Decision-Making in Autonomous - PowerPoint PPT Presentation

Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille – Nord Europe Valse, Inria Lille – Nord Europe Renault Group Lille, April 2019

Motivation Classic Autonomous Driving Pipeline Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). ◮ Won’t scale to complex scenes, handle negotiation and aggressiveness Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

Reinforcement Learning: why? Search for an optimal policy π ( a | s ) : � � ∞ � � � � γ t r ( s t , a t ) max � a t ∼ π ( s t ) , s t + 1 ∼ T ( s t , a t ) E � π t = 0 � �� policy return R T π The dynamics T ( s t + 1 | s t , a t ) are unknown. The agent learns by interaction with the environment Challenges: ◮ exploration-exploitation ◮ partial observability ◮ credit assignment ◮ safety Safe Reinforcement Learning for Autonomous Driving Lille - 3/54

Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Safe Reinforcement Learning for Autonomous Driving Lille - 4/54

Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 4/54

A first benchmark The highway-env environment ◮ Vehicle kinematics: Kinematic Bicycle Model ◮ Low-level longitudinal and lateral controllers ◮ Behavioural models: IDM and MOBIL ◮ Graphical road network and route planning A few baseline agents — Setup ◮ Model-free: DQN ◮ Model-based (planning): Value Iteration and MCTS Safe Reinforcement Learning for Autonomous Driving Lille - 5/54

A first benchmark — Results Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.4 0.2 0.2 0.1 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Videos available on Safe Reinforcement Learning for Autonomous Driving Lille - 6/54

The safety / performance trade-off Let us look at the performances of DQN: Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Conflicting objectives ◮ Reward r t = ω v velocity − ω c collision π = � ◮ We only control the return R T t γ t r t . ◮ For any fixed ω , there can be many optimal policies with different velocity collision ratios → the Pareto-optimal curve Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

A first formalisation of risk Constrained Reinforcement Learning ◮ Augment the MDP with a cost function c : S × A × S → R , cost discount γ c , and a budget β . ◮ Optimise the reward while keeping the cost under a budget � ∞ � � γ t r t max E π π t = 0 � ∞ � � γ t s.t. c c t ≤ β E π t = 0 Budgeted Reinforcement Learning Find a single budget-dependent policy π ( a | s , β ) that solves all the corresponding CMDPs Safe Reinforcement Learning for Autonomous Driving Lille - 8/54

A BMDP algorithm Lagrangian Relaxation Consider the dual problem and replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t r t − λγ t max c c t E π t ◮ Train many policies π k with penalties λ k and recover the cost budgets β k ◮ Very data/memory-heavy Safe Reinforcement Learning for Autonomous Driving Lille - 9/54

Our BMDP algorithm Budgeted Fitted-Q [Carrara et al. 2019] A model-free, value-based, fixed-point iteration procedure. regression Q r n + 1 ( s i , a i , β i ) ← − − − − − − � r ′ π n A ( s ′ i , a ′ , β i ) Q r n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ a ′ ∈A regression Q c n + 1 ( s i , a i , β i ) ← − − − − − − � c ′ π n A ( s ′ i , a ′ , β i ) Q c n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ c a ′ ∈A ( π n A , π n B ) ← � π A ( s , a , β ) Q r arg max n ( s , a , π B ( s , a , β )) ( π A ,π B ) ∈ Ψ n a ∈A   π A ∈ M ( A ) S× R , π B ∈ R S×A× R ,       such that, ∀ s ∈ S , ∀ β ∈ R , Ψ n = �  π A ( s , a , β ) Q c  n ( s , a , π B ( s , a , β )) ≤ β     a ∈A Safe Reinforcement Learning for Autonomous Driving Lille - 10/54

From dynamic programming to RL Continuous Reinfocement Learning 1. Risk-sensitive exploration. 2. Scalable function approximation 3. Parallel computing of the targets and experiences. Safe Reinforcement Learning for Autonomous Driving Lille - 11/54

Risk-sensitive exploration Algorithm 1: Risk-sensitive exploration 1 Initialise an empty batch D . 2 for each intermediate batch do for each episode in batch do 3 Sample initial budget β ∼ U ( B ) . 4 while episode not done do 5 Update ε from schedule. 6 Sample z ∼ U ([ 0 , 1 ]) . 7 if z > ε then See 8 Sample ( a , β ′ ) from ( π A , π B ) . example on 9 // Exploit else 10 Sample ( a , β ′ ) from U (∆ AB ) . 11 // Explore Append transition ( s , β, a , r ′ , c ′ , s ′ ) to batch D . 12 Update episode budget β ← β ′ . 13 ( π A , π B ) ← BFTQ ( D ) . 14 15 return the batch of transitions D Safe Reinforcement Learning for Autonomous Driving Lille - 12/54

Function approximation Hidden Hidden ( s , β ) Q Layer 1 Layer 2 s 0 Q r ( a 0 ) s 1 Q r ( a 1 ) Q c ( a 0 ) Q c ( a 1 ) β Encoder Figure: Neural Network for Q -functions approximation when the state dimension is 2 and there are 2 actions. Safe Reinforcement Learning for Autonomous Driving Lille - 13/54

Parallel computing of the targets Algorithm 3: Compute targets Algorithm 2: BFTQ (parallel) B , γ c , γ r , fit r ,fit c (regression 1 In: D , � 1 Q r , Q c = Q ( D ) algorithms); // perform a single forward pass 2 Out: Q r , Q c ; 2 Split D among workers: D = ∪ w ∈ W D w 3 X = { s i , a i , β i } i ∈ [ 0 , |D| ] ; // Run the following loop on each 4 Initialise Q r = Q c = ( s , a , β ) → 0; worker in parallel 3 for w ∈ W do 5 repeat Y r , Y c = ( Y c w , Y r w ) ← 4 6 compute targets ( D w , Q r , Q c , � compute targets ( D , Q r , Q c , � B , γ c , γ r ) B , γ c , γ r ) ; Q r , Q c = fit r ( X , Y r ) , fit c ( X , Y c ) ; 5 Join the results: Y c = ∪ w ∈ W Y c 7 w and 8 until convergence or timeout ; Y r = ∪ w ∈ W Y r w 6 return ( Y c , Y r ) Safe Reinforcement Learning for Autonomous Driving Lille - 14/54

Experiments Video available on 1 0 2 3 4 β ∈ [0.31,1.00] 6 0.30 β ∈ [0.21,0.29] 0.20 β ∈ [0.11,0.19] 0.10 10 β ∈ [0.01,0.09] λ ∈ {15,20} 0.00 Safe Reinforcement Learning for Autonomous Driving Lille - 15/54

Looking back Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.2 0.4 0.1 0.2 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Compared to DQN, the MCTS was really good in terms of safety. But the VI, not so much. Safe Reinforcement Learning for Autonomous Driving Lille - 16/54

Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 17/54

Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. - Model bias: T � = ˆ T see example at Safe Reinforcement Learning for Autonomous Driving Lille - 17/54

Safe Reinforcement Learning for Decision-Making in Autonomous - PowerPoint PPT Presentation

Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille Nord Europe Valse, Inria Lille Nord Europe Renault Group

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and

DECISION MAKING readysetpresent.com Decision Making Program Objectives ( 1 of 2 ) To examine

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Decision Making 1 Decision Making Skills Establishing a positive decision-making environment.

Reinforcement Learning for Safe Decision-Making in Autonomous Driving Edouard Leurent 1,2,3 ,

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Module 4 Markov Processes CS 886 Sequential Decision Making and Reinforcement Learning

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

On the Complexity of Computing Real Radicals of Polynomial Systems Mohab Safey El Din 1 Zhi-Hong

Setting Professional Boundaries as a Student Leader Juan Blanco, Student Life Supervisor @

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

between citizens and public institutions Angelo Cozzubo University of Chicago

A Proof of the Covariant Entropy Bound Joint work with H. Casini, Z. Fisher, and J. Maldacena,

A Southern Spectroscopic Survey Instrument: Synergies with WFIRST Jeff Newman, U. Pittsburgh/PITT

Classical de Sitter solutions and the swampland David ANDRIOT Introduction Stringy de Sitter

Improving Entity Recommendation with Search Log and Multi-Task Learning Jizhou Huang , Wei Zhang,