safe reinforcement learning for decision making in
play

Safe Reinforcement Learning for Decision-Making in Autonomous - PowerPoint PPT Presentation

Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille Nord Europe Valse, Inria Lille Nord Europe Renault Group


  1. Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille – Nord Europe Valse, Inria Lille – Nord Europe Renault Group Lille, April 2019

  2. Motivation Classic Autonomous Driving Pipeline Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

  3. Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

  4. Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). ◮ Won’t scale to complex scenes, handle negotiation and aggressiveness Safe Reinforcement Learning for Autonomous Driving Lille - 2/54

  5. Reinforcement Learning: why? Search for an optimal policy π ( a | s ) : � � ∞ � � � � γ t r ( s t , a t ) max � a t ∼ π ( s t ) , s t + 1 ∼ T ( s t , a t ) E � π t = 0 � �� � policy return R T π The dynamics T ( s t + 1 | s t , a t ) are unknown. The agent learns by interaction with the environment Challenges: ◮ exploration-exploitation ◮ partial observability ◮ credit assignment ◮ safety Safe Reinforcement Learning for Autonomous Driving Lille - 3/54

  6. Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Safe Reinforcement Learning for Autonomous Driving Lille - 4/54

  7. Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 4/54

  8. A first benchmark The highway-env environment ◮ Vehicle kinematics: Kinematic Bicycle Model ◮ Low-level longitudinal and lateral controllers ◮ Behavioural models: IDM and MOBIL ◮ Graphical road network and route planning A few baseline agents — Setup ◮ Model-free: DQN ◮ Model-based (planning): Value Iteration and MCTS Safe Reinforcement Learning for Autonomous Driving Lille - 5/54

  9. A first benchmark — Results Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.4 0.2 0.2 0.1 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Videos available on Safe Reinforcement Learning for Autonomous Driving Lille - 6/54

  10. The safety / performance trade-off Let us look at the performances of DQN: Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

  11. The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

  12. The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Conflicting objectives ◮ Reward r t = ω v velocity − ω c collision π = � ◮ We only control the return R T t γ t r t . ◮ For any fixed ω , there can be many optimal policies with different velocity collision ratios → the Pareto-optimal curve Safe Reinforcement Learning for Autonomous Driving Lille - 7/54

  13. A first formalisation of risk Constrained Reinforcement Learning ◮ Augment the MDP with a cost function c : S × A × S → R , cost discount γ c , and a budget β . ◮ Optimise the reward while keeping the cost under a budget � ∞ � � γ t r t max E π π t = 0 � ∞ � � γ t s.t. c c t ≤ β E π t = 0 Budgeted Reinforcement Learning Find a single budget-dependent policy π ( a | s , β ) that solves all the corresponding CMDPs Safe Reinforcement Learning for Autonomous Driving Lille - 8/54

  14. A BMDP algorithm Lagrangian Relaxation Consider the dual problem and replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t r t − λγ t max c c t E π t ◮ Train many policies π k with penalties λ k and recover the cost budgets β k ◮ Very data/memory-heavy Safe Reinforcement Learning for Autonomous Driving Lille - 9/54

  15. Our BMDP algorithm Budgeted Fitted-Q [Carrara et al. 2019] A model-free, value-based, fixed-point iteration procedure. regression Q r n + 1 ( s i , a i , β i ) ← − − − − − − � r ′ π n A ( s ′ i , a ′ , β i ) Q r n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ a ′ ∈A regression Q c n + 1 ( s i , a i , β i ) ← − − − − − − � c ′ π n A ( s ′ i , a ′ , β i ) Q c n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ c a ′ ∈A ( π n A , π n B ) ← � π A ( s , a , β ) Q r arg max n ( s , a , π B ( s , a , β )) ( π A ,π B ) ∈ Ψ n a ∈A   π A ∈ M ( A ) S× R , π B ∈ R S×A× R ,       such that, ∀ s ∈ S , ∀ β ∈ R , Ψ n = �  π A ( s , a , β ) Q c  n ( s , a , π B ( s , a , β )) ≤ β     a ∈A Safe Reinforcement Learning for Autonomous Driving Lille - 10/54

  16. From dynamic programming to RL Continuous Reinfocement Learning 1. Risk-sensitive exploration. 2. Scalable function approximation 3. Parallel computing of the targets and experiences. Safe Reinforcement Learning for Autonomous Driving Lille - 11/54

  17. Risk-sensitive exploration Algorithm 1: Risk-sensitive exploration 1 Initialise an empty batch D . 2 for each intermediate batch do for each episode in batch do 3 Sample initial budget β ∼ U ( B ) . 4 while episode not done do 5 Update ε from schedule. 6 Sample z ∼ U ([ 0 , 1 ]) . 7 if z > ε then See 8 Sample ( a , β ′ ) from ( π A , π B ) . example on 9 // Exploit else 10 Sample ( a , β ′ ) from U (∆ AB ) . 11 // Explore Append transition ( s , β, a , r ′ , c ′ , s ′ ) to batch D . 12 Update episode budget β ← β ′ . 13 ( π A , π B ) ← BFTQ ( D ) . 14 15 return the batch of transitions D Safe Reinforcement Learning for Autonomous Driving Lille - 12/54

  18. Function approximation Hidden Hidden ( s , β ) Q Layer 1 Layer 2 s 0 Q r ( a 0 ) s 1 Q r ( a 1 ) Q c ( a 0 ) Q c ( a 1 ) β Encoder Figure: Neural Network for Q -functions approximation when the state dimension is 2 and there are 2 actions. Safe Reinforcement Learning for Autonomous Driving Lille - 13/54

  19. Parallel computing of the targets Algorithm 3: Compute targets Algorithm 2: BFTQ (parallel) B , γ c , γ r , fit r ,fit c (regression 1 In: D , � 1 Q r , Q c = Q ( D ) algorithms); // perform a single forward pass 2 Out: Q r , Q c ; 2 Split D among workers: D = ∪ w ∈ W D w 3 X = { s i , a i , β i } i ∈ [ 0 , |D| ] ; // Run the following loop on each 4 Initialise Q r = Q c = ( s , a , β ) → 0; worker in parallel 3 for w ∈ W do 5 repeat Y r , Y c = ( Y c w , Y r w ) ← 4 6 compute targets ( D w , Q r , Q c , � compute targets ( D , Q r , Q c , � B , γ c , γ r ) B , γ c , γ r ) ; Q r , Q c = fit r ( X , Y r ) , fit c ( X , Y c ) ; 5 Join the results: Y c = ∪ w ∈ W Y c 7 w and 8 until convergence or timeout ; Y r = ∪ w ∈ W Y r w 6 return ( Y c , Y r ) Safe Reinforcement Learning for Autonomous Driving Lille - 14/54

  20. Experiments Video available on 1 0 2 3 4 β ∈ [0.31,1.00] 6 0.30 β ∈ [0.21,0.29] 0.20 β ∈ [0.11,0.19] 0.10 10 β ∈ [0.01,0.09] λ ∈ {15,20} 0.00 Safe Reinforcement Learning for Autonomous Driving Lille - 15/54

  21. Looking back Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.2 0.4 0.1 0.2 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Compared to DQN, the MCTS was really good in terms of safety. But the VI, not so much. Safe Reinforcement Learning for Autonomous Driving Lille - 16/54

  22. Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 17/54

  23. Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. - Model bias: T � = ˆ T see example at Safe Reinforcement Learning for Autonomous Driving Lille - 17/54

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend