safe policy improvement with baseline bootstrapping
play

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - PowerPoint PPT Presentation

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi Tachet des Combes Introduction Theory Experiments Conclusion Problem setting Batch setting Fixed dataset, no direct interaction with the


  1. Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R´ emi Tachet des Combes

  2. Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

  3. Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

  4. Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Long trajectories Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

  5. Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

  6. Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

  7. Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Infinite MDPs benchmark • Model-free SPIBB for use with function approximation. • First deep RL algorithm reliable in the batch setting. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

  8. Introduction Theory Experiments Conclusion Robust Markov Decision Processes [Iyengar, 2005, Nilim and El Ghaoui, 2005] • True environment M ∗ = �X , A , P ∗ , R ∗ , γ � is unknown. • Maximum Likelihood Estimation (MLE) MDP built from counts: � M = �X , A , � P , � R , γ � . M , e ) : M ∗ ∈ Ξ( � • Robust MDP set Ξ( � M , e ) with probability at least 1 − δ . • Error function e ( x , a ) derived from concentration bounds. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 4/18

  9. Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

  10. Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. [Thomas, 2015]: High-Confidence Policy Improvement • HCPI searches for the best regularization hyperparameter to allow safe policy improvement. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

  11. Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

  12. Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . SPIBB algorithm • Construction of the bootstrapped set: B = { ( x , a ) ∈ X × A , N D ( x , a ) < N Λ } . • Optimization over a constrained policy set: π ⊙ spibb = argmax π ∈ Π b ρ ( π, � M ) , Π b = { π , s.t. π ( a | x ) = π b ( a | x ) if ( x , a ) ∈ B } . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

  13. Introduction Theory Experiments Conclusion SPIBB policy iteration Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

  14. Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

  15. Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

  16. Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) π ( i + 1 ) ( a 2 | x ) = 0 . 0 M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 3 | x ) = 0 . 7 M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

  17. Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

  18. Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Theorem (Safe policy improvement) With high probability 1 − δ : spibb , � M ) − ρ ( π b , � ρ ( π ⊙ spibb , M ∗ ) − ρ ( π b , M ∗ ) ≥ ρ ( π ⊙ M ) � log 2 |X||A| 2 |X| − 4 V max 2 1 − γ N ∧ δ Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

  19. Introduction Theory Experiments Conclusion Model-free formulation SPIBB algorithm • It may be formulated in a model-free manner by setting the targets: � y ( i ) π b ( a ′ | x ′ j ) Q ( i ) ( x ′ j , a ′ ) = r j + γ j a ′ | ( x ′ j , a ′ ) ∈ B   �   π b ( a ′ | x ′ ∈ B Q ( i ) ( x ′ j , a ′ ) . + γ j ) max   a ′ | ( x ′ j , a ′ ) / a ′ | ( x ′ j , a ′ ) / ∈ B Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend