lecture 15 batch rl
play

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from Philip Thomas with modifications Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL /


  1. Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from Philip Thomas with modifications Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  2. Class Structure • Last time: Meta Reinforcement Learning • This time: Batch RL • Next time: Quiz Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  3. A Scientific Experiment Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  4. A Scientific Experiment Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  5. What Should We Do For a New Student? Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  6. Involves Counterfactual Reasoning Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  7. Involves Generalization Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  8. Batch Reinforcement Learning Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  9. Batch RL Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  10. Batch RL Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  11. The Problem • If you apply an existing method, do you have confidence that it will work? Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  12. A property of many real applications • Deploying "bad" policies can be costly or dangerous Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  13. Deploying bad policies can be costly Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  14. Deploying bad policies can be dangerous Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  15. What property should a safe batch reinforcement learning algorithm have? • Given past experience from current policy/policies, produce a new policy • “Guarantee that with probability at least 1 − δ , will not change your policy to one that is worse than the current policy.” • You get to choose δ • Guarantee not contingent on the tuning of any hyperparameters Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  16. Table of Contents Notation 1 Create a safe batch reinforement learning algorithm 2 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  17. Notation � � s t = s ) • Policy π : π ( a ) = P ( a t = a • Trajectory: T = ( s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · , s L , a L , r L ) • Historical data: D = { T 1 , T 2 , · · · , T n } • Historical data from behavior policy, π b • Objective: L V π = E [ � � π ] γ t R t � t = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  18. Safe batch reinforement learning algorithm • Reinforcement learning algorithm, A • Historical data, D , which is a random variable • Policy produced by the algorithm, A ( D ) , which is a random variable • a safe batch reinforement learning algorithm, A , satisfies: Pr( V A ( D ) ≥ V π b ) ≥ 1 − δ or, in general Pr( V A ( D ) ≥ V min ) ≥ 1 − δ Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  19. Table of Contents Notation 1 Create a safe batch reinforement learning algorithm 2 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  20. Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm, Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  21. Off-policy policy evaluation (OPE) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  22. Importance Sampling Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  23. Importance Sampling � L � � L n � s t ) � � IS ( D ) = 1 π e ( a t � � � γ t R i t � � s t ) n π b ( a t i = 1 t = 1 t = 1 E [ IS ( D )] = V π e Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  24. Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  25. High-confidence off-policy policy evaluation (HCOPE) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  26. Hoeffding’s inequality • Let X 1 , · · · , X n be n independent identically distributed random variables such that X i ∈ [ 0 , b ] • Then with probability at least 1 − δ : n � E [ X i ] ≥ 1 ln( 1 /δ ) � X i − b , n 2 n i = 1 where X i = 1 � n � L t = 1 γ t R i i = 1 ( w i t ) in our case. n Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  27. Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  28. Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  29. Off-policy policy evaluation • Importance sampling (IS): � L � � L n � � s t ) � π e ( a t IS ( D ) = 1 � � � γ t R i � s t ) t � n π b ( a t i = 1 t = 1 t = 1 • Per-decision importance sampling (PDIS) � t L n � s τ ) � � γ t 1 π e ( a τ � � � R i PSID ( D ) = � s τ ) t � n π b ( a τ t = 1 i = 1 τ = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  30. Off-policy policy evaluation (revisited) • Importance sampling (IS): � L n � IS ( D ) = 1 � � γ t R i w i t n i = 1 t = 1 • Weighted importance sampling (WIS) � L n � 1 � � γ t R i WIS ( D ) = w i � n t i = 1 w i i = 1 t = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  31. Off-policy policy evaluation (revisited) • Weighted importance sampling (WIS) � L n � 1 � � γ t R i WIS ( D ) = w i � n t i = 1 w i t = 1 i = 1 • Biased. When n = 1 , E [ WIS ] = V ( π b ) • Strongly consistent estimator of V π e • i.e. Pr(lim n →∞ WIS ( D ) = V π e ) = 1 • If • Finite horizon • One behavior policy, or bounded rewards Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  32. Off-policy policy evaluation (revisited) • Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling • A fun exercise! Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

  33. Control variates • Given: X • Estimate: µ = E [ X ] • ˆ µ = X • Unbiased: E [ˆ µ ] = E [ X ] = µ • Variance: Var (ˆ µ ) = Var ( X ) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend