cs 730 730w 830 intro ai
play

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: - PowerPoint PPT Presentation

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 1 / 14 MDP Wrap-Up RTDP MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS


  1. CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 – 1 / 14

  2. MDP Wrap-Up ■ RTDP ■ MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS 730 – 2 / 14

  3. Real-time Dynamic Programming for a known MDP. which states to update? MDP Wrap-Up ■ RTDP initialize U to an upper bound ■ MDPs ADP update U as we follow greedy policy from s 0 Q -Learning � T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a s ′ states that agent is likely to visit (nice anytime profile) Wheeler Ruml (UNH) Lecture 18, CS 730 – 3 / 14

  4. Summary of MDP Solving value iteration: compute U π ∗ ■ MDP Wrap-Up ■ RTDP prioritized sweeping ◆ ■ MDPs RTDP ◆ ADP Q -Learning policy iteration: compute U π using ■ ◆ linear algebra (exact) ◆ simplified value iteration (exact and faster?) ◆ modified PI (a few updates, so inexact) Wheeler Ruml (UNH) Lecture 18, CS 730 – 4 / 14

  5. MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Model-based Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 5 / 14

  6. Adaptive Dynamic Programming ‘model-based’. active vs passive MDP Wrap-Up learn T and R as we go, calculating π using MDP methods (eg, ADP ■ ADP VI or PI) ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Until max-update ≤ loss − bound (1 − γ ) 2 2 γ 2 Q -Learning for each state s s ′ T ( s, a, s ′ ) U ( s ′ ) U ( s ) ← R ( s ) + γ max a � � T ( s, a, s ′ ) U ( s ′ ) π ( s ) = argmax a s ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 6 / 14

  7. Prioritized Sweeping given an experience ( s, a, s ′ , r ) , MDP Wrap-Up ADP update model ■ ADP update s ■ Sweeping ■ Policy Iteration repeat k times: ■ Bandits do highest priority update ■ Break Q -Learning to update state s with change δ in U ( s ) : update U ( s ) priority of s ← 0 for each predecessor s ′ of s : priority s ′ ← max of current and max a δ ˆ T ( s ′ , as ′ ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 7 / 14

  8. Policy Iteration repeat until π doesn’t change: MDP Wrap-Up ADP given π , compute U π ( s ) for all states ■ ADP given U , calculate policy by one-step look-ahead ■ Sweeping ■ Policy Iteration ■ Bandits If π doesn’t change, U doesn’t either. ■ Break We are at an equilibrium (= optimal π )! Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 8 / 14

  9. Exploration vs Exploitation problem: MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

  10. Exploration vs Exploitation problem: greedy (local minima) MDP Wrap-Up ADP �� � ■ ADP U + ( s ) ← R ( s ) + γ max T ( s, a, s ′ ) U + ( s ′ ) , N ( a, s ) f ■ Sweeping ■ Policy Iteration a s ′ ■ Bandits ■ Break where f ( u, n ) = R max if n < k , u otherwise Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

  11. Break asst 4 ■ MDP Wrap-Up final papers: writing-intensive ■ ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q -Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 10 / 14

  12. MDP Wrap-Up ADP Q -Learning ■ Q -Learning ■ Summary ■ EOLQs Model-free Reinforcement Learning Wheeler Ruml (UNH) Lecture 18, CS 730 – 11 / 14

  13. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning ■ Summary ■ EOLQs Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  14. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  15. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  16. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  17. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  18. Q -Learning MDP Wrap-Up � T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ADP a Q -Learning s ′ ■ Q -Learning � � ■ Summary � T ( s, a, s ′ )( R ( s ′ ) + max Q ( s ′ , a ′ )) Q ( s, a ) = γ ■ EOLQs a ′ s ′ Given experience � s, a, s ′ , r � : Q ( s, a ) ← Q ( s, a ) + α ( error ) Q ( s, a ) ← Q ( s, a ) + α ( sensed − predicted ) Q ( s ′ , a ′ )) − Q ( s, a )) Q ( s, a ) ← Q ( s, a ) + α ( γ ( r + max a ′ α ≈ 1 /N ? policy: choose random with probability 1 /N ? Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

  19. Summary Model known (solving MDP): MDP Wrap-Up ADP value iteration ■ policy iteration: compute U π using Q -Learning ■ ■ Q -Learning ■ Summary ◆ linear algebra ■ EOLQs ◆ simplified value iteration ◆ a few updates (modified PI) Model unknown (RL): ADP using ■ ◆ value iteration ◆ a few updates (eg, prioritized sweeping) Q-learning ■ Wheeler Ruml (UNH) Lecture 18, CS 730 – 13 / 14

  20. EOLQs What question didn’t you get to ask today? ■ MDP Wrap-Up What’s still confusing? ■ ADP What would you like to hear more about? ■ Q -Learning ■ Q -Learning ■ Summary Please write down your most pressing question about AI and put ■ EOLQs it in the box on your way out. Thanks! Wheeler Ruml (UNH) Lecture 18, CS 730 – 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend