ds595 cs525 reinforcement learning
play

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last lecture v Reinforcement Learning Components Model, Value function, Policy v Model-based


  1. This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

  2. Last lecture v Reinforcement Learning Components § Model, Value function, Policy v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration v Project 1 description.

  3. Quiz 1 Week 4 (9 /24 R) v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration § 20 min at the beginning • You can start as early as 5:55PM, and finish as late as 6:20PM. The quiz duration is 20 minutes. § Login class zoom so you can ask questions regarding the quiz in Zoom chat box. Project 1 due Week 4 (9 /24 R)

  4. This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

  5. Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states

  6. RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

  7. RL components: ( 1 ) Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 =s’ |s t =s, a t =a) r(s,a)

  8. RL components: ( 2 ) Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’

  9. RL components: ( 3 ) Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’

  10. RL agents and algorithms Model-Free: Model-based No model Explicit: Model

  11. Find a good policy: Problem settings Model-based control Model-free control v Computing while v (Agent’s internal interacting with computation) environment § Given model of how the § Agent doesn’t know world works how world works § Dynamics and reward § Interacts with world to model implicitly/explicitly learn § Algorithm computes how how world works to act in order to § Agent improves policy maximize expected reward (may involve planning)

  12. Find a good policy: Problem settings Model-based control Model-free control v (Agent’s internal v Computing while interacting computation) with environment § Frozen Lake project 1 § Taxi passenger-seeking problem § Know all rules of game / perfect model § Demand/Traffic dynamics are uncertain § dynamic programming, tree search § Huge state space Path 1 Path 2 Path 3

  13. Find a good policy: Problem settings Model-based control Model-free control v Given: MDP/R/P v Given: MDP § S, A, γ § S, A, P, R, γ v Unknow § P , R, v Output: v Output: § π § π

  14. This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

  15. DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

  16. Random Walks on Graphs Random Walk Random walk sampling Routing Molecule in liquid Influence diffusion

  17. Undirected Graphs Undirected !! 2 3 1 6 4 5

  18. Random Walk v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 1 0 1 0 0 2 0 0 # & Symmetric # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 1 0 1 0 0 0 0 2 " % " % 4 3 v Transition Probability Matrix Undirected ⎧ 1 if i is not equal to j " % ⎪ 0 1/ 3 1/ 3 1/ 3 P k i $ ' ij = ⎨ 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' ⎪ $ ' 0 if i = j 1/ 3 1/ 3 0 1/ 3 ⎩ $ ' 1/ 2 0 1/ 2 0 # & v |E|: number of links v Stationary Distribution π i = d i 2 E

  19. A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3

  20. A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 s 0 * P = s 1

  21. Taxi passenger-seeking task: Markov Process --- Episodes s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 Example: Sample episodes starting from s 3 s 3 , s 2 , s 2 , s 2 , s 1 , s 1 ,... s 3 , s 3 , s 4 , s 5 , s 6 , s 6 ,... s 3 , s 4 , s 5 , s 4 ,...

  22. DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

  23. A random walker + rewards: Markov Reward Process (MRP) s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states.

  24. A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states Sample returns for sample 4-step episodes, γ = ½ v s 3 (t=1), s 4 (t=2), s 5 (t=3), s 5 (t=4): v G 1 =? v G 3 =?

  25. A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states v Sample returns for sample 4-step episodes, γ = 1/2 v s 3 , s 4 , s 5 , s 6 : G 1 =? v s 3 , s 3 , s 4 , s 3 : G 1 =? v s 3 , s 2 , s 1 , s 1 : G 1 =?

  26. Path 1 Samples: v s 3 , s 4 , s 5 , s 6… v s 3 , s 3 , s 4 , s 3… Path 2 v s 3 , s 2 , s 1 , s 1… Path 3 v …

  27. Return vs Value function Path 1 Path 2 Path 2 Path 3 Samples: v Samples: v s 3 , s 4 , s 5 , s 6…, s 3 , s 3 , s 4 , s 3… v …

  28. DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

  29. Taxi passenger-seeking task: Markov Decision Process (MDP) s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 Deterministic transition model

  30. This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

  31. For deterministic policy:

  32. For deterministic and stochastic policy: From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

  33. (All-in-one algorithm) From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

  34. Deterministic policy

  35. From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

  36. This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

  37. Review of Dynamic Programming for policy evaluation (model-baased) action state equivalently, " (#′)] " (#) = & ",$ [( + *! ! ! !%& 53

  38. Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Bootstrapping: Update for V uses an estimate v Known model P(s’|s,a) and r(s,a) 54

  39. Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Requires model of MDP P(s’|s,a) and r(s,a) Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history 55

  40. Model-free Policy Evaluation v What if don’t know dynamics model P and/or reward model R? v Today: Policy evaluation without a model v Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π 56

  41. Model-free Policy Evaluation v Monte Carlo (MC) policy evaluation § First visit based § Every visit based v Temporal Difference (TD) § TD(0) v Metrics to evaluate and compare algorithms 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend