ds595 cs525 reinforcement learning
play

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application


  1. This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

  2. Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application stories. v Topics to be covered in this course. v Course logistics 2

  3. Reinforcement Learning What is it? Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize some notion of cumulative reward. 1. Model 2. Value function 3. Policy (From Wikipedia)

  4. RL involves 4 key aspects RL involves 4 key aspects 1. Optimization. 1. Optimization. 2. Exploration. 2. Exploration. v Goal is to find an optimal way v Goal is to find an optimal way to make decisions, with to make decisions, with maximized total cumulated maximized total cumulated rewards rewards 4. Delayed consequences 4. Delayed consequences 2. Generalization. 2. Generalization. v Programming v Programming all possibilities all possibilities is not possible. is not possible. $5 $20 $5 $20 4 28

  5. Branches of Machine Learning AI planning Supervised Unsupervised Learning Learning Machine Learning Reinforcement Learning Imitation learning From David Silver’s Slides

  6. Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.

  7. Today’s topics v Reinforcement Learning Components § State vs observation § Stochastic vs deterministic model and policy § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.

  8. Reinforcement Learning Components Observation Action Reward Environment

  9. Agent-Environment interactions over time (sequential decision process) Observation Action o t a t Each time step t : 1. Agent takes an action a t ; 2. World updates given action Reward a t , emits observation o t and reward r t ; r t 3. Agent receives observation ot and reward r t . Environment

  10. Interaction history, Decision-making Observation Action o t a t Reward r t Environment History h t = ( a 1 , o 1 , r 1 , ..., a t , o t , r t ) Agent chooses action a t+1 based on history h t State: information assumed to determine what happens next as a function of history: s t = f ( h t ), In many cases, for simplicity, s t = o t

  11. State transition & Markov property Observation/State Action s t =o t a t Reward r t Environment Transition Probability p(s t+1 |s t ,a t ) State s t is Markov if and only if: p(s t+1 |s t , a t ) = p(s t+1 |h t , a t ) Future is independent of past, given present.

  12. A taxi driver seeks for Hypertension control Passengers: State (observation): State: (Current location, (current blood pressure) with or without passenger) Action: A direction to go Action: take medication or not Path 1 Path 2 Path 3

  13. More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold?

  14. More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold? 1. Make it Markov by setting state as the history: s t = h t Again, in practice , we often assume most recent observation is sufficient statistic of history: s t = o t State representation has big implications for: 1. Computational complexity 2. Data required 3. Resulting performance

  15. Fully vs Partially Observable Markov Decision Process What you observe partially What you observe fully represent the environment represents the environment state state. s t = h t s t = o t

  16. Breakout game Poker games

  17. Deterministic vs Stochastic Model Stochastic: Given history Deterministic: Given & action, many potential history & action, single observations & rewards observation & reward Common assumption for Common assumption in customers, patients, hard to robotics and controls model domains p(s t+1 | s t , a t ) =1, s t+1 =s 0≤ p(s t+1 | s t , a t ) < 1 p(s t+1 | s t , a t ) =0, s t+1 ≠s P[r(s t , a t ) =3]=50%, r(s t , a t ) =3, s t =s, a t =a P[r(s t , a t ) =5]=50%, s t =s, a t =a

  18. Breakout game Hypertension control For both transition and reward

  19. Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states

  20. RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

  21. RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

  22. RL components: Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 = s’ | s t = s , a t = a )

  23. Taxi passenger-seeking task Stochastic Markov Model s 6 s 5 s 3 s 2 s 4 s 1 r’ 1 =0 r’ 2 =0 r’ 3 =0 r’ 4 =0 r’ 5 =0 r’ 6 =0 Taxi agent’s transition model: 0.5 = p(s 3 |s 3 , right) = p(s 4 |s 3 , right) 0.5 = p(s 4 |s 4 , right) = P(s 5 |s 4 , right) Numbers above show RL agent’s reward model , which may be wrong. Ture reward model is r =[1,0,0,0,3,0]

  24. RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

  25. RL components: Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’

  26. Taxi passenger-seeking task Policy s 6 s 5 s 3 s 2 s 4 s 1 50% 50% Action set: {left, right} Policy presented by arrow. Q1: Is this a deterministic or stochastic policy? Q2: Give an example of another policy type?

  27. RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

  28. RL components: Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards, with γ in [0,1]. v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’

  29. Taxi passenger-seeking task: Value function s 6 s 5 s 3 s 2 s 4 s 1 Discount factor , γ = 0 Policy #1: π(s 1 ) = π(s 2 ) = ··· = π(s 6 ) = right Q: V π ? Policy #2: π(left| s i ) = π(right| s i ) = 50%, for i=1,…,6 Q: V π ?

  30. Types of RL agents/algorithms Model-Free: Model-based Explicit: Value function Explicit: Model and/or policy function May or may not have policy No model and/or value function

  31. Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning v MDP model § Policy Evaluation, Policy Search v Project 1 demo and description.

  32. MDP v Markov Decision Process

  33. Transition Model Reward Model Policy function Value function

  34. Taxi passenger-seeking task: Transition Model Reward Model Policy function MDP Value function s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 deterministic transition model

  35. Transition Model Reward Model Policy function Value function

  36. Transition Model Reward Model Policy function Value function

  37. Taxi passenger-seeking task: a2 MDP Policy Evaluation a1 s 6 s 5 s 3 s 2 s 4 s 1 v Let π(s) = a 1 ∀ s. γ = 0. v What is the value of this policy? 2

  38. Taxi passenger-seeking task: a2 MDP Control a1 s 6 s 5 s 3 s 2 s 4 s 1 v 6 discrete states (location of the taxi) v 2 actions: Left or Right v How many deterministic policies are there? v Is the optimal policy for a MDP always unique? 2

  39. If policy doesn’t change, can it ever change again? Is there a maximum number of iterations of policy iteration?

  40. Project 1 starts today Due 9/24 mid-night v https://users.wpi.edu/~yli15/courses/DS595 CS525Fall20/Assignments.html 55

  41. Any Comments & Critiques?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend