artificial intelligence markov decision processes
play

ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - PowerPoint PPT Presentation

INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html


  1. INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

  2. PageRank (Google) PageRank can be understood as a) A M arkov C hain b) A M arkov D ecision P rocess c) A P artially O bservable M arkov D ecision P rocess d) None of the above 2

  3. Markov models Markov model = stochastic model that assumes Markov property.  stochastic model: models process where state depends on previous states in non‐deterministic way.  Markov property: the probability distribution of future states, conditioned on both past and present values , depends only upon present state: “given the present, the future does not depend on the past” Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. 3

  4. Markov model types Prediction Planning Typically for Markov chain MDP Fully observable optimisation (Markov decision process) purposes Hidden POMDP Partially observable (Partially observable Markov model Markov decision process) Prediction models can be represented at variable level by a (Dynamic) Bayesian network: S 3 S 2 S 1 … S 2 … S 1 S 3 O 1 O 2 O 3 4

  5. PageRank (Google) PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above 5

  6. MDPs: outline  Search in non‐deterministic environments  Solution: optimal policy (plan) of actions that maximizes rewards (decision‐theoretic planning)  Algorithm: – Bellman equation – value iteration  Link with learning 6

  7. Running example: Grid World  A maze‐like problem  The agent lives in a grid, where walls block the agent’s path  Noisy movement: actions do not always go as planned  If wall in chosen direction, then stay put;  80% of the time, the action North takes the agent North  10% of the time, North takes the agent West; 10% East (same deviation for other actions)  The agent receives rewards each “time” step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal : maximize sum of rewards

  8. Search in non-deterministic environments: Grid World example Deterministic Grid Stochastic Grid World World Noisy movement: actions do not always go as planned 8

  9. MDP: in search of `plan’ that maximises reward Each MDP state projects a search tree s s is a state: a (s,a,s’) called a transition T(s,a,s’) = P(s’ | s, a) s’ R(s,a,s’) 9

  10. Goals, rewards and optimality criteria  Planning goals: encoded in reward function Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero‐ reward absorbing state, and assigning all other states negative reward.  Rewards: additive and time‐separable  Transitions: effect is uncertain  Objective: maximize expected total reward; future rewards may be discounted  Planning horizon: finite, infinite or indefinite (latter: special case of infinite; guaranteed to reach terminal state) 10

  11. Markov Decision Processes MDPs are non‐deterministic search problems An MDP is defined by:  A set of states s  S  A set of actions a  A  A transition function T(s, a, s’) – Probability that a from s leads to s’ , i.e., P(s’ | s, a) – Also called the model or the dynamics  A reward function R(s, a, s’) – Sometimes just R(s) or R(s’)  A start state  Sometimes a terminal state 11

  12. What is Markov about MDPs?  Recall: “Markov” generally means that given the present state, the future and the past are independent  For Markov decision processes, “Markov” means action outcomes depend only on the current state Andrey Markov (1856‐1922)  This is just like search, where the successor function could only depend on the current state (not the history) 12

  13. Running example: Grid World  A maze‐like problem  The agent lives in a grid, where walls block the agent’s path  Noisy movement: actions do not always go as planned  If wall in chosen direction, then stay put;  80% of the time, the action North takes the agent North  10% of the time, North takes the agent West; 10% East  same deviation for other actions  The agent receives rewards each “time” step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal : maximize sum of rewards

  14. Policies  Recall: deterministic search  optimal plan: a sequence of actions, from start to a goal  For MDPs, we want an optimal Example: optimal policy when policy  *: S → A R(s, a, s’) = ‐ 0.03 for all non‐ terminal states s – A policy  gives an action for each state – An optimal policy is one that maximizes expected utility (reward) if followed Note: an explicit policy defines a reflex agent 14

  15. Optimal Policies - examples +1 +1 ‐1 ‐1 R(s) = ‐0.01 R(s) = ‐0.03 +1 +1 ‐1 ‐1 R(s) = ‐2.0 R(s) = ‐0.4 Visualisation: each cell represents the state in which the robot occupies that cell; arrow indicates optimal action in the given state 15

  16. Utilities of reward sequences  What preferences should an agent have over reward sequences?  More or less? [1, 2, 2] or [2, 3, 4]  Now or later? [0, 0, 1] or [1, 0, 0]  It’s reasonable to maximize the sum of rewards  It’s also reasonable to prefer rewards now to rewards later  A solution: values of rewards decay exponentially 16

  17. Discounting Worth Worth Next Worth In Two Now Step Steps 17

  18. Discounting: implementation  How to discount? – Each time we descend a level, we multiply in the discount once  Why discount? – Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge  Example: – Value of receiving [1,2,3] with discount of 0.5 = 1*1 + 0.5*2 + 0.25*3 – Which is less than that of [3,2,1] 18

  19. Return: reward in the long run  Episodic tasks : interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.  return is total reward from ‘time’ t to ‘time’ T , ending an episode: ����� 𝑆 � � 𝑠 ��� � 𝑠 ��� � … � 𝑠 � � � 𝑠 ����� ���  Continuing tasks : interaction does not have natural episodes.  return discounted by discount rate γ , 0 ≤ γ ≤ 1: � ��� � 𝛿 � · 𝑠 ��� � ⋯ � � 𝛿 � · 𝑠 𝑆 � � 𝑠 ��� � 𝛿 · 𝑠 ����� ���    (shortsigh ted 0 1 farsighted ) 19

  20. Solving MDPs 20

  21. Optimal quantities state s  State s has value V(s) : V π (s) = expected reward from s when acting according to policy π V * (s) = expected reward starting in s and thereafter acting optimally action a Transition (s,a,s’)  The optimal policy: q‐state  * (s) = optimal action from state s Important propery: Policy  that is greedy with respect to V*  optimal policy  * state s’ (For later: intermediate q‐state ha s value Q(s,a) ) 21

  22. Characterizing any V π (s) What is the value of following policy  when in state s t ? First, let’s consider the deterministic situation:              k k V ( s ) E [ R ] R r r r       t t t t k 1 t 1 t k 1   k 0 k 1                  k 1 i r r r r r R          t 1 t k 1 t 1 t 1 i 1 t 1 t 1   k 1 i 0          r V ( s ) R ( s , ( s ), s ) V ( s )     t 1 t 1 t t t 1 t 1 Noise  take expected value over all possible next states:           V ( s ) P ( s | s , ( s )) R ( s , ( s ), s ) V ( s )    t t 1 t t t t t 1 t 1 s  t 1 where P() is given by the transition function T(). 22

  23. Example: Policy Evaluation V π is shown for each state (indicated in cell) π: Always Go Right π: Always Go Forward terminal states 23

  24. Characterizing optimal V*(s) Expected reward from state s is maximized by acting optimally in s and thereafter  the optimal value for a state is obtained when  * following the optimal policy :   * *    V ( s ) V ( s ) max V ( s )            max T ( s , ( s ), s ' ) R ( s , ( s ), s ' ) V ( s ' )    s '  *     * max T ( s , a , s ' ) R ( s , a , s ' ) V ( s ' ) ( max Q ( s , a )) a a s ' This equation is called the Bellman equation . 24

  25. Using V*(s) to obtain *(s) The optimal policy can be extracted from V*(s) :          * * ( s ) arg max V ( s ) arg max T ( s , a , s ' ) R ( s , a , s ' ) V ( s ' )  a s ' using one‐step look‐ahead, i.e. • use the Bellman equation once more to compute the given summation for all actions • rather than returning the max value, return the action that gives the max value 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend