ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - PowerPoint PPT Presentation

INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

PageRank (Google) PageRank can be understood as a) A M arkov C hain b) A M arkov D ecision P rocess c) A P artially O bservable M arkov D ecision P rocess d) None of the above 2

Markov models Markov model = stochastic model that assumes Markov property.  stochastic model: models process where state depends on previous states in non‐deterministic way.  Markov property: the probability distribution of future states, conditioned on both past and present values , depends only upon present state: “given the present, the future does not depend on the past” Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. 3

Markov model types Prediction Planning Typically for Markov chain MDP Fully observable optimisation (Markov decision process) purposes Hidden POMDP Partially observable (Partially observable Markov model Markov decision process) Prediction models can be represented at variable level by a (Dynamic) Bayesian network: S 3 S 2 S 1 … S 2 … S 1 S 3 O 1 O 2 O 3 4

PageRank (Google) PageRank can be understood as a) A Markov Chain b) A Markov Decision Process c) A Partially Observable Markov Decision Process d) None of the above 5

MDPs: outline  Search in non‐deterministic environments  Solution: optimal policy (plan) of actions that maximizes rewards (decision‐theoretic planning)  Algorithm: – Bellman equation – value iteration  Link with learning 6

Running example: Grid World  A maze‐like problem  The agent lives in a grid, where walls block the agent’s path  Noisy movement: actions do not always go as planned  If wall in chosen direction, then stay put;  80% of the time, the action North takes the agent North  10% of the time, North takes the agent West; 10% East (same deviation for other actions)  The agent receives rewards each “time” step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal : maximize sum of rewards

Search in non-deterministic environments: Grid World example Deterministic Grid Stochastic Grid World World Noisy movement: actions do not always go as planned 8

MDP: in search of `plan’ that maximises reward Each MDP state projects a search tree s s is a state: a (s,a,s’) called a transition T(s,a,s’) = P(s’ | s, a) s’ R(s,a,s’) 9

Goals, rewards and optimality criteria  Planning goals: encoded in reward function Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero‐ reward absorbing state, and assigning all other states negative reward.  Rewards: additive and time‐separable  Transitions: effect is uncertain  Objective: maximize expected total reward; future rewards may be discounted  Planning horizon: finite, infinite or indefinite (latter: special case of infinite; guaranteed to reach terminal state) 10

Markov Decision Processes MDPs are non‐deterministic search problems An MDP is defined by:  A set of states s  S  A set of actions a  A  A transition function T(s, a, s’) – Probability that a from s leads to s’ , i.e., P(s’ | s, a) – Also called the model or the dynamics  A reward function R(s, a, s’) – Sometimes just R(s) or R(s’)  A start state  Sometimes a terminal state 11

What is Markov about MDPs?  Recall: “Markov” generally means that given the present state, the future and the past are independent  For Markov decision processes, “Markov” means action outcomes depend only on the current state Andrey Markov (1856‐1922)  This is just like search, where the successor function could only depend on the current state (not the history) 12

Running example: Grid World  A maze‐like problem  The agent lives in a grid, where walls block the agent’s path  Noisy movement: actions do not always go as planned  If wall in chosen direction, then stay put;  80% of the time, the action North takes the agent North  10% of the time, North takes the agent West; 10% East  same deviation for other actions  The agent receives rewards each “time” step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal : maximize sum of rewards

Policies  Recall: deterministic search  optimal plan: a sequence of actions, from start to a goal  For MDPs, we want an optimal Example: optimal policy when policy  *: S → A R(s, a, s’) = ‐ 0.03 for all non‐ terminal states s – A policy  gives an action for each state – An optimal policy is one that maximizes expected utility (reward) if followed Note: an explicit policy defines a reflex agent 14

Optimal Policies - examples +1 +1 ‐1 ‐1 R(s) = ‐0.01 R(s) = ‐0.03 +1 +1 ‐1 ‐1 R(s) = ‐2.0 R(s) = ‐0.4 Visualisation: each cell represents the state in which the robot occupies that cell; arrow indicates optimal action in the given state 15

Utilities of reward sequences  What preferences should an agent have over reward sequences?  More or less? [1, 2, 2] or [2, 3, 4]  Now or later? [0, 0, 1] or [1, 0, 0]  It’s reasonable to maximize the sum of rewards  It’s also reasonable to prefer rewards now to rewards later  A solution: values of rewards decay exponentially 16

Discounting Worth Worth Next Worth In Two Now Step Steps 17

Discounting: implementation  How to discount? – Each time we descend a level, we multiply in the discount once  Why discount? – Sooner rewards probably do have higher utility than later rewards – Also helps our algorithms converge  Example: – Value of receiving [1,2,3] with discount of 0.5 = 1*1 + 0.5*2 + 0.25*3 – Which is less than that of [3,2,1] 18

Return: reward in the long run  Episodic tasks : interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.  return is total reward from ‘time’ t to ‘time’ T , ending an episode: �� 𝑆 � � 𝑠 �� 𝑠 �� … � 𝑠 � � � 𝑠 ��  Continuing tasks : interaction does not have natural episodes.  return discounted by discount rate γ , 0 ≤ γ ≤ 1: � �� 𝛿 � · 𝑠 �� ⋯ � � 𝛿 � · 𝑠 𝑆 � � 𝑠 �� 𝛿 · 𝑠 ��    (shortsigh ted 0 1 farsighted ) 19

Solving MDPs 20

Optimal quantities state s  State s has value V(s) : V π (s) = expected reward from s when acting according to policy π V * (s) = expected reward starting in s and thereafter acting optimally action a Transition (s,a,s’)  The optimal policy: q‐state  * (s) = optimal action from state s Important propery: Policy  that is greedy with respect to V*  optimal policy  * state s’ (For later: intermediate q‐state ha s value Q(s,a) ) 21

Characterizing any V π (s) What is the value of following policy  when in state s t ? First, let’s consider the deterministic situation:              k k V ( s ) E [ R ] R r r r       t t t t k 1 t 1 t k 1   k 0 k 1                  k 1 i r r r r r R          t 1 t k 1 t 1 t 1 i 1 t 1 t 1   k 1 i 0          r V ( s ) R ( s , ( s ), s ) V ( s )     t 1 t 1 t t t 1 t 1 Noise  take expected value over all possible next states:           V ( s ) P ( s | s , ( s )) R ( s , ( s ), s ) V ( s )    t t 1 t t t t t 1 t 1 s  t 1 where P() is given by the transition function T(). 22

Example: Policy Evaluation V π is shown for each state (indicated in cell) π: Always Go Right π: Always Go Forward terminal states 23

Characterizing optimal V*(s) Expected reward from state s is maximized by acting optimally in s and thereafter  the optimal value for a state is obtained when  * following the optimal policy :   * *    V ( s ) V ( s ) max V ( s )            max T ( s , ( s ), s ' ) R ( s , ( s ), s ' ) V ( s ' )    s '  *     * max T ( s , a , s ' ) R ( s , a , s ' ) V ( s ' ) ( max Q ( s , a )) a a s ' This equation is called the Bellman equation . 24

Using V*(s) to obtain *(s) The optimal policy can be extracted from V*(s) :          * * ( s ) arg max V ( s ) arg max T ( s , a , s ' ) R ( s , a , s ' ) V ( s ' )  a s ' using one‐step look‐ahead, i.e. • use the Bellman equation once more to compute the given summation for all actions • rather than returning the max value, return the action that gives the max value 25

ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - PowerPoint PPT Presentation

INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

Markov Decision Processes Philipp Koehn presented by Shuoyang Ding 11 April 2017 Philipp Koehn

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Reminders 12 days until the American election. I voted. Did you? If you havent returned your

Social and Information Networks Resources Many of the things that we cover are from papers. But

Today's World-wide Today's World-wide Computing Grid for the Computing Grid for the Computing

mHealth, Patient Reported Outcomes, & Registries UICC Presentation Bradford Hirsch, MD, MBA

Compressed sensing off-the-grid: The Fisher metric, support stability and optimal sampling bounds

Robust Spectral Compressed Sensing via Structured Matrix Completion Yuxin Chen Electrical

<Off-Grid-Traces> Discussions Reimagining digital communication after ecological disaster

Vehicle-Grid Integration Analysis Presentation to VGI Working Group May 7, 2020 Christa Heavey,

Sambuz

Useful Links

Newsletter

Mail Us

ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja - PowerPoint PPT Presentation

INFOB2KI 2019-2020 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Markov decision processes Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

Markov Decision Processes Philipp Koehn presented by Shuoyang Ding 11 April 2017 Philipp Koehn

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Reminders 12 days until the American election. I voted. Did you? If you havent returned your

Social and Information Networks Resources Many of the things that we cover are from papers. But

Today's World-wide Today's World-wide Computing Grid for the Computing Grid for the Computing

mHealth, Patient Reported Outcomes, &amp; Registries UICC Presentation Bradford Hirsch, MD, MBA

Compressed sensing off-the-grid: The Fisher metric, support stability and optimal sampling bounds

Robust Spectral Compressed Sensing via Structured Matrix Completion Yuxin Chen Electrical

&lt;Off-Grid-Traces&gt; Discussions Reimagining digital communication after ecological disaster

Vehicle-Grid Integration Analysis Presentation to VGI Working Group May 7, 2020 Christa Heavey,

Sambuz

Useful Links

Newsletter

Mail Us

mHealth, Patient Reported Outcomes, & Registries UICC Presentation Bradford Hirsch, MD, MBA

<Off-Grid-Traces> Discussions Reimagining digital communication after ecological disaster