Recall the MDP Framework Slightly di ff erent notation this time S : - PDF document

Recall the MDP Framework Slightly di ff erent notation this time S : Finite set of states of the world About this class A : Finite set of actions Partially Observable Markov Decision Processes T : S ⇥ A ! Π ( S ): State transition function. [Most of this lecture based on Kaelbling, Littman, Write T ( s, a, s 0 ) for probability of ending in state and Cassandra, 1998] s 0 when starting from state s and taking action a . R : S ⇥ A ! R : Reward function. R ( s, a ) is the expected reward for taking action a in state s . 1 2 Partial Observability How to Control a POMDP A POMDP is a tuple h S, A, T, R, Ω , O i where S, A, T, R describe an MDP, and: Ω is a finite set of observations the agent can experience O : S ⇥ A ! Π ( Ω ) is the observation function, giving, for each action and the resulting state, a probability distribution over possible observations. O ( s 0 , a, o ) is the probability of making an (from Kaelbling, Littman, and Cassandra) observation o given that the agent took action a and landed in state s 0 . 3 4

Example 1 State Estimation Agent keeps an internal belief state that sum- (from Kaelbling, Littman, and Cassandra) marizes its previous experience. The SE updates this belief state based on the last action, 3 is a goal state. Task is episodic. Two ac- the current observation, and the previous belief tions, East and West that succeed with Pr 0 . 9 state. and, when they fail, go in the opposite direc- tion. If no movement is possible then the agent What should the belief state be? Most prob- stays in the same location. able state of the world? But this could lead to big problems. Suppose I’m wrong? Sup- Suppose the agent starts o ff equally likely to pose I’m uncertain and can gain value through be in any of the three non-goal states. Then taking an informative action? takes action East twice and does not observe the goal state. What is the evolution of belief Instead we will use probability distributions over states? the true state of the world. [0 . 333 , 0 . 333 , 0 . 000 , 0 . 333] [0 . 100 , 0 . 450 , 0 . 000 , 0 . 450] 5 6 Example 2 [0 . 100 , 0 . 164 , 0 . 000 , 0 . 736] (from Littman, 2009) Suppose in either of the two Start states you There will always be some probability mass on can look up and make an observation that will each of the nongoal states, since actions have be either Green or Red. This gives you the in- some chance of failing. formation you need to succeed, but if there’s a small penalty for actions or some discount- ing, you wouldn’t necessarily do it if you were using the most probable state (for example if your initial belief state is 1/4 probability on being in (rewardLeft, start) and 3/4 probability on being in (rewardRight, start) Interesting connection, again, to value of in- formation, and exploration-exploitation. 7

The “Belief MDP” State space: B : the set of belief states Belief State Updates Action space: A : same as original MDP Let b ( s ) be the probability assigned to world Transition model: state s by belief state b . Then P s 2 S b ( s ) = 1. τ ( b, a, b 0 ) = Pr( b 0 | a, b ) = Pr( b 0 | a, b, o ) Pr( o | a, b ) X o 2 Ω Given b, a, o compute b 0 . where Pr( b 0 | b, a, o ) is 1 if SE ( b, a, o ) = b 0 and 0 otherwise. b 0 ( s 0 ) = Pr( s 0 | o, a, b ) Reward function: ρ ( b, a ) = P s 2 S b ( s ) R ( s, a ) = Pr( o | s 0 , a, b ) Pr( s 0 | a, b ) Pr( o | a, b ) Isn’t this delusional? I’m getting rewarded just = Pr( o | s 0 , a ) P s 2 S Pr( s 0 | a, b, s ) Pr( s | a, b ) for believing I’m in a good state? Only works Pr( o | a, b ) because my updates are based on a correct ob- = O ( s 0 , a, o ) P s 2 S T ( s, a, s 0 ) b ( s ) servation and transition model of the world, so Pr( o | a, b ) the belief state represents the true probabilities The denominator is a normalizing factor, so of being in each world state. this is all easy to compute. The bad news: In general, very hard to solve continuous space MDPs (uncountably many belief states). 8 9 Policy Trees / Contingent Plans Let a ( p ) be the action specified at the top of a policy tree, and o i ( p ) be the policy subtree Think about finite-horizon policies. Can’t just induced from p when observing o i . have a mapping from states to actions in this case, because we don’t know what state we’re Suppose p is a one-step policy tree. going to be in. Instead formulate contingent V p ( s ) = R ( s, a ( p )) plans or policy trees that tell the agent what to do in case of each particular sequence of Now, how do we go from the value functions observations from a given start (world)-state. constructed from policy trees of depth t � 1 to value functions constructed from policy trees of depth t ? V p ( s ) = R ( s, a ( p )) + γ [Expected value of the future] X Pr( s 0 | s, a ( p )) X Pr( o i | s 0 , a ( p )) V o i ( p ) ( s 0 ) = R ( s, a ( p )) + γ s 0 2 S o i 2 Ω X T ( s, a ( p ) , s 0 ) X O ( s 0 , a ( p ) , o i ) V o i ( p ) ( s 0 ) = R ( s, a ( p )) + γ s 0 2 S o i 2 Ω Since we won’t actually know s , we need: X V p ( b ) = b ( s ) V p ( s ) s 2 S 10

Let α p = h V p ( s 1 ) , . . . V p ( s n ) i . Then V p ( b ) = b · α p Then the optimal t -step policy starting from belief state b is given by: V t ( b ) = max p 2 P b · α p where P is the (finite) set of all t -step policy trees.

Recall the MDP Framework Slightly di ff erent notation this time S : - PDF document

Recall the MDP Framework Slightly di ff erent notation this time S : Finite set of states of the world About this class A : Finite set of actions Partially Observable Markov Decision Processes T : S A ! ( S ): State transition function.

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 &

Logic Programming and MDPs for Planning Alborz Geramifard Winter 2009 Index Introduction Logic

Whats Next? 29 1 The MDP Journey Next Spring Leadership MDP Badge Fall Symposium

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

Talk overview Introduction and historical background Multiple delivery publishing (MDP)

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Lack of Generalization Feature Vectors Rather than use every single detail of a state space, we }

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an MDP, and , a policy to

3rd Annual Automotive Industry Warranty & Recall Symposium Global Financial Advisory Services

Lectur Lecture 4: e 4: Electr Electrical Test Equipm ical Test Equipment ent Recall Fr

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

MASTER CLASS ON TRACEABILITY, RECALL & WITHDRAWAL Ms Vinod Kotwal Director (Codex), FSSAI 3

Recall of Therapeutic Goods - Overview 2018 GMP Forum Craig Davies Australian Recall Coordinator

RECALL READINESS Presentation Prepared for the Colorado Fruit and Vegetable Growers Association

Insert Image if Alejandra Meglioli appropriate Senior Program Officer IPPF/WHR World Cancer

Features(and(prevalence(of(plagiarism(in biomedical(science

The Fermilab Muon g-2 experiment: laser calibration system M. Karuza University of Rijeka and

NSF Proposal Strategies and Current Opportunities for Collaboration Presented by:

Hadronic light-by-light scatering from latice QCD Jeremy Green in collaboration with Nils

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

Security threats and mi0ga0ons for iOS developers Emil

http://www.mpi-forum.org/ LLNL-PRES-696804 This work was performed under the auspices of the U.S.

Recall the MDP Framework Slightly di ff erent notation this time S : - PDF document

Recall the MDP Framework Slightly di ff erent notation this time S : Finite set of states of the world About this class A : Finite set of actions Partially Observable Markov Decision Processes T : S A ! ( S ): State transition function.

GPU tuning, part 1 (updated) CSE 6230: HPC Tools &amp; Apps Fall 2014 September 30 &amp;

Logic Programming and MDPs for Planning Alborz Geramifard Winter 2009 Index Introduction Logic

Whats Next? 29 1 The MDP Journey Next Spring Leadership MDP Badge Fall Symposium

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

Talk overview Introduction and historical background Multiple delivery publishing (MDP)

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Lack of Generalization Feature Vectors Rather than use every single detail of a state space, we }

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an MDP, and , a policy to

3rd Annual Automotive Industry Warranty &amp; Recall Symposium Global Financial Advisory Services

Lectur Lecture 4: e 4: Electr Electrical Test Equipm ical Test Equipment ent Recall Fr

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

MASTER CLASS ON TRACEABILITY, RECALL &amp; WITHDRAWAL Ms Vinod Kotwal Director (Codex), FSSAI 3

Recall of Therapeutic Goods - Overview 2018 GMP Forum Craig Davies Australian Recall Coordinator

RECALL READINESS Presentation Prepared for the Colorado Fruit and Vegetable Growers Association

Insert Image if Alejandra Meglioli appropriate Senior Program Officer IPPF/WHR World Cancer

Features(and(prevalence(of(plagiarism(in biomedical(science

The Fermilab Muon g-2 experiment: laser calibration system M. Karuza University of Rijeka and

NSF Proposal Strategies and Current Opportunities for Collaboration Presented by:

Hadronic light-by-light scatering from latice QCD Jeremy Green in collaboration with Nils

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

Security threats and mi0ga0ons for iOS developers Emil

http://www.mpi-forum.org/ LLNL-PRES-696804 This work was performed under the auspices of the U.S.

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 &

3rd Annual Automotive Industry Warranty & Recall Symposium Global Financial Advisory Services

MASTER CLASS ON TRACEABILITY, RECALL & WITHDRAWAL Ms Vinod Kotwal Director (Codex), FSSAI 3