learning in autonomous systems
play

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - PowerPoint PPT Presentation

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016


  1. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome “La Sapienza” Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016 Luca Iocchi Markov Decision Processes 1 / 26

  2. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Sapienza University of Rome, Italy Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Markov Decision Processes Luca Iocchi Luca Iocchi Markov Decision Processes 2 / 26

  3. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Markov Decision Processes (MDP) are discrete-time (stochastic) control processes describing the evolution of a dynamic system over which we have control on actions to be executed . Used in many applications, including robotics and control. Depending on the available knowledge, MDP are used to model both reasoning/planning and learning tasks. Luca Iocchi Markov Decision Processes 3 / 26

  4. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP DBNs/HMMs used for state estimation (known model) or model parameter estimation (unknown model). Input: observations, control/actions, (training data) Output: state estimation (model parameters) MDPs used for planning (known model) or reinforcement learning (unknown model). Input: state, reward, (transition function) Output: best action to perform in each state Luca Iocchi Markov Decision Processes 4 / 26

  5. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP DBNs/HMMs and MDPs are all probabilistic graphical models . DBN/HMM graphical models represent conditional probabilities among variables and a temporal unfolding of the system evolution: (nodes = random variables, edges = cond. probabilities) MDP graphical models explictly represent actions causing state transitions (nodes = states, edges = actions). Luca Iocchi Markov Decision Processes 5 / 26

  6. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } Different graphical models for DBN and MDP. Luca Iocchi Markov Decision Processes 6 / 26

  7. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation with DBN for state estimation DBN: X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } δ = transition function Z = { Z Left , Z Right , Z Up , Z Down } Input: z 1: T , a 1: T Output: P ( x t | z 1: T , a 1: T ) Luca Iocchi Markov Decision Processes 7 / 26

  8. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation with MDP for planning/learning MDP: X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } δ = transition function r = reward function Planning: Input: MDP model (with δ and r ) Output: best actions Learning: Input: MDP model (without δ and r ) Output: best actions Luca Iocchi Markov Decision Processes 8 / 26

  9. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Running Example : grid controller (see Web site) Only Left and Right actions with non-deterministic effects (see next slides). Different problems: state estimation planning reinforcement learning Luca Iocchi Markov Decision Processes 9 / 26

  10. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Deterministic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions δ : X × A → X is a transition function r : X × A → ℜ is a reward function Markov property: x t +1 = δ ( x t , a t ) and r t = r ( x t , a t ) Sometimes, the reward function is defined as r : X → ℜ Luca Iocchi Markov Decision Processes 10 / 26

  11. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Non-deterministic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions δ : X × A → 2 X is a transition function r : X × A × X → ℜ is a reward function Luca Iocchi Markov Decision Processes 11 / 26

  12. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Stochastic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions P ( X × A × X ) is a probability distribution over transitions r : X × A × X → ℜ is a reward function Note: P ( X × A × X ) is expressed as P ( x ′ | x , a ) that is the conditional probability of the successor state, given the current state and the current action. Luca Iocchi Markov Decision Processes 12 / 26

  13. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Full Observability in MDP States are fully observable. In presence of non-deterministic or stochastic actions, the state resulting from the execution of an action is not known before the execution of the action, but it can be fully observed after its execution. Luca Iocchi Markov Decision Processes 13 / 26

  14. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) MDP Solution Concept Given an MDP, we want to find an optimal policy . Policy is a function π : X → A Optimality is defined with respect to maximizing the (expected value of the) cumulative discounted reward . r 2 + γ 2 ¯ V π ( x 1 ) = E [¯ r 1 + γ ¯ r 3 + . . . ] where ¯ r t = r ( x t , a t , x t +1 ), a t = π ( x t ), and γ ∈ [0 , 1] is the discount factor for future rewards. Optimal policy: π ∗ ≡ argmax π V π ( x ) , ∀ x ∈ X Luca Iocchi Markov Decision Processes 14 / 26

  15. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Value function Deterministic case V π ( x ) = r 1 + γ r 2 + γ 2 r 3 + . . . ( t ) ( x ) = r t + γ r t +1 + γ 2 r t +2 + . . . V π ( t ) ( x ) = r t + γ ( r t +1 + γ ( r t +2 + . . . )) = r t + γ V π ( t +1) ( x ) V π Non-deterministic/stochastic case: V π ( x ) = E [ r 1 + γ r 2 + γ 2 r 3 + . . . ] Luca Iocchi Markov Decision Processes 15 / 26

  16. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Reasoning and Learning in MDP If the MDP � X , A , δ, r � is completely known → reasoning or planning The optimal policy is computed off-line (i.e., before the actual execution of the task). If the MDP � X , A , δ, r � is not completely known → learning The optimal policy is computed on-line (i.e., during the execution of the task). Advantages : adaptive to changing, unknown characteristics of the environment. Disadvantages : time consuming, it may execute undesired behaviors. Luca Iocchi Markov Decision Processes 16 / 26

  17. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Dynamic programming Given the MDP � X , A , δ, r � , Initialize V (0) ( x ) and π 0 ( x ) randomly Iterate the two steps: 1 V ( x ) ← � x ′ P ( x ′ | x , π ( x )) [ r ( x , π ( x ) , x ′ ) + γ V ( x ′ )] 2 π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( x ′ )] Termination condition: no changes in π Luca Iocchi Markov Decision Processes 17 / 26

  18. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Value Iteration Given the MDP � X , A , δ, r � , Initialize V (0) ( x ) randomly Iterate the step: 1 V ( t ) ( x ) ← max a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( t − 1) ( x ′ )] Then, compute π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( t ) ( x ′ )] Termination condition: ∀ x , V ( t ) ( x ) − V ( t − 1) ( x ) < θ Luca Iocchi Markov Decision Processes 18 / 26

  19. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Policy Iteration Given the MDP � X , A , δ, r � , Initialize the policy π 0 ( x ) randomly Iterate the steps: 1 Solve the linear system in V ( x ): x ′ P ( x ′ | x , π ( x )) [ r ( x , π ( x ) , x ′ ) + γ V ( x ′ )] V ( x ) = � 2 Update π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( x ′ )] Termination condition: no changes in π Luca Iocchi Markov Decision Processes 19 / 26

  20. Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Example 1: simple deterministic grid world S 3 S 4 G 100 0 Reaching the goal state G from initial 100 0 state S 0 . 0 S 0 S 1 S 2 0 0 MDP � X , A , δ, r � X = { S 0 , S 1 , S 2 , S 3 , S 4 , G } A = { L , R , U , D } δ represented as arrows in the figure (e.g., δ ( S 0 , R ) = S 1 ) r ( x , a ) represented as red values on the arrows in the figure (e.g., r ( S 0 , R ) = 0) Luca Iocchi Markov Decision Processes 20 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend