Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - PowerPoint PPT Presentation

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome “La Sapienza” Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016 Luca Iocchi Markov Decision Processes 1 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Sapienza University of Rome, Italy Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Markov Decision Processes Luca Iocchi Luca Iocchi Markov Decision Processes 2 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Markov Decision Processes (MDP) are discrete-time (stochastic) control processes describing the evolution of a dynamic system over which we have control on actions to be executed . Used in many applications, including robotics and control. Depending on the available knowledge, MDP are used to model both reasoning/planning and learning tasks. Luca Iocchi Markov Decision Processes 3 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP DBNs/HMMs used for state estimation (known model) or model parameter estimation (unknown model). Input: observations, control/actions, (training data) Output: state estimation (model parameters) MDPs used for planning (known model) or reinforcement learning (unknown model). Input: state, reward, (transition function) Output: best action to perform in each state Luca Iocchi Markov Decision Processes 4 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP DBNs/HMMs and MDPs are all probabilistic graphical models . DBN/HMM graphical models represent conditional probabilities among variables and a temporal unfolding of the system evolution: (nodes = random variables, edges = cond. probabilities) MDP graphical models explictly represent actions causing state transitions (nodes = states, edges = actions). Luca Iocchi Markov Decision Processes 5 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } Different graphical models for DBN and MDP. Luca Iocchi Markov Decision Processes 6 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation with DBN for state estimation DBN: X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } δ = transition function Z = { Z Left , Z Right , Z Up , Z Down } Input: z 1: T , a 1: T Output: P ( x t | z 1: T , a 1: T ) Luca Iocchi Markov Decision Processes 7 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Example : grid world representation with MDP for planning/learning MDP: X = { ( r , c ) | r = 1 , . . . , Nrows , c = 1 , . . . , Ncols } A = { Left , Right , Up , Down } δ = transition function r = reward function Planning: Input: MDP model (with δ and r ) Output: best actions Learning: Input: MDP model (without δ and r ) Output: best actions Luca Iocchi Markov Decision Processes 8 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) DBN vs. MDP Running Example : grid controller (see Web site) Only Left and Right actions with non-deterministic effects (see next slides). Different problems: state estimation planning reinforcement learning Luca Iocchi Markov Decision Processes 9 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Deterministic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions δ : X × A → X is a transition function r : X × A → ℜ is a reward function Markov property: x t +1 = δ ( x t , a t ) and r t = r ( x t , a t ) Sometimes, the reward function is defined as r : X → ℜ Luca Iocchi Markov Decision Processes 10 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Non-deterministic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions δ : X × A → 2 X is a transition function r : X × A × X → ℜ is a reward function Luca Iocchi Markov Decision Processes 11 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Markov Decision Processes (MDP) Stochastic transitions MDP = � X , A , δ, r � X is a finite set of states A is a finite set of actions P ( X × A × X ) is a probability distribution over transitions r : X × A × X → ℜ is a reward function Note: P ( X × A × X ) is expressed as P ( x ′ | x , a ) that is the conditional probability of the successor state, given the current state and the current action. Luca Iocchi Markov Decision Processes 12 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Full Observability in MDP States are fully observable. In presence of non-deterministic or stochastic actions, the state resulting from the execution of an action is not known before the execution of the action, but it can be fully observed after its execution. Luca Iocchi Markov Decision Processes 13 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) MDP Solution Concept Given an MDP, we want to find an optimal policy . Policy is a function π : X → A Optimality is defined with respect to maximizing the (expected value of the) cumulative discounted reward . r 2 + γ 2 ¯ V π ( x 1 ) = E [¯ r 1 + γ ¯ r 3 + . . . ] where ¯ r t = r ( x t , a t , x t +1 ), a t = π ( x t ), and γ ∈ [0 , 1] is the discount factor for future rewards. Optimal policy: π ∗ ≡ argmax π V π ( x ) , ∀ x ∈ X Luca Iocchi Markov Decision Processes 14 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Value function Deterministic case V π ( x ) = r 1 + γ r 2 + γ 2 r 3 + . . . ( t ) ( x ) = r t + γ r t +1 + γ 2 r t +2 + . . . V π ( t ) ( x ) = r t + γ ( r t +1 + γ ( r t +2 + . . . )) = r t + γ V π ( t +1) ( x ) V π Non-deterministic/stochastic case: V π ( x ) = E [ r 1 + γ r 2 + γ 2 r 3 + . . . ] Luca Iocchi Markov Decision Processes 15 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Reasoning and Learning in MDP If the MDP � X , A , δ, r � is completely known → reasoning or planning The optimal policy is computed off-line (i.e., before the actual execution of the task). If the MDP � X , A , δ, r � is not completely known → learning The optimal policy is computed on-line (i.e., during the execution of the task). Advantages : adaptive to changing, unknown characteristics of the environment. Disadvantages : time consuming, it may execute undesired behaviors. Luca Iocchi Markov Decision Processes 16 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Dynamic programming Given the MDP � X , A , δ, r � , Initialize V (0) ( x ) and π 0 ( x ) randomly Iterate the two steps: 1 V ( x ) ← � x ′ P ( x ′ | x , π ( x )) [ r ( x , π ( x ) , x ′ ) + γ V ( x ′ )] 2 π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( x ′ )] Termination condition: no changes in π Luca Iocchi Markov Decision Processes 17 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Value Iteration Given the MDP � X , A , δ, r � , Initialize V (0) ( x ) randomly Iterate the step: 1 V ( t ) ( x ) ← max a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( t − 1) ( x ′ )] Then, compute π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( t ) ( x ′ )] Termination condition: ∀ x , V ( t ) ( x ) − V ( t − 1) ( x ) < θ Luca Iocchi Markov Decision Processes 18 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Solving the MDP (reasoning) Policy Iteration Given the MDP � X , A , δ, r � , Initialize the policy π 0 ( x ) randomly Iterate the steps: 1 Solve the linear system in V ( x ): x ′ P ( x ′ | x , π ( x )) [ r ( x , π ( x ) , x ′ ) + γ V ( x ′ )] V ( x ) = � 2 Update π ( x ) ← argmax a ∈ A � x ′ P ( x ′ | x , a ) [ r ( x , a , x ′ ) + γ V ( x ′ )] Termination condition: no changes in π Luca Iocchi Markov Decision Processes 19 / 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) Example 1: simple deterministic grid world S 3 S 4 G 100 0 Reaching the goal state G from initial 100 0 state S 0 . 0 S 0 S 1 S 2 0 0 MDP � X , A , δ, r � X = { S 0 , S 1 , S 2 , S 3 , S 4 , G } A = { L , R , U , D } δ represented as arrows in the figure (e.g., δ ( S 0 , R ) = S 1 ) r ( x , a ) represented as red values on the arrows in the figure (e.g., r ( S 0 , R ) = 0) Luca Iocchi Markov Decision Processes 20 / 26

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - PowerPoint PPT Presentation

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016

Autonomous Ground Systems Human Centered Teaming of Autonomous Battlefield Robotics W. Stuart

Visual Reinforcement Learning with Imagined Goals Ashvin Nair, Vitchyr Pong , Murtaza Dalal,

LMIs and autonomous work 1 From autonomous work to discontinuous career paths Autonomous

Math 211 Math 211 Lecture #2 2 Autonomous Equations Autonomous Equations General equation:

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Updating Autonomous Start to an Updating Autonomous Start to an RTK Field Survey (Part II) RTK

Updating Autonomous Start to an Updating Autonomous Start to an RTK Field Survey RTK Field

AUTONOMOUS DRIVING IN AGRICULTURE LEADING TO AUTONOMOUS WORKSITE SOLUTIONS Dr. John F. Reid,

AUTONOMOUS DRIVING FRENCH NATIONAL PLAN NEW FRANCE FOR INDUSTRY (NFI) PLAN JF SENCERIN

Autonomous institution of Khanty-Mansiysk Autonomous Okrug - Yugra "High Technology Park"

The Autonomous Cars Ethics A study on the Decision-Making Mechanism (DMM) of Autonomous

AUTONOMOUS DRONE NAVIGATION WITH DEEP LEARNING Nikolai Smolyanskiy, Alexey Kamenev, Jeffrey Smith

DEEP LEARNING IN THE FIELD OF AUTONOMOUS DRIVING AN OUTLINE OF THE DEPLOYMENT PROCESS FOR ADAS

Topology of the Internet Autonomous Systems (AS) The global Internet consists of Autonomous

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti Course web site:

Autonomous Ground Systems Improved Relative Positioning for Path Following in Autonomous Convoys

What Do We Need? Markov Decision Processes } AI systems must be able to handle complex, uncertain

Finite State Machines: Definitions; Verification Greg Plaxton Theory in Programming Practice,

Comparing State Machines Udo Kelter , Maik Schmidt Software Engineering Group University of

Finite State Machines Finite State Machines Often AI as agents: sense , think , then act

Planning and Optimization F1. Markov Decision Processes Malte Helmert and Thomas Keller

On the Difficulty of FSM-based Hardware Obfuscation CHES 2018, September 10, 2018 Marc Fyrbiak 1 ,

Software Architecture Bertrand Meyer ETH Zurich, March-July 2007 Lecture 1: Introduction Goal

Statecharts - Tool Joo Pimentel O CTOBER /2014 REQUIREMENTS ENGINEERING LAB Agenda

Sambuz

Useful Links

Newsletter

Mail Us

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - PowerPoint PPT Presentation

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016

Autonomous Ground Systems Human Centered Teaming of Autonomous Battlefield Robotics W. Stuart

Visual Reinforcement Learning with Imagined Goals Ashvin Nair*, Vitchyr Pong* , Murtaza Dalal,

LMIs and autonomous work 1 From autonomous work to discontinuous career paths Autonomous

Math 211 Math 211 Lecture #2 2 Autonomous Equations Autonomous Equations General equation:

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Updating Autonomous Start to an Updating Autonomous Start to an RTK Field Survey (Part II) RTK

Updating Autonomous Start to an Updating Autonomous Start to an RTK Field Survey RTK Field

AUTONOMOUS DRIVING IN AGRICULTURE LEADING TO AUTONOMOUS WORKSITE SOLUTIONS Dr. John F. Reid,

AUTONOMOUS DRIVING FRENCH NATIONAL PLAN NEW FRANCE FOR INDUSTRY (NFI) PLAN JF SENCERIN

Autonomous institution of Khanty-Mansiysk Autonomous Okrug - Yugra &quot;High Technology Park&quot;

The Autonomous Cars Ethics A study on the Decision-Making Mechanism (DMM) of Autonomous

AUTONOMOUS DRONE NAVIGATION WITH DEEP LEARNING Nikolai Smolyanskiy, Alexey Kamenev, Jeffrey Smith

DEEP LEARNING IN THE FIELD OF AUTONOMOUS DRIVING AN OUTLINE OF THE DEPLOYMENT PROCESS FOR ADAS

Topology of the Internet Autonomous Systems (AS) The global Internet consists of Autonomous

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti Course web site:

Autonomous Ground Systems Improved Relative Positioning for Path Following in Autonomous Convoys

What Do We Need? Markov Decision Processes } AI systems must be able to handle complex, uncertain

Finite State Machines: Definitions; Verification Greg Plaxton Theory in Programming Practice,

Comparing State Machines Udo Kelter , Maik Schmidt Software Engineering Group University of

Finite State Machines Finite State Machines Often AI as agents: sense , think , then act

Planning and Optimization F1. Markov Decision Processes Malte Helmert and Thomas Keller

On the Difficulty of FSM-based Hardware Obfuscation CHES 2018, September 10, 2018 Marc Fyrbiak 1 ,

Software Architecture Bertrand Meyer ETH Zurich, March-July 2007 Lecture 1: Introduction Goal

Statecharts - Tool Joo Pimentel O CTOBER /2014 REQUIREMENTS ENGINEERING LAB Agenda

Sambuz

Useful Links

Newsletter

Mail Us

Visual Reinforcement Learning with Imagined Goals Ashvin Nair, Vitchyr Pong , Murtaza Dalal,

Autonomous institution of Khanty-Mansiysk Autonomous Okrug - Yugra "High Technology Park"