Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - - PowerPoint PPT Presentation

learning in autonomous systems
SMART_READER_LITE
LIVE PREVIEW

Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti - - PowerPoint PPT Presentation

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Learning in Autonomous Systems Proff. Luca Iocchi, Giorgio Grisetti A.Y. 2015/2016


slide-1
SLIDE 1

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

University of Rome “La Sapienza” Master in Artificial Intelligence and Robotics

Learning in Autonomous Systems

  • Proff. Luca Iocchi, Giorgio Grisetti

A.Y. 2015/2016

Luca Iocchi Markov Decision Processes 1 / 26

slide-2
SLIDE 2

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Sapienza University of Rome, Italy Master in Artificial Intelligence and Robotics Learning in Autonomous Systems

Markov Decision Processes

Luca Iocchi

Luca Iocchi Markov Decision Processes 2 / 26

slide-3
SLIDE 3

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Markov Decision Processes (MDP)

Markov Decision Processes (MDP) are discrete-time (stochastic) control processes describing the evolution of a dynamic system over which we have control on actions to be executed. Used in many applications, including robotics and control. Depending on the available knowledge, MDP are used to model both reasoning/planning and learning tasks.

Luca Iocchi Markov Decision Processes 3 / 26

slide-4
SLIDE 4

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

DBNs/HMMs used for state estimation (known model) or model parameter estimation (unknown model). Input: observations, control/actions, (training data) Output: state estimation (model parameters) MDPs used for planning (known model) or reinforcement learning (unknown model). Input: state, reward, (transition function) Output: best action to perform in each state

Luca Iocchi Markov Decision Processes 4 / 26

slide-5
SLIDE 5

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

DBNs/HMMs and MDPs are all probabilistic graphical models. DBN/HMM graphical models represent conditional probabilities among variables and a temporal unfolding of the system evolution: (nodes = random variables, edges = cond. probabilities) MDP graphical models explictly represent actions causing state transitions (nodes = states, edges = actions).

Luca Iocchi Markov Decision Processes 5 / 26

slide-6
SLIDE 6

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

Example: grid world representation X = {(r, c) | r = 1, . . . , Nrows, c = 1, . . . , Ncols} A = {Left, Right, Up, Down} Different graphical models for DBN and MDP.

Luca Iocchi Markov Decision Processes 6 / 26

slide-7
SLIDE 7

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

Example: grid world representation with DBN for state estimation DBN: X = {(r, c) | r = 1, . . . , Nrows, c = 1, . . . , Ncols} A = {Left, Right, Up, Down} δ = transition function Z = {ZLeft, ZRight, ZUp, ZDown} Input: z1:T, a1:T Output: P(xt|z1:T, a1:T)

Luca Iocchi Markov Decision Processes 7 / 26

slide-8
SLIDE 8

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

Example: grid world representation with MDP for planning/learning MDP: X = {(r, c) | r = 1, . . . , Nrows, c = 1, . . . , Ncols} A = {Left, Right, Up, Down} δ = transition function r = reward function Planning: Input: MDP model (with δ and r) Output: best actions Learning: Input: MDP model (without δ and r) Output: best actions

Luca Iocchi Markov Decision Processes 8 / 26

slide-9
SLIDE 9

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

DBN vs. MDP

Running Example: grid controller (see Web site) Only Left and Right actions with non-deterministic effects (see next slides). Different problems: state estimation planning reinforcement learning

Luca Iocchi Markov Decision Processes 9 / 26

slide-10
SLIDE 10

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Markov Decision Processes (MDP)

Deterministic transitions MDP = X, A, δ, r X is a finite set of states A is a finite set of actions δ : X × A → X is a transition function r : X × A → ℜ is a reward function Markov property: xt+1 = δ(xt, at) and rt = r(xt, at) Sometimes, the reward function is defined as r : X → ℜ

Luca Iocchi Markov Decision Processes 10 / 26

slide-11
SLIDE 11

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Markov Decision Processes (MDP)

Non-deterministic transitions MDP = X, A, δ, r X is a finite set of states A is a finite set of actions δ : X × A → 2X is a transition function r : X × A × X → ℜ is a reward function

Luca Iocchi Markov Decision Processes 11 / 26

slide-12
SLIDE 12

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Markov Decision Processes (MDP)

Stochastic transitions MDP = X, A, δ, r X is a finite set of states A is a finite set of actions P(X × A × X) is a probability distribution over transitions r : X × A × X → ℜ is a reward function Note: P(X × A × X) is expressed as P(x′|x, a) that is the conditional probability of the successor state, given the current state and the current action.

Luca Iocchi Markov Decision Processes 12 / 26

slide-13
SLIDE 13

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Full Observability in MDP

States are fully observable. In presence of non-deterministic or stochastic actions, the state resulting from the execution of an action is not known before the execution of the action, but it can be fully observed after its execution.

Luca Iocchi Markov Decision Processes 13 / 26

slide-14
SLIDE 14

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

MDP Solution Concept

Given an MDP, we want to find an optimal policy. Policy is a function π : X → A Optimality is defined with respect to maximizing the (expected value of the) cumulative discounted reward. V π(x1) = E[¯ r1 + γ ¯ r2 + γ2 ¯ r3 + . . .] where ¯ rt = r(xt, at, xt+1), at = π(xt), and γ ∈ [0, 1] is the discount factor for future rewards. Optimal policy: π∗ ≡ argmaxπ V π(x) , ∀x ∈ X

Luca Iocchi Markov Decision Processes 14 / 26

slide-15
SLIDE 15

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Value function

Deterministic case V π(x) = r1 + γr2 + γ2r3 + . . . V π

(t)(x) = rt + γrt+1 + γ2rt+2 + . . .

V π

(t)(x) = rt + γ(rt+1 + γ(rt+2 + . . .)) = rt + γV π (t+1)(x)

Non-deterministic/stochastic case: V π(x) = E[r1 + γr2 + γ2r3 + . . .]

Luca Iocchi Markov Decision Processes 15 / 26

slide-16
SLIDE 16

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Reasoning and Learning in MDP

If the MDP X, A, δ, r is completely known → reasoning or planning The optimal policy is computed off-line (i.e., before the actual execution

  • f the task).

If the MDP X, A, δ, r is not completely known → learning The optimal policy is computed on-line (i.e., during the execution of the task). Advantages: adaptive to changing, unknown characteristics of the environment. Disadvantages: time consuming, it may execute undesired behaviors.

Luca Iocchi Markov Decision Processes 16 / 26

slide-17
SLIDE 17

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Solving the MDP (reasoning)

Dynamic programming Given the MDP X, A, δ, r, Initialize V(0)(x) and π0(x) randomly Iterate the two steps:

1 V (x) ←

x′ P(x′|x, π(x)) [r(x, π(x), x′) + γV (x′)]

2 π(x) ← argmaxa∈A

  • x′ P(x′|x, a) [r(x, a, x′) + γV (x′)]

Termination condition: no changes in π

Luca Iocchi Markov Decision Processes 17 / 26

slide-18
SLIDE 18

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Solving the MDP (reasoning)

Value Iteration Given the MDP X, A, δ, r, Initialize V(0)(x) randomly Iterate the step:

1 V(t)(x) ← maxa∈A

  • x′ P(x′|x, a) [r(x, a, x′) + γV(t−1)(x′)]

Then, compute π(x) ← argmaxa∈A

  • x′ P(x′|x, a) [r(x, a, x′) + γV(t)(x′)]

Termination condition: ∀x , V(t)(x) − V(t−1)(x) < θ

Luca Iocchi Markov Decision Processes 18 / 26

slide-19
SLIDE 19

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Solving the MDP (reasoning)

Policy Iteration Given the MDP X, A, δ, r, Initialize the policy π0(x) randomly Iterate the steps:

1 Solve the linear system in V (x):

V (x) =

x′ P(x′|x, π(x)) [r(x, π(x), x′) + γV (x′)]

2 Update π(x) ← argmaxa∈A

  • x′ P(x′|x, a) [r(x, a, x′) + γV (x′)]

Termination condition: no changes in π

Luca Iocchi Markov Decision Processes 19 / 26

slide-20
SLIDE 20

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 1: simple deterministic grid world

Reaching the goal state G from initial state S0. S0 G S1 S2 S4 S3

100 100

MDP X, A, δ, r X = {S0, S1, S2, S3, S4, G} A = {L, R, U, D} δ represented as arrows in the figure (e.g., δ(S0, R) = S1) r(x, a) represented as red values on the arrows in the figure (e.g., r(S0, R) = 0)

Luca Iocchi Markov Decision Processes 20 / 26

slide-21
SLIDE 21

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 2: deterministic grid controller

Reaching the right-most side of the environment from any initial state. MDP X, A, δ, r X = {(r, c)|coordinates in the grid} A = {Left, Right, Up, Down} δ: cardinal movements with no effects (i.e., the agent remains in the current state) if destination state is a black square r: 1000 for reaching the right-most column, -10 for hitting any

  • bstacle, 0 otherwise

Luca Iocchi Markov Decision Processes 21 / 26

slide-22
SLIDE 22

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 3: non-deterministic grid controller

Reaching the right-most side of the environment from any initial state. MDP X, A, δ, r X = {(r, c)|coordinates in the grid} A = {Left, Right} δ: cardinal movements with non-deterministic effects (see next slide) r: 1000 for reaching the right-most column, -10 for hitting any

  • bstacle, +1 for any Right action, -1 for any Left action.

Luca Iocchi Markov Decision Processes 22 / 26

slide-23
SLIDE 23

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 3: non-deterministic grid controller

Transition probability: P(xt+1 = (r′, c′)|xt = (r, c), a = Right) = =            γF, if ¬obstacle(r, c + 1) ∧ c′ = c + 1 ∧ r′ = r

1−γF 2

, if ¬obstacle(r, c + 1) ∧ c′ = c + 1 ∧ |r′ − r| = 1 γB, if obstacle(r, c + 1) ∧ c′ = c − 1 ∧ r′ = r

1−γB 2

, if obstacle(r, c + 1) ∧ c′ = c − 1 ∧ |r′ − r| = 1 0, otherwise

Luca Iocchi Markov Decision Processes 23 / 26

slide-24
SLIDE 24

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 4: Hanoi Towers

Luca Iocchi Markov Decision Processes 24 / 26

slide-25
SLIDE 25

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

Example 5: Tic-Tac-Toe

Luca Iocchi Markov Decision Processes 25 / 26

slide-26
SLIDE 26

Sapienza University of Rome, Italy - Learning in Autonomous Systems (2015/16)

References

[ArtInt] David Poole and Alan Mackworth. Artificial Intelligence: Foundations of Computational Agents, Chapter 9.5 Decision Processes. Cambridge University Press, 2010. On-line: http://artint.info/

Luca Iocchi Markov Decision Processes 26 / 26