Stochastic Optimal Control part 2 discrete time, Markov Decision - - PowerPoint PPT Presentation

stochastic optimal control part 2 discrete time markov
SMART_READER_LITE
LIVE PREVIEW

Stochastic Optimal Control part 2 discrete time, Markov Decision - - PowerPoint PPT Presentation

Stochastic Optimal Control part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint Machine Learning & Robotics Group TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008 Why


slide-1
SLIDE 1

Stochastic Optimal Control – part 2 discrete time, Markov Decision Processes, Reinforcement Learning

Marc Toussaint

Machine Learning & Robotics Group – TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008

  • Why stochasticity?
  • Markov Decision Processes
  • Bellman optimality equation, Dynamic Programming, Value Iteration
  • Reinforcement Learning: learning from experience

1/21

slide-2
SLIDE 2

Why consider stochasticity?

1) system is inherently stochastic 2) true system is actually deterministic but

a) system is described on level of abstraction, simplification which makes model approximate and stochastic b) sensors/observations are noisy, we never know the exact state c) we can handle only a part of the whole system – partial knowledge → uncertainty – decomposed planning; factored state representation

2 agent world agent 1

  • probabilities are a tool to represent information and uncertainty

– there are many sources of uncertainty

2/21

slide-3
SLIDE 3

Machine Learning models of stochastic processes

  • Markov Processes

defined by random variables x0, x1, .. and transition probabilities P(xt+1 | xt)

x0 x1 x2

  • non-Markovian Processes

– higher order Markov Processes, auto regression models – structured models (hierarchical, grammars, text models) – Gaussian processes (both, discrete and continuous time) – etc

  • continuous time processes

– stochastic differential equations

3/21

slide-4
SLIDE 4

Markov Decision Processes

  • Markov Process on the random variables of states xt, actions at, and

rewards rt

x2 x1 a2 a1 a0 r1 r2 r0

π

x0

P(xt+1 | at, xt) transition probability (1) P(rt | at, xt) reward probability (2) P(at | xt) = π(at | xt) policy (3)

  • we will assume stationarity,

no explicit dependency on time – P(x′ | a, x) and P(r | a, x) are invariable properties of the world – the policy π is a property of the agent

4/21

slide-5
SLIDE 5
  • ptimal policies
  • value (expected discounted return) of policy π when started in x

V π(x) = E

  • r0 + γr1 + γ2r2 + · · · | x0 =x; π
  • (cf. cost function

C(x0, a0:T ) = φ(xT ) + T-1 R(t, xt, at) )

  • optimal value function:

V ∗(x) = max

π

V π(x)

  • policy π∗ if optimal iff

∀x : V π∗(x) = V ∗(x) (simultaneously maximizing the value in all states)

  • There always exists (at least one) optimal deterministic policy!

5/21

slide-6
SLIDE 6

Bellman optimality equation

V π(x) = E

  • r0 + γr1 + γ2r2 + · · · | x0 =x; π
  • = E {r0 | x0 =x; π} + γE {r1 + γr2 + · · · | x0 =x; π}

= R(π(x), x) + γ

x′ P(x′ | π(x), x) E {r1 + γr2 + · · · | x1 =x′; π}

= R(π(x), x) + γ

x′ P(x′ | π(x), x) V π(x′)

  • Bellman optimality equation

V ∗(x) = max

a

  • R(a, x) + γ

x′ P(x′ | a, x) V ∗(x′)

  • π∗(x) = argmax

a

  • R(a, x) + γ

x′ P(x′ | a, x) V ∗(x′)

  • (if π would select another action than argmaxa[·], π wouldn’t be optimal:

π′ which = π everywhere except π′(x) = argmaxa[·] would be better)

  • this is the principle of optimality in the stochastic case

(related to Viterbi, max-product algorithm)

6/21

slide-7
SLIDE 7

Dynamic Programming

  • Bellman optimality equation

V ∗(x) = max

a

  • R(a, x) + γ

x′ P(x′ | a, x) V ∗(x′)

  • Value iteration (initialize V0(x) = 0, iterate k = 0, 1, ..)

∀x : Vk+1(x) = max

a

  • R(a, x) + γ

x′ P(x′|a, x) Vk(x′)

  • – stopping criterion: maxx |Vk+1(x) − Vk(x)| ≤ ǫ

(see script for proof of convergence)

  • once it converged, choose the policy

πk(x) = argmax

a

  • R(a, x) + γ

x′ P(x′ | a, x) Vk(x′)

  • 7/21
slide-8
SLIDE 8

maze example

  • typical example for a

value function in navigation [online demo – or switch to Terran Lane’s lecture...]

8/21

slide-9
SLIDE 9

comments

  • Bellman’s principle of optimality is the core of the methods
  • it refers to the recursive thinking of what makes a path optimal

– the recursive property of the optimal value function

  • related to Viterbi, max-product algorithm

9/21

slide-10
SLIDE 10

Learning from experience

  • Reinforcement Learning problem: model P(x′ | a, x) and P(r | a, x) are

not known, only exploration is allowed

D y n a m i c P r

  • g

. m

  • d

e l l e a r n i n g p

  • l

i c y u p d a t e p

  • l

i c y s e a r c h TD learning Q-learning EM policy optim.

experience {xt, at, rt} MDP model π P, R value/Q-function V, Q policy

10/21

slide-11
SLIDE 11

Model learning

  • trivial on direct discrete representation:

use experience data to estimate model ˆ P(x′|a, x) ∝ #(x′ ← x|a) – for non-direct representations: Machine Learning methods

  • use DP to compute optimal policy for estimated model
  • Exploration-Exploitation is not a Dilemma

possible solutions: E3 algorithm, Bayesian RL (see later)

11/21

slide-12
SLIDE 12

Temporal Difference

  • recall Value Iteration

∀x : Vk+1(x) = max

a

  • R(a, x) + γ

x′ P(x′|a, x) Vk(x′)

  • Temporal Difference learning (TD): given experience (xtatrt xt+1)

Vnew(xt) = (1 − α)Vold(xt) + α[rt + γVold(xt+1)] = Vold(xt) + α[rt + γVold(xt+1) − Vold(xt)] . ... is a stochastic variant of Dynamic Programming → one can prove convergence with probability 1 (see Q-learning in script)

  • reinforcement:

– more reward than expected (rt > γVold(xt+1) − Vold(xt)) → increase V (xt) – less reward than expected (rt < γVold(xt+1) − Vold(xt)) → decrease V (xt)

12/21

slide-13
SLIDE 13

Q-learning convergence with prob 1

  • Q-learning

Qπ(a, x) = E

  • r0 + γr1 + γ2r2 + · · · | x0 =x, a0 =u; π
  • Q∗(a, x) = R(a, x) + γ

x′ P(x′ | a, x) maxa′ Q∗(a′, x′)

∀a,x : Qk+1(a, x) = R(a, x) + γ

x′ P(x′|a, x) maxa′ Qk(a′, x′)

Qnew(xt, at) = (1 − α)Qold(xt, at) + α[rt + γ max

a

Qold(xt+1, a)]

  • Q-learning is a stochastic approximation of Q-VI:

Q-VI is deterministic: Qk+1 = T(Qk) Q-learning is stochastic: Qk+1 = (1 − α)Qk + α[T(Qk) + ηk] ηk is zero mean!

13/21

slide-14
SLIDE 14

Q-learning impact

  • Q-Learning (Watkins, 1988) is the first provably convergent direct

adaptive optimal control algorithm

  • Great impact on the field of Reinforcement Learning

– smaller representation than models – automatically focuses attention to where it is needed i.e., no sweeps through state space – though does not solve the exploration versus exploitation issue – epsilon-greedy, optimistic initialization, etc,...

14/21

slide-15
SLIDE 15

Eligibility traces

  • Temporal Difference:

Vnew(x0) = Vold(x0) + α[r0 + γVold(x1) − Vold(x0)]

  • longer reward sequence: r0 r1 r2 r3 r4 r5 r6 r7

temporal credit assignment, think further backwards, receiving r3 also tells us something about V (x0) Vnew(x0) = Vold(x0) + α[r0 + γr1 + γ2r2 + γ3Vold(x3) − Vold(x0)]

  • online implementation: remember where you’ve been recently

(“eligibility trace”) and update those values as well: e(xt) ← e(xt) + 1 ∀x : Vnew(x) = Vold(x) + αe(x)[rt + γVold(xt+1) − Vold(xt)] ∀x : e(x) ← γλe(x)

  • core topic of Sutton & Barto book

– great improvement

15/21

slide-16
SLIDE 16

comments

  • again, Bellman’s principle of optimality is the core of the methods

TD(λ), Q-learning, eligibilities, are all methods to converge to a function obeying the Bellman optimality equation

16/21

slide-17
SLIDE 17

E3 : Explicit Explore or Exploit

  • (John Langford)

from observed data construct two MDPs:

(1) MDPknown includes sufficiently often visited states and executed actions with (rather exact) estimates of P and R. (model which captures what you know) (2) MDPunknown = MDPknown except the reward is 1 for all actions which leave the known states and 0 otherwise. (model which captures optimism of exploration)

  • the algorithm:

(1) If last x not in Known: choose the least previously used action (2) Else: (a) [seek exploration] If Vunknown > ǫ then act according to Vunknown until state is unknown (or t mod T = 0) then goto (1) (b) [exploit] else act according to Vknown

17/21

slide-18
SLIDE 18

E3 – Theory

  • for any (unknown) MDP:

– total number of actions and computation time required by E3 are poly(|X|, |A|, T ∗, 1

ǫ, ln 1 δ)

– performance guarantee: with probability at least (1 − δ) exp. return of E3 will exceed V ∗ − ǫ

  • details

– actual return:

1 T

T

t=1 rt

– let T ∗ denote the (unknown) mixing time of the MDP – one key insight: even the optimal policy will take time O(T ∗) to achieve actual return that is near-optimal

  • straight-forward & intuitive approach!

– the exploration-exploitation dilemma is not a dilemma! – cf. active learning, information seeking, curiosity, variance analysis

18/21

slide-19
SLIDE 19

Bayesian Reinforcement Learning

  • initially, we don’t know the MDP

– but based on experience we can estimate the MDP

  • parametrize MDP by a parameter θ

(e.g., direct parametrization: P(x′ | a, x) = θx′ax) – given experience x0a0 x1a1 x0a0 · · · we can estimate the posterior b(θ) = P(θ | x0:t, a0:t)

  • given a “posterior belief b about the world”, plan to maximize policy

value in this distribution of worlds V ∗(x, b) = max

a

  • R(a, x) + γ

x′

  • θP(x′|a, x; θ)b(θ)dθ V ∗(x′, b′)
  • (see last year’s ICML tutorial, old theory from Operations Research) 19/21
slide-20
SLIDE 20

comments

  • there is no fundamentally unsolved exploration-exploitation dilemma!

– but an efficiency issue

20/21

slide-21
SLIDE 21

further topics

  • function approximation:

– Laplacian eigen functions as value function representation (Mahadevan) – Gaussian processes (Carl Rasmussen, Yaakov Engel)

  • representations:

– macro (hierarchical) policies, abstractions, options (Precup, Sutton, Singh, Dietterich, Parr) – predictive state representations (PSR, Littman, Sutton, Singh)

  • partial observability (POMDPs)

21/21

slide-22
SLIDE 22
  • we conclude with a demo from Andrew Ng

– helicopter flight – actions: standard remote control – reward functions: hand-designed for specific tasks [rolls] [inverted flight]

22/21

slide-23
SLIDE 23

appendix: discrete time continuous state control

  • same framework as MDPs, different conventional notation:

control MDPs ut control at action xt+1 = xt + f(t , xt, ut) + ξt discrete dx = f(t, x, u) dt+dξ continuous system P(xt+1 | at, xt) transition prob φ(xT ) final cost R(t, xt, ut) local cost P(rt | at, xt) reward prob C(x0, u0:T )

  • exp. trajectory cost

V π(x) policy value J(t, x)

  • ptimal cost-to-go

V ∗(x)

  • ptimal value function
  • discrete time stochastic controlled system

xt+1 = f(xt, ut) + ξ , ξ ∼ N(0, Q) P(xt+1 | ut, xt) = N(xt+1 | f(xt, ut), Q)

  • objective is to minimize the expectation of the cost

C(x0:T , u0:T ) =

T

  • t

=

R(t, xt, ut) .

23/21

slide-24
SLIDE 24

appendix: discrete time continuous state control

  • just as we had in the MDP case, the value function obeys the Bellman
  • ptimality equation

Jt(x) = min

u

  • R(t, x, u) +
  • x′ P(x′ | u, x) Jt+1(x′)
  • 2 types of optimal control problems
  • pen-loop: find control sequence u∗

1:T that minimizes the expected

cost closed-loop: find a control law π∗ : (t, x) → ut (that exploits the true state observation in each time step and maps it to a feedback control signal) that minimizes the expected cost

24/21

slide-25
SLIDE 25

appendix: Linear-quadratic-Gaussian (LQG) case

  • consider a linear control process with Gaussian noise and quadratic costs,

P(xt | xt-1, ut) = N(xt | Axt-1 + But, Q) , C(x1:T , u1:T ) =

T

X

t=1

xT

t Rxt + uT t Hut .

  • assume we know the exact cost-to-go Jt(x) at time t and that it has the form

Jt(x) = xT Vtx. Then, Jt-1(x) = min

a

h xT Rx + uT Hu + Z

y

N(y | Ax + Bu, Q) yT Vty dx i = min

a

h xT Rx + uT Hu + (Ax + Bu)T Vt(Ax + Bu) + tr(VtQ) i = min

a

h xT Rx + uT (H + BT VtB)u + 2uT BT VtAx + xT AVtAx + tr(Vt minimization yields 0 = 2(H + BT VtB)u∗ + 2BT VtAx ⇒ u∗ = −(H + BT VtB)-1BT VtAx Jt-1(x) = xT Vt-1x , Vt-1 = R + AT VtA − AT VtB(H + BT VtB)-1BT VtA

  • this is called the Ricatti equation

25/21