Stochastic Optimal Control part 2 discrete time, Markov Decision - PowerPoint PPT Presentation

Stochastic Optimal Control – part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint Machine Learning & Robotics Group – TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008 • Why stochasticity? • Markov Decision Processes • Bellman optimality equation, Dynamic Programming, Value Iteration • Reinforcement Learning: learning from experience 1/21

Why consider stochasticity? 1) system is inherently stochastic 2) true system is actually deterministic but a) system is described on level of abstraction, simplification which makes model approximate and stochastic b) sensors/observations are noisy, we never know the exact state c) we can handle only a part of the whole system – partial knowledge → uncertainty – decomposed planning; factored state representation world agent agent 1 2 • probabilities are a tool to represent information and uncertainty – there are many sources of uncertainty 2/21

Machine Learning models of stochastic processes • Markov Processes defined by random variables x 0 , x 1 , .. and transition probabilities P ( x t + 1 | x t ) x 0 x 1 x 2 • non-Markovian Processes – higher order Markov Processes, auto regression models – structured models (hierarchical, grammars, text models) – Gaussian processes (both, discrete and continuous time) – etc • continuous time processes – stochastic differential equations 3/21

Markov Decision Processes • Markov Process on the random variables of states x t , actions a t , and rewards r t a 0 a 1 a 2 π x 0 x 1 x 2 r 0 r 1 r 2 P ( x t + 1 | a t , x t ) transition probability (1) P ( r t | a t , x t ) reward probability (2) P ( a t | x t ) = π ( a t | x t ) policy (3) • we will assume stationarity, no explicit dependency on time – P ( x ′ | a, x ) and P ( r | a, x ) are invariable properties of the world – the policy π is a property of the agent 4/21

optimal policies • value ( expected discounted return) of policy π when started in x V π ( x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x ; π � � C ( x 0 , a 0: T ) = φ ( x T ) + � T - 1 (cf. cost function R ( t, x t , a t ) ) 0 • optimal value function: V π ( x ) V ∗ ( x ) = max π • policy π ∗ if optimal iff ∀ x : V π ∗ ( x ) = V ∗ ( x ) (simultaneously maximizing the value in all states) • There always exists (at least one) optimal deterministic policy! 5/21

Bellman optimality equation V π ( x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x ; π � � = E { r 0 | x 0 = x ; π } + γ E { r 1 + γr 2 + · · · | x 0 = x ; π } x ′ P ( x ′ | π ( x ) , x ) E { r 1 + γr 2 + · · · | x 1 = x ′ ; π } = R ( π ( x ) , x ) + γ � x ′ P ( x ′ | π ( x ) , x ) V π ( x ′ ) = R ( π ( x ) , x ) + γ � • Bellman optimality equation � � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) V ∗ ( x ) = max R ( a, x ) + γ � a � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) � π ∗ ( x ) = argmax R ( a, x ) + γ � a (if π would select another action than argmax a [ · ] , π wouldn’t be optimal: π ′ which = π everywhere except π ′ ( x ) = argmax a [ · ] would be better) • this is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) 6/21

Dynamic Programming • Bellman optimality equation � x ′ P ( x ′ | a, x ) V ∗ ( x ′ ) � V ∗ ( x ) = max R ( a, x ) + γ � a • Value iteration (initialize V 0 ( x ) = 0 , iterate k = 0 , 1 , .. ) � � R ( a, x ) + γ � ∀ x : V k +1 ( x ) = max x ′ P ( x ′ | a, x ) V k ( x ′ ) a – stopping criterion: max x | V k +1 ( x ) − V k ( x ) | ≤ ǫ (see script for proof of convergence) • once it converged, choose the policy � � x ′ P ( x ′ | a, x ) V k ( x ′ ) π k ( x ) = argmax R ( a, x ) + γ � a 7/21

maze example • typical example for a value function in navigation [online demo – or switch to Terran Lane’s lecture...] 8/21

comments • Bellman’s principle of optimality is the core of the methods • it refers to the recursive thinking of what makes a path optimal – the recursive property of the optimal value function • related to Viterbi, max-product algorithm 9/21

Learning from experience • Reinforcement Learning problem: model P ( x ′ | a, x ) and P ( r | a, x ) are not known, only exploration is allowed experience { x t , a t , r t } g n p i o n l i r c a y e l TD learning s Q-learning e l e a d r o c m h MDP model policy P, R π policy optim. EM D y e n t a a m d p i c u P y c r o i l o g . p value/Q-function V, Q 10/21

Model learning • trivial on direct discrete representation: use experience data to estimate model P ( x ′ | a, x ) ∝ #( x ′ ← x | a ) ˆ – for non-direct representations: Machine Learning methods • use DP to compute optimal policy for estimated model • Exploration-Exploitation is not a Dilemma possible solutions: E 3 algorithm, Bayesian RL (see later) 11/21

Temporal Difference • recall Value Iteration � � ∀ x : V k +1 ( x ) = max R ( a, x ) + γ � x ′ P ( x ′ | a, x ) V k ( x ′ ) a • Temporal Difference learning (TD): given experience ( x t a t r t x t +1 ) V new ( x t ) = (1 − α ) V old ( x t ) + α [ r t + γV old ( x t + 1 )] = V old ( x t ) + α [ r t + γV old ( x t + 1 ) − V old ( x t )] . ... is a stochastic variant of Dynamic Programming → one can prove convergence with probability 1 (see Q-learning in script) • reinforcement: ( r t > γV old ( x t + 1 ) − V old ( x t ) ) – more reward than expected → increase V ( x t ) – less reward than expected ( r t < γV old ( x t + 1 ) − V old ( x t ) ) → decrease V ( x t ) 12/21

Q-learning convergence with prob 1 • Q-learning Q π ( a, x ) = E r 0 + γr 1 + γ 2 r 2 + · · · | x 0 = x, a 0 = u ; π � � x ′ P ( x ′ | a, x ) max a ′ Q ∗ ( a ′ , x ′ ) Q ∗ ( a, x ) = R ( a, x ) + γ � ∀ a,x : Q k +1 ( a, x ) = R ( a, x ) + γ � x ′ P ( x ′ | a, x ) max a ′ Q k ( a ′ , x ′ ) Q new ( x t , a t ) = (1 − α ) Q old ( x t , a t ) + α [ r t + γ max Q old ( x t +1 , a )] a • Q-learning is a stochastic approximation of Q-VI: Q-VI is deterministic: Q k +1 = T ( Q k ) Q-learning is stochastic: Q k +1 = (1 − α ) Q k + α [ T ( Q k ) + η k ] η k is zero mean! 13/21

Q-learning impact • Q-Learning (Watkins, 1988) is the first provably convergent direct adaptive optimal control algorithm • Great impact on the field of Reinforcement Learning – smaller representation than models – automatically focuses attention to where it is needed i.e., no sweeps through state space – though does not solve the exploration versus exploitation issue – epsilon-greedy, optimistic initialization, etc,... 14/21

Eligibility traces • Temporal Difference: V new ( x 0 ) = V old ( x 0 ) + α [ r 0 + γV old ( x 1 ) − V old ( x 0 )] • longer reward sequence: r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 temporal credit assignment, think further backwards, receiving r 3 also tells us something about V ( x 0 ) V new ( x 0 ) = V old ( x 0 ) + α [ r 0 + γr 1 + γ 2 r 2 + γ 3 V old ( x 3 ) − V old ( x 0 )] • online implementation: remember where you’ve been recently (“eligibility trace”) and update those values as well: e ( x t ) ← e ( x t ) + 1 ∀ x : V new ( x ) = V old ( x ) + αe ( x )[ r t + γV old ( x t + 1 ) − V old ( x t )] ∀ x : e ( x ) ← γλe ( x ) • core topic of Sutton & Barto book – great improvement 15/21

comments • again, Bellman’s principle of optimality is the core of the methods TD ( λ ) , Q-learning, eligibilities, are all methods to converge to a function obeying the Bellman optimality equation 16/21

E 3 : Explicit Explore or Exploit • (John Langford) from observed data construct two MDPs: (1) MDP known includes sufficiently often visited states and executed actions with (rather exact) estimates of P and R . (model which captures what you know) (2) MDP unknown = MDP known except the reward is 1 for all actions which leave the known states and 0 otherwise. (model which captures optimism of exploration) • the algorithm: (1) If last x not in Known: choose the least previously used action (2) Else: (a) [seek exploration] If V unknown > ǫ then act according to V unknown until state is unknown (or t mod T = 0 ) then goto (1) (b) [exploit] else act according to V known 17/21

E 3 – Theory • for any (unknown) MDP: – total number of actions and computation time required by E 3 are poly ( | X | , | A | , T ∗ , 1 ǫ , ln 1 δ ) – performance guarantee: with probability at least (1 − δ ) exp. return of E 3 will exceed V ∗ − ǫ • details � T 1 – actual return: t =1 r t T – let T ∗ denote the (unknown) mixing time of the MDP – one key insight: even the optimal policy will take time O ( T ∗ ) to achieve actual return that is near-optimal • straight-forward & intuitive approach! – the exploration-exploitation dilemma is not a dilemma! – cf. active learning, information seeking, curiosity, variance analysis 18/21

Stochastic Optimal Control part 2 discrete time, Markov Decision - PowerPoint PPT Presentation

Stochastic Optimal Control part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint Machine Learning & Robotics Group TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008 Why

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Discrete time Markov chains Today: Discrete Time Markov Chains, Limiting Discrete time Markov

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Simulation of Discrete-Time Markov Chains Discrete-Time Markov Chains (DTMCs) Numerical Solution

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Outlines Stochastic Process Discrete Time Markov Chain (DTMC) 2 Stochastic Process

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Markov chains Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad Niemi

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

PROCUREMENT OF MEALS FROM A COMMERCIAL VENDOR SCHOOL YEAR 2020-2021 TODAYS WEBINAR This

Advertising, Innovation, and Economic Growth Laurent Cavenaile Pau Roldan-Blanco University of

Random matrices, operators and analytic functions Benedek Valk o (University of Wisconsin

4/19/2018 From Play to Practice: Connecting Teachers Play to Childrens Learning Walter F.

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Introduction to Reinforcement Learning Milan Straka October 5, 2020 Charles University in

ProbabilityandStatistics* ! ! forComputerScience** All!models!are!wrong,!but!some!

Monte Carlo Learning Lecture 4, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki Used

Stochastic Optimal Control part 2 discrete time, Markov Decision - PowerPoint PPT Presentation

Stochastic Optimal Control part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint Machine Learning & Robotics Group TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki, July 5th, 2008 Why

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Discrete time Markov chains Today: Discrete Time Markov Chains, Limiting Discrete time Markov

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Simulation of Discrete-Time Markov Chains Discrete-Time Markov Chains (DTMCs) Numerical Solution

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Outlines Stochastic Process Discrete Time Markov Chain (DTMC) 2 Stochastic Process

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Discrete time Markov chains Today: Short recap of probability theory Markov chain

Markov chains Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad Niemi

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

PROCUREMENT OF MEALS FROM A COMMERCIAL VENDOR SCHOOL YEAR 2020-2021 TODAYS WEBINAR This

Advertising, Innovation, and Economic Growth Laurent Cavenaile Pau Roldan-Blanco University of

Random matrices, operators and analytic functions Benedek Valk o (University of Wisconsin

4/19/2018 From Play to Practice: Connecting Teachers Play to Childrens Learning Walter F.

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Introduction to Reinforcement Learning Milan Straka October 5, 2020 Charles University in

Probability*and*Statistics* ! ! for*Computer*Science** All!models!are!wrong,!but!some!

Monte Carlo Learning Lecture 4, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki Used

ProbabilityandStatistics* ! ! forComputerScience** All!models!are!wrong,!but!some!