Dynamic Programming Talk 5 by Daniela and Christoph Content - PowerPoint PPT Presentation

Reinforcement Learning and Dynamic Programming Talk 5 by Daniela and Christoph

Content Reinforcement Learning Problem • Agent-Environment Interface • Markov Decision Processes • Value Functions • Bellman equations Dynamic Programming • Policy Evaluation, Improvement and Iteration • Asynchronous DP • Generalized Policy Iteration

Reinforcement Learning Problem • Learning from interactions • Achieving a goal

Example robot actions 1 2 3 4 5 6 7 8 Reward is -1 for 9 10 11 12 all transition, except for the last 13 14 15 16 transition. Reward for the last transition is 2.

Agent-Environment Interface Agent Agent • Learner • Decision maker Environment • Everything outside of the agent Environment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Interaction • State: 𝑇 𝑢 ∈ 𝑇 1 Agent • Reward: 𝑆 𝑢 ∈ ℝ -1 or 2 A t S t R t • Action: 𝐵 𝑢 ∈ 𝐵(𝑇 𝑢 ) Environment Discrete time steps • 𝑢 = 0,1,2,3 …

Example Robot Agent 1 2 3 S 0 =1 0 4 5 6 Environment

Example Robot Agent 1 2 3 S 1 =2 -1 4 5 6 Environment

Example Robot Agent 1 2 3 S 4 =6 2 4 5 6 Environment

Policy 𝜌 𝑢 𝑣𝑞|𝑡 𝑗 = 0.25 0.25 𝜌 𝑢 𝑚𝑓𝑔𝑢|𝑡 𝑗 = 0.25 0.25 0.25 𝜌 𝑢 𝑒𝑝𝑥𝑜|𝑡 𝑗 = 0.25 0.25 𝜌 𝑢 𝑠𝑗𝑕ℎ𝑢|𝑡 𝑗 = 0.25 • In each state, the agent can choose between different actions. The probability that the agent selects a possible action is called policy. • 𝜌 𝑢 𝑏|𝑡 : probability that 𝐵 𝑢 = 𝑏 if 𝑇 𝑢 = 𝑡 • In reinforcement learning: the agent changes the policy as a result of the experience

Example Robot: Diagram 0.25 0.5 0.5 0.25 0.25 1 2 3 0.25 0.25 1 2 3 0.25 0.25 0.25 0.25 4 5 6 0.25 0.25 0.25 4 5 6 0.25 0.5 0.25

Reward signal • Goal: Maximizing the total amount of cumulative reward over the long run 0.25 0.5 -1 0.5 -1 -1 0.25 0.25 -1 1 -1 2 3 1 2 3 0.25 0.25 -1 -1 0.25 -1 0.25 -1 4 5 6 0.25 2 0.25 -1 -1 0.25 0.25 0.25 4 -1 5 6 0.25 2 -1 -1 0.5 -1 0.25

Return Sum of the rewards • 𝐻 𝑢 = 𝑆 𝑢+1 + 𝑆 𝑢+2 + 𝑆 𝑢+3 + ⋯ + 𝑆 𝑈 , where T is a final step Maximize the expected return t=0 1 2 3 G 0 =-1-1-1-1+2=-2 G 0 =-1-1+2=0 4 5 6

Discounting • If the task is a continuing task, a discount rate for the return is needed Discount rate determines the present value of the future rewards in a continuing task • 𝐻 𝑢 = 𝑆 𝑢+1 + 𝛿 ∗ 𝑆 𝑢+2 + 𝛿 2 ∗ 𝑆 𝑢+3 + ⋯ = ∞ 𝛿 𝑙 𝑆 𝑢+𝑙+1 𝑙=0 where 𝛿 is called the discount rate: 0 ≤ 𝛿 ≤ 1 𝑼 𝜹 𝒍 𝑺 𝒖+𝒍+𝟐 Unified Notation: 𝑯 𝒖 = 𝒍=𝟏

The Markov Property 1 2 3 4 5 6 7 8 9 • 𝑄𝑠 𝑆 𝑢+1 = 𝑠, 𝑇 𝑢+1 = 𝑡 ′ |𝑇 0 , 𝐵 0 , 𝑆 1 , … , 𝑇 𝑢−1 , 𝐵 𝑢−1 , 𝑆 1 , 𝑇 𝑢 , 𝐵 𝑢 = 𝑄𝑠 𝑆 𝑢+1 = 𝑠, 𝑇 𝑢+1 = 𝑡 ′ |𝑇 𝑢 , 𝐵 𝑢 • State signal summarizes past sensations compactly such that all relevant information is retained • Decisions are assumed to be a function of the current state only

The Markov Decision Processes Task has to satisfy the Markov Property • If the state and action spaces are finite, then it is called a finite Markov decision process • Given any state and action, s and a, the probability of each 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) = possible next state and reward, s’, r, is: 𝑄𝑠 𝑇 𝑢+1 = 𝑡 ′ , 𝑆 𝑢+1 = 𝑠|𝑇 𝑢 = 𝑡, 𝐵 𝑢 = 𝑏

Example robot 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) = 𝑄𝑠 𝑇 𝑢+1 = 𝑡 ′ , 𝑆 𝑢+1 |𝑇 𝑢 = 𝑡, 𝐵 𝑢 = 𝑏 0.25 𝑞(2, −1|1, 𝑠𝑗𝑕ℎ𝑢) = 1 0.5 -1 0.5 -1 -1 0.25 0.25 𝑞(4, −1|1, 𝑒𝑝𝑥𝑜) = 1 -1 -1 1 2 3 0.25 0.25 𝑞(4, −1|1, 𝑣𝑞) = 0 -1 -1 0.25 -1 0.25 -1 0.25 2 0.25 -1 -1 0.25 0.25 0.25 4 -1 5 6 0.25 2 -1 -1 0.5 -1 0.25

The Markov Decision Processes • Given any current state and action, s and a, together with any next state, s’, the expected value of next reward is: 𝑠(𝑡, 𝑏, 𝑡 ′ ) = 𝐹 𝑆 𝑢+1 |𝑇 𝑢 = 𝑡, 𝐵 𝑢 = 𝑏, 𝑇 𝑢+1 = 𝑡′

Example robot 𝑠(𝑡, 𝑏, 𝑡 ′ ) = 𝐹 𝑆 𝑢+1 |𝑇 𝑢 = 𝑡, 𝐵 𝑢 = 𝑏, 𝑇 𝑢+1 = 𝑡′ 0.25 𝑠 1, 𝑠𝑗𝑕ℎ𝑢, 2 = −1 0.5 -1 0.5 -1 -1 0.25 0.25 -1 𝑠 1, 𝑒𝑝𝑥𝑜, 4 = −1 -1 1 2 3 0.25 0.25 -1 -1 0.25 -1 0.25 -1 𝑠 5, 𝑠𝑗𝑕ℎ𝑢, 6 = 2 0.25 2 0.25 -1 -1 0.25 0.25 0.25 4 -1 5 6 0.25 2 -1 -1 0.5 -1 0.25

Value functions • Value functions estimate how good it is for the agent to be in a given state (state-value function) or how good it is to perform a certain action in a given state (action-value function) • Value functions are defined with respect to particular policies • The value of a state s under a policy π is the expected return when starting in s and following π thereafter: 𝑤 𝜌 𝑡 = 𝐹 𝜌 𝐻 𝑢 |𝑇 𝑢 = 𝑡 • v π is called the state-value function for policy π

State-value function

Property of state-value function 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤 𝜌 (𝑡 ′ ) 𝑤 𝜌 𝑡 = 𝐹 𝜌 𝐻 𝑢 |𝑇 𝑢 = 𝑡 = 𝜌(𝑏|𝑡) 𝑡 ′ ,𝑠 𝑏 • Bellman equation for v π • Expresses a relationship between the value of a state and the value of its successor states

Example state-value function 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤 𝜌 (𝑡 ′ ) 𝑤 𝜌 𝑡 = 𝐹 𝜌 𝐻 𝑢 |𝑇 𝑢 = 𝑡 = 𝜌(𝑏|𝑡) 𝑡 ′ ,𝑠 𝑏 0.25 1 2 3 0.25 1 -1 2 3 0.25 2 -1 -1 0.75 -1 𝛿 = 1 0.5 𝑤 𝜌 3 = 0 𝑤 𝜌 1 = 3 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤 𝜌 1 ) + 0.25 ∗ 1 ∗ (−1 + 𝑤 𝜌 2 ) 𝑤 𝜌 2 = 2 ∗ (0.25 ∗ 1 ∗ −1 + 𝑤 𝜌 2 ) + 0.25 ∗ 1 ∗ −1 + 𝑤 𝜌 1 + 0.25 ∗ 1 ∗ (2 + 𝑤 𝜌 3 ) 𝒘 𝝆 𝟑 = −𝟔 𝒘 𝝆 𝟒 = 𝟏 𝒘 𝝆 𝟐 = −𝟘

Action-value function • The value of the expected return taking action a in state s under policy π • 𝑟 𝜌 𝑡, 𝑏 = 𝐹 𝜌 𝐻 𝑢 |𝑇 𝑢 = 𝑡, 𝐵 𝑢 = 𝑏 • q π is called the action-value function for policy π

Optimal policy A policy π is better or equal to a policy π’ if the state-value function is greater or equal to that of π’ 𝜌 ≥ 𝜌 ′ 𝑗𝑔 𝑏𝑜𝑒 𝑝𝑜𝑚𝑧 𝑗𝑔 𝑤 𝜌 (𝑡) ≥ 𝑤 𝜌′ 𝑡 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇 • Optimal state-value function 𝑤 ∗ 𝑡 = max 𝑤 𝜌 𝑡 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇 • 𝜌 Optimal action-value function 𝑟 ∗ 𝑡, 𝑏 = max 𝑟 𝜌 𝑡, 𝑏 , 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑡 ∈ 𝑇 𝑏𝑜𝑒 𝑏 ∈ 𝐵(𝑡) • 𝜌

Bellman optimality equation • Without a reference to any specific policy Bellman optimality equation for v * 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤 ∗ (𝑡 ′ ) 𝑏∈𝐵(𝑡) • 𝑤 ∗ 𝑡 = max 𝑡 ′ ,𝑠

Bellman optimality equation for v * 𝑏∈𝐵(𝑡) 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) 𝑠 + 𝛿𝑤 ∗ (𝑡 ′ ) 𝑤 ∗ 𝑡 = max 𝑡 ′ ,𝑠 1 -1 2 3 1 2 3 2 -1 -1 -1 actions: 1 ∗ −1 + 𝑤 ∗ 1 𝛿 = 1 up 1 ∗ −1 + 𝑤 ∗ 1 down 𝑤 ∗ 1 = max 𝑤 ∗ 3 = 0 left 1 ∗ −1 + 𝑤 ∗ 1 𝒘 ∗ 𝟐 =? right 1 ∗ (−1 + 𝑤 ∗ 2 ) 1 ∗ −1 + 𝑤 ∗ 2 𝒘 ∗ 𝟑 =? up 1 ∗ −1 + 𝑤 ∗ 2 down 𝑤 ∗ 2 = max left 1 ∗ −1 + 𝑤 ∗ 1 right 1 ∗ 2 + 𝑤 ∗ 3

Bellman optimality equation Bellman optimality equation for q * 𝑞(𝑡 ′ , 𝑠|𝑡, 𝑏) 𝑠 + 𝛿 max 𝑏′ 𝑟 ∗ (𝑡 ′ , 𝑏 ′ ) • 𝑟 ∗ 𝑡, 𝑏 = 𝑡 ′ ,𝑠

Bellman optimality equation • System of nonlinear equations, one for each state • N states: there are N equations and N unknowns • If we know 𝑞 𝑡 ′ , 𝑠 𝑡, 𝑏 and 𝑠(𝑡, 𝑏, 𝑡 ′ ) then in principle one can solve this system of equations • If we have v * it is relatively easy to determine an optimal policy π * v * -9 -5 -3 -5 -3 -2 -3 -2 0

Assumptions for solving the Bellman optimality equation • Markov property • We know the dynamics of the environment • We have enough computational resources to complete the computation of the solution • Problem: Long computational time • Solution: Dynamic programming

Dynamic Programming

Dynamic Programming Collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process Problem of classic DP algorithms: They are only of limited utility in reinforcement learning: • Assumption of perfect model • Great computational expense

Key Idea of Dynamic Programming Goal: Find optimal policy Problem: Solve the Bellman optimality equation 𝑞 𝑡 ′ , 𝑠 𝑡, 𝑏 [𝑠 + 𝛿𝑤 ∗ 𝑡 ′ ] 𝑤 ∗ 𝑡 = max 𝑏 𝑡 ′ ,𝑠 Solution methods: • Direct search • Linear programming • Dynamic programming

Dynamic Programming Talk 5 by Daniela and Christoph Content - PowerPoint PPT Presentation

Reinforcement Learning and Dynamic Programming Talk 5 by Daniela and Christoph Content Reinforcement Learning Problem Agent-Environment Interface Markov Decision Processes Value Functions Bellman equations Dynamic Programming

Dynamic Programming Outline and Reading Matrix Chain-Product (5.3.1) Dynamic Programming:

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is

CS 170 Section 6 Dynamic Programming Owen Jow | owenjow@berkeley.edu Agenda Dynamic

Dynamic Programming Kevin Zatloukal July 18, 2011 Motivation Dynamic programming deserves

Dynamic programming 1 Dynamic programming also solve a problem by combining the solutions to

Dynamic Programming December 15, 2016 CMPE 250 Dynamic Programming December 15, 2016 1 / 60

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Dynamic Programming (Chapter 6) Algorithm Design Techniques Greedy Divide and Conquer Dynamic

Lecture 18: Elements of Dynamic Programming COMS10007 - Algorithms Dr. Christian Konrad

Open, extensible dynamic programming systems or just how deep is the dynamic rabbit hole?

17: Dynamic Programming CS1101S: Programming Methodology Martin Henz October 19, 2012 CS1101S:

Dynamic Programming Has nothing to do with programming in the way we normally use that term

Merge Sort 5/6/2003 1:27 PM Outline and Reading The General Technique (5.3.2) Dynamic

Dynamic Games & Cartels Johan.Stennek@Economics.gu.se 1 Dynamic Games 2 Dynamic Games

Alloy Analyzer 4 Tutorial Session 4: Dynamic Modeling Greg Dennis and Rob Seater Software

SPEERMINT Working Group Administriva mailing list: speermint@ietf.org subscribe:

Hyperparameter Optimization using Hyperopt Yassine Alouini - Paul Coursaux 03/11/2016 @qucit

Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France

Solving stochastic dynamic programming models without transition matrices Paul L. Fackler

The Dynamic Analysis Model Instructor: Dr. Hany H. Ammar Dept. of Computer Science and

CS 331: Artificial Intelligence Intelligent Agents 1 General Properties of AI Systems Sensors

UML Essentials Dynamic Modeling Excerpts from: Object Oriented Software Engineering by

Dynamic Programming Talk 5 by Daniela and Christoph Content - PowerPoint PPT Presentation

Reinforcement Learning and Dynamic Programming Talk 5 by Daniela and Christoph Content Reinforcement Learning Problem Agent-Environment Interface Markov Decision Processes Value Functions Bellman equations Dynamic Programming

Dynamic Programming Outline and Reading Matrix Chain-Product (5.3.1) Dynamic Programming:

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is

CS 170 Section 6 Dynamic Programming Owen Jow | owenjow@berkeley.edu Agenda Dynamic

Dynamic Programming Kevin Zatloukal July 18, 2011 Motivation Dynamic programming deserves

Dynamic programming 1 Dynamic programming also solve a problem by combining the solutions to

Dynamic Programming December 15, 2016 CMPE 250 Dynamic Programming December 15, 2016 1 / 60

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Dynamic Programming (Chapter 6) Algorithm Design Techniques Greedy Divide and Conquer Dynamic

Lecture 18: Elements of Dynamic Programming COMS10007 - Algorithms Dr. Christian Konrad

Open, extensible dynamic programming systems or just how deep is the dynamic rabbit hole?

17: Dynamic Programming CS1101S: Programming Methodology Martin Henz October 19, 2012 CS1101S:

Dynamic Programming Has nothing to do with programming in the way we normally use that term

Merge Sort 5/6/2003 1:27 PM Outline and Reading The General Technique (5.3.2) Dynamic

Dynamic Games &amp; Cartels Johan.Stennek@Economics.gu.se 1 Dynamic Games 2 Dynamic Games

Alloy Analyzer 4 Tutorial Session 4: Dynamic Modeling Greg Dennis and Rob Seater Software

SPEERMINT Working Group Administriva mailing list: speermint@ietf.org subscribe:

Hyperparameter Optimization using Hyperopt Yassine Alouini - Paul Coursaux 03/11/2016 @qucit

Bayesian machine learning: a tutorial R emi Bardenet CNRS &amp; CRIStAL, Univ. Lille, France

Solving stochastic dynamic programming models without transition matrices Paul L. Fackler

The Dynamic Analysis Model Instructor: Dr. Hany H. Ammar Dept. of Computer Science and

CS 331: Artificial Intelligence Intelligent Agents 1 General Properties of AI Systems Sensors

UML Essentials Dynamic Modeling Excerpts from: Object Oriented Software Engineering by

Dynamic Games & Cartels Johan.Stennek@Economics.gu.se 1 Dynamic Games 2 Dynamic Games

Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France