DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

Last lecture v Reinforcement Learning Components § Model, Value function, Policy v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration v Project 1 description.

Quiz 1 Week 4 (9 /24 R) v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration § 20 min at the beginning • You can start as early as 5:55PM, and finish as late as 6:20PM. The quiz duration is 20 minutes. § Login class zoom so you can ask questions regarding the quiz in Zoom chat box. Project 1 due Week 4 (9 /24 R)

This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states

RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

RL components: （ 1 ） Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 =s’ |s t =s, a t =a) r(s,a)

RL components: （ 2 ） Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’

RL components: （ 3 ） Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’

RL agents and algorithms Model-Free: Model-based No model Explicit: Model

Find a good policy: Problem settings Model-based control Model-free control v Computing while v (Agent’s internal interacting with computation) environment § Given model of how the § Agent doesn’t know world works how world works § Dynamics and reward § Interacts with world to model implicitly/explicitly learn § Algorithm computes how how world works to act in order to § Agent improves policy maximize expected reward (may involve planning)

Find a good policy: Problem settings Model-based control Model-free control v (Agent’s internal v Computing while interacting computation) with environment § Frozen Lake project 1 § Taxi passenger-seeking problem § Know all rules of game / perfect model § Demand/Traffic dynamics are uncertain § dynamic programming, tree search § Huge state space Path 1 Path 2 Path 3

Find a good policy: Problem settings Model-based control Model-free control v Given: MDP/R/P v Given: MDP § S, A, γ § S, A, P, R, γ v Unknow § P , R, v Output: v Output: § π § π

This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process

Random Walks on Graphs Random Walk Random walk sampling Routing Molecule in liquid Influence diffusion

Undirected Graphs Undirected !! 2 3 1 6 4 5

Random Walk v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 1 0 1 0 0 2 0 0 # & Symmetric # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 1 0 1 0 0 0 0 2 " % " % 4 3 v Transition Probability Matrix Undirected ⎧ 1 if i is not equal to j " % ⎪ 0 1/ 3 1/ 3 1/ 3 P k i $ ' ij = ⎨ 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' ⎪ $ ' 0 if i = j 1/ 3 1/ 3 0 1/ 3 ⎩ $ ' 1/ 2 0 1/ 2 0 # & v |E|: number of links v Stationary Distribution π i = d i 2 E

A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3

A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 s 0 * P = s 1

Taxi passenger-seeking task: Markov Process --- Episodes s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 Example: Sample episodes starting from s 3 s 3 , s 2 , s 2 , s 2 , s 1 , s 1 ,... s 3 , s 3 , s 4 , s 5 , s 6 , s 6 ,... s 3 , s 4 , s 5 , s 4 ,...

A random walker + rewards: Markov Reward Process (MRP) s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states.

A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states Sample returns for sample 4-step episodes, γ = ½ v s 3 (t=1), s 4 (t=2), s 5 (t=3), s 5 (t=4): v G 1 =? v G 3 =?

A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states v Sample returns for sample 4-step episodes, γ = 1/2 v s 3 , s 4 , s 5 , s 6 : G 1 =? v s 3 , s 3 , s 4 , s 3 : G 1 =? v s 3 , s 2 , s 1 , s 1 : G 1 =?

Path 1 Samples: v s 3 , s 4 , s 5 , s 6… v s 3 , s 3 , s 4 , s 3… Path 2 v s 3 , s 2 , s 1 , s 1… Path 3 v …

Return vs Value function Path 1 Path 2 Path 2 Path 3 Samples: v Samples: v s 3 , s 4 , s 5 , s 6…, s 3 , s 3 , s 4 , s 3… v …

Taxi passenger-seeking task: Markov Decision Process (MDP) s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 Deterministic transition model

This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

For deterministic policy:

For deterministic and stochastic policy: From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

(All-in-one algorithm) From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

Deterministic policy

From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition

This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation

Review of Dynamic Programming for policy evaluation (model-baased) action state equivalently, " (#′)] " (#) = & ",$ [( + *! ! ! !%& 53

Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Bootstrapping: Update for V uses an estimate v Known model P(s’|s,a) and r(s,a) 54

Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Requires model of MDP P(s’|s,a) and r(s,a) Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history 55

Model-free Policy Evaluation v What if don’t know dynamics model P and/or reward model R? v Today: Policy evaluation without a model v Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π 56

Model-free Policy Evaluation v Monte Carlo (MC) policy evaluation § First visit based § Every visit based v Temporal Difference (TD) § TD(0) v Metrics to evaluate and compare algorithms 57

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last lecture v Reinforcement Learning Components Model, Value function, Policy v Model-based

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

MPC-Based Humanoid Gait Generation with application to Pursuit-Evasion (slides prepared by Nicola

Advanced Virgo and Einstein Telescope Bas Swinkels (Nikhef) 7 th Belgium-Dutch GW meeting

Toward Self-Adaptive Software Employing Model Predictive Control NII Shonan Meeting on Controlled

UK monetary policy outlook Andrew Goodwin Chief UK Economist agoodwin@oxfordeconomics.com Ian

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13,

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Waiting Times in BMAP/BMAP/1 Queues MAM-9, Budapest Nail Akar, Bilkent University, Ankara, Turkey

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last lecture v Reinforcement Learning Components Model, Value function, Policy v Model-based

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

MPC-Based Humanoid Gait Generation with application to Pursuit-Evasion (slides prepared by Nicola

Advanced Virgo and Einstein Telescope Bas Swinkels (Nikhef) 7 th Belgium-Dutch GW meeting

Toward Self-Adaptive Software Employing Model Predictive Control NII Shonan Meeting on Controlled

UK monetary policy outlook Andrew Goodwin Chief UK Economist agoodwin@oxfordeconomics.com Ian

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13,

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Waiting Times in BMAP/BMAP/1 Queues MAM-9, Budapest Nail Akar, Bilkent University, Ankara, Turkey

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm