DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

ds595 cs525 reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v Model Free Control Generalized policy iteration Control with Exploration


slide-1
SLIDE 1

DS595/CS525 Reinforcement Learning

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020 This lecture will be recorded!!!

slide-2
SLIDE 2

Last Lecture

v Model Free Control

§ Generalized policy iteration § Control with Exploration § Monte Carlo (MC) Policy Iteration § Temporal-Difference (TD) Policy Iteration

  • SARSA
  • Q-Learning

§ Maximization bias and Double-Q-Learning § Project 2 description.

slide-3
SLIDE 3

This Lecture

v Value Function Approximation VFA

§ Introduction § VFA for Policy Evaluation § VFA for Control

slide-4
SLIDE 4

This Lecture

v Value Function Approximation VFA

§ Introduction § VFA for Policy Evaluation § VFA for Control

slide-5
SLIDE 5

RL algorithms

Model-Free Control

§ Policy evaluation

MC (First/every

visit) and TD

§ Value/Policy Iteration

  • MC Iteration
  • TD Iteration

– SARSA – Q-Learning – Double Q- Learning

v Model-based

control

§ Policy evaluation (DP) § Policy iteration § Value iteration

Tabular Representation Value Function Function Representation Value Function

Policy function approximation Value Function approximation

(Asynchronous) Advantage Actor Critic: A2C A3C

slide-6
SLIDE 6

Value function representations

v Value function

v enormous state and/or action

spaces

v Tabular representation is

insufficient

v Tabular representation v Vπ can be viewed as a

vector

slide-7
SLIDE 7

Value function representations

v Value function

v enormous state and/or action

spaces

v Tabular representation is

insufficient

v Tabular representation v Vπ can be viewed as a

vector

Vπ(s) s Vπ(s’) s’

……

w Vπ(s;w) s w with k dimensions, k<<|S| S →Vπ of |S| dimensions

slide-8
SLIDE 8

Value Function Approximation (VFA)

v Represent a (state-action/state) value function

with a parameterized function instead of a table

v Vπ(s) v Qπ(s,a)

w w Qπ(s,a;w) Vπ(s;w) s s a

slide-9
SLIDE 9

Why VFA? Benefits of VFA?

v Vπ(s) v Qπ(s,a)

w w Qπ(s,a;w) Vπ(s;w) s s a

slide-10
SLIDE 10

Why VFA? Benefits of VFA?

v Huge state and/or action space, thus impossible

by tabular representation

v Want more compact representation that

generalizes across state or states and actions

v Vπ(s) v Qπ(s,a)

w w Qπ(s,a;w) Vπ(s;w) s s a

slide-11
SLIDE 11

Benefits of Generalization via VFA

v Huge state and/or action space,

§ Reduce the memory needed

v More compact representation that generalizes

across state or states and actions

§ Generalization across states/state-action-pairs § Advantages of tabular: Exact value of s, or s,a

v A trade-off

§ Capacity vs (computational and space) efficiency w Vπ(s) s w with k dimensions, k<<|S|

slide-12
SLIDE 12

What function for VFA?

v Vπ(s) v Qπ(s,a) v What function approximator?

w w Qπ(s,a;w) Vπ(s;w) s s a ?

slide-13
SLIDE 13

What function for VFA?

v Many possible function approximators including

§ Linear combinations of features § Neural networks § Decision trees § Nearest neighbors, and more.

v In this class we will focus on function

approximators that are differentiable (Why?)

v Two very popular classes of differentiable

function approximators

§ Linear feature representations; § Neural networks (Deep Reinforcement Learnig).

slide-14
SLIDE 14

This Lecture

v Value Function Approximation VFA

§ Introduction § VFA for Policy Evaluation § VFA for Control

slide-15
SLIDE 15

Review: Gradient Descent

v Consider a function J(w) that is a differentiable

function of a parameter vector w

v Goal is to find parameter w that minimizes J v The gradient of J(w) is

slide-16
SLIDE 16

Review: Gradient Descent

v Consider a function J(w) that is a differentiable

function of a parameter vector w

v Goal is to find parameter w that minimizes J v The gradient of J(w) is v Gradient vector points the uphill direction. v To minimize J(w), we remove α-weighted

gradient vector from w in each iteration.

J(w)

slide-17
SLIDE 17

VFA problem

v Consider an oracle function exists, that takes s

as input, and outputs a Vπ(s).

§ The oracle may not be accessible in practice (that is the model-free problem setting).

v The objective was to find the best approximate

representation of Vπ(s), given a particular parameterized function V’π(s; w)

w Vπ(s;w) s

slide-18
SLIDE 18

Without loss of generality, a constant parameter ½ was added.

slide-19
SLIDE 19

From full gradient to Stochastic gradient

Without loss of generality, a constant parameter ½ was added.

slide-20
SLIDE 20

Model-free Policy Evaluation From tabular Representation to VFA

v Following a fixed policy π (or had access to

prior data) Goal is to estimate Vπ and/or Qπ

v Maintained a look up table to store estimates Vπ

and/or Qπ

v Updated these tabular estimates

§ after each episode (Monte Carlo methods)

  • r

§ after each step (TD methods)

V(1) V(2) V(3) V(4) V(5)

slide-21
SLIDE 21

v Following a fixed policy π (or had access to

prior data) Goal is to estimate Vπ and/or Qπ

v Maintained a function parameter vector w to

store estimates Vπ and/or Qπ

v Updated the function parameter vector w

§ after each episode (Monte Carlo methods)

  • r

§ after each step (TD methods) w Vπ(s) s

Model-free Policy Evaluation From tabular Representation to VFA

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

v From updating initial

V over iterations

§ MC § TD

v To updating initial w over iterations

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

^

slide-29
SLIDE 29

, α=0.5, γ=1

(s1,a1,0,s7,a1,0,s7,a1,0,T) What is Δw and w1 = w0 - Δw after update with the first visit of s1? s1 s2 s3 s4 s5 s6 s7

  • Init w0=[1,1,1,1,1,1,1,1]T
slide-30
SLIDE 30

x(s1)=[2,0,0,0,0,0,0,1]T x(s2)=[0,2,0,0,0,0,0,1]T x(s3)=[0,0,2,0,0,0,0,1]T x(s4)=[0,0,0,2,0,0,0,1]T x(s5)=[0,0,0,0,2,0,0,1]T x(s6)=[0,0,0,0,0,2,0,1]T x(s7)=[0,0,0,0,0,0,1,2]T w0=[1,1,1,1,1,1,1,1]T (s1,a1,0,s7,a1,0,s7,a1,0,T) s1: Gs1=0, V(s1)=x(s1)T w =3 α=0.5, x(s1)=[2,0,0,0,0,0,0,1]T Δw=-0.5*(0-3) [2,0,0,0,0,0,0,1]T=[3,0,0,0,0,0,0,1.5]T w1=w0-Δw=[1,1,1,1,1,1,1,1]T-[3,0,0,0,0,0,0,1.5]T=[-2,1,1,1,1,1,1,-0.5]T s1 s2 s3 s4 s5 s6 s7

  • , α=0.5, γ=1
slide-31
SLIDE 31

Tabular representation

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

, α=0.5, γ=1

What is w1 after update with a tuple (s1,a1,1,s7)? s1 s2 s3 s4 s5 s6 s7

  • Init w0=[1,1,1,1,1,1,1,1]T

Linear VFA with TD (Offline Practice)

TD

slide-36
SLIDE 36

, α=0.5, γ=1

What is w1 after update with a tuple (s1,a1,1,s7)? s1 s2 s3 s4 s5 s6 s7

  • Init w0=[1,1,1,1,1,1,1,1]T

Linear VFA with TD (Offline Practice)

TD Answer: w1=[2,1,1,1,1,1,1,1.5]

slide-37
SLIDE 37

This Lecture

v Value Function Approximation VFA

§ Introduction § VFA for Policy Evaluation § VFA for Control

slide-38
SLIDE 38

Recall: Tabular representation

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Recall: Tabular representation

slide-45
SLIDE 45

Model-Free Q-Learning Control Value Function Approximation (VFA)

slide-46
SLIDE 46

Model-Free Q-Learning Control Value Function Approximation (VFA)

slide-47
SLIDE 47

RL algorithms

Model-Free Control

§ Policy evaluation

MC (First/every

visit) and TD

§ Value/Policy Iteration

  • MC Iteration
  • TD Iteration

– SARSA – Q-Learning – Double Q- Learning

v Model-based

control

§ Policy evaluation (DP) § Policy iteration § Value iteration

Tabular Representation Value Function Function Representation Value Function

Policy function approximation Value Function approximation

(Asynchronous) Advantage Actor Critic: A2C A3C

slide-48
SLIDE 48

48

Project 3 is available Starts 10/15 Thursday Due 10/29 Thursday mid-night

v http://users.wpi.edu/~yli15/courses/DS595C

S525Fall20/Assignments.html

v https://github.com/yingxue-

zhang/DS595CS525-RL- Projects/tree/master/Project3

slide-49
SLIDE 49

Next Lecture

v (Continue) Value Function Approximation

§ Linear Value Function

v Review of Deep Learning v Deep Learning Implementation in Pytorch

§ (by TA Yingxue)

slide-50
SLIDE 50

Questions?