to ECR Search Policy - Hill Climbing to " 01 ECR - - PowerPoint PPT Presentation

▶

to ECR Search Policy - Hill Climbing to " 01 ECR - - PowerPoint PPT Presentation

Feb 24, 2024 326 likes •539 views

Policysearchttill Climbing to ECR Search Policy - Hill Climbing to " 01 ECR Search Genetic Doit a . a toooo : Thilo On Policysearchttill Climbing # to " ECR Search Genetic rennin

slide-1

SLIDE 1

Policysearchttill

Climbing

ECR

µ

to

slide-2

SLIDE 2

Policy

Search

Hill

Climbing

ECR

"

01

µ

to

Genetic

Search

a

.

Doit

a

toooo

:

On

Thilo

slide-3

SLIDE 3

Policysearchttill

Climbing

ECR

"

#

to

Genetic

Search

÷::÷÷?÷i÷÷÷i¥⇒÷

.

rennin

slide-4

SLIDE 4

Policy

Search

CMA-t

slide-5

SLIDE 5

Gradient

Bandits

:

slide-6

SLIDE 6

Gradient

Bandits

:

y

Just

a

scalar

per

arm

.

No

States !

But

in

full RL

case

,

policy

inflames

future States

!

slide-7

SLIDE 7

Proof

f

policy

gradient

theorem

slide-8

SLIDE 8

Proof

f

policy

gradient

theorem

→

Marginalize

R

,

push

in

gradient Dynamics

t

Reward

constant

w . r . E .

Q

slide-9

SLIDE 9

Proof

f

policy

gradient

theorem

Marginalize

R

,

push

in

gradient

→

Dynamics

t

Reward

constant

w . r . E .

⑦

Expanding Vals

' )

creates

→

deeply

nested

computation

;

At

every

step

,

compute

every

state

you

could

get

to

from

every

stale

you

could

have

been

in

t

Transform

into

simple

Sum

ver

time

steps and

states

:

What

is

total

prob

f

being

at each

state at

each time

step ?

slide-10

SLIDE 10

Proof

f

policy

gradient

theorem

Marginalize

R

,

push

in

gradient

Dynamics

t

Reward

constant

w . r . E .

⑦

Expanding

Vt ( s

' )

creates

→

deeply

nested

computation

;

At

every

step

,

compute

every

state

you

could

get

to

from

every

stale

you

could

have

been

in

I

n

normalized

①

Transform

into

simple

Sum

ver

steady

.

stole

time

steps and

states

:

prob

f

5

What

is

total

prob

f

being

at each

state at

each time

step ?

normalized

version

O

slide-11

SLIDE 11

REINFORCE

→

All

actions

I f-

Q

approx

,

not

a

Sample

return

slide-12

SLIDE 12

REINFORCE

→

slide-13

SLIDE 13

REINFORCE

→

slide-14

SLIDE 14

Gradient

Bandits

+

Base

line ←

I

Mean

Expectation

f

Zero

Samples

slide-15

SLIDE 15

Gradient

Bandits REINFORCE

t

Baseline

+

Baseline

I

!

Mean

Expectation

①

f

Zero

Samples

f

I

Lse )

slide-16

SLIDE 16

slide-17

SLIDE 17

Actually

policy

search

Directly

parameterized

policy

No

valve

functions ( except

baseline

in

REINFORCE

)

Continuous

actions

natural

to

represent

High

variance

,

No

bootstrapping

Scales

w/ policy

Complexity

,

not

size

f

state

space

slide-18

SLIDE 18

Actor

nly

Critic

nly
policy

search

value

function

Directly

parameterized

methods

policy

Indirect

policy

via

VE

No

value

functions

Discrete

actions

( except

baseline

nly

in

REINFORCE

)

Lower

variance

Continuous

actions

I

natural

to

represent

bootstrapping

High

variance

,

Scales

with

size

No

bootstrapping

f

state space

Scales

w/ policy

Complexity

,

not

size

f

state

space

slide-19

SLIDE 19

Actor

nly

Critic

nly

Actor

Critic
.

policy

search

value

frickin

Policy

Search

t

valve

function

methods

Directly parametrized
Benefits
f

both

!

policy

Indirect

policy

via

UF

Continuous

actions

No

valve

functions

Discrete

actions

Bootstrapping

( except

baseline

nly

in

REINFORCE

)

Lower

variance

Scales primarily

with

Continuous

actions

I

Policy complexity

natural

to

represent

bootstrapping

High

variance

,

Scales

with

size

No

bootstrapping

f

state space

Scales

w/ policy

Complexity

,

not

size

f

state

space

slide-20

SLIDE 20

Actor

nly

Critic

nly

Actor

Critic
.

policy

search

value

frickin

Policy

Search

t

valve

function

methods

Directly parametrized
Benefits
f

both

!

policy

Indirect

policy

via

VF

Continuous

actions

No

valve

functions

Discrete

actions

Bootstrapping

( except

baseline

nly

in

REINFORCE

)

Lower

variance

Scales primarily

with

Continuous

actions

I

Policy complexity

natural

to

represent

bootstrapping

High

variance

,

Scales

with

size

Many

f

most

popular

No

bootstrapping

f

state space

contemporary methods

are

A-

c

:

Proximal

Policy

Optimization

Scales

w/ policy

Complexity

,

not

size

A

3C

f

state

space

Soft

Actor

Critic

DD

PG

:

(