to ECR Search Policy - Hill Climbing to " 01 ECR - - PowerPoint PPT Presentation

to
SMART_READER_LITE
LIVE PREVIEW

to ECR Search Policy - Hill Climbing to " 01 ECR - - PowerPoint PPT Presentation

Policysearchttill Climbing to ECR Search Policy - Hill Climbing to " 01 ECR Search Genetic Doit a . a toooo : Thilo On Policysearchttill Climbing # to " ECR Search Genetic rennin


slide-1
SLIDE 1

Policysearchttill

Climbing

ECR

µ

to

slide-2
SLIDE 2

Policy

Search

  • Hill

Climbing

ECR

"

01

µ

to

Genetic

Search

a

.

Doit

a

toooo

:

On

Thilo

slide-3
SLIDE 3

Policysearchttill

Climbing

ECR

"

#

to

Genetic

Search

÷::÷÷?÷i÷÷÷i¥⇒÷

.

rennin

slide-4
SLIDE 4

Policy

Search

  • CMA-t
slide-5
SLIDE 5

Gradient

Bandits

:

slide-6
SLIDE 6

Gradient

Bandits

:

y

Just

a

scalar

per

arm

.

No

States !

But

in

full RL

case

,

policy

inflames

future States

!

slide-7
SLIDE 7

Proof

  • f

policy

gradient

theorem

slide-8
SLIDE 8

Proof

  • f

policy

gradient

theorem

Marginalize

R

,

push

in

gradient Dynamics

t

Reward

constant

w . r . E .

Q

slide-9
SLIDE 9

Proof

  • f

policy

gradient

theorem

Marginalize

R

,

push

in

gradient

Dynamics

t

Reward

constant

w . r . E .

Expanding Vals

' )

creates

deeply

nested

computation

;

At

every

step

,

compute

every

state

you

could

get

to

from

every

stale

you

could

have

been

in

t

Transform

into

simple

Sum

  • ver

time

steps and

states

:

What

is

total

prob

  • f

being

at each

state at

each time

step ?

slide-10
SLIDE 10

Proof

  • f

policy

gradient

theorem

Marginalize

R

,

push

in

gradient

  • Dynamics
t

Reward

constant

w . r . E .

Expanding

Vt ( s

' )

creates

deeply

nested

computation

;

At

every

step

,

compute

every

state

you

could

get

to

from

every

stale

you

could

have

been

in

I

  • n

normalized

Transform

into

simple

Sum

  • ver

steady

.

stole

time

steps and

states

:

prob

  • f

5

What

is

total

prob

  • f

being

at each

state at

each time

step ?

normalized

version

O

slide-11
SLIDE 11

REINFORCE

All

actions

I f-

Q

approx

,

not

a

Sample

return

slide-12
SLIDE 12

REINFORCE

slide-13
SLIDE 13

REINFORCE

slide-14
SLIDE 14

Gradient

Bandits

+

Base

line ←

I

Mean

Expectation

  • f

Zero

Samples

slide-15
SLIDE 15

Gradient

Bandits REINFORCE

t

Baseline

+

Baseline

I

!

Mean

Expectation

  • f

Zero

Samples

f

I

Lse )

slide-16
SLIDE 16
slide-17
SLIDE 17

Actually

  • policy

search

  • Directly

parameterized

policy

  • No

valve

functions ( except

baseline

in

REINFORCE

)

  • Continuous

actions

natural

to

represent

  • High

variance

,

No

bootstrapping

  • Scales

w/ policy

Complexity

,

not

size

  • f

state

space

slide-18
SLIDE 18

Actor

  • nly

Critic

  • nly
  • policy

search

  • value

function

  • Directly

parameterized

methods

policy

  • Indirect

policy

via

VE

  • No

value

functions

  • Discrete

actions

( except

baseline

  • nly

in

REINFORCE

)

  • Lower

variance

  • Continuous

actions

I

natural

to

represent

bootstrapping

  • High

variance

,
  • Scales

with

size

No

bootstrapping

  • f

state space

  • Scales

w/ policy

Complexity

,

not

size

  • f

state

space

slide-19
SLIDE 19

Actor

  • nly

Critic

  • nly

Actor

  • Critic
  • .

policy

search

  • value

frickin

  • Policy

Search

t

valve

function

methods

  • Directly parametrized
  • Benefits
  • f

both

!

policy

  • Indirect

policy

via

UF

  • Continuous

actions

  • No

valve

functions

  • Discrete

actions

  • Bootstrapping

( except

baseline

  • nly

in

REINFORCE

)

  • Lower

variance

  • Scales primarily

with

  • Continuous

actions

I

Policy complexity

natural

to

represent

bootstrapping

  • High

variance

,
  • Scales

with

size

No

bootstrapping

  • f

state space

  • Scales

w/ policy

Complexity

,

not

size

  • f

state

space

slide-20
SLIDE 20

Actor

  • nly

Critic

  • nly

Actor

  • Critic
  • .

policy

search

  • value

frickin

  • Policy

Search

t

valve

function

methods

  • Directly parametrized
  • Benefits
  • f

both

!

policy

  • Indirect

policy

via

VF

  • Continuous

actions

  • No

valve

functions

  • Discrete

actions

  • Bootstrapping

( except

baseline

  • nly

in

REINFORCE

)

  • Lower

variance

  • Scales primarily

with

  • Continuous

actions

I

Policy complexity

natural

to

represent

bootstrapping

  • High

variance

,
  • Scales

with

size

Many

  • f

most

popular

No

bootstrapping

  • f

state space

contemporary methods

are

A-

c

:

  • Proximal

Policy

Optimization

  • Scales

w/ policy

Complexity

,

not

size

  • A

3C

  • f

state

space

  • Soft

Actor

Critic

  • DD

PG

:

(