Policysearchttill
Climbing
ECR
to ECR Search Policy - Hill Climbing to " 01 ECR - - PowerPoint PPT Presentation
Policysearchttill Climbing to ECR Search Policy - Hill Climbing to " 01 ECR Search Genetic Doit a . a toooo : Thilo On Policysearchttill Climbing # to " ECR Search Genetic rennin
Climbing
ECR
Policy
Search
Climbing
ECR
"
Genetic
Search
a
.Doit
a
:
On
Thilo
Climbing
ECR
Genetic
Search
÷::÷÷?÷i÷÷÷i¥⇒÷
.
Policy
Search
Gradient
Bandits
:
Gradient
Bandits
:
y
Just
ascalar
per
arm
.No
States !
But
in
full RL
case
,
policy
inflames
future States
!
Proof
policy
gradient
theorem
Proof
policy
gradient
theorem
→Marginalize
R
,push
ingradient Dynamics
tReward
constant
w . r . E .Q
Proof
policy
gradient
theorem
Marginalize
R
,push
ingradient
→Dynamics
tReward
constant
w . r . E .⑦
Expanding Vals
' )creates
→
deeply
nested
computation
;
At
every
step
,compute
every
state
you
could
get
to
from
every
stale
you
could
have
been
in
t
Transform
into
simple
Sum
time
steps and
states
:
What
is
total
prob
being
at each
state at
each time
step ?
Proof
policy
gradient
theorem
Marginalize
R
,push
ingradient
Reward
constant
w . r . E .⑦
Expanding
Vt ( s
' )creates
→
deeply
nested
computation
;
At
every
step
,compute
every
state
you
could
get
to
from
every
stale
you
could
have
been
in
I
normalized
①
Transform
into
simple
Sum
steady
.stole
time
steps and
states
:
prob
5
What
is
total
prob
being
at each
state at
each time
step ?
normalized
version
O
REINFORCE
→
All
actions
I f-
Q
approx
,not
a
Sample
return
REINFORCE
→
REINFORCE
→
Gradient
Bandits
+Base
line ←
I
Mean
Expectation
Zero
Samples
Gradient
Bandits REINFORCE
tBaseline
+Baseline
I
Mean
Expectation
①
Zero
Samples
f
I
Lse )
Actually
search
parameterized
policy
valve
functions ( except
baseline
in
REINFORCE
)
actions
natural
to
represent
variance
,No
bootstrapping
w/ policy
Complexity
,not
size
state
space
Actor
Critic
search
function
parameterized
methods
policy
policy
via
VE
value
functions
actions
( except
baseline
in
REINFORCE
)
variance
actions
Inatural
to
represent
bootstrapping
variance
,with
size
No
bootstrapping
state space
w/ policy
Complexity
,not
size
state
space
Actor
Critic
Actor
policy
search
frickin
Search
tvalve
function
methods
both
!
policy
policy
via
UF
actions
valve
functions
actions
( except
baseline
in
REINFORCE
)
variance
with
actions
IPolicy complexity
natural
to
represent
bootstrapping
variance
,with
size
No
bootstrapping
state space
w/ policy
Complexity
,not
size
state
space
Actor
Critic
Actor
policy
search
frickin
Search
tvalve
function
methods
both
!
policy
policy
via
VF
actions
valve
functions
actions
( except
baseline
in
REINFORCE
)
variance
with
actions
IPolicy complexity
natural
to
represent
bootstrapping
variance
,with
size
Many
most
popular
No
bootstrapping
state space
contemporary methods
are
A-
c
:
Policy
Optimization
w/ policy
Complexity
,not
size
3C
state
space
Actor
Critic
PG
:
(