CS 440/ECE448 Lecture 22: Reinforcement Learning
Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 4/2019
By Nicolas P. Rougier - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=29327040
CS 440/ECE448 Lecture 22: Reinforcement Learning Slides by Svetlana - - PowerPoint PPT Presentation
CS 440/ECE448 Lecture 22: Reinforcement Learning Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 4/2019 By Nicolas P. Rougier - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=29327040
Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 4/2019
By Nicolas P. Rougier - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=29327040
environment is called a trial)
Action GreetS Welcome to NJFun. Please say an activity name or say 'list activities' for a list of activities I know about. GreetU Welcome to NJFun. How may I help you? ReAsk 1 S I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list. ReAsk 1M Please tell me the activity type. You can also tell me the location and time.
Initial gait Learned gait Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion Nate Kohl and Peter Stone. IEEE International Conference on Robotics and Automation, 2004.
Pieter Abbeel et al.
Video
Video Sergey Levine et al., Berkeley
Figure 2: Illustration of the proposed framework: at each step, the NMT environment (left) computes a candidate
WRITE.
and try to solve the MDP concurrently
the transition probabilities P(s’ | s, a)
that tells us the value of doing action a in state s
Try to learn the model of the MDP (transition probabilities and rewards) and learn how to act (solve the MDP) simultaneously
according to these relative frequencies
Î
' ) ( *
s s A a
given the model of the environment we’ve experienced through our actions so far:
Î
' ) ( *
s s A a
f(u,n) trades off greed [preference for high utility u] against curiosity [preference for low observed frequencies n]
Î ' ) ( '
s s A a
+
if ) , ( u N n R n u f
e
exploration function Number of times we’ve taken action a’ in state s
Î
=
' ) ( '
) ' ( ) ' , | ' ( max arg
s s A a
s U a s s P a
Set utility of a’ to R+ [= optimistic reward estimate] if a’ in state s explored less than Ne [a constant] times Set utility to actual observed utility
select the next action
a
) , ( max arg ) (
*
a s Q s
a
= p
=
' *
) ' ( ) , | ' ( max arg ) (
s a
s U a s s P s p
Source: Berkeley CS188
a
' '
s a
Î
' ) (
s s A a
a
' '
s a
is the only possible outcome. Call this “local quality” as !"#$%" &, ( ; it is computed using ! &, ( .
to compute !+,-(&, ().
' '
s a
local new
'
a local
local new
local new
'
a new
'
a local
'
a local
'
a
'
a
Exploration function Number of times we’ve taken action a’ from state s
'
a
'
a
Exploration function Number of times we’ve taken action a’ from state s
'
a
'
a
Exploration function Number of times we’ve taken action a’ from state s
That’s not necessarily the action we will take next time…
)" * + ,", !" , .(,", !")
Exploration function Number of times we’ve taken action a’ from state s’
That is the action we will take next time…
function U(s) or action-utility function Q(s,a)
as a weighted linear combination of features:
control)
2 2 1 1
n n
parameters to improve the expected reward