Lecture #6 – Deep Reinforcement Learning
Aykut Erdem // Hacettepe University // Spring 2019
CMP722
ADVANCED COMPUTER VISION
Image: StarCraft II DeepMind feature layer API
CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - - PowerPoint PPT Presentation
Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019 Illustration: William Joel Previously on CMP722 image captioning
Lecture #6 – Deep Reinforcement Learning
Aykut Erdem // Hacettepe University // Spring 2019
Image: StarCraft II DeepMind feature layer API
networks
Previously on CMP722
Illustration: William Joel
Lecture overview
sclaimer: Much of the material and slides for this lecture were borrowed from
—Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning"
3
Decision Making and Learning under Uncertainty
4
Buzz Feathers TaMaties Java Junction Jeff’s Place DCM Nca’Kos Vlambojant Hutmakers Mirriam’s Kitchen Otaku Roman’s Pizza
Reinforcement Learning (RL)
under uncertainty
in a wide range of applications
5
6
RL can model a vast range of problems
motivated RL research
Optimal Control Games Animal Learning
7
Lindquist, J. 1962, "Operations of a hydrothermal electric system: A multistage decision process." Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, "Long-term hydro scheduling based on stochastic models." EPSOM 98.
Photo by Magda Ehlers from Pexels
8
Long-term consequences in optimal control
Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98.
9
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal
Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/
10
Samuel’s Checkers Player
11
Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599.
Photo credit: https://www.flickr.com/photos/scorius/750037290
12
Figure from: Schultz, Wolfram, Peter Dayan, and
prediction and reward." Science 1997.
RL as a valuable tool for modelling neurological phenomena
13
Further Reading
Interfaces, 15(6).
models to psychiatric and neurological disorders." Nature neuroscience 14.2 (2011).
http://incompleteideas.net/book/the-book-2nd.html Chapter 1, 14-16
14
15
In RL – agent interacts with an environment
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
agent environment
16
In RL – agent interacts with an environment
state s" ∈ S
agent environment
17
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s" ∈ S
agent environment
action a" ∈ A
18
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s" ∈ S
agent environment
action a& ∈ A reward r" ∈ ℝ
19
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s" ∈ S
agent environment
action a" ∈ A reward r" ∈ ℝ
20
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s" ∈ S
agent environment
action a& ∈ A reward r" ∈ ℝ
21
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s"#$ ∈ S
agent
acts with policy π(a|s)
environment
action a, ∈ A reward r"#$ ∈ ℝ
22
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
In RL – agent interacts with an environment
state s"#$ ∈ S
agent
acts with policy π(a|s)
environment
transition dynamics p(-.#$|s", a") and reward function r(1.#$|s", a")
action a. ∈ A reward r"#$ ∈ ℝ
23
Photo credit: https://www.flickr.com/photos/steveonjava/8170183457
Markov Decision Process (MDP)
Defined by ! = ($, &, ', (, ))
discount factor: ) ∈ (0,1)
recent state and action)
24
Markov Decision Process (MDP)
Defined by ! = ($, &, ', (, ))
discount factor: ) ∈ (0,1)
recent state and action)
G/ = 0
123 4
)15
167
25
26
State space
a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%)
27
State space
28
Mnih et al. results in Atari – a lesson in generality
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare,
reinforcement learning. Nature, 518(7540).
Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ 29
Case Study: Investigating Human Priors for Playing Video Games
https://rach0012.github.io/humanRL_website/
30
31
Case Study: Investigating Human Priors for Playing Video Games
https://rach0012.github.io/humanRL_website/
32
Case Study: Investigating Human Priors for Playing Video Games
33
Case Study: Investigating Human Priors for Playing Video Games
34
Case Study: Investigating Human Priors for Playing Video Games
35
Case Study: Investigating Human Priors for Playing Video Games
Action space
Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language)
36
A Platform for Research: TextWorld
https://www.microsoft.com/en-us/research/project/textworld/
37
Rewards
score in Atari)
learned solutions
38
Rewards
For details and full video: https://blog.openai.com/faulty-reward-functions/
39
Further Reading
http://incompleteideas.net/book/the-book-2nd.html Chapter 3, 9.5
40
41
Policy Gradient: Intuition
42
Policy Gradient: Intuition
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
43
Policy Gradient: Intuition
.5 .5 .5 .5
44
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.5 .5 .5 .5
45
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.5 .5 .5 .5 Loose L
46
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.45 .45 .55 .55
47
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.45 .45 .55 .55
48
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.45 .45 .55 .55 Win! J
49
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.4 .4 .6 .6
50
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.4 .4 .6 .6
51
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.4 .4 .6 .6 Win! J
52
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
.45 .45 .55 .55
53
Multi-armed bandit problems
Photo credit: https://www.flickr.com/photos/knothing/11264853546/
Policy Gradient: Intuition
54
Focus on the Policy: Parametric Form
! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-
55
Focus on the Policy: Parametric Form
! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-
probability dist stribution
56
Focus on the Policy: Parametric Form
! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-
le learnable le p parameters
57
Focus on the Policy: Parametric Form
! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-
act ction preference ces s
58
Focus on the Policy: Parametric Form
! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-
nor normalized d pr proba
bilities
59
Policy Gradient Objective
al: find parameters ! that maximize expected reward
" ! = $
%∈'
()(+) $
/ 0|+; ! 3%
Policy Gradient Objective
al: find parameters ! that maximize expected reward
" ! = $
%∈'
()(+) $
/ 0|+; ! 3%
61
Policy Gradient Objective
al: find parameters ! that maximize expected reward
" ! = $
%∈'
()(+) $
/ 0|+; ! 3%
62
Policy Gradient Objective
al: find parameters ! that maximize expected reward
" ! = $
%∈'
()(+) $
/ 0|+; ! 3%
+
63
Policy Gradient Objective
al: find parameters ! that maximize expected reward
" ! = $
%∈'
()(+) $
/ 0|+; ! 3%
allenge: compute updates to parameterized policy – which depends on the unknown environment dynamics
64
The Policy Gradient Theorem
y insi sight: gradient of ! does not require derivatives of "#(%)
∇! ( ∝ *
+∈-
"#(%) *
.∈/
0+
.∇1 2|%; (
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13
65
The Policy Gradient Theorem
∇" # ∝ %
&∈(
)*(,) %
.∈/
0&
.∇1 2|,; #
66
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13
The Policy Gradient Theorem
∇" # ∝ %
&∈(
)*(,) %
.∈/
0&
.∇1 2|,; # and reweighting
67
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13
The Policy Gradient Theorem
∇" # ∝ %
&∈(
)*(,) %
.∈/
0&
.∇1 2|,; #
68
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13
Policy Gradient Algorithm: REINFORCE
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement
69
Example Applications – Visual Dialog Learning
(2017). Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. ICCV, 2017.
interaction between Questioner and Answerer agent
70
Example Applications - Manipulation
algorithm called PPO (Proximal Policy Optimization)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. For of Open AI’s work on dexterous manipulation see: https://blog.openai.com/learning-dexterity/
71
72
Temporal Difference - Overview
value
from previous estimates
73
Running Example: TD-Learning in Malmo
Project Malmo
built on Minecraft
github.com/Microsoft/malmo
The Malmo Platform for Artificial Intelligence Experimentation Matthew Johnson, Katja Hofmann, Tim Hutton, & David Bignell 2016
Running Example: TD-Learning in Malmo
Task: cliff walking – the agent has to learn to navigate to the blue goal block Adapted from Sutton & Barto 2018, chapter 6
Try this at home, see https://github.com/Microsoft/malmo - tutorial 6
75
States, actions, rewards …
76
Performance
Policy
77
Challenge: Data Efficiency
to estimate returns.
78
Challenge: Data Efficiency
to estimate returns.
79
Action-Value (Q) Function
Q" #$, &$ ≡ (" )$ #$, &$ = (" +
,-. /
0,1
$2,23|#$, &$
80
Action-Value (Q) Function
Q" #$, &$ ≡ (" )$ #$, &$ = (" +
,-. /
0,1
$2,23|#$, &$
81
Bellman Equations
82
Q" #$, &$ ≡ (" )$ #$, &$ = (" +
,-. /
0,1
$2,23|#$, &$
= (" 1
$23 + 0(" )$23 #$23, &$23 #$, &$
Bellman Equations
Q" #$, &$ ≡ (" )$ #$, &$ = (" +
,-. /
0,1
$2,23|#$, &$
= (" 1
$23 + 0(" )$23 #$23, &$23 #$, &$
= (" 1
$23 + 0Q" #$23, &$23
#$, &$
83
Temporal Difference (TD) Error
Q" #$, &$ = (" )
$*+ + -Q" #$*+, &$*+ #$, &$
84
Temporal Difference (TD) Error
Q" #$, &$ = (" )
$*+ + -Q" #$*+, &$*+ #$, &$
. = Q" #$, &$ − (" )
$*+ + -Q" #$*+, &$*+ #$, &$
85
Q-Learning Algorithm
Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dissertation, King's College, Cambridge). Dayan, P., & Watkins, C. J. C. H. (1992). Q-learning. Machine learning, 8(3). Algorithm from: Sutton & Barto 2018, chapter 6, page 131
86
Back to the Cliff Walking Example …
Exam Exampl ple: e: Update action value with observed reward (e.g., r = -0.1) and the current Q value estimate of the state we ended up in
87
After 10 minutes of training using Q-Learning:
88
Q-Learning with Function Approximation
approximator, e.g., a deep neural net
J " = $
% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7
89
Q-Learning with Function Approximation
approximator, e.g., a deep neural net
J " = $
% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7
90
Case Study: Human-level control through deep reinforcement learning
(2015). Nature, 518(7540).
91
Learning to navigate Minecraft from pixels using DQN
92
Learning to navigate Minecraft from pixels using DQN
93
Learning to navigate Minecraft from pixels using DQN
94
Further Reading
MIT press, 2nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 6, 9-11
reinforcement learning to aerobatic helicopter flight. NIPS 2007. Project homepage: http://heli.stanford.edu/
myoelectric prosthesis control: A case series in adaptive switching. Prosthetics and orthotics international, 40(5).
95
96