[PPT] - CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement PowerPoint Presentation

SLIDE 1

Lecture #6 – Deep Reinforcement Learning

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Image: StarCraft II DeepMind feature layer API

SLIDE 2

image captioning
visual question answering
case study: neural module

networks

Previously on CMP722

Illustration: William Joel

SLIDE 3

Lecture overview

case studies (and a bit of history)
formalizing reinforcement learning
policy gradient methods
temporal differences, q-learning
Discl

sclaimer: Much of the material and slides for this lecture were borrowed from

—Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning"

3

SLIDE 4

Decision Making and Learning under Uncertainty

4

Buzz Feathers TaMaties Java Junction Jeff’s Place DCM Nca’Kos Vlambojant Hutmakers Mirriam’s Kitchen Otaku Roman’s Pizza

SLIDE 5

Reinforcement Learning (RL)

the science and engineering of decision making and learning

under uncertainty

a type of machine learning that models learning from experience

in a wide range of applications

5

SLIDE 6

Case Studies (and a bit of history)

6

SLIDE 7

RL can model a vast range of problems

Example problems that

motivated RL research

Optimal Control Games Animal Learning

7

SLIDE 8

Lindquist, J. 1962, "Operations of a hydrothermal electric system: A multistage decision process." Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, "Long-term hydro scheduling based on stochastic models." EPSOM 98.

Photo by Magda Ehlers from Pexels

8

SLIDE 9

Long-term consequences in optimal control

Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98.

9

SLIDE 10

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal

n Research and Development, 11(6):601–617

Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/

10

SLIDE 11

Samuel’s Checkers Player

11

SLIDE 12

Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599.

Photo credit: https://www.flickr.com/photos/scorius/750037290

12

SLIDE 13

Figure from: Schultz, Wolfram, Peter Dayan, and

P. Read Montague. "A neural substrate of

prediction and reward." Science 1997.

RL as a valuable tool for modelling neurological phenomena

13

SLIDE 14

Formalizing Reinforcement Learning

15

SLIDE 16

In RL – agent interacts with an environment

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

agent environment

16

SLIDE 17

In RL – agent interacts with an environment

state s" ∈ S

agent environment

17

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 18

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a" ∈ A

18

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 19

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a& ∈ A reward r" ∈ ℝ

19

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 20

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a" ∈ A reward r" ∈ ℝ

20

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 21

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a& ∈ A reward r" ∈ ℝ

21

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 22

In RL – agent interacts with an environment

state s"#$ ∈ S

agent

acts with policy π(a|s)

environment

action a, ∈ A reward r"#$ ∈ ℝ

22

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 23

In RL – agent interacts with an environment

state s"#$ ∈ S

agent

acts with policy π(a|s)

environment

transition dynamics p(-.#$|s", a") and reward function r(1.#$|s", a")

action a. ∈ A reward r"#$ ∈ ℝ

23

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

SLIDE 24

Markov Decision Process (MDP)

Defined by ! = ($, &, ', (, ))

discount factor: ) ∈ (0,1)

Key assumption: Markov property (dynamics only depend on most

recent state and action)

24

SLIDE 25

Markov Decision Process (MDP)

Defined by ! = ($, &, ', (, ))

discount factor: ) ∈ (0,1)

Key assumption: Markov property (dynamics only depend on most

recent state and action)

Define goal:
Take actions that maximize (discounted) cumulative return

G/ = 0

123 4

)15

167

25

SLIDE 26

Examples – States, Actions, Rewards

26

SLIDE 27

State space

Important modelling choice: how to represent the problem?
Example: hydroelectric power control problem
Consider choices:

a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%)

27

SLIDE 28

State space

Important modelling choice: how to represent the problem?
Considerations:
Is the Markov property satisfied?
(How) can prior (expert) knowledge be encoded?
Effects on optimal solution?
Effects on data efficiency?

28

SLIDE 29

Mnih et al. results in Atari – a lesson in generality

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare,

M. G., ... & Hassabis, D. (2015). Human-level control through deep

reinforcement learning. Nature, 518(7540).

Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ 29

SLIDE 30

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

https://rach0012.github.io/humanRL_website/

30

SLIDE 31

31

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

https://rach0012.github.io/humanRL_website/

SLIDE 32

32

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

SLIDE 33

33

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

SLIDE 34

34

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

SLIDE 35

35

Case Study: Investigating Human Priors for Playing Video Games

Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

SLIDE 36

Action space

Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language)

Trade-offs include: data efficiency, generalization

36

SLIDE 37

A Platform for Research: TextWorld

https://www.microsoft.com/en-us/research/project/textworld/

37

SLIDE 38

Rewards

Key Question: where do RL agents’ goals come from?
In some settings – natural reward signal may be available (e.g., game

score in Atari)

More typically – important modelling choice with strong effects on

learned solutions

38

SLIDE 39

Rewards

For details and full video: https://blog.openai.com/faulty-reward-functions/

39

SLIDE 40

RL Approaches 1: Policy Gradient Methods

41

SLIDE 42

Policy Gradient: Intuition

Focus on learning a good behaviour policy

42

SLIDE 43

Policy Gradient: Intuition

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

43

SLIDE 44

Policy Gradient: Intuition

.5 .5 .5 .5

44

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 45

Policy Gradient: Intuition

.5 .5 .5 .5

45

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 46

Policy Gradient: Intuition

.5 .5 .5 .5 Loose L

46

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 47

Policy Gradient: Intuition

.45 .45 .55 .55

47

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 48

Policy Gradient: Intuition

.45 .45 .55 .55

48

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 49

Policy Gradient: Intuition

.45 .45 .55 .55 Win! J

49

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 50

Policy Gradient: Intuition

.4 .4 .6 .6

50

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 51

Policy Gradient: Intuition

.4 .4 .6 .6

51

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 52

Policy Gradient: Intuition

.4 .4 .6 .6 Win! J

52

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 53

Policy Gradient: Intuition

.45 .45 .55 .55

53

Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

SLIDE 54

Policy Gradient: Intuition

Focus on learning a good behaviour policy
Repeat:
1. Collect experience using the current policy
2. Update the policy towards better outcomes

54

SLIDE 55

Focus on the Policy: Parametric Form

Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

55

SLIDE 56

Focus on the Policy: Parametric Form

Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

probability dist stribution

56

SLIDE 57

Focus on the Policy: Parametric Form

Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

le learnable le p parameters

57

SLIDE 58

Focus on the Policy: Parametric Form

Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

act ction preference ces s

58

SLIDE 59

Focus on the Policy: Parametric Form

Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

nor normalized d pr proba

babi

bilities

59

SLIDE 60

Policy Gradient Objective

Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

∈.

/ 0|+; ! 3%

60

SLIDE 61

Policy Gradient Objective

Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

∈.

/ 0|+; ! 3%

/4

61

SLIDE 62

Policy Gradient Objective

Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

∈.

/ 0|+; ! 3%

/4

62

SLIDE 63

Policy Gradient Objective

Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

∈.

/ 0|+; ! 3%

/4

+

63

SLIDE 64

Policy Gradient Objective

Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

∈.

/ 0|+; ! 3%

Chal

allenge: compute updates to parameterized policy – which depends on the unknown environment dynamics

64

SLIDE 65

The Policy Gradient Theorem

Key

y insi sight: gradient of ! does not require derivatives of "#(%)

∇! ( ∝ *

+∈-

"#(%) *

.∈/

0+

.∇1 2|%; (

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

65

SLIDE 66

The Policy Gradient Theorem

Terms can be estimated from data!

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; #

66

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

SLIDE 67

The Policy Gradient Theorem

Terms can be estimated from data!

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; # and reweighting

67

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

SLIDE 68

The Policy Gradient Theorem

Computing the gradient

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; #

68

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

SLIDE 69

Policy Gradient Algorithm: REINFORCE

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement

learning. Machine learning, 8(3-4), 229-256. Algorithm from: Sutton & Barto 2018, chapter 13, page 328

69

SLIDE 70

Example Applications – Visual Dialog Learning

Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D.

(2017). Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. ICCV, 2017.

Use REINFORCE to optimize

interaction between Questioner and Answerer agent

70

SLIDE 71

Example Applications - Manipulation

Uses a highly scalable implementation of an advanced Policy Gradient

algorithm called PPO (Proximal Policy Optimization)

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. For of Open AI’s work on dexterous manipulation see: https://blog.openai.com/learning-dexterity/

71

SLIDE 72

RL Approaches 2: Temporal Differences, Q-Learning

72

SLIDE 73

Temporal Difference - Overview

Policy Gradient: focused policy improvement
Here:
Focus on estimating policy val

value

Intuitively – estimate how good an action is in a given situation
Key insight: values can be estimated efficiently by bootstrapping

from previous estimates

73

SLIDE 74

Running Example: TD-Learning in Malmo

Project Malmo

A platform for AI experimentation,

built on Minecraft

microsoft.com/en-us/research/project/project-malmo/
Open source on github

github.com/Microsoft/malmo

The Malmo Platform for Artificial Intelligence Experimentation Matthew Johnson, Katja Hofmann, Tim Hutton, & David Bignell 2016

SLIDE 75

Running Example: TD-Learning in Malmo

Task: cliff walking – the agent has to learn to navigate to the blue goal block Adapted from Sutton & Barto 2018, chapter 6

Try this at home, see https://github.com/Microsoft/malmo - tutorial 6

75

SLIDE 76

States, actions, rewards …

76

SLIDE 77

Performance

f a Random

Policy

77

SLIDE 78

Challenge: Data Efficiency

With basic policy gradient / REINFORCE, require many policy rollouts

to estimate returns.

Can we do better?

78

SLIDE 79

Challenge: Data Efficiency

With basic policy gradient / REINFORCE, require many policy rollouts

to estimate returns.

Can we do better?
Yes – using ideas from dynamic programming

79

SLIDE 80

Action-Value (Q) Function

Define the action-value function

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

80

SLIDE 81

Action-Value (Q) Function

Define the action-value function

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

81

SLIDE 82

Bellman Equations

The Bellman equations for Q defines recursively:

82

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

= (" 1

$23 + 0(" )$23 #$23, &$23 #$, &$

SLIDE 83

Bellman Equations

The Bellman equations for Q defines recursively:

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

= (" 1

$23 + 0(" )$23 #$23, &$23 #$, &$

= (" 1

$23 + 0Q" #$23, &$23

#$, &$

83

SLIDE 84

Temporal Difference (TD) Error

If Q-value estimates are accurate, the following must hold:

Q" #$, &$ = (" )

$*+ + -Q" #$*+, &$*+ #$, &$

84

SLIDE 85

Temporal Difference (TD) Error

If Q-value estimates are accurate, the following must hold:

Q" #$, &$ = (" )

$*+ + -Q" #$*+, &$*+ #$, &$

If not, there is an error:

. = Q" #$, &$ − (" )

$*+ + -Q" #$*+, &$*+ #$, &$

To learn better Q-value estimates – minimize .

85

SLIDE 86

Q-Learning Algorithm

Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dissertation, King's College, Cambridge). Dayan, P., & Watkins, C. J. C. H. (1992). Q-learning. Machine learning, 8(3). Algorithm from: Sutton & Barto 2018, chapter 6, page 131

86

SLIDE 87

Back to the Cliff Walking Example …

Rewards are propagated backwards in time

Exam Exampl ple: e: Update action value with observed reward (e.g., r = -0.1) and the current Q value estimate of the state we ended up in

87

SLIDE 88

After 10 minutes of training using Q-Learning:

88

SLIDE 89

Q-Learning with Function Approximation

To generalize over states and actions, parameterize Q with a function

approximator, e.g., a deep neural net

The TD error serves as loss:

J " = $

% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7

89

SLIDE 90

Q-Learning with Function Approximation

To generalize over states and actions, parameterize Q with a function

approximator, e.g., a deep neural net

The TD error serves as loss:

J " = $

% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7

And is optimized using gradient descent

90

SLIDE 91

Case Study: Human-level control through deep reinforcement learning

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D.

(2015). Nature, 518(7540).

91

SLIDE 92

Learning to navigate Minecraft from pixels using DQN

92

SLIDE 93

Learning to navigate Minecraft from pixels using DQN

93

SLIDE 94

Learning to navigate Minecraft from pixels using DQN

94

SLIDE 95

Next Lecture: Embodied Vision

96