CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - - PowerPoint PPT Presentation

cmp722
SMART_READER_LITE
LIVE PREVIEW

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - - PowerPoint PPT Presentation

Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019 Illustration: William Joel Previously on CMP722 image captioning


slide-1
SLIDE 1

Lecture #6 – Deep Reinforcement Learning

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Image: StarCraft II DeepMind feature layer API

slide-2
SLIDE 2
  • image captioning
  • visual question answering
  • case study: neural module

networks

Previously on CMP722

Illustration: William Joel

slide-3
SLIDE 3

Lecture overview

  • case studies (and a bit of history)
  • formalizing reinforcement learning
  • policy gradient methods
  • temporal differences, q-learning
  • Discl

sclaimer: Much of the material and slides for this lecture were borrowed from

—Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning"

3

slide-4
SLIDE 4

Decision Making and Learning under Uncertainty

4

Buzz Feathers TaMaties Java Junction Jeff’s Place DCM Nca’Kos Vlambojant Hutmakers Mirriam’s Kitchen Otaku Roman’s Pizza

slide-5
SLIDE 5

Reinforcement Learning (RL)

  • the science and engineering of decision making and learning

under uncertainty

  • a type of machine learning that models learning from experience

in a wide range of applications

5

slide-6
SLIDE 6

Case Studies (and a bit of history)

6

slide-7
SLIDE 7

RL can model a vast range of problems

  • Example problems that

motivated RL research

Optimal Control Games Animal Learning

7

slide-8
SLIDE 8

Lindquist, J. 1962, "Operations of a hydrothermal electric system: A multistage decision process." Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, "Long-term hydro scheduling based on stochastic models." EPSOM 98.

Photo by Magda Ehlers from Pexels

8

slide-9
SLIDE 9

Long-term consequences in optimal control

Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98.

9

slide-10
SLIDE 10

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal

  • n Research and Development, 11(6):601–617

Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/

10

slide-11
SLIDE 11

Samuel’s Checkers Player

11

slide-12
SLIDE 12

Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599.

Photo credit: https://www.flickr.com/photos/scorius/750037290

12

slide-13
SLIDE 13

Figure from: Schultz, Wolfram, Peter Dayan, and

  • P. Read Montague. "A neural substrate of

prediction and reward." Science 1997.

RL as a valuable tool for modelling neurological phenomena

13

slide-14
SLIDE 14

Further Reading

  • White, D. J. (1985). Real applications of Markov decision processes.

Interfaces, 15(6).

  • White, D. J. (1988). Further real applications of Markov decision
  • processes. Interfaces, 18(5).
  • Maia, Tiago V., and Michael J. Frank. "From reinforcement learning

models to psychiatric and neurological disorders." Nature neuroscience 14.2 (2011).

  • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An
  • introduction. MIT press, 2nd Edition.

http://incompleteideas.net/book/the-book-2nd.html Chapter 1, 14-16

14

slide-15
SLIDE 15

Formalizing Reinforcement Learning

15

slide-16
SLIDE 16

In RL – agent interacts with an environment

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

agent environment

16

slide-17
SLIDE 17

In RL – agent interacts with an environment

state s" ∈ S

agent environment

17

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-18
SLIDE 18

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a" ∈ A

18

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-19
SLIDE 19

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a& ∈ A reward r" ∈ ℝ

19

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-20
SLIDE 20

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a" ∈ A reward r" ∈ ℝ

20

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-21
SLIDE 21

In RL – agent interacts with an environment

state s" ∈ S

agent environment

action a& ∈ A reward r" ∈ ℝ

21

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-22
SLIDE 22

In RL – agent interacts with an environment

state s"#$ ∈ S

agent

acts with policy π(a|s)

environment

action a, ∈ A reward r"#$ ∈ ℝ

22

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-23
SLIDE 23

In RL – agent interacts with an environment

state s"#$ ∈ S

agent

acts with policy π(a|s)

environment

transition dynamics p(-.#$|s", a") and reward function r(1.#$|s", a")

action a. ∈ A reward r"#$ ∈ ℝ

23

Photo credit: https://www.flickr.com/photos/steveonjava/8170183457

slide-24
SLIDE 24

Markov Decision Process (MDP)

Defined by ! = ($, &, ', (, ))

discount factor: ) ∈ (0,1)

  • Key assumption: Markov property (dynamics only depend on most

recent state and action)

24

slide-25
SLIDE 25

Markov Decision Process (MDP)

Defined by ! = ($, &, ', (, ))

discount factor: ) ∈ (0,1)

  • Key assumption: Markov property (dynamics only depend on most

recent state and action)

  • Define goal:
  • Take actions that maximize (discounted) cumulative return

G/ = 0

123 4

)15

167

25

slide-26
SLIDE 26

Examples – States, Actions, Rewards

26

slide-27
SLIDE 27

State space

  • Important modelling choice: how to represent the problem?
  • Example: hydroelectric power control problem
  • Consider choices:

a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%)

27

slide-28
SLIDE 28

State space

  • Important modelling choice: how to represent the problem?
  • Considerations:
  • Is the Markov property satisfied?
  • (How) can prior (expert) knowledge be encoded?
  • Effects on optimal solution?
  • Effects on data efficiency?

28

slide-29
SLIDE 29

Mnih et al. results in Atari – a lesson in generality

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare,

  • M. G., ... & Hassabis, D. (2015). Human-level control through deep

reinforcement learning. Nature, 518(7540).

Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ 29

slide-30
SLIDE 30

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

https://rach0012.github.io/humanRL_website/

30

slide-31
SLIDE 31

31

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.

https://rach0012.github.io/humanRL_website/

slide-32
SLIDE 32

32

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.
slide-33
SLIDE 33

33

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.
slide-34
SLIDE 34

34

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.
slide-35
SLIDE 35

35

Case Study: Investigating Human Priors for Playing Video Games

  • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018.
slide-36
SLIDE 36

Action space

Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language)

  • Trade-offs include: data efficiency, generalization

36

slide-37
SLIDE 37

A Platform for Research: TextWorld

https://www.microsoft.com/en-us/research/project/textworld/

37

slide-38
SLIDE 38

Rewards

  • Key Question: where do RL agents’ goals come from?
  • In some settings – natural reward signal may be available (e.g., game

score in Atari)

  • More typically – important modelling choice with strong effects on

learned solutions

38

slide-39
SLIDE 39

Rewards

For details and full video: https://blog.openai.com/faulty-reward-functions/

39

slide-40
SLIDE 40

Further Reading

  • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An
  • introduction. MIT press, 2nd Edition.

http://incompleteideas.net/book/the-book-2nd.html Chapter 3, 9.5

40

slide-41
SLIDE 41

RL Approaches 1: Policy Gradient Methods

41

slide-42
SLIDE 42

Policy Gradient: Intuition

  • Focus on learning a good behaviour policy

42

slide-43
SLIDE 43

Policy Gradient: Intuition

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

43

slide-44
SLIDE 44

Policy Gradient: Intuition

.5 .5 .5 .5

44

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-45
SLIDE 45

Policy Gradient: Intuition

.5 .5 .5 .5

45

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-46
SLIDE 46

Policy Gradient: Intuition

.5 .5 .5 .5 Loose L

46

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-47
SLIDE 47

Policy Gradient: Intuition

.45 .45 .55 .55

47

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-48
SLIDE 48

Policy Gradient: Intuition

.45 .45 .55 .55

48

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-49
SLIDE 49

Policy Gradient: Intuition

.45 .45 .55 .55 Win! J

49

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-50
SLIDE 50

Policy Gradient: Intuition

.4 .4 .6 .6

50

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-51
SLIDE 51

Policy Gradient: Intuition

.4 .4 .6 .6

51

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-52
SLIDE 52

Policy Gradient: Intuition

.4 .4 .6 .6 Win! J

52

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-53
SLIDE 53

Policy Gradient: Intuition

.45 .45 .55 .55

53

  • Example: Learning in

Multi-armed bandit problems

Photo credit: https://www.flickr.com/photos/knothing/11264853546/

slide-54
SLIDE 54

Policy Gradient: Intuition

  • Focus on learning a good behaviour policy
  • Repeat:
  • 1. Collect experience using the current policy
  • 2. Update the policy towards better outcomes

54

slide-55
SLIDE 55

Focus on the Policy: Parametric Form

  • Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

55

slide-56
SLIDE 56

Focus on the Policy: Parametric Form

  • Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

probability dist stribution

56

slide-57
SLIDE 57

Focus on the Policy: Parametric Form

  • Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

le learnable le p parameters

57

slide-58
SLIDE 58

Focus on the Policy: Parametric Form

  • Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

act ction preference ces s

58

slide-59
SLIDE 59

Focus on the Policy: Parametric Form

  • Most common parameterization:

! " #; % = '((*,,;-) ∑,0∈2 '( *,,0;-

nor normalized d pr proba

  • babi

bilities

59

slide-60
SLIDE 60

Policy Gradient Objective

  • Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

  • ∈.

/ 0|+; ! 3%

  • 60
slide-61
SLIDE 61

Policy Gradient Objective

  • Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

  • ∈.

/ 0|+; ! 3%

  • /4

61

slide-62
SLIDE 62

Policy Gradient Objective

  • Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

  • ∈.

/ 0|+; ! 3%

  • /4

62

slide-63
SLIDE 63

Policy Gradient Objective

  • Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

  • ∈.

/ 0|+; ! 3%

  • /4

+

63

slide-64
SLIDE 64

Policy Gradient Objective

  • Goal

al: find parameters ! that maximize expected reward

" ! = $

%∈'

()(+) $

  • ∈.

/ 0|+; ! 3%

  • Chal

allenge: compute updates to parameterized policy – which depends on the unknown environment dynamics

64

slide-65
SLIDE 65

The Policy Gradient Theorem

  • Key

y insi sight: gradient of ! does not require derivatives of "#(%)

∇! ( ∝ *

+∈-

"#(%) *

.∈/

0+

.∇1 2|%; (

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

65

slide-66
SLIDE 66

The Policy Gradient Theorem

  • Terms can be estimated from data!

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; #

66

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

slide-67
SLIDE 67

The Policy Gradient Theorem

  • Terms can be estimated from data!

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; # and reweighting

67

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

slide-68
SLIDE 68

The Policy Gradient Theorem

  • Computing the gradient

∇" # ∝ %

&∈(

)*(,) %

.∈/

0&

.∇1 2|,; #

68

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NIPS 2000. See Sutton & Barto 2018, chapter 13

slide-69
SLIDE 69

Policy Gradient Algorithm: REINFORCE

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement

  • learning. Machine learning, 8(3-4), 229-256. Algorithm from: Sutton & Barto 2018, chapter 13, page 328

69

slide-70
SLIDE 70

Example Applications – Visual Dialog Learning

  • Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D.

(2017). Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. ICCV, 2017.

  • Use REINFORCE to optimize

interaction between Questioner and Answerer agent

70

slide-71
SLIDE 71

Example Applications - Manipulation

  • Uses a highly scalable implementation of an advanced Policy Gradient

algorithm called PPO (Proximal Policy Optimization)

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. For of Open AI’s work on dexterous manipulation see: https://blog.openai.com/learning-dexterity/

71

slide-72
SLIDE 72

RL Approaches 2: Temporal Differences, Q-Learning

72

slide-73
SLIDE 73

Temporal Difference - Overview

  • Policy Gradient: focused policy improvement
  • Here:
  • Focus on estimating policy val

value

  • Intuitively – estimate how good an action is in a given situation
  • Key insight: values can be estimated efficiently by bootstrapping

from previous estimates

73

slide-74
SLIDE 74

Running Example: TD-Learning in Malmo

Project Malmo

  • A platform for AI experimentation,

built on Minecraft

  • microsoft.com/en-us/research/project/project-malmo/
  • Open source on github

github.com/Microsoft/malmo

The Malmo Platform for Artificial Intelligence Experimentation Matthew Johnson, Katja Hofmann, Tim Hutton, & David Bignell 2016

slide-75
SLIDE 75

Running Example: TD-Learning in Malmo

Task: cliff walking – the agent has to learn to navigate to the blue goal block Adapted from Sutton & Barto 2018, chapter 6

Try this at home, see https://github.com/Microsoft/malmo - tutorial 6

75

slide-76
SLIDE 76

States, actions, rewards …

76

slide-77
SLIDE 77

Performance

  • f a Random

Policy

77

slide-78
SLIDE 78

Challenge: Data Efficiency

  • With basic policy gradient / REINFORCE, require many policy rollouts

to estimate returns.

  • Can we do better?

78

slide-79
SLIDE 79

Challenge: Data Efficiency

  • With basic policy gradient / REINFORCE, require many policy rollouts

to estimate returns.

  • Can we do better?
  • Yes – using ideas from dynamic programming

79

slide-80
SLIDE 80

Action-Value (Q) Function

  • Define the action-value function

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

80

slide-81
SLIDE 81

Action-Value (Q) Function

  • Define the action-value function

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

81

slide-82
SLIDE 82

Bellman Equations

  • The Bellman equations for Q defines recursively:

82

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

= (" 1

$23 + 0(" )$23 #$23, &$23 #$, &$

slide-83
SLIDE 83

Bellman Equations

  • The Bellman equations for Q defines recursively:

Q" #$, &$ ≡ (" )$ #$, &$ = (" +

,-. /

0,1

$2,23|#$, &$

= (" 1

$23 + 0(" )$23 #$23, &$23 #$, &$

= (" 1

$23 + 0Q" #$23, &$23

#$, &$

83

slide-84
SLIDE 84

Temporal Difference (TD) Error

  • If Q-value estimates are accurate, the following must hold:

Q" #$, &$ = (" )

$*+ + -Q" #$*+, &$*+ #$, &$

84

slide-85
SLIDE 85

Temporal Difference (TD) Error

  • If Q-value estimates are accurate, the following must hold:

Q" #$, &$ = (" )

$*+ + -Q" #$*+, &$*+ #$, &$

  • If not, there is an error:

. = Q" #$, &$ − (" )

$*+ + -Q" #$*+, &$*+ #$, &$

  • To learn better Q-value estimates – minimize .

85

slide-86
SLIDE 86

Q-Learning Algorithm

Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dissertation, King's College, Cambridge). Dayan, P., & Watkins, C. J. C. H. (1992). Q-learning. Machine learning, 8(3). Algorithm from: Sutton & Barto 2018, chapter 6, page 131

86

slide-87
SLIDE 87

Back to the Cliff Walking Example …

  • Rewards are propagated backwards in time

Exam Exampl ple: e: Update action value with observed reward (e.g., r = -0.1) and the current Q value estimate of the state we ended up in

87

slide-88
SLIDE 88

After 10 minutes of training using Q-Learning:

88

slide-89
SLIDE 89

Q-Learning with Function Approximation

  • To generalize over states and actions, parameterize Q with a function

approximator, e.g., a deep neural net

  • The TD error serves as loss:

J " = $

% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7

89

slide-90
SLIDE 90

Q-Learning with Function Approximation

  • To generalize over states and actions, parameterize Q with a function

approximator, e.g., a deep neural net

  • The TD error serves as loss:

J " = $

% + ' max +∈- . /%01, 3; "5 − . /%, 3%; " 7

  • And is optimized using gradient descent

90

slide-91
SLIDE 91

Case Study: Human-level control through deep reinforcement learning

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D.

(2015). Nature, 518(7540).

91

slide-92
SLIDE 92

Learning to navigate Minecraft from pixels using DQN

92

slide-93
SLIDE 93

Learning to navigate Minecraft from pixels using DQN

93

slide-94
SLIDE 94

Learning to navigate Minecraft from pixels using DQN

94

slide-95
SLIDE 95

Further Reading

  • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction.

MIT press, 2nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 6, 9-11

  • Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). An application of

reinforcement learning to aerobatic helicopter flight. NIPS 2007. Project homepage: http://heli.stanford.edu/

  • Edwards, A. L., Dawson, M. R., Hebert, J. S., Sherstan, C., Sutton, R. S., Chan,
  • K. M., & Pilarski, P. M. (2016). Application of real-time machine learning to

myoelectric prosthesis control: A case series in adaptive switching. Prosthetics and orthotics international, 40(5).

95

slide-96
SLIDE 96

Next Lecture: Embodied Vision

96