INF3490 - Biologically inspired computing Reinforcement Learning - - PowerPoint PPT Presentation

inf3490 biologically inspired computing
SMART_READER_LITE
LIVE PREVIEW

INF3490 - Biologically inspired computing Reinforcement Learning - - PowerPoint PPT Presentation

INF3490 - Biologically inspired computing Reinforcement Learning Weria Khaksar October 10, 2018 The Commuter (2018) Ghostbusters (1984) 10.10.2018 2 It would be in vain for one Intelligent Being , to set a Rule to the Actions of another,


slide-1
SLIDE 1

INF3490 - Biologically inspired computing Reinforcement Learning

Weria Khaksar

October 10, 2018

slide-2
SLIDE 2

2

10.10.2018

Ghostbusters (1984) The Commuter (2018)

slide-3
SLIDE 3

3

10.10.2018

”It would be in vain for one Intelligent Being , to set a Rule to the Actions of another, if he had not in his Power, to reward the compliance with, and punish deviation from his Rule, by some Good and Evil, that is not the natural product and consequence of the action itself.” (Locke, ”Essay”, 2.28.6) ”The use of punishments and rewards can at best be a part of the teaching process. Roughly speaking, if the teacher has no other means of communicating to the pupil, the amount of information which can reach him does not exceed the total number

  • f

rewards and punishments applied.” (Turing (1950) ”Computing Machinery and Intelligence”)

slide-4
SLIDE 4

10.10.2018

4

Applications of RL:

From: ”Deconstructing Reinforcement Learning” ICML 2009

slide-5
SLIDE 5

10.10.2018

5

Examples:

Barrett WAM robot learning to flip pancakes by reinforcement learning Socially Aware Motion Planning with Deep Reinforcement Learning Hierarchical Reinforcement Learning for Robot Navigation Google DeepMind's Deep Q-learning playing Atari Breakout

slide-6
SLIDE 6

Last time: Supervised learning

Untrained Classifier

“CAT” No, it was a dog. Adjust classifier parameters

6

10.10.2018

slide-7
SLIDE 7

7

Supervised learning: Weight updates

x1 x2 xn

. . .

w1 w2 wn

a=i=1

n wi xi

1 if a  q

y = 0 if a < q

y

{

inputs weights activation

  • utput

q

10.10.2018

slide-8
SLIDE 8

Reinforcement Learning: Infrequent Feedback

50 chess moves later

You lost Update chess- playing strategy

8

10.10.2018

slide-9
SLIDE 9

How do we update our system now? We don’t know the error!

9

10.10.2018

slide-10
SLIDE 10

10.10.2018

10

slide-11
SLIDE 11

11

10.10.2018

slide-12
SLIDE 12

12

10.10.2018

slide-13
SLIDE 13

The reinforcement learning problem: State, Action and Reward

13

10.10.2018

slide-14
SLIDE 14

The reinforcement learning problem: State, Action and Reward

14

10.10.2018

slide-15
SLIDE 15

The reinforcement learning problem: State, Action and Reward

15

10.10.2018

slide-16
SLIDE 16

The reinforcement learning problem: State, Action and Reward

16

“Move piece from J1 to H1”

10.10.2018

slide-17
SLIDE 17

The reinforcement learning problem: State, Action and Reward

17

You took an opponent’s piece. Reward=1

10.10.2018

slide-18
SLIDE 18

The reinforcement learning problem: State, Action and Reward

18

10.10.2018

slide-19
SLIDE 19

The reinforcement learning problem: State, Action and Reward

19

10.10.2018

slide-20
SLIDE 20

The reinforcement learning problem: State, Action and Reward Learning is guided by the reward

  • An infrequent numerical feedback indicating how

well we are doing.

  • Problems:

– The reward does not tell us what we should have done! – The reward may be delayed – does not always indicate when we made a mistake.

10.10.2018

20

slide-21
SLIDE 21

The reinforcement learning problem: The reward Function

  • Corresponds

to the fitness function

  • f

an evolutionary algorithm.

  • 𝑠

𝑢+1is a function of 𝑡𝑢, 𝑏𝑢 .

  • The reward is a numeric value. Can be negative

(“punishment”).

  • Can be given throughout the learning episode,
  • r only in the end.
  • Goal: Maximize total reward.

10.10.2018

21

slide-22
SLIDE 22

The reinforcement learning problem: Maximizing total reward

22

10.10.2018

▪ Total reward: 𝑆 = ෍

𝑢=0 𝑂−1

𝑠

𝑢+1

Future rewards may be uncertain and we might care more about rewards that come soon. Therefore, we discount future rewards: 𝑆 = ෍

𝑢=0 ∞

𝛿𝑢. 𝑠

𝑢+1 ,

0 ≤ 𝛿 ≤ 1

  • r

𝑆 = ෍

𝑙=0 ∞

𝛿𝑙. 𝑠

𝑢+𝑙+1 ,

0 ≤ 𝛿 ≤ 1

slide-23
SLIDE 23

The reinforcement learning problem: Maximizing total reward

23

10.10.2018

▪ Future reward: 𝑆 = 𝑠

1 + 𝑠 2 + 𝑠 3 + ⋯ + 𝑠 𝑜

𝑆𝑢 = 𝑠

𝑢 + 𝑠 𝑢+1 + 𝑠 𝑢+2 + ⋯ + 𝑠 𝑜

▪ Discount future rewards (environment is stochastic) 𝑆𝑢 = 𝑠

𝑢 + 𝛿𝑠 𝑢+1 + 𝛿2𝑠 𝑢+2 + ⋯ + 𝛿𝑜−𝑢𝑠 𝑜

= 𝑠

𝑢 + 𝛿(𝑠 𝑢+1+𝛿(𝑠 𝑢+2 + ⋯ ))

= 𝑠

𝑢 + 𝛿𝑆𝑢+1

▪ A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.

slide-24
SLIDE 24

The reinforcement learning problem: Discounted rewards example

24

𝑢 0.99𝑢 0.95𝑢 0.50𝑢 0.05𝑢

1 0,990000 0,950000 0,500000 0,050000 2 0,980100 0,902500 0,250000 0,002500 4 0,960596 0,814506 0,062500 0,000006 8 0,922745 0,663420 0,003906 0,000000 16 0,851458 0,440127 0,000015 0,000000 32 0,724980 0,193711 0,000000 0,000000 64 0,525596 0,037524 0,000000 0,000000

10.10.2018

slide-25
SLIDE 25

The reinforcement learning problem: Discounted rewards example

25

10.10.2018

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 10 20 30 40 50 60

γ^time time

0,99 0,95 0,50 0,05

slide-26
SLIDE 26

The reinforcement learning problem: Action Selection

  • At each learning stage, the RL algorithm looks at the

possible actions and calculates the expected average reward.

𝑅𝑡,𝑢 𝑏

  • Based on 𝑅𝑡,𝑢 𝑏 , an action will be selected using:

➢ Greedy strategy: pure exploitation ➢ 𝜻-Greedy strategy: exploitation with a little exploration ➢ Soft-Max strategy: 𝑄 𝑅𝑡,𝑢 𝑏

=

𝑓(𝑅𝑡,𝑢 𝑏 /𝜐) σ𝑐 𝑓(𝑅𝑡,𝑢 𝑐 /𝜐)

26

10.10.2018

slide-27
SLIDE 27

The reinforcement learning problem: Policy (𝜌) and Value (𝑊)

▪ The set of actions we took define our policy (𝜌). ▪ The expected rewards we get in return, defines our value (𝑊).

27

10.10.2018

slide-28
SLIDE 28

The reinforcement learning problem: Markov Decision Process

  • If we only need to know the current state, the

problem has the Markov property.

  • No Markov Property:

𝑄

𝑠(𝑠 𝑢 = 𝑠′, 𝑡𝑢+1 = 𝑡′|𝑡𝑢, 𝑏𝑢, 𝑠 𝑢−1, … , 𝑠 1, 𝑡1, 𝑏1, 𝑡0, 𝑏0)

  • Markov Property:

𝑄

𝑠(𝑠 𝑢 = 𝑠′, 𝑡𝑢+1 = 𝑡′|𝑡𝑢, 𝑏𝑢)

28

10.10.2018

slide-29
SLIDE 29

The reinforcement learning problem: Markov Decision Process

29

10.10.2018 A simple example of a Markov Decision Process

slide-30
SLIDE 30

The reinforcement learning problem: Value

10.10.2018

30

  • The expected future reward is known as the value.
  • Two ways to compute the value:

– The value of a state, 𝑊(𝑡), averaged over all possible actions in that state. (state-value function) 𝑊 𝑡 = 𝐹 𝑠𝑢 𝑡𝑢 = 𝑡 = 𝐹 ෍

𝑗=0 ∞

𝛿𝑗. 𝑠𝑢+𝑗+1| 𝑡𝑢 = 𝑡 – The value of a state/action pair 𝑅(𝑡, 𝑏). (action-value function) 𝑅 𝑡, 𝑏 = 𝐹 𝑠𝑢 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏 = 𝐹 ෍

𝑗=0 ∞

𝛿𝑗. 𝑠𝑢+𝑗+1| 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏

  • 𝑹 and 𝑾 are initially unknown, and learned iteratively as we gain

experience.

slide-31
SLIDE 31

The reinforcement learning problem: The Q-Learning Algorithm

10.10.2018

31

slide-32
SLIDE 32

The reinforcement learning problem: The Q-Learning Algorithm

10.10.2018

32

slide-33
SLIDE 33

The reinforcement learning problem: The SARSA Algorithm

10.10.2018

33

slide-34
SLIDE 34

Q-learning example

  • Credits: Arjun Chandra

10.10.2018

34

home

  • 1
slide-35
SLIDE 35

10.10.2018

35

slide-36
SLIDE 36

10.10.2018

36

slide-37
SLIDE 37

10.10.2018

37

slide-38
SLIDE 38

10.10.2018

38

slide-39
SLIDE 39

10.10.2018

39

slide-40
SLIDE 40

10.10.2018

40

slide-41
SLIDE 41

10.10.2018

41

slide-42
SLIDE 42

10.10.2018

42

slide-43
SLIDE 43

10.10.2018

43

slide-44
SLIDE 44

10.10.2018

44

slide-45
SLIDE 45

10.10.2018

45

slide-46
SLIDE 46

10.10.2018

46

slide-47
SLIDE 47

10.10.2018

47

slide-48
SLIDE 48

10.10.2018

48

slide-49
SLIDE 49

10.10.2018

49

slide-50
SLIDE 50

10.10.2018

50

slide-51
SLIDE 51

10.10.2018

51

slide-52
SLIDE 52

10.10.2018

52

slide-53
SLIDE 53

10.10.2018

53

slide-54
SLIDE 54

10.10.2018

54

slide-55
SLIDE 55

10.10.2018

55

slide-56
SLIDE 56

10.10.2018

56

slide-57
SLIDE 57

10.10.2018

57

slide-58
SLIDE 58

10.10.2018

58

slide-59
SLIDE 59

10.10.2018

59

slide-60
SLIDE 60

Action selection

10.10.2018

60

slide-61
SLIDE 61

On-policy vs off-policy learning

61

Start The Cliff Goal

  • Reward structure: Each move: -1. Move to

cliff: -100.

  • Policy: 90% chance of choosing best action

(exploit). 10% chance of choosing random action (explore).

10.10.2018

slide-62
SLIDE 62

On-policy vs off-policy learning: Q-learning

62

Start The Cliff Goal

  • Always assumes optimal action -> does not

visit cliff often while learning. Therefore, does not learn that cliff is dangerous.

  • Resulting path is efficient, but risky.

10.10.2018

slide-63
SLIDE 63

On-policy vs off-policy learning: SARSA

63

Start The Cliff Goal

  • During learning, we more frequently end up
  • utside the cliff (due to the 10% chance of

exploring in our policy).

  • That info propagates to all states, generating

a safer plan.

10.10.2018

slide-64
SLIDE 64

Which plan is better?

64

Start The Cliff Goal Start The Cliff Goal

  • SARSA (on-policy):
  • Q-learning (off-policy):

10.10.2018

slide-65
SLIDE 65

Using evolution and neural networks in reinforcement learning

MarI/O - Machine Learning for Video Games

10.10.2018

65

slide-66
SLIDE 66

10.10.2018

66