Reinforcement learning Applied artificial intelligence (EDA132) - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement learning Applied artificial intelligence (EDA132) - - PowerPoint PPT Presentation

Reinforcement learning Applied artificial intelligence (EDA132) Lecture 13 2012-04-26 Elin A. Topp Material based on course book, chapter 21 (17), and on lecture Belningsbaserad inlrning / Reinforcement learning by rjan Ekeberg,


slide-1
SLIDE 1

Reinforcement learning

Applied artificial intelligence (EDA132) Lecture 13 2012-04-26 Elin A. Topp

Material based on course book, chapter 21 (17), and on lecture “Belöningsbaserad inlärning / Reinforcement learning” by Örjan Ekeberg, CSC/Nada, KTH, autumn term 2006 (in Swedish) 1

Friday, 27 April 2012

slide-2
SLIDE 2

Outline

  • Reinforcement learning (chapter 21, with some references to 17)
  • Problem definition
  • Learning situation
  • Roll of the reward
  • Simplified assumptions
  • Central concepts and terms
  • Known environment
  • Bellman’s equation
  • Approaches to solutions
  • Unknown environment
  • Monte-Carlo method
  • Temporal-Difference learning
  • Q-Learning
  • Sarsa-Learning
  • Improvements
  • The usefulness of making mistakes
  • Eligibility Trace

2

Friday, 27 April 2012

slide-3
SLIDE 3

Outline

  • Reinforcement learning (chapter 21, with some references to 17)
  • Problem definition
  • Learning situation
  • Roll of the reward
  • Simplified assumptions
  • Central concepts and terms
  • Known environment
  • Bellman’s equation
  • Approaches to solutions
  • Unknown environment
  • Monte-Carlo method
  • Temporal-Difference learning
  • Q-Learning
  • Sarsa-Learning
  • Improvements
  • The usefulness of making mistakes
  • Eligibility Trace

3

Friday, 27 April 2012

slide-4
SLIDE 4

Reinforcement learning

Learning of a behaviour (a strategy, a skill) without access to a right / wrong measure for actions and decisions taken.

4

Friday, 27 April 2012

slide-5
SLIDE 5

Reinforcement learning

Learning of a behaviour (a strategy, a skill) without access to a right / wrong measure for actions and decisions taken. With the help of a reward, a measure is given, of how well things are going

4

Friday, 27 April 2012

slide-6
SLIDE 6

Reinforcement learning

Learning of a behaviour (a strategy, a skill) without access to a right / wrong measure for actions and decisions taken. With the help of a reward, a measure is given, of how well things are going Note: The reward is not given in direct connection with a good choice of action (temporal credit assignment)

4

Friday, 27 April 2012

slide-7
SLIDE 7

Reinforcement learning

Learning of a behaviour (a strategy, a skill) without access to a right / wrong measure for actions and decisions taken. With the help of a reward, a measure is given, of how well things are going Note: The reward is not given in direct connection with a good choice of action (temporal credit assignment) Note: The reward does not tell what exactly it was, that made the “good” action (structural credit assignment)

4

Friday, 27 April 2012

slide-8
SLIDE 8

Real life examples

5

Friday, 27 April 2012

slide-9
SLIDE 9

Real life examples

5

Riding a bicycle Powder skiing

Friday, 27 April 2012

slide-10
SLIDE 10

Learning situation: A model

An agent interacts with its environment The agent performs actions Actions have influence on the environment’s state The agent observes the environment’s state and receives a reward from the environment

6

Agent Environment

Action a State s Reward r

Friday, 27 April 2012

slide-11
SLIDE 11

Learning situation: The agent’s task

The task: Find a behaviour (action sequence) that maximises the overall reward How long into the future should we spy? Finite time horizon:

max E[ ∑ rt]

Infinite time horizon:

max E[ ∑ γt rt]

with γ being a discount factor for future rewards (0 < γ < 1)

7 h t=0 ∞ t=0

Friday, 27 April 2012

slide-12
SLIDE 12

The reward function’s roll

The reward function depends on the type of task

8

Friday, 27 April 2012

slide-13
SLIDE 13

The reward function’s roll

The reward function depends on the type of task

  • Game (Chess, Backgammon): Reward is given only in the end of the game, +1 for

“win”, -1 for “loose”

8

Friday, 27 April 2012

slide-14
SLIDE 14

The reward function’s roll

The reward function depends on the type of task

  • Game (Chess, Backgammon): Reward is given only in the end of the game, +1 for

“win”, -1 for “loose”

  • Avoid mistakes (Riding a bike, Learning to fly according to hitchhiker’s guide):

Reward -1 when failing (falling)

8

Friday, 27 April 2012

slide-15
SLIDE 15

The reward function’s roll

The reward function depends on the type of task

  • Game (Chess, Backgammon): Reward is given only in the end of the game, +1 for

“win”, -1 for “loose”

  • Avoid mistakes (Riding a bike, Learning to fly according to hitchhiker’s guide):

Reward -1 when failing (falling)

  • Find the shortest / cheapest / fastest path to a goal: Reward -1 for each step

8

Friday, 27 April 2012

slide-16
SLIDE 16

A classic example: Grid World

Simplified “Wumpus world” with just two gold pieces

9

G G

Friday, 27 April 2012

slide-17
SLIDE 17

A classic example: Grid World

Simplified “Wumpus world” with just two gold pieces

  • Every state sj is represented by a field in the grid

9

G G

Friday, 27 April 2012

slide-18
SLIDE 18

A classic example: Grid World

Simplified “Wumpus world” with just two gold pieces

  • Every state sj is represented by a field in the grid
  • Action a the agent can choose consists of moving one step to a neighbouring field

9

G G

Friday, 27 April 2012

slide-19
SLIDE 19

A classic example: Grid World

Simplified “Wumpus world” with just two gold pieces

  • Every state sj is represented by a field in the grid
  • Action a the agent can choose consists of moving one step to a neighbouring field
  • Reward: -1 in every step until one of the goals (G) is reached.

9

G G

Friday, 27 April 2012

slide-20
SLIDE 20

Simplifying assumptions

10

Friday, 27 April 2012

slide-21
SLIDE 21

Simplifying assumptions

We assume for now:

10

Friday, 27 April 2012

slide-22
SLIDE 22

Simplifying assumptions

We assume for now:

  • Discrete time (steps over time)

10

Friday, 27 April 2012

slide-23
SLIDE 23

Simplifying assumptions

We assume for now:

  • Discrete time (steps over time)
  • Finite number of possible actions ai

ai ∈ a1, a2, a3, ... , an

10

Friday, 27 April 2012

slide-24
SLIDE 24

Simplifying assumptions

We assume for now:

  • Discrete time (steps over time)
  • Finite number of possible actions ai

ai ∈ a1, a2, a3, ... , an

  • Finite number of states sj

sj ∈ s1, s2, s3, ... , sm

10

Friday, 27 April 2012

slide-25
SLIDE 25

Simplifying assumptions

We assume for now:

  • Discrete time (steps over time)
  • Finite number of possible actions ai

ai ∈ a1, a2, a3, ... , an

  • Finite number of states sj

sj ∈ s1, s2, s3, ... , sm

  • The context is a constant MDP (Markov Decision Process), where reward and new

state s’ only depend on s, a, and (random) noise

10

Friday, 27 April 2012

slide-26
SLIDE 26

Simplifying assumptions

We assume for now:

  • Discrete time (steps over time)
  • Finite number of possible actions ai

ai ∈ a1, a2, a3, ... , an

  • Finite number of states sj

sj ∈ s1, s2, s3, ... , sm

  • The context is a constant MDP (Markov Decision Process), where reward and new

state s’ only depend on s, a, and (random) noise

  • Environment is observable

10

Friday, 27 April 2012

slide-27
SLIDE 27

The agent’s internal representation

11

Friday, 27 April 2012

slide-28
SLIDE 28

The agent’s internal representation

  • An agent’s policy π is the “rule” after which the agent chooses its action a in a

given state s π(s) ⟼ a

11

Friday, 27 April 2012

slide-29
SLIDE 29

The agent’s internal representation

  • An agent’s policy π is the “rule” after which the agent chooses its action a in a

given state s π(s) ⟼ a

  • An agent’s utility function U describes the expected future reward given s, when

following policy π Uπ(s) ⟼ ℜ

11

Friday, 27 April 2012

slide-30
SLIDE 30

Grid World: A state’s value

A state’s value depends on the chosen policy

12

Friday, 27 April 2012

slide-31
SLIDE 31

Grid World: A state’s value

A state’s value depends on the chosen policy

12

  • 1
  • 2
  • 3
  • 1
  • 2
  • 3
  • 2
  • 2
  • 3
  • 2
  • 1
  • 3
  • 2
  • 1

U with optimal policy

Friday, 27 April 2012

slide-32
SLIDE 32

Grid World: A state’s value

A state’s value depends on the chosen policy

12

  • 1
  • 2
  • 3
  • 1
  • 2
  • 3
  • 2
  • 2
  • 3
  • 2
  • 1
  • 3
  • 2
  • 1

U with optimal policy

  • 14
  • 20
  • 22
  • 14
  • 18
  • 22
  • 20
  • 20
  • 22
  • 18
  • 14
  • 22
  • 20
  • 14

U with random policy

Friday, 27 April 2012

slide-33
SLIDE 33

A 4x3 world

  • Fixed policy - passive learning.

13

Friday, 27 April 2012

slide-34
SLIDE 34

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).

13

Friday, 27 April 2012

slide-35
SLIDE 35

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).
  • Do trials, observe, until terminal state is reached, update utilities

13

Friday, 27 April 2012

slide-36
SLIDE 36

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).
  • Do trials, observe, until terminal state is reached, update utilities
  • Eventually, agent learns how good the policy is - it can evaluate the policy and test

different ones

13

Friday, 27 April 2012

slide-37
SLIDE 37

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).
  • Do trials, observe, until terminal state is reached, update utilities
  • Eventually, agent learns how good the policy is - it can evaluate the policy and test

different ones

  • Policy as described in the left grid is optimal with rewards of -0.04 for all

reachable, nonterminal states, and without discounting.

13

Friday, 27 April 2012

slide-38
SLIDE 38

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).
  • Do trials, observe, until terminal state is reached, update utilities
  • Eventually, agent learns how good the policy is - it can evaluate the policy and test

different ones

  • Policy as described in the left grid is optimal with rewards of -0.04 for all

reachable, nonterminal states, and without discounting.

13

R R R +1 U U

  • 1

U L L L

Friday, 27 April 2012

slide-39
SLIDE 39

A 4x3 world

  • Fixed policy - passive learning.
  • Always start in state (1,1).
  • Do trials, observe, until terminal state is reached, update utilities
  • Eventually, agent learns how good the policy is - it can evaluate the policy and test

different ones

  • Policy as described in the left grid is optimal with rewards of -0.04 for all

reachable, nonterminal states, and without discounting.

13

R R R +1 U U

  • 1

U L L L 0.812 0.868 0.918 +1 0.762 0.660

  • 1

0.705 0.655 0.611 0.388

Friday, 27 April 2012

slide-40
SLIDE 40

Outline

  • Reinforcement learning (chapter 21, with some references to 17)
  • Problem definition
  • Learning situation
  • Roll of the reward
  • Simplified assumptions
  • Central concepts and terms
  • Known (observable) environment
  • Bellman’s equation
  • Approaches to solutions
  • Unknown environment
  • Monte-Carlo method
  • Temporal-Difference learning
  • Q-Learning
  • Sarsa-Learning
  • Improvements
  • The usefulness of making mistakes
  • Eligibility Trace

14

Friday, 27 April 2012

slide-41
SLIDE 41

Environment model

15

Friday, 27 April 2012

slide-42
SLIDE 42

Environment model

  • Where do we get in each step?

δ(s, a) ⟼ s’

15

Friday, 27 April 2012

slide-43
SLIDE 43

Environment model

  • Where do we get in each step?

δ(s, a) ⟼ s’

  • What will the reward be?

r( s, a) ⟼ ℜ

15

Friday, 27 April 2012

slide-44
SLIDE 44

Environment model

  • Where do we get in each step?

δ(s, a) ⟼ s’

  • What will the reward be?

r( s, a) ⟼ ℜ

15

Friday, 27 April 2012

slide-45
SLIDE 45

Environment model

  • Where do we get in each step?

δ(s, a) ⟼ s’

  • What will the reward be?

r( s, a) ⟼ ℜ The utility values of different states obey Bellman’s equation, given a fixed policy π: Uπ(s) = r( s, π(s)) + γ·Uπ( δ( s, π(s)))

15

Friday, 27 April 2012

slide-46
SLIDE 46

Solving the equation

16

Friday, 27 April 2012

slide-47
SLIDE 47

Solving the equation

There are two ways of solving Bellman’s equation Uπ(s) = r( s, π(s)) + γ·Uπ( δ( s, π(s)))

16

Friday, 27 April 2012

slide-48
SLIDE 48

Solving the equation

There are two ways of solving Bellman’s equation Uπ(s) = r( s, π(s)) + γ·Uπ( δ( s, π(s)))

  • Directly: Uπ(s) = r( s, π(s)) + γ·∑s’ P( s’ | s, π(s)) Uπ(s’)

16

Friday, 27 April 2012

slide-49
SLIDE 49

Recap: Random policy

17

  • 14
  • 20
  • 22
  • 14
  • 18
  • 22
  • 20
  • 20
  • 22
  • 18
  • 14
  • 22
  • 20
  • 14

Uπ(s) = r( s, π(s)) + γ·∑s’ P( s’ | s, π(s)) Uπ(s’)

Friday, 27 April 2012

slide-50
SLIDE 50

Solving the equation

There are two ways of solving (this “optimal” version of) Bellman’s equation Uπ(s) = r( s, π(s)) + γ·Uπ( δ( s, π(s)))

  • Directly: Uπ(s) = r( s, π(s)) + γ·∑s’ P( s’ | s, π(s)) Uπ(s’)
  • Iteratively (Value / utility iteration), stop when equilibrium is reached, i.e., “nothing

happens” U (s) ⟵ r( s, π(s)) + γ·U ( δ( s, π(s)))

18 π k+1 π k

Friday, 27 April 2012

slide-51
SLIDE 51

Bayesian reinforcement learning

19

Friday, 27 April 2012

slide-52
SLIDE 52

Bayesian reinforcement learning

A remark:

19

Friday, 27 April 2012

slide-53
SLIDE 53

Bayesian reinforcement learning

A remark: One form of reinforcement learning integrates Bayesian learning into the process to

  • btain the transition model, i.e., P( s’ | s, π(s))

19

Friday, 27 April 2012

slide-54
SLIDE 54

Bayesian reinforcement learning

A remark: One form of reinforcement learning integrates Bayesian learning into the process to

  • btain the transition model, i.e., P( s’ | s, π(s))

This means to assume a prior probability for each hypothesis on how the model might look like and then applying Bayes’ rule to obtain the posterior.

19

Friday, 27 April 2012

slide-55
SLIDE 55

Bayesian reinforcement learning

A remark: One form of reinforcement learning integrates Bayesian learning into the process to

  • btain the transition model, i.e., P( s’ | s, π(s))

This means to assume a prior probability for each hypothesis on how the model might look like and then applying Bayes’ rule to obtain the posterior. We are not going into details here!

19

Friday, 27 April 2012

slide-56
SLIDE 56

Finding optimal policy and value function

20

Friday, 27 April 2012

slide-57
SLIDE 57

Finding optimal policy and value function

How can we find an optimal policy π*?

20

Friday, 27 April 2012

slide-58
SLIDE 58

Finding optimal policy and value function

How can we find an optimal policy π*? That would be easy if we had the optimal value / utility function U*: π*(s) = argmax( r( s, a) + γ·U*( δ( s, a)))

a

20

Friday, 27 April 2012

slide-59
SLIDE 59

Finding optimal policy and value function

How can we find an optimal policy π*? That would be easy if we had the optimal value / utility function U*: π*(s) = argmax( r( s, a) + γ·U*( δ( s, a)))

a

Apply to the “optimal version” of Bellman’s equation U*(s) = max( r( s, a) + γ·U*( δ( s, a)))

a

20

Friday, 27 April 2012

slide-60
SLIDE 60

Finding optimal policy and value function

How can we find an optimal policy π*? That would be easy if we had the optimal value / utility function U*: π*(s) = argmax( r( s, a) + γ·U*( δ( s, a)))

a

Apply to the “optimal version” of Bellman’s equation U*(s) = max( r( s, a) + γ·U*( δ( s, a)))

a

Tricky to solve ... but possible: Combine policy and value iteration by switching in each iteration step

20

Friday, 27 April 2012

slide-61
SLIDE 61

Policy iteration

21

Friday, 27 April 2012

slide-62
SLIDE 62

Policy iteration

Policy iteration provides exactly this switch.

21

Friday, 27 April 2012

slide-63
SLIDE 63

Policy iteration

Policy iteration provides exactly this switch. For each iteration step k: πk(s) = argmax( r( s, a) + γ·Uk( δ( s, a)))

a

Uk+1(s) = r( s, πk(s)) + γ·Uk( δ( s, πk(s)))

21

Friday, 27 April 2012

slide-64
SLIDE 64

Outline

  • Reinforcement learning (chapter 21, with some references to 17)
  • Problem definition
  • Learning situation
  • Roll of the reward
  • Simplified assumptions
  • Central concepts and terms
  • Known environment
  • Bellman’s equation
  • Approaches to solutions
  • Unknown environment
  • Monte-Carlo method
  • Temporal-Difference learning
  • Q-Learning
  • Sarsa-Learning
  • Improvements
  • The usefulness of making mistakes
  • Eligibility Trace

22

Friday, 27 April 2012

slide-65
SLIDE 65

Monte Carlo approach

23

Friday, 27 April 2012

slide-66
SLIDE 66

Monte Carlo approach

Usually the reward r( s, a) and the state transition function δ( s, a) are unknown to the learning agent.

23

Friday, 27 April 2012

slide-67
SLIDE 67

Monte Carlo approach

Usually the reward r( s, a) and the state transition function δ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? )

23

Friday, 27 April 2012

slide-68
SLIDE 68

Monte Carlo approach

Usually the reward r( s, a) and the state transition function δ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? )

23

Friday, 27 April 2012

slide-69
SLIDE 69

Monte Carlo approach

Usually the reward r( s, a) and the state transition function δ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) Still, we can estimate U* from experience, as a Monte Carlo approach will do:

  • Start with a randomly chosen s
  • Follow a policy π, store rewards and st for the step at time t
  • When the goal is reached, update the Uπ(s) estimate for all visited states

st with the future reward that was given when reaching the goal

  • Start over with a randomly chosen s ...

23

Friday, 27 April 2012

slide-70
SLIDE 70

Monte Carlo approach

Usually the reward r( s, a) and the state transition function δ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) Still, we can estimate U* from experience, as a Monte Carlo approach will do:

  • Start with a randomly chosen s
  • Follow a policy π, store rewards and st for the step at time t
  • When the goal is reached, update the Uπ(s) estimate for all visited states

st with the future reward that was given when reaching the goal

  • Start over with a randomly chosen s ...

Converges slowly...

23

Friday, 27 April 2012

slide-71
SLIDE 71

Temporal Difference learning

24

Friday, 27 April 2012

slide-72
SLIDE 72

Temporal Difference learning

Temporal Difference learning ...

24

Friday, 27 April 2012

slide-73
SLIDE 73

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state:

24

Friday, 27 April 2012

slide-74
SLIDE 74

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state

24

Friday, 27 April 2012

slide-75
SLIDE 75

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state

24

Friday, 27 April 2012

slide-76
SLIDE 76

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state

24

Friday, 27 April 2012

slide-77
SLIDE 77

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state Or: What the agent believes before acting Uπ( st)

24

Friday, 27 April 2012

slide-78
SLIDE 78

Temporal Difference learning

Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state Or: What the agent believes before acting Uπ( st) and after acting rt+1 + γ · Uπ( st+1)

24

Friday, 27 April 2012

slide-79
SLIDE 79

Applying the estimates

25

Friday, 27 April 2012

slide-80
SLIDE 80

Applying the estimates

The second estimate in the Temporal Difference learning approach is obviously “better”, ...

25

Friday, 27 April 2012

slide-81
SLIDE 81

Applying the estimates

The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate

25

Friday, 27 April 2012

slide-82
SLIDE 82

Applying the estimates

The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate Uπ( st) ⟵ Uπ( st) + α[ rt+1 + γ·Uπ ( st+1) - Uπ( st)]

25

Friday, 27 April 2012

slide-83
SLIDE 83

Applying the estimates

The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate Uπ( st) ⟵ Uπ( st) + α[ rt+1 + γ·Uπ ( st+1) - Uπ( st)] Which gives us a measure of the “surprise” or “disappointment” for the outcome of an action.

25

Friday, 27 April 2012

slide-84
SLIDE 84

Applying the estimates

The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate Uπ( st) ⟵ Uπ( st) + α[ rt+1 + γ·Uπ ( st+1) - Uπ( st)] Which gives us a measure of the “surprise” or “disappointment” for the outcome of an action. Converges significantly faster than the pure Monte Carlo approach.

25

Friday, 27 April 2012

slide-85
SLIDE 85

Q-learning

26

Friday, 27 April 2012

slide-86
SLIDE 86

Q-learning

Problem:

26

Friday, 27 April 2012

slide-87
SLIDE 87

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that.

26

Friday, 27 April 2012

slide-88
SLIDE 88

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s):

26

Friday, 27 April 2012

slide-89
SLIDE 89

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s

26

Friday, 27 April 2012

slide-90
SLIDE 90

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s π(s) = argmax Q( s, a)

26

Friday, 27 April 2012

slide-91
SLIDE 91

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s π(s) = argmax Q( s, a) a

26

Friday, 27 April 2012

slide-92
SLIDE 92

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s π(s) = argmax Q( s, a) a U*( s) = max Q*( s, a)

26

Friday, 27 April 2012

slide-93
SLIDE 93

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s π(s) = argmax Q( s, a) a U*( s) = max Q*( s, a) a

26

Friday, 27 April 2012

slide-94
SLIDE 94

Q-learning

Problem: even if U is appropriately estimated, it is not possible to compute π, as the agent has no knowledge about δ and r, i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s): Q( s, a): Expected total reward when choosing a in s π(s) = argmax Q( s, a) a U*( s) = max Q*( s, a) a

26

Friday, 27 April 2012

slide-95
SLIDE 95

Learning Q

27

Friday, 27 April 2012

slide-96
SLIDE 96

Learning Q

How can we learn Q?

27

Friday, 27 April 2012

slide-97
SLIDE 97

Learning Q

How can we learn Q? Also the Q-function can be learned using the Temporal Difference approach:

27

Friday, 27 April 2012

slide-98
SLIDE 98

Learning Q

How can we learn Q? Also the Q-function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α[ r + γ max Q( s’, a’) - Q( s, a)]

27

Friday, 27 April 2012

slide-99
SLIDE 99

Learning Q

How can we learn Q? Also the Q-function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α[ r + γ max Q( s’, a’) - Q( s, a)] a’

27

Friday, 27 April 2012

slide-100
SLIDE 100

Learning Q

How can we learn Q? Also the Q-function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α[ r + γ max Q( s’, a’) - Q( s, a)] a’ With s’ being the next state that is reached when choosing action a’

27

Friday, 27 April 2012

slide-101
SLIDE 101

Learning Q

How can we learn Q? Also the Q-function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α[ r + γ max Q( s’, a’) - Q( s, a)] a’ With s’ being the next state that is reached when choosing action a’ Again, a problem: the max operator requires obviously a search through all possible actions that can be taken in the next step...

27

Friday, 27 April 2012

slide-102
SLIDE 102

SARSA-learning

28

Friday, 27 April 2012

slide-103
SLIDE 103

SARSA-learning

SARSA-learning works similar to Q-learning, but it is the currently active policy that controls the actually taken action a’:

28

Friday, 27 April 2012

slide-104
SLIDE 104

SARSA-learning

SARSA-learning works similar to Q-learning, but it is the currently active policy that controls the actually taken action a’: Q( s, a) ⟵ Q( s, a) + α[ r + γ Q( s’, a’) - Q( s, a)]

28

Friday, 27 April 2012

slide-105
SLIDE 105

SARSA-learning

SARSA-learning works similar to Q-learning, but it is the currently active policy that controls the actually taken action a’: Q( s, a) ⟵ Q( s, a) + α[ r + γ Q( s’, a’) - Q( s, a)]

28

Friday, 27 April 2012

slide-106
SLIDE 106

SARSA-learning

SARSA-learning works similar to Q-learning, but it is the currently active policy that controls the actually taken action a’: Q( s, a) ⟵ Q( s, a) + α[ r + γ Q( s’, a’) - Q( s, a)] Got its name from the “experience tuples” having the form State-Action-Reward-State-Action

28

Friday, 27 April 2012

slide-107
SLIDE 107

SARSA-learning

SARSA-learning works similar to Q-learning, but it is the currently active policy that controls the actually taken action a’: Q( s, a) ⟵ Q( s, a) + α[ r + γ Q( s’, a’) - Q( s, a)] Got its name from the “experience tuples” having the form State-Action-Reward-State-Action < s, a, r, s’, a’ >

28

Friday, 27 April 2012

slide-108
SLIDE 108

Outline

  • Reinforcement learning (chapter 21, with some references to 17)
  • Problem definition
  • Learning situation
  • Roll of the reward
  • Simplified assumptions
  • Central concepts and terms
  • Known environment
  • Bellman’s equation
  • Approaches to solutions
  • Unknown environment
  • Monte-Carlo method
  • Temporal-Difference learning
  • Q-Learning
  • Sarsa-Learning
  • Improvements
  • The usefulness of making mistakes
  • Eligibility Trace

29

Friday, 27 April 2012

slide-109
SLIDE 109

Improvements and adaptations

What can we do, when ...

  • ... the environment is not fully observable?
  • ... there are too many states?
  • ... the states are not discrete?
  • ... the agent is acting in continuous time?

30

Friday, 27 April 2012

slide-110
SLIDE 110

Allowing to be wrong sometimes

31

Friday, 27 April 2012

slide-111
SLIDE 111

Allowing to be wrong sometimes

Exploration - Exploitation dilemma: When following one policy based on the current estimate of Q, it is not guaranteed that Q actually converges to Q* (the

  • ptimal Q).

31

Friday, 27 April 2012

slide-112
SLIDE 112

Allowing to be wrong sometimes

Exploration - Exploitation dilemma: When following one policy based on the current estimate of Q, it is not guaranteed that Q actually converges to Q* (the

  • ptimal Q).

A simple solution: Use a policy that has a certain probability of “being wrong” once in a while, to explore better.

31

Friday, 27 April 2012

slide-113
SLIDE 113

Allowing to be wrong sometimes

Exploration - Exploitation dilemma: When following one policy based on the current estimate of Q, it is not guaranteed that Q actually converges to Q* (the

  • ptimal Q).

A simple solution: Use a policy that has a certain probability of “being wrong” once in a while, to explore better.

  • ε-greedy: Will sometimes (with probability ε) pick a random action instead of the
  • ne that looks best (greedy)

31

Friday, 27 April 2012

slide-114
SLIDE 114

Allowing to be wrong sometimes

Exploration - Exploitation dilemma: When following one policy based on the current estimate of Q, it is not guaranteed that Q actually converges to Q* (the

  • ptimal Q).

A simple solution: Use a policy that has a certain probability of “being wrong” once in a while, to explore better.

  • ε-greedy: Will sometimes (with probability ε) pick a random action instead of the
  • ne that looks best (greedy)
  • Softmax: Weighs the probability for choosing different actions according to how

“good” they appear to be.

31

Friday, 27 April 2012

slide-115
SLIDE 115

ε-greedy Q-learning

32

Friday, 27 April 2012

slide-116
SLIDE 116

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

32

Friday, 27 April 2012

slide-117
SLIDE 117

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

  • Initialise Q(s, a) arbitrarily ∀s, a, choose learning rate α and discount factor γ

32

Friday, 27 April 2012

slide-118
SLIDE 118

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

  • Initialise Q(s, a) arbitrarily ∀s, a, choose learning rate α and discount factor γ
  • Initialise s

32

Friday, 27 April 2012

slide-119
SLIDE 119

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

  • Initialise Q(s, a) arbitrarily ∀s, a, choose learning rate α and discount factor γ
  • Initialise s
  • Repeat for each step:
  • Choose a from s using ε-greedy policy based on Q(s, a)
  • Take action a, observe reward r, and next state s'
  • Update Q(s, a) ← Q(s, a) + α[r + γ max Q(s', a') - Q(s, a)]

a'

  • replace s with s'

32

Friday, 27 April 2012

slide-120
SLIDE 120

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

  • Initialise Q(s, a) arbitrarily ∀s, a, choose learning rate α and discount factor γ
  • Initialise s
  • Repeat for each step:
  • Choose a from s using ε-greedy policy based on Q(s, a)
  • Take action a, observe reward r, and next state s'
  • Update Q(s, a) ← Q(s, a) + α[r + γ max Q(s', a') - Q(s, a)]

a'

  • replace s with s'

until T steps.

32

Friday, 27 April 2012

slide-121
SLIDE 121

ε-greedy Q-learning

A suggested algorithm (ε-greedy implementation, given some “black box”, that produces r and s’, given s and a)

  • Initialise Q(s, a) arbitrarily ∀s, a, choose learning rate α and discount factor γ
  • Initialise s
  • Repeat for each step:
  • Choose a from s using ε-greedy policy based on Q(s, a)
  • Take action a, observe reward r, and next state s'
  • Update Q(s, a) ← Q(s, a) + α[r + γ max Q(s', a') - Q(s, a)]

a'

  • replace s with s'

until T steps.

32

Friday, 27 April 2012

slide-122
SLIDE 122

Speeding up the process

Idea: the Time Difference (TD) updates can be used to improve the estimation also

  • f states where the agent has already been earlier.

∀s, a : Q( s, a) ⟵ Q( s, a) + α[ rt+1 + γ Q( st+1, at+1) - Q( st, at)] · e With e the eligibility trace, telling how long ago the agent visited s and chose action a Often called TD( λ), with λ being the time constant that describes the “annealing rate” of the trace.

33

Friday, 27 April 2012

slide-123
SLIDE 123

Application examples

  • Game playing.
  • A. Samuel’s checkers program (1959). Remarkable: did not use any

rewards... but was managed to converge anyhow...

  • G. Tesauro’s backgammon program from 1992, first introduced as

Neurogammon, with a neural network representation of Q(s, a). Required an expert for tedious training ;-) The newer version TD- gammon learned from self-play and rewards at the end of the game according to generalised TD-learning. Played quite well after two weeks

  • f computing time ...
  • Robotics
  • Classic example: the inverse pendulum (cart-pole). Two actions: jerk right
  • r jerk left (bang-bang control). First learning algorithm to this problem

applied in 1968 (Michie and Chambers), using a real cart!

  • More recently: Pancake flipping ;-)

34

Friday, 27 April 2012

slide-124
SLIDE 124

Flipping ... a piece of (pan)cake?

35

Video from programming-by-demonstration.org (Dr. Sylvain Calinon & Dr. Petar Kormushev)

Friday, 27 April 2012

slide-125
SLIDE 125

Homework for Machine Learning

  • Homework 3 is related to machine learning, announced on the course page
  • Choose between 3a, 3b, 3c (or do several), but only one (the best) will

contribute in the end as homework 3

  • 3c is in the area of today’s lecture (slides will be provided after the

lecture ;-)

  • The task: get a little two-legged agent (“robot”) to learn to “walk”
  • Some programming effort is involved (instructions provided)
  • Main idea is to explore different reinforcement learning approaches and

compare their effect on the agent’s success (or failure...) and report on the experience

  • A series of images for “animation” of the agent is provided
  • Support methods for the “animation” of the agent’s walk are provided in

Matlab and Python (transferring to Java should also be easily possible, the Matlab code is less than 30 lines long)

36

Friday, 27 April 2012

slide-126
SLIDE 126

Homework for Machine Learning cont’d

  • Seemingly “simple” task - just doing it gives a grade 3 at maximum.
  • BUT: the important part of this task is the INTERPRETATION and

DISCUSSION of results, which should be done in a thoroughly prepared and written REPORT. Please make sure you have read the instructions carefully before starting the work!

  • Deadline for handing in: May 10, 2012.

37

Friday, 27 April 2012