Reinforcement Learning (RL) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

reinforcement learning rl
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning (RL) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

Reinforcement Learning (RL) CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley. Reinforcement Learning (RL) Learning as a result of


slide-1
SLIDE 1

Reinforcement Learning (RL)

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2018

Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.

slide-2
SLIDE 2

Reinforcement Learning (RL)

2

 Learning as a result of interaction with an environment

 to improve the agent’s ability to behave optimally in the future

to achieve the goal.

 The first idea when we think about the nature of learning  Examples:

 Baby movements  Learning to drive car

 Environment’s response affects our subsequent actions  We find out the effects of our actions later

slide-3
SLIDE 3

Paradigms of learning

3

 Supervised learning

 Training data: features and labels for 𝑂 samples

𝒚 𝑜 , 𝑧(𝑜)

𝑜=1 𝑂

 Unsupervised learning

 Training data: only features for 𝑂 samples 𝒚 𝑜

𝑜=1 𝑂

 Reinforcement learning

 Training data: a sequence of (s, a, r)

 (state, action, reward)

 Agent acts on its environment, it receives some evaluation of its

action via reinforcement signal

 it is not told of which action is the correct one to achieve its goal

slide-4
SLIDE 4

Reinforcement Learning (RL)

4

 𝑇: Set of states  𝐵: Set of actions  Goal: Learning an optimal policy (mapping from states to actions) in

  • rder to maximize its long-term reward

 The agent's objective is to maximize amount of reward it receives over time.

slide-5
SLIDE 5

Environment properties

5

 Deterministic vs. stochastic

 Stochastic: stochastic reward & transition

 Known vs. unknown

 Unknown: Agent doesn't know the precise results of its actions

before doing them

 Fully observable vs. partially observable

 Observable (accessible): percept identifies the state  Partially observable: Agent doesn't necessarily know all about

the current state

 [We discuss about only fully observable environments.]

slide-6
SLIDE 6

Reinforcement Learning: Example

6

 Chess game (deterministic game)

 Learning task: chose move at arbitrary board states  Training signal: final win or loss  Training: e.g., played n games against itself

slide-7
SLIDE 7

Non-deterministic world: Example

7

 What is the policy to achieve max reward?

[Russel, AIMA,2010] Other action reward: -0.04

slide-8
SLIDE 8

Main characteristics and applications of RL

8

 Main characteristics of RL

 Learning is a multistage decision making process

 Actions influence later perceptions (inputs)  Delayed reward: actions may affect not only the immediate reward but

also subsequent rewards

 agent must learn from interactions with environment

 Agent must be able to learn from its own experience

 Not entirely supervised, but interactive

 by trial-and-error

 Opportunity for active exploration

 Needs trade-off between exploration and exploitation

 Goal: map states to actions, so as to maximize reward over

time (optimal policy)

slide-9
SLIDE 9

Main elements of RL

9

 Policy: A map from state space to action space.

 May be stochastic.

 A reward function

 It maps each state (or, state-action pair) to a real number,

called reward.

 A value function

 Value of a state (or state-action) is the total expected reward,

starting from that state (or state-action).

slide-10
SLIDE 10

Popular applications

10

 Robotics

 Learn to optimize performance of agents

 Control

 Automate motion of a vehicle

 Game playing

 Good board-playing programs

slide-11
SLIDE 11

RL deterministic world: Example

11

 Example: Robot grid world

 Deterministic and known reward and transitions

slide-12
SLIDE 12

Optimal policy

12

𝑠 = −0.04 for other actions 𝑠 = −4 for other actions 𝑠 = −0.4 for other actions [Russel, AIMA 2010]

slide-13
SLIDE 13

RL problem: deterministic environment

13

 Deterministic

 Transition and reward functions

 At time 𝑢:

 Agent observes state 𝑡𝑢 ∈ 𝑇  Then chooses action 𝑏𝑢 ∈ 𝐵  Then receives reward 𝑠

𝑢, and state changes to 𝑡𝑢+1

 Learn to choose action 𝑏𝑢 in state 𝑡𝑢 that maximizes the

discounted return: 𝑆𝑢 = 𝑠

𝑢 + 𝛿𝑠 𝑢+1 + 𝛿2𝑠 𝑢+2 + ⋯ = ෍

𝑙=0 ∞

𝛿𝑙𝑠𝑢+𝑙 ,

0 < 𝛿 ≤ 1

Upon visiting the sequence of states st, st+1, … with actions at, at+1, … it shows the total payoff

slide-14
SLIDE 14

RL problem: stochastic environment

14

 Stochastic environment

 Stochastic transition and/or reward function

 Learn to choose a policy that maximizes the expected

discounted return: starting from state 𝑡𝑢 𝐹 𝑆𝑢 = 𝐹 𝑠

𝑢 + 𝛿𝑠 𝑢+1 + 𝛿2𝑠 𝑢+2 + ⋯ 𝑆𝑢 = 𝑠𝑢 + 𝛿𝑠𝑢+1 + 𝛿2𝑠𝑢+2 + ⋯ = ෍

𝑙=0 ∞

𝛿𝑙𝑠𝑢+𝑙

slide-15
SLIDE 15

Markov Decision Process (RL Setting)

15

 We encounter a multistage decision making process.  Markov assumption:

𝑄 𝑡𝑢+1, 𝑠

𝑢 𝑡𝑢, 𝑏𝑢, 𝑠 𝑢−1, 𝑡𝑢−1, 𝑏𝑢−1, , 𝑠 𝑢−2, … ) = 𝑄(𝑡𝑢+1, 𝑠 𝑢|𝑡𝑢, 𝑏𝑢)

 Markov property: Transition probabilities depend on state only,

not on the path to the state.

 Goal: for every possible state 𝑡∈𝑇 learn a policy 𝜌

for choosing actions that maximizes (infinite-horizon discounted reward): 𝐹 𝑠

𝑢 + 𝛿𝑠 𝑢+1 + 𝛿2𝑠 𝑢+2 + ⋯ |𝑡𝑢 = 𝑡, 𝜌 ,

0 < 𝛿 ≤ 1

slide-16
SLIDE 16

Markov Decision Process

16

 Components:

 states  actions  state transition probabilities  reward function  discount factor

 Deterministic vs. nondeterministic MDP

slide-17
SLIDE 17

17

a s s, a s,a,s’ s’

slide-18
SLIDE 18

Example

18

Cool Warm Overheated

Fast Fast Slow Slow 0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2

  • 10
slide-19
SLIDE 19

MDP: Recycling Robot example

19

 𝑇 = ℎ𝑗𝑕ℎ, 𝑚𝑝𝑥  𝐵 = 𝑡𝑓𝑏𝑠𝑑ℎ, 𝑥𝑏𝑗𝑢, 𝑠𝑓𝑑ℎ𝑏𝑠𝑕𝑓

 𝒝 ℎ𝑗𝑕ℎ = 𝑡𝑓𝑏𝑠𝑑ℎ, 𝑥𝑏𝑗𝑢  𝒝 𝑚𝑝𝑥 = 𝑡𝑓𝑏𝑠𝑑ℎ, 𝑥𝑏𝑗𝑢, 𝑠𝑓𝑑ℎ𝑏𝑠𝑕𝑓

 ℛ𝑡𝑓𝑏𝑠𝑑ℎ > ℛ𝑥𝑏𝑗𝑢 Available actions in the ‘high’ state

𝑄(𝑡𝑢+1 = ℎ𝑗𝑕ℎ|𝑡𝑢 = ℎ𝑗𝑕ℎ, 𝑏𝑢 = 𝑡𝑓𝑏𝑠𝑑ℎ)

reward

slide-20
SLIDE 20

RL: Autonomous Agent

20

 Execute actions in environment, observe results, and learn

 Learn

(perhaps stochastic) policy that maximizes 𝐹 σ𝑙=0

𝛿𝑙𝑠

𝑢+𝑙 |𝑡𝑢 = 𝑡, 𝜌 for every state 𝑡 ∈ 𝑇

 Function to be learned is the policy 𝜌: 𝑇 × 𝐵 → [0,1]

(when the policy is deterministic 𝜌: 𝑇 → 𝐵)

 Training examples in supervised learning: 𝑡, 𝑏 pairs  RL training data shows the amount of reward for a pair 𝑡, 𝑏 .

 training data are of the form

𝑡, 𝑏 , 𝑠

slide-21
SLIDE 21

State-value function for policy 𝜌

21

 Given a policy 𝜌, define value function

𝑊𝜌 𝑡 = 𝐹 σ𝑙=0

𝛿𝑙𝑠

𝑢+𝑙 𝑡𝑢 = 𝑡, 𝜌

 𝑊𝜌 𝑡 : How good for the agent to be in the state 𝑡 when

its policy is 𝜌

 It is simply the expected sum of discounted rewards upon

starting in state s and taking actions according to 𝜌

slide-22
SLIDE 22

Approaches to solve RL problems

22

 Three fundamental classes of methods for solving the RL

problems:

 Dynamic programming  Monte Carlo methods  T

emporal-difference learning

 Main approaches

 Value iteration and Policy iteration are two more classic

approaches to this problem.

 They are dynamic programming approaches

 Q-learning is a more recent approaches to this problem.

 It is a temporal-difference method.

slide-23
SLIDE 23

Recursive definition for 𝑊𝜌(𝑇)

23

𝑊𝜌 𝑡 = 𝐹 σ𝑙=0

𝛿𝑙𝑠

𝑢+𝑙 𝑡𝑢 = 𝑡, 𝜌

= 𝐹 𝑠𝑢 + 𝛿 σ𝑙=1

𝛿𝑙−1𝑠𝑢+𝑙 𝑡𝑢 = 𝑡, 𝜌

= 𝐹 𝑠𝑢 + 𝛿 σ𝑙=0

𝛿𝑙𝑠𝑢+𝑙+1 𝑡𝑢 = 𝑡, 𝜌 = ෍

𝑏

𝜌(𝑡, 𝑏) ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝐹 σ𝑙=0

𝛿𝑙𝑠

𝑢+𝑙+1 𝑡𝑢+1 = 𝑡′, 𝜌

𝑊𝜌(𝑡′) Bellman Equations 𝒬𝑡𝑡′

𝑏 = 𝑄 𝑡𝑢+1 = 𝑡′ 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏

ℛ𝑡𝑡′

𝑏

= 𝐹 𝑠𝑢 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏, 𝑡𝑢+1 = 𝑡′ 𝑊𝜌 𝑡 = ෍

𝑏

𝜌(𝑡, 𝑏) ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊𝜌(𝑡′) Base equation for dynamic programming approaches

slide-24
SLIDE 24

State-action value function for policy 𝜌

24

𝑅𝜌 𝑡, 𝑏 = 𝐹 σ𝑙=0

𝛿𝑙𝑠𝑢+𝑙 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏, 𝜌 = ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝐹 σ𝑙=0

𝛿𝑙𝑠

𝑢+𝑙+1 𝑡𝑢+1 = 𝑡′, 𝜌

𝑊𝜌(𝑡′) 𝑅𝜌 𝑡, 𝑏 = ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊𝜌(𝑡′)

slide-25
SLIDE 25

State-value function for policy 𝜌: example

25

 Deterministic example

90 81 73 66 100 𝑊𝜌 𝑡 = ℛ𝑡𝑡′

𝜌 𝑡 + 𝛿𝑊𝜌(𝑡′)

slide-26
SLIDE 26

Grid-world: value function example

26

[Russel, AIMA 2010]

slide-27
SLIDE 27

Optimal policy

27

 Policy 𝜌 is better than (or equal to) 𝜌′ (i.e. 𝜌 ≥ 𝜌′) iff

𝑊𝜌 𝑡 ≥ 𝑊𝜌′ 𝑡 , ∀𝑡 ∈ 𝑇

 Optimal policy 𝜌∗ is better than (or equal to) all other

policies (∀𝜌, 𝜌∗ ≥ 𝜌)

 Optimal policy 𝛒∗:

𝜌∗ 𝑡 = argmax

𝜌

𝑊𝜌 𝑡 , ∀𝑡 ∈ 𝑇

slide-28
SLIDE 28

MDP: Optimal policy state-value and action-value functions

28

 Optimal policies share the same optimal state-value

function (𝑊𝜌∗(𝑡) will be abbreviated as 𝑊∗(𝑡)): 𝑊∗ 𝑡 = max

𝜌

𝑊𝜌 𝑡 , ∀𝑡 ∈ 𝑇

 And the same optimal action-value function:

𝑅∗ 𝑡, 𝑏 = max

𝜌

𝑅𝜌 𝑡, 𝑏 , ∀𝑡 ∈ 𝑇, 𝑏 ∈ 𝒝(𝑡)

 For any MDP, a deterministic optimal policy exists!

slide-29
SLIDE 29

Optimal policy

29

 If we have 𝑊∗(𝑡) and 𝑄(𝑡𝑢+1|𝑡𝑢, 𝑏𝑢) we can compute 𝜌∗(𝑡)

𝜌∗ 𝑡 = argmax

𝑏

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊∗(𝑡′)

 It can also be computed as:

𝜌∗ 𝑡 = argmax

𝑏∈𝒝(𝑡)

𝑅∗ 𝑡, 𝑏

 Optimal policy has the interesting property that it is the

  • ptimal policy for all states.

 Share the same optimal state-value function  It is not dependent on the initial state.

 use the same policy no matter what the initial state of MDP is

slide-30
SLIDE 30

Bellman optimality equation

30

𝑊∗ 𝑡 = max

𝑏∈𝒝(𝑡) ෍ 𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊∗ 𝑡′ 𝑅∗ 𝑡, 𝑏 = ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿 max

𝑏′ 𝑅∗ 𝑡′, 𝑏′

𝑊∗ 𝑡 = max

𝑏∈𝒝(𝑡) 𝑅∗ 𝑡, 𝑏

𝑅∗ 𝑡, 𝑏 = ෍

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊∗ 𝑡′

slide-31
SLIDE 31

Optimal Quantities

  • The value (utility) of a state s:

V*(s) = expected utility starting in s and acting optimally

  • The value (utility) of a q-state (s,a):

Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally

  • The optimal policy:

*(s) = optimal action from state s

a s s’ s, a

(s,a,s’) is a transition

s,a,s’

s is a state (s, a) is a q-state

31

slide-32
SLIDE 32

Snapshot of Demo – Gridworld V Values

Noise = 0.2 Discount = 0.9 Living reward = 0

32

slide-33
SLIDE 33

Snapshot of Demo – Gridworld Q Values

Noise = 0.2 Discount = 0.9 Living reward = 0

33

slide-34
SLIDE 34

Value Iteration algorithm

34

1)

Initialize all 𝑊(𝑡) to zero

2)

Repeat until convergence

 for 𝑡 ∈ 𝑇 

𝑊(𝑡) ← max

𝑏

σ𝑡′ 𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊(𝑡′)

3)

for 𝑡 ∈ 𝑇 𝜌(𝑡) ← argmax

𝑏

σ𝑡′ 𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊(𝑡′)

𝑊(𝑡) converges to 𝑊∗(𝑡) Asynchronous: Instead of updating values for all states at once in each iteration, it can update them state by state, or more often to some states than others.

Consider only MDPs with finite state and action spaces:

slide-35
SLIDE 35

Value Iteration

 Bellman equations characterize the optimal values:  Value iteration computes them:  Value iteration is just a fixed point solution method

… though the Vk vectors are also interpretable as time-limited values

a V(s) s, a s,a,s ’ V(s’)

35

slide-36
SLIDE 36

Racing Search Tree

36

a Vk+1(s) s, a s,a,s’ Vk(s’)

slide-37
SLIDE 37

Racing Search Tree

37

slide-38
SLIDE 38

Time-Limited Values

 Key idea: time-limited values  Define Vk(s) to be the optimal value of s if the game

ends in k more time steps

38

slide-39
SLIDE 39

Value Iteration

 Start withV0(s) = 0: no time steps left means an expected reward sum of zero  Given vector ofVk(s) values, do one ply of expectimax from each state:  Repeat until convergence  Complexity of each iteration: O(S2A)  Theorem: will converge to unique optimal values

Basic idea: approximations get refined towards optimal values

Policy may converge long before values do

a Vk+1(s) s, a s,a,s’ Vk(s’)

39

slide-40
SLIDE 40

Example: Value Iteration

0 0 0 2 1 0 3.5 2.5 0

Assume no discount!

40

slide-41
SLIDE 41

Computing Time-Limited Values

41

slide-42
SLIDE 42

k=0

Noise = 0.2 Discount = 0.9 Living reward = 0

42

slide-43
SLIDE 43

k=1

Noise = 0.2 Discount = 0.9 Living reward = 0

43

slide-44
SLIDE 44

k=2

Noise = 0.2 Discount = 0.9 Living reward = 0

44

slide-45
SLIDE 45

k=3

Noise = 0.2 Discount = 0.9 Living reward = 0

45

slide-46
SLIDE 46

k=4

Noise = 0.2 Discount = 0.9 Living reward = 0

46

slide-47
SLIDE 47

k=5

Noise = 0.2 Discount = 0.9 Living reward = 0

47

slide-48
SLIDE 48

k=6

Noise = 0.2 Discount = 0.9 Living reward = 0

48

slide-49
SLIDE 49

k=7

Noise = 0.2 Discount = 0.9 Living reward = 0

49

slide-50
SLIDE 50

k=8

Noise = 0.2 Discount = 0.9 Living reward = 0

50

slide-51
SLIDE 51

k=9

Noise = 0.2 Discount = 0.9 Living reward = 0

51

slide-52
SLIDE 52

k=10

Noise = 0.2 Discount = 0.9 Living reward = 0

52

slide-53
SLIDE 53

k=11

Noise = 0.2 Discount = 0.9 Living reward = 0

53

slide-54
SLIDE 54

k=12

Noise = 0.2 Discount = 0.9 Living reward = 0

54

slide-55
SLIDE 55

k=100

Noise = 0.2 Discount = 0.9 Living reward = 0

55

slide-56
SLIDE 56

Computing Actions from Values

 Let’s imagine we have the optimal valuesV*(s)  How should we act?

 It’s not obvious!

 We need to do (one step)  This is called policy extraction, since it gets the policy implied by the

values

56

slide-57
SLIDE 57

Computing Actions from Q-Values

 Let’s imagine we have the optimal q-values:  How should we act?

 Completely trivial to decide!

 Important lesson: actions are easier to select from q-values than values!

57

slide-58
SLIDE 58

Convergence*

How do we know the Vk vectors are going to converge?

Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values

Case 2: If the discount is less than 1

Sketch: For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees

The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros

That last layer is at best all RMAX

It is at worst RMIN

But everything is discounted by γk that far out

So Vk and Vk+1 are at most γk max|R| different

So as k increases, the values converge 58

slide-59
SLIDE 59

Value Iteration

59

 Value iteration works even if we randomly traverse the

environment instead of looping through each state and action (update asynchronously)

 but we must still visit each state infinitely often

 Value Iteration

 It is time and memory expensive

slide-60
SLIDE 60

Problems with Value Iteration

 Value iteration repeats the Bellman updates:  Problem 1: It’s slow – O(S2A) per iteration  Problem 2: The “max” at each state rarely changes  Problem 3: The policy often converges long before the values

a s s, a s,a,s’ s’

60

slide-61
SLIDE 61

Convergence

61

[Russel, AIMA, 2010]

slide-62
SLIDE 62

Main steps in solving Bellman optimality equations

62

 Two kinds of steps, which are repeated in some order for all

the states until no further changes take place 𝜌 𝑡 = argmax

𝑏

𝑡′

𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊𝜌(𝑡′) 𝑊𝜌 𝑡 = ෍

𝑡′

𝒬𝑡𝑡′

𝜌(𝑡) ℛ𝑡𝑡′ 𝜌(𝑡) + 𝛿𝑊𝜌(𝑡′)

slide-63
SLIDE 63

Policy Iteration algorithm

63

1)

Initialize 𝜌(𝑡) arbitrarily

2)

Repeat until convergence

 Compute the value function for the current policy 𝜌 (i.e. 𝑊𝜌)  𝑊 ← 𝑊𝜌  for 𝑡 ∈ 𝑇 

𝜌(𝑡) ← argmax

𝑏

σ𝑡′ 𝒬𝑡𝑡′

𝑏

ℛ𝑡𝑡′

𝑏

+ 𝛿𝑊(𝑡′)

𝜌(𝑡) converges to 𝜌∗(𝑡) updates the policy (greedily) using the current value function.

slide-64
SLIDE 64

Policy Iteration

 Repeat steps until policy converges

 Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal

utilities!) until convergence

 Step 2: Policy improvement: update policy using one-step look-ahead with

resulting converged (but not optimal!) utilities as future values

 This is policy iteration

 It’s still optimal!  Can converge (much) faster under some conditions 64

slide-65
SLIDE 65

Fixed Policies Evaluation

fixed some policy (s), then the tree would be simpler – only one action per state a s s, a s,a,s’ s’ (s) s s, (s) s, (s),s’ s’ Do the optimal action Do what  says to do

65

max over all actions to compute the

  • ptimal values
slide-66
SLIDE 66

Utilities for a Fixed Policy

 Another basic operation: compute the utility of a state s

under a fixed (generally non-optimal) policy

 Recursive relation (one-step look-ahead / Bellman equation):

(s) s s, (s) s, (s),s’ s’

66

V(s) = expected total discounted rewards starting in s and following 

slide-67
SLIDE 67

Policy Evaluation

 How do we calculate the V’s for a fixed policy ?  Idea 1: Turn recursive Bellman equations into updates

(like value iteration)

 Efficiency: O(S2) per iteration  Idea 2: Without the maxes, the Bellman equations are just a linear system

(s) s s, (s) s, (s),s’ s’

67

slide-68
SLIDE 68

Policy Iteration

 Evaluation: For fixed current policy , find values with policy evaluation:

Iterate until values converge:

 Improvement: For fixed values, get a better policy using policy extraction

One-step look-ahead:

68

slide-69
SLIDE 69

When to stop iterations:

69

[Russel, AIMA 2010]

slide-70
SLIDE 70

Comparison

 Both value iteration and policy iteration compute the same thing (all

  • ptimal values)

 In value iteration:

 Every iteration updates both the values and (implicitly) the policy  We don’t track the policy, but taking the max over actions implicitly recomputes

it

 In policy iteration:

 We do several passes that update utilities with fixed policy (each pass is fast

because we consider only one action, not all of them)

 After the policy is evaluated, a new policy is chosen (slow like a value iteration

pass)

 The new policy will be better (or we’re done)

 Both are dynamic programs for solving MDPs

70

slide-71
SLIDE 71

MDP Algorithms: Summary

 So you want to….

 Compute optimal values: use value iteration or policy iteration  Compute values for a particular policy: use policy evaluation  Turn your values into a policy: use policy extraction (one-step lookahead) 71

slide-72
SLIDE 72

Unknown transition model

72

 So far: learning optimal policy when we know 𝒬𝑡𝑡′

𝑏

(i.e.

T(s,a,s’)) and ℛ𝑡𝑡′

𝑏  it requires prior knowledge of the environment's dynamics

 If a model is not available, then it is particularly useful to

estimate action values rather than state values

slide-73
SLIDE 73

Reinforcement Learning

 Still assume a Markov decision process (MDP):

 A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)

 Still looking for a policy (s)  New twist: don’t know T or R

 I.e. we don’t know which states are good or what the actions do  Must actually try actions and states out to learn 73

slide-74
SLIDE 74

Reinforcement Learning

 Basic idea:

 Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must (learn to) act so as to maximize expected rewards  All learning is based on observed samples of outcomes!

Environmen t

Agent

Actions: a State: s Reward: r

74

slide-75
SLIDE 75

Applications

75

 Control & robotics

 Autonomous helicopter

 self-reliant agent must do to learn from its own experiences.  eliminating hand coding of control strategies

 Board games  Resource (time, memory, channel, …) allocation

slide-76
SLIDE 76

Double Bandits

76

slide-77
SLIDE 77

Let’s Play!

$2 $2 $0 $2 $2 $2 $2 $0 $0 $0

77

slide-78
SLIDE 78

What Just Happened?

 That wasn’t planning, it was learning!

 Specifically, reinforcement learning  There was an MDP, but you couldn’t solve it with just computation  You needed to actually act to figure it out

 Important ideas in reinforcement learning that came up

 Exploration: you have to try unknown actions to get information  Exploitation: eventually, you have to use what you know  Regret: even if you learn intelligently, you make mistakes  Sampling: because of chance, you have to try things repeatedly  Difficulty: learning can be much harder than solving a known MDP 78

slide-79
SLIDE 79

Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

79

slide-80
SLIDE 80

RL algorithms

80

 Model-based (passive)

 Learn

model

  • f

environment (transition and reward probabilities)

 Then, value iteration or policy iteration algorithms

 Model-free (active)

slide-81
SLIDE 81

Model-Based Learning

 Model-Based Idea:

 Learn an approximate model based on experiences  Solve for values as if the learned model were correct

 Step 1: Learn empirical MDP model

 Count outcomes s’ for each s, a  Normalize to give an estimate of  Discover each

when we experience (s, a, s’)

 Step 2: Solve the learned MDP

 For example, use value iteration, as before 81

slide-82
SLIDE 82

Example: Model-Based Learning

Input Policy 

Assume:  = 1

Observed Episodes (Training) Learned Model A

B C

D

E

B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10

Episode 1 Episode 2 Episode 3 Episode 4

E, north, C, -1 C, east, D, -1 D, exit, x, +10

T(s,a,s’).

T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25 …

R(s,a,s’).

R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10 …

82

slide-83
SLIDE 83

Model-Free Learning

83

slide-84
SLIDE 84

Reinforcement Learning

 We still assume an MDP:

 A set of states s  S  A set of actions (per state) A  A modelT(s,a,s’)  A reward function R(s,a,s’)

 Still looking for a policy (s)  New twist: don’t knowT or R, so must try out actions  Big idea: Compute all averages over T using sample outcomes

84

slide-85
SLIDE 85

Direct Evaluation of a Policy

 Goal: Compute values for each state under   Idea:Average together observed sample values

 Act according to   Every time you visit a state, write down what the

sum of discounted rewards turned out to be

 Average those samples

 This is called direct evaluation

85

slide-86
SLIDE 86

Example: Direct Evaluation

Input Policy 

Assume:  = 1

Observed Episodes (Training) Output Values A

B C

D

E

B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10

Episode 1 Episode 2 Episode 3 Episode 4

E, north, C, -1 C, east, D, -1 D, exit, x, +10

A

B C

D

E

+8 +4 +10

  • 10
  • 2

86

slide-87
SLIDE 87

Monte Carlo methods

87

 do not assume complete knowledge of the environment  require only experience

 sample sequences of states, actions, and rewards from on-line

  • r simulated interaction with an environment

 are based on averaging sample returns

 are defined for episodic tasks

slide-88
SLIDE 88

A Monte Carlo control algorithm using exploring starts

88

1)

Initialize 𝑅 and 𝜌 arbitrarily and 𝑆𝑓𝑢𝑣𝑠𝑜𝑡 to empty lists

2)

Repeat

 Generate an episode using 𝜌 and exploring starts  for each pair of 𝑡 and 𝑏 appearing in the episode

 𝑆 ←return following the first occurrence of 𝑡, 𝑏  Append 𝑆 to 𝑆𝑓𝑢𝑣𝑠𝑜𝑡(𝑡, 𝑏)  𝑅 𝑡, 𝑏 ← 𝑏𝑤𝑓𝑠𝑏𝑕𝑓 𝑆𝑓𝑢𝑣𝑠𝑜𝑡(𝑡, 𝑏)

for each 𝑡 in the episode 

𝜌(𝑡) ← argmax

𝑏

𝑅(𝑡, 𝑏)

slide-89
SLIDE 89

Problems with Direct Evaluation

 What’s good about direct evaluation?

 It’s easy to understand  It doesn’t require any knowledge of T, R  It

eventually computes the correct average values, using just sample transitions

 What bad about it?

 It wastes information about state connections  Each state must be learned separately  So, it takes a long time to learn

Output Values

If B and E both go to C under this policy, how can their values be different?

89

A

B C

D

E

+8 +4 +10

  • 10
  • 2
slide-90
SLIDE 90

Connections between states

 Simplified Bellman updates calculate V for a fixed policy:

Each round, replace V with a one-step-look-ahead layer over V

This approach fully exploited the connections between the states

Unfortunately, we need T and R to do it!

 Key question: how can we do this update to V without knowing T and R?

In other words, how to we take a weighted average without knowing the weights?

(s) s s, (s) s, (s),s’ s’

90

slide-91
SLIDE 91

Connections between states

 We want to improve our estimate of V by computing these averages:  Idea: Take samples of outcomes s’ (by doing the action!) and average

(s) s s, (s) s1 ' s2 ' s3 ' s, (s),s’ s '

Almost! But we can’t rewind time to get sample after sample from state s.

91

slide-92
SLIDE 92

Temporal Difference Learning

 Big idea: learn from every experience!

Update V(s) each time we experience a transition (s, a, s’, r)

Likely outcomes s’ will contribute updates more often

 Temporal difference learning of values

Policy still fixed, still doing evaluation!

Move values toward value of whatever successor occurs: running average

(s) s s, (s) s’ Sample of V(s): Update to V(s): Same update:

92

slide-93
SLIDE 93

Exponential Moving Average

 Exponential moving average

 The running interpolation update:  Makes recent samples more important:  Forgets about the past (distant past values were wrong anyway)

 Decreasing learning rate (alpha) can give converging averages

93

slide-94
SLIDE 94

Example: Temporal Difference Learning

Assume:  = 1, α = 1/2

Observed Transitions

B, east, C, -2

8

  • 1

8

  • 1

3

8

C, east, D, -2

A

B C

D

E

States

94

slide-95
SLIDE 95

Temporal difference methods

95

 TD learning is a combination of MC and DP )i.e. Bellman

equations) ideas.

 Like MC methods, can learn directly from raw experience

without a model of the environment's dynamics.

 Like DP

, update estimates based in part on other learned estimates, without waiting for a final outcome.

slide-96
SLIDE 96

Temporal difference on value function

96

 𝑊 𝑡𝑢 ← 𝑊 𝑡𝑢 + 𝛽 𝑠

𝑢+1 + 𝛿𝑊 𝑡𝑢+1 − 𝑊(𝑡𝑢)

1)

Initialize 𝑊(𝑡) arbitrarily

2)

Repeat (for each episode)

 Initialize s  𝑏 ←action given by policy 𝜌 for 𝑡  Take action 𝑏; observe reward 𝑠, and next state 𝑡′

 𝑊 𝑡 ← 𝑊 𝑡 + 𝛽 𝑠 + 𝛿𝑊 𝑡′ − 𝑊(𝑡)

 until s is terminal

𝜌: the policy to be evaluated

fully incremental fashion

slide-97
SLIDE 97

Problems with TD Value Learning

 TD value leaning is a model-free way to do policy evaluation,

mimicking Bellman updates with running sample averages

 However, if we want to turn values into a (new) policy, we’re

sunk:

a s s, a s,a,s’ s’

97

slide-98
SLIDE 98

Unknown transition model: New policy

98

 With a model, state values alone are sufficient to

determine a policy

 simply look ahead one step and chooses whichever action

leads to the best combination of reward and next state

 Without a model, state values alone are not sufficient.

 However, if agent knows 𝑅(𝑡, 𝑏), it can choose optimal

action without knowing 𝑈 and 𝑆:

𝜌∗ 𝑡 = argmax

𝑏

𝑅(𝑡, 𝑏)

slide-99
SLIDE 99

Unknown transition model: New policy

99

 Idea: learn Q-values, not state values  Makes action selection model-free too!

slide-100
SLIDE 100

Detour: Q-Value Iteration

 Value iteration: find successive (depth-limited) values

Start with V0(s) = 0, which we know is right

Given Vk, calculate the depth k+1 values for all states:

 But Q-values are more useful, so compute them instead

Start with Q0(s,a) = 0, which we know is right

Given Qk, calculate the depth k+1 q-values for all q-states:

100

slide-101
SLIDE 101

Q-Learning

 We’d like to do Q-value updates to each Q-state:

But can’t compute this update without knowing T, R

 Instead, compute average as we go

Receive a sample transition (s,a,r,s’)

This sample suggests

But we want to average over results from (s,a) (Why?)

So keep a running average

101

slide-102
SLIDE 102

Q-Learning

 Learn Q(s,a) values as you go

 Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: 102

slide-103
SLIDE 103

Model-Free Learning

 Model-free (temporal difference) learning

 Experience world through episodes  Update estimates each transition  Over

time, updates will mimic Bellman updates

r a s s, a s’ a’ s’, a’ s’’

103

slide-104
SLIDE 104

Active Reinforcement Learning

 Full reinforcement learning: optimal policies (like value iteration)

 You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You choose the actions now  Goal: learn the optimal policy / values

 In this case:

 Learner makes choices!  Fundamental tradeoff: exploration vs. exploitation  This is NOT offline planning! You actually take actions in the world and

find out what happens…

104

slide-105
SLIDE 105

Exploration/exploitation tradeoff

105

 Exploitation: High rewards from trying previously-well-

rewarded actions

 Exploration:Which actions are best?

 Must try ones not tried before

slide-106
SLIDE 106

How to Explore?

 Several schemes for forcing exploration

 Simplest: random actions (-greedy)

 Every time step, flip a coin  With (small) probability , act randomly  With (large) probability 1-, act on current policy

 Problems with random actions?

 You do eventually explore the space, but keep

thrashing around once learning is done

 One solution: lower  over time  Another solution: exploration functions

106

slide-107
SLIDE 107

Q-learning: Policy

107

 Greedy action selection:

𝜌 𝑡 = argmax

𝑏

෠ 𝑅(𝑡, 𝑏)

 𝜗-greedy: greedy most of the times, occasionally take a random

action

 Softmax policy: Give a higher probability to the actions that

currently have better utility, e.g, 𝜌 𝑡, 𝑏 = 𝑐 ෠

𝑅(𝑡,𝑏)

σ𝑏′ 𝑐 ෠

𝑅(𝑡,𝑏′)

 After learning 𝑅∗, the policy is greedy?

slide-108
SLIDE 108

Q-learning algorithm: Non-deterministic environments

108

 Initialize ෠

𝑅(𝑡, 𝑏) arbitrarily

 Repeat (for each episode): 

Initialize 𝑡

Repeat (for each step of episode):

Choose 𝑏 from 𝑡 using a policy derived from ෠ 𝑅

Take action 𝑏, receive reward 𝑠, observe new state 𝑡′

෠ 𝑅 𝑡, 𝑏 ← ෠ 𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max

𝑏′

෠ 𝑅 𝑡′, 𝑏′ − ෠ 𝑅 𝑡, 𝑏

𝑡 ← 𝑡′

until 𝑡 is terminal

e.g., greedy, ε-greedy

slide-109
SLIDE 109

Q-learning convergence

109

 Q-learning converges to optimal Q-values if

 Every state is visited infinitely often  The policy for action selection becomes greedy as time

approaches infinity

 The step size parameter is chosen appropriately

slide-110
SLIDE 110

Step size parameter

110

 Stochastic approximation conditions

 The learning rate is decreased fast enough but not too fast

 One of choices for 𝛽𝑜

𝛽𝑜 = 1 𝑤𝑗𝑡𝑗𝑢𝑡𝑜(𝑡, 𝑏)

slide-111
SLIDE 111

Regret

 Even if you learn the optimal policy, you still make mistakes along the way!  Regret is a measure of your total mistake cost: the difference between

your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards

 Minimizing regret goes beyond learning to be optimal – it requires

  • ptimally learning to be optimal

 Example: random exploration and exploration functions both end up

  • ptimal, but random exploration has higher regret

111

slide-112
SLIDE 112

Q-Learning Properties

 Amazing result: Q-learning converges to optimal policy --

even if you’re acting suboptimally!

 This is called off-policy learning  Caveats:

You have to explore enough

You have to eventually make the learning rate small enough

… but not decrease it too quickly

Basically, in the limit, it doesn’t matter how you select actions (!)

112

slide-113
SLIDE 113

Video of Demo Q-Learning -- Gridworld

113

slide-114
SLIDE 114

Video of Demo Q-learning – Epsilon-Greedy – Crawler

114

slide-115
SLIDE 115

Video of Demo Q-Learning -- Crawler

115

slide-116
SLIDE 116

SARSA (State-Action-Reward-State-Action)

116

 Initialize ෠

𝑅(𝑡, 𝑏)

 Repeat (for each episode): 

Initialize 𝑡

Choose 𝑏 from 𝑡 using a policy derived from ෠ 𝑅

Repeat (for each step of episode):

Take action 𝑏, receive reward 𝑠, observe new state 𝑡′

Choose 𝑏′ from 𝑡′ using policy derived from ෠ 𝑅

 ෠

𝑅 𝑡, 𝑏 ← ෠ 𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 ෠ 𝑅 𝑡′, 𝑏′ − ෠ 𝑅 𝑡, 𝑏

𝑡 ← 𝑡′

𝑏 ← 𝑏′

until 𝑡 is terminal

e.g., greedy, ε-greedy

slide-117
SLIDE 117

Q-learning vs. SARSA

117

 Q-learning is an off-policy learning algorithm  SARSA is an on-policy learning algorithm

slide-118
SLIDE 118

The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique Compute V*, Q*, * Value / policy iteration Evaluate a fixed policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Compute V*, Q*, * VI/PI on approx. MDP Evaluate a fixed policy  PE on approx. MDP Goal Technique Compute V*, Q*, * Q-learning Evaluate a fixed policy  Value Learning

118

slide-119
SLIDE 119

Tabular methods: Problem

119

 All of the introduced methods maintain a table  Table size can be very large for complex environments

 Too many states to visit them all in training

 We may not even visit some states

 Too many states to hold the q-tables in memory

 But computation and memory problem

slide-120
SLIDE 120

Generalizing Across States

 Basic Q-Learning keeps a table of all q-values  In realistic situations, we cannot possibly learn

about every single state!

 Instead, we want to generalize:

Learn about some small number of training states from experience

Generalize that experience to new, similar situations

This is a fundamental idea in machine learning, and we’ll see it over and over again

120

slide-121
SLIDE 121

Example: Pacman

Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!

121

slide-122
SLIDE 122

Feature-Based Representations

 Solution: describe a state using a vector of

features (properties)

Features are functions from states to real numbers (often 0/1) that capture important properties of the state

Example features:

Distance to closest ghost

Distance to closest dot

Number of ghosts

1 / (dist to dot)2

Is Pacman in a tunnel? (0/1)

…… etc.

Is it the exact state on this slide?

Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

122

slide-123
SLIDE 123

Linear Value Functions

 Using a feature representation, we can write a q function (or value function) for any

state using a few weights:

 Advantage: our experience is summed up in a few powerful numbers  Disadvantage: states may share features but actually be very different in value! 123

slide-124
SLIDE 124

Function Approximation

 We approximate 𝑊(𝑡) and 𝑅(𝑡, 𝑏) with any supervised

learning methods

𝑊

𝜾 𝑡 = 𝜄1𝑔 1 𝑡 + ⋯ + 𝜄𝑛𝑔 𝑛 𝑡

  • r

𝑅𝜾 𝑡, 𝑏 = 𝜄1𝑔

1 𝑡, 𝑏 + ⋯ + 𝜄𝑛𝑔 𝑛 𝑡, 𝑏

 We can generalize from visited states to unvisited ones.

 In addition to the less space requirement

124

slide-125
SLIDE 125

Optimization: Least Squares

20

Error or “residual” Prediction Observation

125

slide-126
SLIDE 126

Least square error: minimization

Approximate q update explained: Imagine we had only one point x, with features f(x), target value y, and weights w: “target” “prediction”

126

slide-127
SLIDE 127

Approximate Q-Learning

 Q-learning with linear Q-functions:  Intuitive interpretation:

Adjust weights of active features

E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features

Exact Q’s Approximate Q’s

127

slide-128
SLIDE 128

Example: Q-Pacman

128

slide-129
SLIDE 129

Least square error: minimization

129

 In the general case

slide-130
SLIDE 130

Adjusting function weights

130

Tesauro used function approximation in his Backgammon playing temporal difference learning research. TD-Gammon plays at level of best human players (learn through self play)

𝜾 ← 𝜾 + 𝛽 𝑠 + 𝛿 ෠ 𝑊

𝜾 𝑡′ −෠

𝑊

𝜾 𝑡 𝛼𝜾 ෠

𝑊

𝜾(𝑡)

  • r

𝜾 ← 𝜾 + 𝛽 𝑠 + 𝛿 max

𝑏′

෠ 𝑅𝜾 𝑡′, 𝑏′ − ෠ 𝑅𝜾(𝑡, 𝑏) 𝛼𝜾 ෠ 𝑅𝜾(𝑡, 𝑏)

slide-131
SLIDE 131

Deep RL

131

 𝑅(𝑡, 𝑏; 𝜄): neural network with parameter 𝜄  A single feedforward to compute for the current state 𝑡,

Q-values of all actions

slide-132
SLIDE 132

Playing Atari Games

132

 State: Raw pixels of images  Action: e.g., R, L, U, D  Reward: score  Last layer has 4-d output (if 4 actions)

 Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 )

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

slide-133
SLIDE 133

References

133

 T. Mitchell, Machine Learning, MIT Press,1998. [Chapter 13]  R.S.

Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1999 [Chapters 1,3,4,6].