Machine Learning and Data Mining Reinforcement Learning Markov - - PowerPoint PPT Presentation

machine learning and data mining reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning and Data Mining Reinforcement Learning Markov - - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask Overview Intro Markov Decision Processes Reinforcement Learning Sarsa Q-learning Exploration vs Exploitation tradeoff 2


slide-1
SLIDE 1

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes

Kalev Kask

+

slide-2
SLIDE 2

Overview

  • Intro
  • Markov Decision Processes
  • Reinforcement Learning

– Sarsa – Q-learning

  • Exploration vs Exploitation tradeoff

2

slide-3
SLIDE 3

Resources

  • Book: Reinforcement Learning: An Introduction

Richard S. Sutton and Andrew G. Barto

  • UCL Course on Reinforcement Learning

David Silver

– https://www.youtube.com/watch?v=2pWv7GOvuf0 – https://www.youtube.com/watch?v=lfHX2hHRMVQ – https://www.youtube.com/watch?v=Nd1-UUMVfz4 – https://www.youtube.com/watch?v=PnHCvfgC_ZA – https://www.youtube.com/watch?v=0g4j2k_Ggc4 – https://www.youtube.com/watch?v=UoPei5o4fps

3

slide-4
SLIDE 4

Lecture 1: Introduction to Reinforcement Learning About RL

Branches of Machine Learning

Reinforcement Learning Supervised Learning Unsupervised Learning Machine Learning

4

slide-5
SLIDE 5

Why is it different

  • No target values to predict
  • Feedback in the form of rewards

– May be delayed not instantaneous

  • Have a goal : max reward
  • Have timeline : actions along arrow of time
  • Actions affect what data it will receive

5

slide-6
SLIDE 6

Agent-Environment

6

slide-7
SLIDE 7

Lecture 1: Introduction to Reinforcement Learning The RL Problem Environments

Agent and Environment

  • bservation

reward action At Rt Ot

At each step t the agent:

Executes action At Receives observation Ot Receives scalar rewardRt

The environment:

Receives action At Emits observationOt+1 Emits scalar reward Rt+1

t increments at env. step

7

slide-8
SLIDE 8

Sequential Decision Making

  • Actions have long term

consequences

  • Goal maximize cumulative

(long term) reward

– Rewards may be delayed – May need to sacrifice short term reward

  • Devise a plan to maximize

cumulative reward

8

slide-9
SLIDE 9

Lecture 1: Introduction to Reinforcement Learning The RL Problem Reward

Sequential Decision Making Examples:

A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)

9

slide-10
SLIDE 10

Reinforcement Learning

10

Learn a behavior strategy (policy) that maximizes the long term Sum of rewards in an unknown and stochastic environment (Emma Brunskill: )

Planning under Uncertainty

Learn a behavior strategy (policy) that maximizes the long term Sum of rewards in a known stochastic environment (Emma Brunskill: )

slide-11
SLIDE 11

295, Winter 2018 11

slide-12
SLIDE 12

Lecture 1: Introduction to Reinforcement Learning Problems within RL

Atari Example: Reinforcement Learning

  • bservation

reward action At Rt Ot

Rules of the game are unknown Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

12

slide-13
SLIDE 13

13

Demos

Some videos

  • https://www.youtube.com/watch?v=V1eYniJ0Rnk
  • https://www.youtube.com/watch?v=CIF2SBVY-J0
  • https://www.youtube.com/watch?v=I2WFvGl4y8c
slide-14
SLIDE 14

14

Markov Property

slide-15
SLIDE 15

15

State Transition

slide-16
SLIDE 16

16

Markov Process

slide-17
SLIDE 17

17

Student Markov Chain

slide-18
SLIDE 18

18

Student MC : Episodes

slide-19
SLIDE 19

19

Student MC : Transition Matrix

slide-20
SLIDE 20

20

Return

slide-21
SLIDE 21

21

Value

slide-22
SLIDE 22

22

Student MRP

slide-23
SLIDE 23

23

Student MRP : Returns

slide-24
SLIDE 24

24

Student MRP : Value Function

slide-25
SLIDE 25

25

Student MRP : Value Function

slide-26
SLIDE 26

26

Student MRP : Value Function

slide-27
SLIDE 27

27

Bellman Equation for MRP

slide-28
SLIDE 28

28

Backup Diagrams for MRP

slide-29
SLIDE 29

29

Bellman Eq: Student MRP

slide-30
SLIDE 30

30

Bellman Eq: Student MRP

slide-31
SLIDE 31

Lecture 2: Markov DecisionProcesses Markov Reward Processes Bellman Equation

Solving the Bellman Equation

The Bellman equation is a linear equation It can be solved directly: v = R + γPv (I − γP) v = R v = (I − γP)−1 R Computational complexity is O(n3) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g.

Dynamic programming Monte-Carlo evaluation Temporal-Difference learning

31

slide-32
SLIDE 32

32

Markov Decision Process

slide-33
SLIDE 33

33

Student MDP

slide-34
SLIDE 34

34

Policies

slide-35
SLIDE 35

35

MP → MRP → MDP

slide-36
SLIDE 36

36

Value Function

slide-37
SLIDE 37

37

Bellman Eq for MDP

Evaluating Bellman equation translates into 1-step lookahead

slide-38
SLIDE 38

38

Bellman Eq, V

slide-39
SLIDE 39

39

Bellman Eq, q

slide-40
SLIDE 40

40

Bellman Eq, V

slide-41
SLIDE 41

41

Bellman Eq, q

slide-42
SLIDE 42

42

Student MDP : Bellman Eq

slide-43
SLIDE 43

43

Bellman Eq : Matrix Form

slide-44
SLIDE 44

44

Optimal Value Function

slide-45
SLIDE 45

45

Student MDP : Optimal V

slide-46
SLIDE 46

46

Student MDP : Optimal Q

slide-47
SLIDE 47

Lecture 2: Markov DecisionProcesses Markov Decision Processes Optimal Value Functions

Optimal Policy

Define a partial ordering over policies π ≥ π'if vπ(s) ≥ vπ'(s),∀ s Theorem For any Markov Decision Process There exists an optimal policy π

∗ that is better than or equal

to all other policies, π

∗≥ π, ∀

π All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s) All optimal policies achieve the optimal action-value function, qπ∗(s,a) = q∗(s,a)

47

slide-48
SLIDE 48

Lecture 2: Markov DecisionProcesses Markov Decision Processes Optimal Value Functions

Finding an Optimal Policy

An optimal policy can be found by maximising over q∗(s, a), There is always a deterministic optimal policy for any MDP If we know q∗(s, a), we immediately have the optimal policy

48

slide-49
SLIDE 49

49

Student MDP : Optimal Policy

slide-50
SLIDE 50

50

Bellman Optimality Eq, V

slide-51
SLIDE 51

51

Student MDP : Bellman Optimality

slide-52
SLIDE 52

Lecture 2: Markov DecisionProcesses Markov Decision Processes Bellman Optimality Equation

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods

Value Iteration Policy Iteration Q-learning Sarsa

52

Not easy

slide-53
SLIDE 53

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Maze Example

Start Goal

Rewards: -1 per time-step Actions: N, E, S, W States: Agent’s location

53

slide-54
SLIDE 54

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Maze Example: Policy

Start Goal

Arrows represent policy π(s) for each state s

54

slide-55
SLIDE 55

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Maze Example: Value Function

  • 14
  • 13
  • 12
  • 11
  • 10
  • 9
  • 16
  • 15
  • 12
  • 8
  • 16
  • 17
  • 6
  • 7
  • 18
  • 19
  • 5
  • 24
  • 20
  • 4
  • 3
  • 23
  • 22
  • 21
  • 22
  • 2
  • 1

Start Goal

Numbers represent value vπ (s) of each state s

55

slide-56
SLIDE 56

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Maze Example: Model

  • 1
  • 1 -1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Start Goal

Agent may have an internal model of the environment Dynamics: how actions change the state Rewards: how muchreward from each state The model may beimperfect Grid layout represents transition model Pa

ss‘ a s

Numbers represent immediate reward R from each state s (same for all a)

56

slide-57
SLIDE 57

Algorithms for MDPs

57

slide-58
SLIDE 58

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Model

58

slide-59
SLIDE 59

Algorithms cont.

59

Prediction Control

slide-60
SLIDE 60

Lecture 1: Introduction to Reinforcement Learning Problems within RL

Learning and Planning

Two fundamental problems in sequential decision making Reinforcement Learning:

The environment is initiallyunknown The agent interacts with the environment The agent improves itspolicy

Planning:

A model of the environment is known The agent performs computations with its model (without any external interaction) The agent improves itspolicy a.k.a. deliberation, reasoning, introspection, pondering, thought, search

60

slide-61
SLIDE 61

Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviourfunction Value function: how good is each state and/or action Model: agent’s representation of the environment

61

slide-62
SLIDE 62

Dynamic Programming

62

slide-63
SLIDE 63

Requirements for DP

63

slide-64
SLIDE 64

Applications for DPs

64

slide-65
SLIDE 65

Lecture 3: Planning by Dynamic Programming Introduction

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction:

Input: MDP (S, A, P, R, γ) and policy π

  • r: MRP (S, Pπ , Rπ , γ)

Output: value function vπ

Or for control:

Input: MDP (S, A, P, R, γ) Output: optimal value function v∗ and: optimal policy π∗

65

slide-66
SLIDE 66

Lecture 3: Planning by Dynamic Programming Policy Evaluation Iterative Policy Evaluation

Policy Evaluation (Prediction)

Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v1 → v2 → ... → vπ Using synchronous backups,

At each iteration k + 1 For all states s ∈ S Update vk+1(s) from vk (s') where s' is a successor state of s

We will discuss asynchronous backups later Convergence to vπ can be proven

66

slide-67
SLIDE 67

Iterative policy Evaluation

68

slide-68
SLIDE 68

Lecture 3: Planning by Dynamic Programming Policy Evaluation Example: Small Gridworld

Evaluating a Random Policy in the Small Gridworld

Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is −1 until the terminal state is reached Agent follows uniform random policy π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25

69

slide-69
SLIDE 69

Policy Evaluation : Grid World

70

slide-70
SLIDE 70

Policy Evaluation : Grid World

71

slide-71
SLIDE 71

Policy Evaluation : Grid World

72

slide-72
SLIDE 72

Policy Evaluation : Grid World

73

slide-73
SLIDE 73

74

Most of the story in a nutshell:

slide-74
SLIDE 74

Finding Best Policy

75

slide-75
SLIDE 75

Lecture 3: Planning by Dynamic Programming Policy Iteration

Policy Improvement

Given a policy π

Evaluate the policy π vπ(s) = E [Rt+1 + γRt+2 + ...|St = s] Improve the policy by acting greedily with respect to vπ π' = greedy(vπ)

In Small Gridworld improved policy was optimal, π' = π∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π∗

76

slide-76
SLIDE 76

Policy Iteration

77

slide-77
SLIDE 77

Lecture 3: Planning by Dynamic Programming Policy Iteration

Policy Iteration

Policy evaluation Estimate vπ Iterative policy evaluation Policy improvement Generate πI ≥ π Greedy policy improvement

78

slide-78
SLIDE 78

Jack’s Car Rental

79

slide-79
SLIDE 79

Policy Iteration in Car Rental

80

slide-80
SLIDE 80

Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Improvement

Policy Improvement

81

slide-81
SLIDE 81

Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Improvement

Policy Improvement (2)

If improvements stop, qπ(s, π'(s)) = max qπ(s, a) = qπ(s, π(s)) = vπ(s)

a∈A

Then the Bellman optimality equation has been satisfied vπ(s) = max qπ(s, a)

a∈A

Therefore vπ (s) = v∗(s) for all s ∈ S so π is an optimal policy

82

slide-82
SLIDE 82

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Some Technical Questions

How do we know that value iteration converges to v∗? Or that iterative policy evaluation converges to vπ ? And therefore that policy iteration converges to v∗? Is the solution unique? How fast do these algorithms converge? These questions are resolved by contraction mapping theorem

83

slide-83
SLIDE 83

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Value Function Space

Consider the vector space V over value functions There are |S| dimensions Each point in this space fully specifies a value function v (s) What does a Bellman backup do to points in this space? We will show that it brings value functions closer And therefore the backups must converge on a unique solution

84

slide-84
SLIDE 84

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Value Function ∞-Norm

s∈S

We will measure distance between state-value functions u and v by the ∞-norm i.e. the largest difference between state values, ||u − v||∞ = max |u(s) − v(s)|

85

slide-85
SLIDE 85

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Bellman Expectation Backup is a Contraction

86

slide-86
SLIDE 86

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Contraction Mapping Theorem

Theorem (Contraction Mapping Theorem) For any metric space V that is complete (i.e. closed) under an

  • perator T (v ), where T is a γ-contraction,

T converges to a unique fixed point At a linear convergence rate of γ

87

slide-87
SLIDE 87

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Convergence of Iter . Policy Evaluation and Policy Iteration

The Bellman expectation operator T π has a unique fixed point vπ is a fixed point of T π (by Bellman expectation equation) By contraction mapping theorem Iterative policy evaluation converges on vπ Policy iteration converges on v

88

slide-88
SLIDE 88

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator T ∗, T ∗(v) = max Ra + γPav

a∈A

This operator is a γ-contraction, i.e. it makes value functions closer by at least γ (similar to previous proof) ||T ∗(u) − T ∗(v)||∞ ≤ γ||u − v||∞

89

slide-89
SLIDE 89

Lecture 3: Planning by Dynamic Programming Contraction Mapping

Convergence of Value Iteration

The Bellman optimality operator T ∗ has a unique fixed point v

∗ is a fixed point of T ∗

(by Bellman optimality equation) By contraction mapping theorem Value iteration converges on v

90

slide-90
SLIDE 90

91

Most of the story in a nutshell:

slide-91
SLIDE 91

92

Most of the story in a nutshell:

slide-92
SLIDE 92

93

Most of the story in a nutshell:

slide-93
SLIDE 93

Lecture 3: Planning by Dynamic Programming Policy Iteration Extensions to Policy Iteration

Modified Policy Iteration

Does policy evaluation need to converge to vπ? Or should we introduce a stopping condition

e.g. E-convergence of value function

Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1

This is equivalent to value iteration (next section)

94

slide-94
SLIDE 94

Lecture 3: Planning by Dynamic Programming Policy Iteration Extensions to Policy Iteration

Generalised Policy Iteration

Policy evaluation Estimate vπ Any policy evaluation algorithm Policy improvement Generate π' ≥ π Any policy improvement algorithm

95

slide-95
SLIDE 95

Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs

Value Iteration

Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v1 → v2 → ... → v

Using synchronous backups

At each iteration k + 1 For all states s ∈ S Update vk+1(s) from vk (s')

Convergence to v

∗ will be proven later

Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy

96

slide-96
SLIDE 96

Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs

Value Iteration (2)

97

slide-97
SLIDE 97

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected

99

slide-98
SLIDE 98

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming

Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming: In-place dynamicprogramming Prioritised sweeping Real-time dynamicprogramming

100

slide-99
SLIDE 99

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming

In-Place Dynamic Programming

101

slide-100
SLIDE 100

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming

Prioritised Sweeping

102

slide-101
SLIDE 101

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming

Real-Time Dynamic Programming

Idea: only states that are relevant to agent Use agent’s experience to guide the selection of states After each time-step St, At, Rt+1 Backup the state St

103

slide-102
SLIDE 102

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Full-width and sample backups

Full-Width Backups

DP uses full-widthbackups For each backup (sync or async)

Every successor state and action is considered Using knowledge of the MDP transitions and reward function

DP is effective for medium-sized problems (millions of states) For large problems DP suffers Bellman’s curse ofdimensionality

Number of states n = |S| grows exponentially with number of state variables

Even one backup can be too expensive

104

slide-103
SLIDE 103

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Full-width and sample backups

Sample Backups

In subsequent lectures we will consider sample backups Using sample rewards and sample transitions (S, A, R, S') Instead of reward function R and transition dynamics P Advantages:

Model-free: no advance knowledge of MDP required Breaks the curse of dimensionality through sampling Cost of backup is constant, independent of n = |S|

105

slide-104
SLIDE 104

Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Approximate Dynamic Programming

Approximate Dynamic Programming

106

slide-105
SLIDE 105

Monte Carlo Learning

107

slide-106
SLIDE 106

Lecture 4: Model-Free Prediction Monte-Carlo Learning

Monte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: nobootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs

All episodes must terminate

MC methods can solve the RL problem by averaging sample returns MC is incremental episode by episode but not step by step Approach: adapting general policy iteration to sample returns First policy evaluation, then policy improvement, then control

108

slide-107
SLIDE 107

Lecture 4: Model-Free Prediction Monte-Carlo Learning

Monte-Carlo Policy Evaluation

Goal: learn vπ from episodes of experience under policy π S1,A1,R2,...,Sk∼ π Recall that the return is the total discounted reward: Gt= Rt+1 + γRt+2 + ...+ γT−1RT Recall that the value function is the expected return: vπ (s) = Eπ [Gt | St = s] Monte-Carlo policy evaluation uses empirical mean return instead of expected return, because we do not have the model

109

slide-108
SLIDE 108

Every Visit MC Policy Evaluation

110

slide-109
SLIDE 109

111

slide-110
SLIDE 110

112

slide-111
SLIDE 111

113

slide-112
SLIDE 112

114

slide-113
SLIDE 113

115

slide-114
SLIDE 114

116

slide-115
SLIDE 115

117

slide-116
SLIDE 116

118

slide-117
SLIDE 117

119

slide-118
SLIDE 118

120

slide-119
SLIDE 119

121

slide-120
SLIDE 120

122

slide-121
SLIDE 121

123

slide-122
SLIDE 122

124

slide-123
SLIDE 123

125

slide-124
SLIDE 124

126

slide-125
SLIDE 125

127

slide-126
SLIDE 126

128

slide-127
SLIDE 127

129

slide-128
SLIDE 128

130

slide-129
SLIDE 129

131

slide-130
SLIDE 130

132

slide-131
SLIDE 131

SARSA

133

slide-132
SLIDE 132

134

slide-133
SLIDE 133

135

slide-134
SLIDE 134

136

slide-135
SLIDE 135

Q-Learning

137

slide-136
SLIDE 136

Q-Learning vs. Sarsa

138

slide-137
SLIDE 137

139

slide-138
SLIDE 138

140

slide-139
SLIDE 139

142

slide-140
SLIDE 140

143

slide-141
SLIDE 141

Monte Carlo Tree Search

144

slide-142
SLIDE 142

145

slide-143
SLIDE 143

146

slide-144
SLIDE 144

147

slide-145
SLIDE 145

148

slide-146
SLIDE 146

149