Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: - - PowerPoint PPT Presentation

machine learning reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: - - PowerPoint PPT Presentation

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker Sorge Introduction Reinforcement learning is an area concerned with how


slide-1
SLIDE 1

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Machine Learning: Reinforcement Learning

Volker Sorge

slide-2
SLIDE 2

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Basics

◮ Reinforcement learning is an area concerned with how

an agent ought to take actions in an environment so as to maximize some notion of reward.

◮ “A way of programming agents by reward and

punishment without needing to specify how the task is to be achieved.”

◮ Specify what to do, but not how to do it.

◮ Only formulate the reward function. ◮ Learning “fills in the details”.

◮ Compute better final solutions for a task.

◮ Based on actual experiences, not on programmer

assumptions.

◮ Less (human) time needed to find a good solution.

slide-3
SLIDE 3

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Main Notions: Policies

◮ Policy: The function that allows us to compute the next

action for a particular state.

◮ An optimal Policy is a policy that maximizes the

expected reward/reinforcement/feedback of a state.

◮ Thus, the task of RL is to use observed rewards to find

an optimal policy for the environment.

slide-4
SLIDE 4

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Main Notions: Modes of Learning

◮ Passive Learning: Agents policy is fixed and our task is

to learn how good the policy is.

◮ Active Learning: Agents must learn what actions to

take.

◮ Off-policy learning: learn the value of the optimal policy

independently of the agent’s actions.

◮ On-policy learning: learn the value of the policy the

agent actually follows.

slide-5
SLIDE 5

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Main Notions: Exploration vs. Exploitation

◮ Exploitation Use the knowledge already learned on what

the next best action is in the current state.

◮ Exploration In order to improve policies the agent must

explore a number of states. I.e., select an action different of the one that it currenlty thinks is best.

slide-6
SLIDE 6

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Difficulties of Reinforcement learning

◮ Blame attribution problem: The problem of determining

which action was responsible for a reward or punishment.

◮ Responsible action may have occurred a long time

before the reward was received.

◮ A combination of actions might have lead to a reward.

◮ Recognising delayed rewards: What seem to be poor

actions now might lead to much greater rewards in the future than what appears to be good actions.

◮ Future rewards need to be recognised and

back-propagated.

◮ Problem complexity increases if the world is dynamic.

◮ Explore-exploit dilemma: If the agent has worked out a

good course of actions, should it continue to follow these actions or should it explore to find better actions?

◮ Agent that never explores can not improve its policy. ◮ Agent that only explores never uses what it has learned.

slide-7
SLIDE 7

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Some Algorithms

◮ Temporal Difference learning ◮ Q-learning ◮ SARSA ◮ Monte Carlo Method ◮ Evolutionary Algorithms

slide-8
SLIDE 8

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Q-Learning Basics

◮ Off-policy learning technique. ◮ The environment is typically formulated as a Markov

Decision Process. (See reading assignment on Markov Processes.)

◮ Finite sets of states S = {s0, . . . , sn} and actions

A = {a0, . . . , am}.

◮ Probabilities Pa(s, s′) for transitions from state s to s′

with action a.

◮ Reward function R that adjusts probabilities over time.

◮ Goal is to learn an optimal policy function Q∗(s, a).

slide-9
SLIDE 9

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

MDP Example

◮ Here’s a simple example of a MDP with three states

S0, S1, S2 and two actions a0, a1.

Source http://wikipedia.org/

slide-10
SLIDE 10

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Q-Learning

◮ Learn quality of state-action combinations:

Q : S × A → R

◮ We learn Q over a (possibly infinite) squence of discrete

time events. s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4, s4 . . . where si are states, ai actions, and ri rewards.

◮ Learn quality of a single experience that, i.e., for

s, a, r, s′.

slide-11
SLIDE 11

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Q-Learning: Algorithm

◮ Maintain a table for Q with an entry for each valid

state action pair (s, a).

◮ Initialise the table with some uniform value. ◮ Update the values over time points t ≥ 0 according to

the following formula:

Q(st, at) = Q(st, at)+α×

  • rt+1 + γmax

a

Q(st+1, a) − Q(st, at)

  • where α is the learning rate and γ is the discount factor.
slide-12
SLIDE 12

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Learning Rate

◮ Models how forgetful or stubborn an agent is. ◮ Learning rate α (small Greek letter alpha) determines to

what extend the newly acquired inofrmation will

  • verride the old information.

◮ α = 0 will the agent not learn anything. ◮ α = 1 will make the agent consider only the most

recent information.

slide-13
SLIDE 13

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

Discounted Reward

◮ The basic idea is to weigh rewards differently over time

and to model the idea that earlier experiences are more relevant than later ones.

◮ E.g., when a child learns to walk the rewards and

punishments are pretty high. But the older we get the less do we have to actually adjust our walking behaviour.

◮ One can model this by including a factor that decreases

  • ver time, i.e., it more and more discounts an

experience.

◮ The discount is normally expressed by a multiplicative

factor γ (small Greek letter gamma), with 0 ≤ γ < 1.

◮ γ = 0 will make the agent “opportunistic” by only

considering current rewards.

◮ A γ value closer to 1 will make the agent strive for

long-term reward.

slide-14
SLIDE 14

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

SARSA Learning

◮ On-policy learning technique. ◮ Very similar to Q-learning. ◮ SARSA stands for “State-Action-Reward-State-Action”. ◮ Learns the quality of the next move by actually carrying

  • ut the next move. Hence we do no longer maximise

the possible next Q value.

slide-15
SLIDE 15

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA

SARSA: Algorithm

◮ Equivalent to Q-learning we initialise and maintain the

table for Q with an entry for each valid state action pair (s, a).

◮ Updating formula is similar to Q-learning with the

exception that we can only take the actual next state into account. Q(st, at) = Q(st, at) + α[rt + γQ(st+1, at+1) − Q(st, at)] where α is the learning rate and γ is the discount factor.