Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: - - PowerPoint PPT Presentation
Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: - - PowerPoint PPT Presentation
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker Sorge Introduction Reinforcement learning is an area concerned with how
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Basics
◮ Reinforcement learning is an area concerned with how
an agent ought to take actions in an environment so as to maximize some notion of reward.
◮ “A way of programming agents by reward and
punishment without needing to specify how the task is to be achieved.”
◮ Specify what to do, but not how to do it.
◮ Only formulate the reward function. ◮ Learning “fills in the details”.
◮ Compute better final solutions for a task.
◮ Based on actual experiences, not on programmer
assumptions.
◮ Less (human) time needed to find a good solution.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Main Notions: Policies
◮ Policy: The function that allows us to compute the next
action for a particular state.
◮ An optimal Policy is a policy that maximizes the
expected reward/reinforcement/feedback of a state.
◮ Thus, the task of RL is to use observed rewards to find
an optimal policy for the environment.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Main Notions: Modes of Learning
◮ Passive Learning: Agents policy is fixed and our task is
to learn how good the policy is.
◮ Active Learning: Agents must learn what actions to
take.
◮ Off-policy learning: learn the value of the optimal policy
independently of the agent’s actions.
◮ On-policy learning: learn the value of the policy the
agent actually follows.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Main Notions: Exploration vs. Exploitation
◮ Exploitation Use the knowledge already learned on what
the next best action is in the current state.
◮ Exploration In order to improve policies the agent must
explore a number of states. I.e., select an action different of the one that it currenlty thinks is best.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Difficulties of Reinforcement learning
◮ Blame attribution problem: The problem of determining
which action was responsible for a reward or punishment.
◮ Responsible action may have occurred a long time
before the reward was received.
◮ A combination of actions might have lead to a reward.
◮ Recognising delayed rewards: What seem to be poor
actions now might lead to much greater rewards in the future than what appears to be good actions.
◮ Future rewards need to be recognised and
back-propagated.
◮ Problem complexity increases if the world is dynamic.
◮ Explore-exploit dilemma: If the agent has worked out a
good course of actions, should it continue to follow these actions or should it explore to find better actions?
◮ Agent that never explores can not improve its policy. ◮ Agent that only explores never uses what it has learned.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Some Algorithms
◮ Temporal Difference learning ◮ Q-learning ◮ SARSA ◮ Monte Carlo Method ◮ Evolutionary Algorithms
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Q-Learning Basics
◮ Off-policy learning technique. ◮ The environment is typically formulated as a Markov
Decision Process. (See reading assignment on Markov Processes.)
◮ Finite sets of states S = {s0, . . . , sn} and actions
A = {a0, . . . , am}.
◮ Probabilities Pa(s, s′) for transitions from state s to s′
with action a.
◮ Reward function R that adjusts probabilities over time.
◮ Goal is to learn an optimal policy function Q∗(s, a).
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
MDP Example
◮ Here’s a simple example of a MDP with three states
S0, S1, S2 and two actions a0, a1.
Source http://wikipedia.org/
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Q-Learning
◮ Learn quality of state-action combinations:
Q : S × A → R
◮ We learn Q over a (possibly infinite) squence of discrete
time events. s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4, s4 . . . where si are states, ai actions, and ri rewards.
◮ Learn quality of a single experience that, i.e., for
s, a, r, s′.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Q-Learning: Algorithm
◮ Maintain a table for Q with an entry for each valid
state action pair (s, a).
◮ Initialise the table with some uniform value. ◮ Update the values over time points t ≥ 0 according to
the following formula:
Q(st, at) = Q(st, at)+α×
- rt+1 + γmax
a
Q(st+1, a) − Q(st, at)
- where α is the learning rate and γ is the discount factor.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Learning Rate
◮ Models how forgetful or stubborn an agent is. ◮ Learning rate α (small Greek letter alpha) determines to
what extend the newly acquired inofrmation will
- verride the old information.
◮ α = 0 will the agent not learn anything. ◮ α = 1 will make the agent consider only the most
recent information.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
Discounted Reward
◮ The basic idea is to weigh rewards differently over time
and to model the idea that earlier experiences are more relevant than later ones.
◮ E.g., when a child learns to walk the rewards and
punishments are pretty high. But the older we get the less do we have to actually adjust our walking behaviour.
◮ One can model this by including a factor that decreases
- ver time, i.e., it more and more discounts an
experience.
◮ The discount is normally expressed by a multiplicative
factor γ (small Greek letter gamma), with 0 ≤ γ < 1.
◮ γ = 0 will make the agent “opportunistic” by only
considering current rewards.
◮ A γ value closer to 1 will make the agent strive for
long-term reward.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA
SARSA Learning
◮ On-policy learning technique. ◮ Very similar to Q-learning. ◮ SARSA stands for “State-Action-Reward-State-Action”. ◮ Learns the quality of the next move by actually carrying
- ut the next move. Hence we do no longer maximise
the possible next Q value.
Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA