machine learning reinforcement learning
play

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: - PowerPoint PPT Presentation

Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker Sorge Introduction Reinforcement learning is an area concerned with how


  1. Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA Machine Learning: Reinforcement Learning Volker Sorge

  2. Intro to AI: Basics Lecture 12 Volker Sorge Introduction ◮ Reinforcement learning is an area concerned with how Q-Learning an agent ought to take actions in an environment so as SARSA to maximize some notion of reward. ◮ “A way of programming agents by reward and punishment without needing to specify how the task is to be achieved.” ◮ Specify what to do, but not how to do it. ◮ Only formulate the reward function. ◮ Learning “fills in the details”. ◮ Compute better final solutions for a task. ◮ Based on actual experiences, not on programmer assumptions. ◮ Less (human) time needed to find a good solution.

  3. Intro to AI: Main Notions: Policies Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Policy : The function that allows us to compute the next action for a particular state. ◮ An optimal Policy is a policy that maximizes the expected reward/reinforcement/feedback of a state. ◮ Thus, the task of RL is to use observed rewards to find an optimal policy for the environment.

  4. Intro to AI: Main Notions: Modes of Learning Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Passive Learning : Agents policy is fixed and our task is to learn how good the policy is. ◮ Active Learning : Agents must learn what actions to take. ◮ Off-policy learning : learn the value of the optimal policy independently of the agent’s actions. ◮ On-policy learning : learn the value of the policy the agent actually follows.

  5. Intro to AI: Main Notions: Exploration vs. Exploitation Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Exploitation Use the knowledge already learned on what the next best action is in the current state. ◮ Exploration In order to improve policies the agent must explore a number of states. I.e., select an action different of the one that it currenlty thinks is best.

  6. Intro to AI: Difficulties of Reinforcement learning Lecture 12 Volker Sorge ◮ Blame attribution problem : The problem of determining which action was responsible for a reward or Introduction punishment. Q-Learning ◮ Responsible action may have occurred a long time SARSA before the reward was received. ◮ A combination of actions might have lead to a reward. ◮ Recognising delayed rewards : What seem to be poor actions now might lead to much greater rewards in the future than what appears to be good actions. ◮ Future rewards need to be recognised and back-propagated. ◮ Problem complexity increases if the world is dynamic. ◮ Explore-exploit dilemma : If the agent has worked out a good course of actions, should it continue to follow these actions or should it explore to find better actions? ◮ Agent that never explores can not improve its policy. ◮ Agent that only explores never uses what it has learned.

  7. Intro to AI: Some Algorithms Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Temporal Difference learning ◮ Q-learning ◮ SARSA ◮ Monte Carlo Method ◮ Evolutionary Algorithms

  8. Intro to AI: Q-Learning Basics Lecture 12 Volker Sorge Introduction Q-Learning ◮ Off-policy learning technique. SARSA ◮ The environment is typically formulated as a Markov Decision Process. (See reading assignment on Markov Processes.) ◮ Finite sets of states S = { s 0 , . . . , s n } and actions A = { a 0 , . . . , a m } . ◮ Probabilities P a ( s , s ′ ) for transitions from state s to s ′ with action a . ◮ Reward function R that adjusts probabilities over time. ◮ Goal is to learn an optimal policy function Q ∗ ( s , a ).

  9. Intro to AI: MDP Example Lecture 12 Volker Sorge ◮ Here’s a simple example of a MDP with three states Introduction S 0 , S 1 , S 2 and two actions a 0 , a 1 . Q-Learning SARSA Source http://wikipedia.org/

  10. Intro to AI: Q-Learning Lecture 12 Volker Sorge Introduction Q-Learning ◮ Learn quality of state-action combinations: SARSA Q : S × A → R ◮ We learn Q over a (possibly infinite) squence of discrete time events. � s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , r 3 , s 3 , a 3 , r 4 , s 4 . . . � where s i are states, a i actions, and r i rewards. ◮ Learn quality of a single experience that, i.e., for � s , a , r , s ′ � .

  11. Intro to AI: Q-Learning: Algorithm Lecture 12 Volker Sorge Introduction Q-Learning ◮ Maintain a table for Q with an entry for each valid SARSA state action pair ( s , a ). ◮ Initialise the table with some uniform value. ◮ Update the values over time points t ≥ 0 according to the following formula: � � Q ( s t , a t ) = Q ( s t , a t )+ α × r t +1 + γ max Q ( s t +1 , a ) − Q ( s t , a t ) a where α is the learning rate and γ is the discount factor .

  12. Intro to AI: Learning Rate Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Models how forgetful or stubborn an agent is. ◮ Learning rate α (small Greek letter alpha) determines to what extend the newly acquired inofrmation will override the old information. ◮ α = 0 will the agent not learn anything. ◮ α = 1 will make the agent consider only the most recent information.

  13. Intro to AI: Discounted Reward Lecture 12 Volker Sorge ◮ The basic idea is to weigh rewards differently over time and to model the idea that earlier experiences are more Introduction Q-Learning relevant than later ones. SARSA ◮ E.g., when a child learns to walk the rewards and punishments are pretty high. But the older we get the less do we have to actually adjust our walking behaviour. ◮ One can model this by including a factor that decreases over time, i.e., it more and more discounts an experience. ◮ The discount is normally expressed by a multiplicative factor γ (small Greek letter gamma), with 0 ≤ γ < 1. ◮ γ = 0 will make the agent “opportunistic” by only considering current rewards. ◮ A γ value closer to 1 will make the agent strive for long-term reward.

  14. Intro to AI: SARSA Learning Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ On-policy learning technique. ◮ Very similar to Q-learning. ◮ SARSA stands for “State-Action-Reward-State-Action”. ◮ Learns the quality of the next move by actually carrying out the next move. Hence we do no longer maximise the possible next Q value.

  15. Intro to AI: SARSA: Algorithm Lecture 12 Volker Sorge Introduction Q-Learning ◮ Equivalent to Q-learning we initialise and maintain the SARSA table for Q with an entry for each valid state action pair ( s , a ). ◮ Updating formula is similar to Q-learning with the exception that we can only take the actual next state into account. Q ( s t , a t ) = Q ( s t , a t ) + α [ r t + γ Q ( s t +1 , a t +1 ) − Q ( s t , a t )] where α is the learning rate and γ is the discount factor .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend