today s outline
play

Todays Outline Reinforcement Learning Dan Weld Reinforcement - PDF document

5/7/2012 CSE 473: Artificial Intelligence Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Linear function


  1. 5/7/2012 CSE 473: Artificial Intelligence Today’s Outline Reinforcement Learning Dan Weld  Reinforcement Learning  Q-value iteration  Q-learning  Exploration / exploitation  Linear function approximation  Linear function approximation Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Recap: MDPs Bellman Equations  Markov decision processes: s  States S a  Actions A s, a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) s,a,s ʼ  Rewards R(s,a,s ʼ ) (and discount  ) s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 )  Algorithms  Value Iteration  Q-value iteration Q*(a, s) =  Quantities:  Policy = map from states to actions  Utility = sum of discounted future rewards  Q-Value = expected utility from a q-state Andrey Markov  Ie. from a state/action pair (1856 ‐ 1922) 4 Bellman Backup Q-Value Iteration  Regular Value iteration: find successive approx optimal values Q 5 (s,a 1 ) = 2 +  0  Start with V 0 * (s) = 0 ~ 2  Given V i * , calculate the values for all states for depth i+1: Q 5 (s,a 2 ) = 5 +  0.9~ 1 s 1 a 1 V 4 = 0 Q i+1 (s,a) +  0.1~ 2 V 5 = 6.5 ~ 6.1 5 s 0 s 0 a 2 a 2 Q 5 (s,a 3 ) = 4.5 +  2 s 2 V 4 = 1 ~ 6.5 a 3  Storing Q-values is more useful! s 3  Start with Q 0 V 4 = 2 * (s,a) = 0 max  Given Q i * , calculate the q-values for all q-states for depth i+1: V i (s’) ] 1

  2. 5/7/2012 Q-Value Iteration Reinforcement Learning  Markov decision processes: Initialize each q-state: Q 0 (s,a) = 0  States S s  Actions A Repeat a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) For all q-states, s,a s, a  Rewards R(s,a,s ʼ ) (and discount  ) Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. s,a,s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 ) Until max s,a |Q i+1 (s,a) – Q i (s,a)| <  s ʼ  Algorithms  Q-value iteration  Q-learning  Approaches for mixing exploration & exploitation V i (s’) ]   -greedy  Exploration functions Applications Stanford Autonomous Helicopter http://heli.stanford.edu/  Robotic control  helicopter maneuvering, autonomous vehicles  Mars rover - path planning, oversubscription planning g g  elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning 10 Two main reinforcement learning Two main reinforcement learning approaches approaches  Model-based approaches:  Model-based approaches:  explore environment & learn model, T=P( s ʼ | s , a ) and R( s , a ), Learn T + R (almost) everywhere |S| 2 |A| + |S||A| parameters (40,000)  use model to plan policy, MDP-style  approach leads to strongest theoretical results h l d t t t th ti l lt  often works well when state-space is manageable  Model-free approach:  Model-free approach: Learn Q  don ʼ t learn a model; learn value function or policy directly |S||A| parameters (400)  weaker theoretical results  often works better when state space is large 2

  3. 5/7/2012 Recap: Sampling Expectations Recap: Exp. Moving Average  Want to compute an expectation weighted by P(x):  Exponential moving average  Makes recent samples more important  Model-based: estimate P(x) from samples, compute expectation  Forgets about the past (distant past values were wrong anyway)  Easy to compute from the running average  Model-free: estimate expectation directly from samples  Decreasing learning rate can give converging averages  Why does this work? Because samples appear with the right frequencies! Q-Learning Update Exploration-Exploitation tradeoff  You have visited part of the state space and found a  Q-Learning = sample-based Q-value iteration reward of 100  is this the best you can hope for???  How learn Q*(s,a) values?  Exploitation : should I stick with what I know and find  Receive a sample (s a s ʼ r)  Receive a sample (s,a,s ,r) a good policy w.r.t. this knowledge? d li hi k l d ?  Consider your old estimate:  at risk of missing out on a better reward somewhere  Consider your new sample estimate:  Exploration : should I look for states w/ more reward?  Incorporate the new estimate into a running average:  at risk of wasting time & getting some negative reward 16 Q-Learning:  Greedy Exploration / Exploitation  Several schemes for action selection  Simplest: random actions (  greedy )  Every time step, flip a coin  With probability  , act randomly  With probability 1-  , act according to current policy With b bilit 1 t di t t li QuickTime™ and a H.264 decompressor  Problems with random actions? are needed to see this picture.  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions 3

  4. 5/7/2012 Exploration Functions Q-Learning Final Solution  When to explore  Q-learning produces tables of q-values:  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function Exploration function  Takes a value estimate and a count, and returns an optimistic utility , e.g. (exact form not important)  Exploration policy π ( s ’ )= vs. Q-Learning – Small Problem Q-Learning Properties  Doesn’t work  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  In realistic situations, we can’t possibly learn about  … but not decrease it too quickly! every single state!  Not too sensitive to how you select actions (!) y ( )  Too many states to visit them all in training Too many states to visit them all in training  Too many states to hold the q-tables in memory  Neat property: off-policy learning  learn optimal policy without following it (some caveats)  Instead, we need to generalize :  Learn about a few states from experience  Generalize that experience to new, similar states (Fundamental idea in machine learning) S E S E Example: Pacman Feature-Based Representations  Let ʼ s say we discover  Solution: describe a state using a vector of features (properties) through experience  Features are functions from states to that this state is bad: real numbers (often 0/1) that capture important properties of the state  Example features:  Example features:  In naïve Q learning,  Distance to closest ghost we know nothing  Distance to closest dot about related states  Number of ghosts  1 / (dist to dot) 2 and their Q values:  Is Pacman in a tunnel? (0/1)  …… etc.  Or even this third one!  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 4

  5. 5/7/2012 Linear Feature Functions Function Approximation  Using a feature representation, we can write a q function (or value function) for any state  Q-learning with linear q-functions: using a linear combination of a few weights: Exact Q ʼ s  Advantage: our experience is summed up in Approximate Q ʼ s a few powerful numbers  Intuitive interpretation: |S| 2 |A| ? |S||A| ?  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states  Disadvantage: states may share features but with that state ʼ s features actually be very different in value!  Formal justification: online least squares Example: Q-Pacman Linear Regression 40 26 24 20 22 20 20 30 40 20 30 20 10 0 10 0 20 0 0 Prediction Prediction Ordinary Least Squares (OLS) Minimizing Error Imagine we had only one point x with features f(x): Error or “ residual ” Error or residual Observation Prediction Approximate q update: “ target ” “ prediction ” 0 0 20 5

  6. 5/7/2012 Overfitting Which Algorithm? 30 Q-learning, no features, 50 learning trials: 25 20 Degree 15 polynomial 15 10 QuickTime™ and a 5 GIF decompressor are needed to see this picture. 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Which Algorithm? Which Algorithm? Q-learning, no features, 1000 learning trials: Q-learning, simple features, 50 learning trials: QuickTime™ and a QuickTime™ and a GIF decompressor GIF decompressor are needed to see this picture. are needed to see this picture. Partially observable MDPs A POMDP: Ghost Hunter  Markov decision processes:  States S  Actions A b  Transitions P(s ʼ |s,a) (or T(s,a,s ʼ ))  Rewards R(s,a,s ʼ ) (and discount  ) a  Start state distribution b 0 =P(s 0 ) b, a QuickTime™ and a H.264 decompressor are needed to see this picture. o  POMDPs, just add: b ʼ  Observations O  Observation model P(o|s,a) (or O(s,a,o)) 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend