Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead - PowerPoint PPT Presentation

Approximate Q-Learning 3/23/18

On-Policy Learning (SARSA) Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be better or worse than Q-learning?

Demo: Q-learning vs SARSA https://studywolf.wordpress.com/2013/07/01/reinfo rcement-learning-sarsa-vs-q-learning/

Which exploration policy is better? ε -greedy or UCB • For Q-learning? • For SARSA? • What information do they need? • What else does the choice depend on?

What will Q-learning do here?

Problem: Large State Spaces If the state space is large, several problems arise. • The table of Q-value estimates becomes enormous. • Q-value updates can be slow to propagate. • High-reward states can be hard to find. The state space grows exponentially with the number of relevant features in the environment.

Reward Shaping Idea: give some small intermediate rewards that help the agent learn. • Like a heuristic, this can guide the search in the right direction. • Rewarding novelty can encourage exploration. Disadvantages: • Requires intervention by the designer to add domain- specific knowledge. • If reward/discount are not balanced right, the agent might prefer accumulating the small rewards to actually solving the problem. • Doesn’t reduce the size of the Q-table.

PacMan State Space • PacMan’s location • ~100 possibilities • The ghosts’ locations • ~100 2 possibilities • Locations still containing food. • ~2 100 possibilities • Pills remaining • 4 possibilities • Ghost scared timers • ~20 2 possibilities The state space is the cross product of these feature sets. • So there are ~100 3 * 2 100 *4*20 2 states.

Feature Vector Representation We can represent a PacMan state as a big vector with the following dimensions: • PacMan x • food 2 • PacMan y • … • ghost 1 x • food 100 • ghost 1 y • power-up 1 • ghost 1 scared • power-up 2 • ghost 2 x • ghost 2 y What is the domain of • ghost 2 scared each of these features? • food 1

Q-Learning Hypothesis Space Q-learning produces a function that maps states to values. • Input: feature vector • Output: value • Are there any restrictions on the function that can be learned?

Function Approximation Key Idea: learn a value function as a linear combination of features. • For each state/action pair encountered, determine its representation in terms of features. • Perform a Q-learning update on each feature. • Q-Value estimate is a weighted sum over the state/action pair’s features. This is our first instance of a change of basis. We will see this idea many more times.

Simple Extractor Features from Lab • "bias" always 1.0 • "#-of-ghosts-1-step-away" the number of ghosts (regardless of whether they are safe or dangerous) that are 1 step away from Pac-Man • "closest-food" the distance in Pac-Man steps to the closest food pellet (does take into account walls that may be in the way) • "eats-food" either 1 or 0 if Pac-Man will eat a pellet of food by taking the given action in the given state

Extract features from neighbor states: • Each of these states has two legal actions. Describe each (s,a) pair in terms of the basic features: • bias • #-of-ghosts-1-step-away • closest-food • eats-food

<latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit> <latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit> <latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit> <latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit> Approximate Q-Learning Update Initialize weight for each feature to 0. Every time we take an action, perform this update: The Q-value estimate for (s,a) is the weighted sum of its features: n X Q ( s, a ) = f i ( s, a ) w i i =1

<latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit> <latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit> <latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit> <latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit> AQL Update Details • The weighted sum of features is equivalent to a dot product between the feature and weight vectors: n X Q ( s, a ) = f i ( s, a ) w i = ~ w · f ( s, a ) i =1 • The correction term is the same for all features. • The correction to each feature is weighted by how “active” that feature was.

Exercise: Feature Q-Update Reward eating food: discount: .95 +10 learning rate: .3 Reward for losing: -500 • Suppose PacMan takes the up action. • The experienced next state is random, because the ghosts’ movements are random. • Suppose moves right and moves down . Old Feature Values: w bias = 1 w ghosts = -20 w food = 2 w eats = 4

Notes on Approximate Q-Learning • Learns weights for a tiny number of features. • Every feature’s value is update every step. • No longer tracking values for individual (s,a) pairs. • (s,a) value estimates are calculated from features. • The weight update is a form of gradient descent. • We’ve seen this before. • We’re performing a variant of linear regression. • Feature extraction is a type of basis change. • We’ll see these again.

What’s the hypothesis space? • Approximate Q-learning learns something different from q-learning. • What iss AQL’s hypothesis space? • Inputs • Outputs • Restrictions on the learned function

Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead - PowerPoint PPT Presentation

Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Analogue of the KroneckerWeber Cyclotomic function fields Theorem in positive characteristic

H.E.S.S. Unidentified Gamma-ray Sources in a Pulsar Wind Nebula Scenario And HESS J1303-631

An Introduction to Symmetric Functions Ira M. Gessel Department of Mathematics Brandeis

Efficient interpolation and evolution of parton distribution functions. Riccardo Nagar Deutsches

Small Scale experiments for fundamental physics ICTP Summer School on Particle Physics, June

Overview Motivation Real Algebraic Numbers Well-Definedness Calculating Real Roots of Rational

Computer Supported Modeling and Reasoning David Basin, Achim D. Brucker, Jan-Georg Smaus, and

Spectral theory of weighted Fourier algebras of (locally) compact quantum groups Hun Hee Lee (

Sambuz

Useful Links

Newsletter

Mail Us