Approximate Q-Learning 2/24/17 State Value V(s) vs. Action Value - PowerPoint PPT Presentation

Approximate Q-Learning 2/24/17

State Value V(s) vs. Action Value Q(s,a) " ∞ # X γ t r t V = E t =0 Either way, value is the sum of future discounted rewards (assuming the agent behaves optimally): • V(s) : after being in state s • Value Iteration • Q(s,a) : after being in state s and taking action a • Q-Learning

State Value vs. Action Value • These concepts are closely tied. • Both algorithms implicitly compute the other value. Value Iteration update: P ( s 0 | s, a ) V ( s 0 ) X V ( s ) ← R ( s ) + γ max a s 0 Q-Learning update: h h ii a 0 Q ( s 0 , a 0 ) Q ( s, a ) ← α R ( s ) + γ max + (1 − α ) Q ( s, a )

Converting Between V(s) and Q(s,a) If you know V(s) and the transition probabilities, you can calculate Q(s,a) by taking an expected value: P ( s 0 | s, a ) V ( s 0 ) X Q ( s, a ) = s 0 If you know Q(s,a) , you can calculate V(s) directly by taking a max: V ( s ) = max Q ( s, a ) a

On-Policy Learning (SARSA) Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be better or worse than Q-learning?

Demo: Q-learning vs SARSA https://studywolf.wordpress.com/2013/07/01/reinfo rcement-learning-sarsa-vs-q-learning/

What will Q-learning do here?

Problem: Large State Spaces If the state space is large, several problems arise. • The table of Q-value estimates becomes enormous. • Q-value updates can be slow to propagate. • High-reward states can be hard to find. The state space grows exponentially with the number of relevant features in the environment.

Reward Shaping Idea: give some small intermediate rewards that help the agent learn. • Like a heuristic, this can guide the search in the right direction. • Rewarding novelty can encourage exploration. Disadvantages: • Requires intervention by the designer to add domain- specific knowledge. • If reward/discount are not balanced right, the agent might prefer accumulating the small rewards to actually solving the problem. • Doesn’t reduce the size of the Q-table.

PacMan State Space • PacMan’s location • ~100 possibilities • The ghosts’ locations • ~100 2 possibilities • Locations still containing food. • Enormous number of combinations • Pills remaining • 4 possibilities • Ghost scared timers • ~40 2 possibilities The state space is the cross product of these feature sets. • So there are ~100 3 *4*40 2 *(food configs) states.

Function Approximation Key Idea: learn a value function as a linear combination of features. • For each state encountered, determine its representation in terms of features. • Perform a Q-learning update on each feature. • Value estimate is a sum over the state’s features. This is our first real foray into machine learning. Many methods we see later are related to this idea.

PacMan Features from Lab • "bias" always 1.0 • "#-of-ghosts-1-step-away" the number of ghosts (regardless of whether they are safe or dangerous) that are 1 step away from Pac-Man • "closest-food" the distance in Pac-Man steps to the closest food pellet (does take into account walls that may be in the way) • "eats-food" either 1 or 0 if Pac-Man will eat a pellet of food by taking the given action in the given state

Extract features from neighbor states: • Each of these states has two legal actions. Describe each (s,a) pair in terms of the basic features: • bias • #-of-ghosts-1-step-away • closest-food • eats-food

Approximate Q-Learning Update Initialize weight for each feature to 0. Every time we take an action, perform this update: The Q-value estimate for (s,a) is the weighted sum of its features:

Exercise: Feature Q-Update Reward eating food: discount: .95 +10 learning rate: .3 Reward for losing: -500 • Suppose PacMan takes the up action. • The experienced next state is random, because the ghosts’ movements are random. • Suppose moves right and moves down . Old Feature Values: w bias = 1 w ghosts = -20 w food = 2 w eats = 4

Notes on Approximate Q-Learning • Learns weights for a tiny number of features. • Every feature’s value is update every step. • No longer tracking values for individual (s,a) pairs. • (s,a) value estimates are calculated from features. • The weight update is a form of gradient descent. • We’ve seen this before. • We’re performing a variant of linear regression. • Feature extraction is a type of basis change. • We’ll see these again.

Plusses and Minuses of Approximation + Dramatically reduces the size of the Q-table. + States will share many features. + Allows generalization to unvisited states. + Makes behavior more robust: making similar decisions in similar states. + Handles continuous state spaces! − Requires feature selection (often must be done by hand). − Restricts the accuracy of the learned rewards. − The true reward function may not be linear in the features.

Approximate Q-Learning 2/24/17 State Value V(s) vs. Action Value - PowerPoint PPT Presentation

Approximate Q-Learning 2/24/17 State Value V(s) vs. Action Value Q(s,a) " # X t r t V = E t =0 Either way, value is the sum of future discounted rewards (assuming the agent behaves optimally): V(s) : after being in state s

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

CS 327E Class 3 September 23, 2019 1) Which SQL join type does this query contain? S(a: int, b:

Memo Tables Jean-Christophe Filli atre CNRS joint work with Fran c ois Bobot and Andrei

Lecture 1.5: Multiplication tables Matthew Macauley Department of Mathematical Sciences Clemson

Assessing Quantitative Reasoning in Introduction to Probability and Statistics Robert J. Krueger,

AuditDirectiveI m plenetation AuditDirectiveI m plenetation Sebastian Strobl, Pierre Kwaku,

REMARKS By Frank J. Chaloupka, PhD Associate Professor, University of Illinois at Chicago

Literacy Intervention Program Plan Guidance Webinar: September 18, 2017 Literacy Intervention

A N S WE R M O D E L : m a k e A N s c q a f o r a c a s e o n o n