Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead - - PowerPoint PPT Presentation

approximate q learning
SMART_READER_LITE
LIVE PREVIEW

Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead - - PowerPoint PPT Presentation

Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be


slide-1
SLIDE 1

Approximate Q-Learning

3/23/18

slide-2
SLIDE 2

On-Policy Learning (SARSA)

Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be better or worse than Q-learning?

slide-3
SLIDE 3

Demo: Q-learning vs SARSA

https://studywolf.wordpress.com/2013/07/01/reinfo rcement-learning-sarsa-vs-q-learning/

slide-4
SLIDE 4

Which exploration policy is better?

ε-greedy or UCB

  • For Q-learning?
  • For SARSA?
  • What information do they need?
  • What else does the choice depend on?
slide-5
SLIDE 5

What will Q-learning do here?

slide-6
SLIDE 6

Problem: Large State Spaces

If the state space is large, several problems arise.

  • The table of Q-value estimates becomes enormous.
  • Q-value updates can be slow to propagate.
  • High-reward states can be hard to find.

The state space grows exponentially with the number of relevant features in the environment.

slide-7
SLIDE 7

Reward Shaping

Idea: give some small intermediate rewards that help the agent learn.

  • Like a heuristic, this can guide the search in the right

direction.

  • Rewarding novelty can encourage exploration.

Disadvantages:

  • Requires intervention by the designer to add domain-

specific knowledge.

  • If reward/discount are not balanced right, the agent

might prefer accumulating the small rewards to actually solving the problem.

  • Doesn’t reduce the size of the Q-table.
slide-8
SLIDE 8

PacMan State Space

  • PacMan’s location
  • ~100 possibilities
  • The ghosts’ locations
  • ~1002 possibilities
  • Locations still containing food.
  • ~2100 possibilities
  • Pills remaining
  • 4 possibilities
  • Ghost scared timers
  • ~202 possibilities

The state space is the cross product of these feature sets.

  • So there are ~1003* 2100 *4*202 states.
slide-9
SLIDE 9

Feature Vector Representation

We can represent a PacMan state as a big vector with the following dimensions:

  • PacMan x
  • PacMan y
  • ghost 1 x
  • ghost 1 y
  • ghost 1 scared
  • ghost 2 x
  • ghost 2 y
  • ghost 2 scared
  • food 1
  • food 2
  • food 100
  • power-up 1
  • power-up 2

What is the domain of each of these features?

slide-10
SLIDE 10

Q-Learning Hypothesis Space

Q-learning produces a function that maps states to values.

  • Input: feature vector
  • Output: value
  • Are there any restrictions on the function that can

be learned?

slide-11
SLIDE 11

Function Approximation

Key Idea: learn a value function as a linear combination of features.

  • For each state/action pair encountered, determine

its representation in terms of features.

  • Perform a Q-learning update on each feature.
  • Q-Value estimate is a weighted sum over the

state/action pair’s features. This is our first instance of a change of basis. We will see this idea many more times.

slide-12
SLIDE 12

Simple Extractor Features from Lab

  • "bias" always 1.0
  • "#-of-ghosts-1-step-away" the number of ghosts

(regardless of whether they are safe or dangerous) that are 1 step away from Pac-Man

  • "closest-food" the distance in Pac-Man steps to the

closest food pellet (does take into account walls that may be in the way)

  • "eats-food" either 1 or 0 if Pac-Man will eat a pellet
  • f food by taking the given action in the given state
slide-13
SLIDE 13

Extract features from neighbor states:

  • Each of these states has two legal actions.

Describe each (s,a) pair in terms of the basic features:

  • bias
  • #-of-ghosts-1-step-away
  • closest-food
  • eats-food
slide-14
SLIDE 14

Approximate Q-Learning Update

Initialize weight for each feature to 0. Every time we take an action, perform this update: The Q-value estimate for (s,a) is the weighted sum of its features:

Q(s, a) =

n

X

i=1

fi(s, a)wi

<latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit>
slide-15
SLIDE 15

AQL Update Details

  • The weighted sum of features is equivalent to a dot

product between the feature and weight vectors:

  • The correction term is the same for all features.
  • The correction to each feature is weighted by how

“active” that feature was.

Q(s, a) =

n

X

i=1

fi(s, a)wi = ~ w · f(s, a)

<latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit>
slide-16
SLIDE 16

Exercise: Feature Q-Update

  • Suppose PacMan takes the up action.
  • The experienced next state is random, because the

ghosts’ movements are random.

  • Suppose moves right and moves down.

Old Feature Values: wbias = 1 wghosts = -20 wfood = 2 weats = 4 Reward eating food: +10 Reward for losing:

  • 500

discount: .95 learning rate: .3

slide-17
SLIDE 17

Notes on Approximate Q-Learning

  • Learns weights for a tiny number of features.
  • Every feature’s value is update every step.
  • No longer tracking values for individual (s,a) pairs.
  • (s,a) value estimates are calculated from features.
  • The weight update is a form of gradient descent.
  • We’ve seen this before.
  • We’re performing a variant of linear regression.
  • Feature extraction is a type of basis change.
  • We’ll see these again.
slide-18
SLIDE 18

What’s the hypothesis space?

  • Approximate Q-learning learns something different

from q-learning.

  • What iss AQL’s hypothesis space?
  • Inputs
  • Outputs
  • Restrictions on the learned function
slide-19
SLIDE 19

Plusses and Minuses of Approximation

+Dramatically reduces the size of the Q-table. +States will share many features.

+Allows generalization to unvisited states. +Makes behavior more robust: making similar decisions in similar states.

+Handles continuous state spaces! −Requires feature selection (often must be done by hand). −Restricts the accuracy of the learned rewards.

−The true reward function may not be linear in the features.