Approximate Q-Learning Update Initialize weight for each feature to - - PowerPoint PPT Presentation

approximate q learning update
SMART_READER_LITE
LIVE PREVIEW

Approximate Q-Learning Update Initialize weight for each feature to - - PowerPoint PPT Presentation

<latexit


slide-1
SLIDE 1

Approximate Q-Learning Update

Initialize weight for each feature to 0. Every time we take an action, perform this update: The Q-value estimate for (s,a) is the weighted sum of its features:

Q(s, a) =

n

X

i=1

fi(s, a)wi

<latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit><latexit sha1_base64="cCR4OZ9l/OETB0Xei2CIfbgOzM=">ACDHicbVDLSsNAFJ3UV62vqEs3g1WoICURQV0Uim5ctmBsoY1hMp20QyeTMDNRSsgPuPFX3LhQcesHuPNvnLZaOuBC4dz7uXe/yYUaks69soLCwuLa8UV0tr6xubW+b2zq2MEoGJgyMWibaPJGUE0dRxUg7FgSFPiMtf3g19lv3REga8Rs1iokboj6nAcVIackzD5oVeYyOYA12ZRJ6Ka3Z2V3KMxh4dOo8eNQzy1bVmgDOEzsnZCj4Zlf3V6Ek5BwhRmSsmNbsXJTJBTFjGSlbiJjPAQ9UlHU45CIt108k0GD7XSg0EkdHEFJ+rviRSFUo5CX3eGSA3krDcW/M6iQrO3ZTyOFGE4+miIGFQRXAcDexRQbBiI0QFlTfCvEACYSVDrCkQ7BnX54nzkn1omo3T8v1yzyNItgD+6ACbHAG6uAaNIADMHgEz+AVvBlPxovxbnxMWwtGPrML/sD4/AG0/pmn</latexit>
slide-2
SLIDE 2

AQL Update Details

  • The weighted sum of features is equivalent to a dot

product between the feature and weight vectors:

  • The correction term is the same for all features.
  • The correction to each feature is weighted by how

“active” that feature was.

Q(s, a) =

n

X

i=1

fi(s, a)wi = ~ w · f(s, a)

<latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit><latexit sha1_base64="DA+nU6WuIiNWRNEhvmCSzynrSI=">ACI3icbZDNSsNAFIUn9a/Wv6pLN4NFqCAlEUFCkU3LluwtDUMJlO2qGTSZiZtJSQh3Hjq7hxocWNC9/FSdqFtl4YOJzvXu7c4aMSmWaX0ZuZXVtfSO/Wdja3tndK+4fPMogEpg0cAC0XaRJIxy0lRUMdIOBUG+y0jLHd6lvDUiQtKAP6hJSLo+6nPqUYyUtpziTaMsz9AprEJbRr4T06qVPMU8gZ5DZ2Ts0JSOCI7HCbRxL1DQy5BTLJkVMyu4LKy5KIF51Z3i1O4FOPIJV5ghKTuWGapujISimJGkYEeShAgPUZ90tOTIJ7IbZ0cm8EQ7PegFQj+uYOb+noiRL+XEd3Wnj9RALrLU/I91IuVdWPKw0gRjmeLvIhBFcA0MdijgmDFJlogLKj+K8QDJBWOteCDsFaPHlZNM8r1xWrcVGq3c7TyIMjcAzKwAKXoAbuQR0AQbP4BW8gw/jxXgzpsbnrDVnzGcOwZ8yvn8AzemiEA=</latexit>
slide-3
SLIDE 3

Exercise: Feature Q-Update

  • Suppose PacMan is considering the up action.

Old Feature Values: wbias = 1 wghosts = -20 wfood = 2 weats = 4 Reward eating food: +10 Reward for losing:

  • 500

discount: .95 learning rate: .3

Features:

  • bias
  • #-of-ghosts-1-step-away
  • closest-food
  • eats-food
slide-4
SLIDE 4

Notes on Approximate Q-Learning

  • Learns weights for a tiny number of features.
  • Every feature’s value is update every step.
  • No longer tracking values for individual (s,a) pairs.
  • (s,a) value estimates are calculated from features.
  • The weight update is a form of gradient descent.
  • We’ve seen this before.
  • We’re performing a variant of linear regression.
  • Feature extraction is a type of basis change.
  • We’ll see these again.
slide-5
SLIDE 5

Hypothesis Spaces

PacMan x PacMan y ghost 1 x ghost 1 y ghost 1 scared ghost 2 x ghost 2 y ghost 2 scared food 1 food 2 … food 100 power-up 1 power-up 2 action bias # ghosts closest food eats food action

feature extraction approximate Q-learning linear

value ∈ R

<latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit>

value ∈ R

<latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit><latexit sha1_base64="SyTq9RWZNKojxbz5Dv+MCHtvyV4=">ACBXicbVA9SwNBEN3zM8avqKUIi0GwCnciqGARsLGMYkwgF8LeZpIs2ds7dueC4Uhl41+xsVCx9T/Y+W/cS1Jo4oOBx3szMwLYikMu63s7C4tLymlvLr29sbm0XdnbvTZRoDlUeyUjXA2ZACgVFCihHmtgYSChFvSvMr82AG1EpO5wGEMzZF0lOoIztFKrcOAjPKAO0wGTCYyoLxT1Q4a9IEhvR61C0S25Y9B54k1JkUxRaRW+/HbEkxAUcsmMaXhujM2UaRcwijvJwZixvusCw1LFQvBNPxGyN6ZJU27UTalkI6Vn9PpCw0ZhgGtjO70Mx6mfif10iwc95MhYoTBMUnizqJpBjRLBPaFho4yqEljGthb6W8xzTjaJPL2xC82ZfnSfWkdFHybk6L5ctpGjmyTw7JMfHIGSmTa1IhVcLJI3kmr+TNeXJenHfnY9K64Exn9sgfOJ8/Vi6ZNw=</latexit>

Q-learning tabular

slide-6
SLIDE 6

Plusses and Minuses of Approximation

+Dramatically reduces the size of the Q-table. +States will share many features.

+Allows generalization to unvisited states. +Makes behavior more robust: making similar decisions in similar states.

+Handles continuous state spaces! −Requires feature selection (often must be done by hand). −Restricts the accuracy of the learned rewards.

−The true reward function may not be linear in the features.

slide-7
SLIDE 7

Linear Regression

3/26/18

slide-8
SLIDE 8

This Week: Supervised Learning

  • Data consists of (input, output) pairs.
  • We know the right output for each example input.

Sub-categories of supervised learning:

  • Regression
  • Continuous outputs
  • Classification
  • Discrete outputs (labels)
slide-9
SLIDE 9

Linear Regression Hypothesis Space

Supervised learning

  • For every input in the data set, we know the output

Regression

  • Outputs are continuous
  • A number, not a category label

The learned model:

  • A linear function mapping input to output
  • A weight for each feature (including bias)
slide-10
SLIDE 10

Linear Models

In two dimensions: In d dimensions: We want to find the linear model that fits our data best. When have we seen a model like this before? f(x) = wx + b ~ x ≡      x0 x1 . . . xd      f(~ x) =      wb w0 . . . wd      ·      1 x0 . . . xd     