CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan Georgia Tech Topics well cover Overview of RL RL vs other forms of learning RL API Applications Framework: Markov


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Ashwin Kalyan Georgia Tech

Topics:

– Policy Gradients – Actor Critic

slide-2
SLIDE 2

2

Topics we’ll cover

  • Overview of RL
  • RL vs other forms of learning
  • RL “API”
  • Applications
  • Framework: Markov Decision Processes (MDP’s)
  • Definitions and notations
  • Policies and Value Functions
  • Solving MDP’s
  • Value Iteration (recap)
  • Q-Value Iteration (new)
  • Policy Iteration
  • Reinforcement learning
  • Value-based RL (Q-learning, Deep-Q Learning)
  • Policy-based RL (Policy gradients)
  • Actor-Critic
slide-3
SLIDE 3

3

  • Markov Decision Processes (MDP):
  • States:
  • Actions:
  • Rewards:
  • Transition Function:
  • Discount Factor:

Recap: MDPs

S

<latexit sha1_base64="s0ORvqxeEX0PpYtJdKPahu5m4g=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2gdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPjPaRbQ=</latexit>

A

<latexit sha1_base64="5yvcY3wy4+X4ZQwlZmNZA2cVtw=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdVNy4r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaTu9zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs5vZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqNh8ta87aowncArn4MEVNOEeWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPcZyRWw=</latexit>

R(s, a, s0)

<latexit sha1_base64="hZg9kq6cjXY4gv0GoxJrKqzxc4=">AB/HicbVDLSsNAFL3xWesr2qWbYBErlJUQZdFNy6r2Ae0oUymk3boZBJmJkI9VfcuFDErR/izr9x0mahrQcGDufcyz1zvIhRqWz721hZXVvf2CxsFbd3dvf2zYPDtgxjgUkLhywUXQ9JwignLUVI91IEBR4jHS8yU3mdx6JkDTkDyqJiBugEac+xUhpaWCW+gFSY4xYej+tyCqytOzgVm2a/YM1jJxclKGHM2B+dUfhjgOCFeYISl7jh0pN0VCUczItNiPJYkQnqAR6WnKUCkm87CT60TrQwtPxT6cWXN1N8bKQqkTAJPT2ZR5aKXif95vVj5V25KeRQrwvH8kB8zS4VW1oQ1pIJgxRJNEBZUZ7XwGAmEle6rqEtwFr+8TNr1mnNeq9dlBvXeR0FOIJjqIADl9CAW2hCzAk8Ayv8GY8GS/Gu/ExH10x8p0S/IHx+QOCfJQE</latexit>

T(s, a, s0) = p(s0|s, a)

<latexit sha1_base64="hDYdnOLHZvolOWA7j6GJtJVOYUI=">ACB3icbVDLSgMxFM3UV62vUZeCBIu0hVJmqAboejGZYW+oB1KJs20oZnMkGSEMnbnxl9x40IRt/6CO/GTDsLbT0QODnXu69xw0Zlcqyvo3Myura+kZ2M7e1vbO7Z+4ftGQCUyaOGCB6LhIEkY5aSqGOmEgiDfZaTtjm8Sv31PhKQBb6hJSBwfDTn1KEZKS3zuOcjNXLduDEtyjIqy0IJXsGwKAsP+lvqm3mrYs0Al4mdkjxIUe+bX71BgCOfcIUZkrJrW6FyYiQUxYxMc71IkhDhMRqSrqYc+UQ68eyOKTzVygB6gdCPKzhTf3fEyJdy4ru6MtlaLnqJ+J/XjZR36cSUh5EiHM8HeRGDKoBJKHBABcGKTRBWFC9K8QjJBWOrqcDsFePHmZtKoV+6xSvTvP167TOLgCJyAIrDBaiBW1AHTYDBI3gGr+DNeDJejHfjY16aMdKeQ/AHxucP2kmXA=</latexit>

γ

<latexit sha1_base64="IoELSitFJaTQ4WT4pr8f01q0csw=">AB7XicbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS+idzCZj5rHMzAphyT948aCIV/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnAkM5k7RpmeW0k2gKIuK0HY1vZ37iWrDlHywk4SGAoaSxYyAdVKrNwQhoF+u+FV/DrxKgpxUI5Gv/zVGyiSCiot4WBMN/ATG2agLSOcTku91NAEyBiGtOuoBEFNmM2vneIzpwxwrLQrafFc/T2RgTBmIiLXKcCOzLI3E/zuqmNr8OMyS1VJLFojl2Co8ex0PmKbE8okjQDRzt2IyAg3EuoBKLoRg+eV0qpVg4tq7f6yUr/J4yiE3SKzlGArlAd3aEGaiKCHtEzekVvnvJevHfvY9Fa8PKZY/QH3ucPiDmPGQ=</latexit>
slide-4
SLIDE 4

4

Recap: Optimal Value Function

The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter

slide-5
SLIDE 5

5

Recap: Optimal Value Function

The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter Optimal policy:

slide-6
SLIDE 6

Recap: Learning Based Methods

  • Typically, we don’t know the environment
  • unknown, how actions affect the environment.
  • unknown, what/when are the good actions?

6

slide-7
SLIDE 7

Recap: Learning Based Methods

  • Typically, we don’t know the environment
  • unknown, how actions affect the environment.
  • unknown, what/when are the good actions?
  • But, we can learn by trial and error.
  • Gather experience (data) by performing actions.
  • Approximate unknown quantities from data.

7

slide-8
SLIDE 8
  • Collect a dataset
  • Loss for a single data point:
  • Act according optimally according to the learnt Q function:

Recap: Deep Q-Learning

8

Target Q-Value Predicted Q-Value

π(s) = arg max

a∈A Q(s, a)

<latexit sha1_base64="yzDprbFnNXTNJIZeZI4+KbcSxOk=">ACEnicbZDLSgMxFIYzXmu9V26CRahBSkzVdCNUHXjsgV7gc5QzqRpG5rJDElGLEOfwY2v4saFIm5dufNtzLRdaOsPgY/nEPO+f2IM6Vt+9taWl5ZXVvPbGQ3t7Z3dnN7+w0VxpLQOgl5KFs+KMqZoHXNKetSFIfE6b/vAmrTfvqVQsFHd6FEvgL5gPUZAG6uTK7oRK6givsQuyL4bwEMnAZcJQ3pAgCdX43GtoE6g2Mnl7ZI9EV4EZwZ5NFO1k/tyuyGJAyo04aBU27Ej7SUgNSOcjrNurGgEZAh92jYoIKDKSyYnjfGxcbq4F0rzhMYT9/dEAoFSo8A3nemar6Wmv/V2rHuXgJE1GsqSDTj3oxzrEaT64yQlmo8MAJHM7IrJACQbVLMmhCc+ZMXoVEuOaelcu0sX7mexZFBh+gIFZCDzlEF3aIqiOCHtEzekVv1pP1Yr1bH9PWJWs2c4D+yPr8AdJ6nPM=</latexit>

Pick action with best Q value

slide-9
SLIDE 9

9

Transition function and reward function

Getting to the optimal policy

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Use value / policy iteration known Obtain “optimal” policy

slide-10
SLIDE 10

10

Transition function and reward function

Getting to the optimal policy

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Use value / policy iteration known Estimate Q values From data Obtain “optimal” policy

Previous class: Q - learning

unknown

slide-11
SLIDE 11

11

Transition function and reward function

Getting to the optimal policy

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Use value / policy iteration known Obtain “optimal” policy Estimate and from data

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Estimate Q values From data unknown

Homework!

slide-12
SLIDE 12

12

Transition function and reward function

Getting to the optimal policy

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Use value / policy iteration known Obtain “optimal” policy Estimate and from data

T

<latexit sha1_base64="3C9MXkPH8TY4nwHmFzXRcPtXQE=">AB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LoxmWFvrAtJZPeaUMzmSHJCGXoX7hxoYhb/8adf2OmnYW2HgczrmXnHv8WHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHSWKYZNFIlIdn2oUXGLTcCOwEyukoS+w7U/uMr/9hErzSDbMNMZ+SEeSB5xRY6XHXkjN2PfTxmxQKrsVdw6ySryclCFHfVD6g0jloQoDRNU67nxqafUmU4Ezgr9hKNMWUTOsKupZKGqPvpPGMnFtlSIJI2ScNmau/N1Iaj0NfTuZJdTLXib+53UTE9z0Uy7jxKBki4+CRBATkex8MuQKmRFTSyhT3GYlbEwVZcaWVLQleMsnr5JWteJdVqoPV+XabV5HAU7hDC7Ag2uowT3UoQkMJDzDK7w52nlx3p2Pxeiak+cwB84nz/Bt5D4</latexit>

R

<latexit sha1_base64="HE8dhDNLhGJlVAgw6eEHnguJlo0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy6r2AdMh5JM21oJhmSjFCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTDjTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkqQtEcql6IdaUM0HbhlOe4miOA457YaT29zvPlGlmRSPZprQIMYjwSJGsLGS34+xGRPMs4fZoFpz6+4caJV4BalBgdag+tUfSpLGVBjCsda+5yYmyLAyjHA6q/RTRNMJnhEfUsFjqkOsnkGTqzyhBFUtknDJqrvzcyHGs9jUM7mUfUy14u/uf5qYmug4yJDVUkMVHUcqRkSi/Hw2ZosTwqSWYKGazIjLGChNjW6rYErzlk1dJp1H3LuqN+8ta86aowncArn4MEVNOEOWtAGAhKe4RXeHO8O/Ox2K05BQ7x/AHzucPi3GRbA=</latexit>

Estimate Q values From data unknown unknown

This class!

slide-13
SLIDE 13

13

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.

Learning the optimal policy

πθ(a|s) : S → A

<latexit sha1_base64="HA8+0KZHcZrLUZfmEHRvpHjDEI=">ACG3icbVDJSgNBEO1xjXGLevTSGIR4CTNRUDxFvXiMaBbIhFDT6SRNeha6a5Qw5j+8+CtePCjiSfDg39hZE18UPB4r4qel4khUb/rLm5hcWl5ZTK+nVtfWNzczWdkWHsWK8zEIZqpoHmksR8DIKlLwWKQ6+J3nV610M/eotV1qEwQ32I97woROItmCARmpmCm4kmi52OUIO7vXBKXV9wC4DmVwPXCU6XQSlwrsf9WzQzGTtvD0CnSXOhGTJBKVm5sNthSz2eYBMgtZ1x46wkYBCwSQfpN1Y8whYDzq8bmgAPteNZPTbgO4bpUXboTIVIB2pvycS8LXu+57pHJ6op72h+J9Xj7F90khEMXIAzZe1I4lxZAOg6ItoThD2TcEmBLmVsq6oIChiTNtQnCmX54lULeOcwXro6yxfNJHCmyS/ZIjkmBTJSmRMmHkgTyRF/JqPVrP1pv1Pm6dsyYzO+QPrM9vS7uiOQ=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>
slide-14
SLIDE 14

14

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.
  • Want to maximize:
  • In other words,

Learning the optimal policy

πθ(a|s) : S → A

<latexit sha1_base64="HA8+0KZHcZrLUZfmEHRvpHjDEI=">ACG3icbVDJSgNBEO1xjXGLevTSGIR4CTNRUDxFvXiMaBbIhFDT6SRNeha6a5Qw5j+8+CtePCjiSfDg39hZE18UPB4r4qel4khUb/rLm5hcWl5ZTK+nVtfWNzczWdkWHsWK8zEIZqpoHmksR8DIKlLwWKQ6+J3nV610M/eotV1qEwQ32I97woROItmCARmpmCm4kmi52OUIO7vXBKXV9wC4DmVwPXCU6XQSlwrsf9WzQzGTtvD0CnSXOhGTJBKVm5sNthSz2eYBMgtZ1x46wkYBCwSQfpN1Y8whYDzq8bmgAPteNZPTbgO4bpUXboTIVIB2pvycS8LXu+57pHJ6op72h+J9Xj7F90khEMXIAzZe1I4lxZAOg6ItoThD2TcEmBLmVsq6oIChiTNtQnCmX54lULeOcwXro6yxfNJHCmyS/ZIjkmBTJSmRMmHkgTyRF/JqPVrP1pv1Pm6dsyYzO+QPrM9vS7uiOQ=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

J(π) = E " T X

t=1

R(st, at) #

<latexit sha1_base64="B+j1qSim06aX+cYcVTPiILNpSYU=">AC63icfVJdixMxFM2MX+v4sV19CVYCi2WMrMK+lJYFEF8WFbZ7i40s0MmzbRhMx8kd8QS8xd8UERX/1DvlvzHS6uLsVLwQO95ycnNwkraTQEIa/Pf/a9Rs3b23dDu7cvXd/u7Pz4EiXtWJ8wkpZqpOUai5FwScgQPKTSnGap5Ifp2evGv74A1dalMUhLCse53ReiEwCq6V7Hheb4xJTmGRpua1TQxNgGiRk0r0CZuV8EknMBhinRh4EtmGwtVfZujkA0skz2BKdJ071Ti0p4etJaPSvLf9cx1RYr6AGBMS9DChVaXKj5hkijITWbNvWwMxjuzp/rlZtGlmxBDsENMWDIK3fRd2gC9e43KgxuM/eZJONxyFq8KbIFqDLlrXQdL5RWYlq3NeAJNU62kUVhAbqkAwyW1Aas0rys7onE8dLGjOdWxWb2Vxz3VmOCuVWwXgVfiDkNzrZd56pRNZH2Va5r/4qY1ZC9iI4qBl6w9qCslhK3Dw8ngnFGcilA5Qp4bJitqBu+OC+R+CGEF298iY42h1FT0e751916ux7GFHqHqI8i9BztoTfoAE0Q8xbeZ+r983P/S/+d/9HK/W9Z6H6FL5P/8ALC3rGQ=</latexit>

π∗ = arg max

π:S→A E

" T X

t=1

R(st, at) #

<latexit sha1_base64="hQq64rXZEwu0gPZ8pj2AdWqwCY=">ADGnicfVJbaxNBGJ1dbzXeUn30ZTAEg1ltwqKEKiK4FOp2rSFTLMTmaTobMXZr7VhnF+hy/+FV98UMQ38cV/42w2bW0jfjBwOd8t5mJCyk0BMFvz79w8dLlK2tXG9eu37h5q7l+e0/npWJ8wHKZq4OYai5FxgcgQPKDQnGaxpLvx4cvKn3/HVda5NkuzAs+Suk0E4lgFBwVrXtBu49JSmEWx+aljQyNgGiRkJ0CJvk8EFH0O1hHRl4ENpKwsWp0nP2riWSJzAkukydqx/Y8W5dklFp3tjOsY8oMZ3BCBPSaGNCi0LlR5gkijITWrNt6wKiH9rx9nGxcLWYET2wPUxr0G24Ucf3sduCqlzHkXGMU9Pct7aui9VKn9/wj6z9nTrs/NXLf8zftRsBRvBIvAqCJeghZaxEzV/knOypRnwCTVehgGBYwMVSCY5LZBSs0Lyg7plA8dzGjK9cgsntbitmMmOMmVOxngBft3hqGp1vM0ds5qZH1eq8h/acMSkicjI7KiBJ6xulFSgw5rv4JngjFGci5A5Qp4WbFbEbdW4H7TQ13CeH5lVfB3uZG+HBj8/Wj1tbz5XWsobvoHuqgED1GW+gV2kEDxLyP3mfvq/fN/+R/8b/7P2qr7y1z7qAz4f/6Axr5AH4=</latexit>

θ∗ = arg max

θ

E " T X

t=1

R(st, at) #

<latexit sha1_base64="6GmiULs0oBvYilwKGbksjpJFOA=">AC/nicfVJdixMxFM2MX+v41VXfAmWQqulzKyCvhQWRfBpW7u9C0QybNtGEzHyR3ZEsM+Fd8UERX/0dvlvzHS6rLsVLwQO5yc3JskKaXQEIa/Pf/K1WvXb2zdDG7dvnP3Xmv7/qEuKsX4iBWyUMcJ1VyKnI9AgOTHpeI0SyQ/Sk5e1/rRB60KPIDWJZ8ktF5LlLBKDgq3vYedoaYZBQWSWLe2NjQGIgWGSlFl7BZAR91DL0+1rGBp5GtJVyeK31n71kieQpjoqvMuYahnR40kYxK8952z3xEifkCJpiQoIMJLUtVnGKSKspMZM2ebQLEMLTvbOwaDPMiD7YPqYN6AUEFhzo9Al2g1A1d+bT2DSkPZ/sYo917H9ajFvtcBCuCm+CaA3aF37cesXmRWsyngOTFKtx1FYwsRQBYJbgNSaV5SdkLnfOxgTjOuJ2b1fBZ3HDPDaHcygGv2L93GJpvcwS56xb1pe1mvyXNq4gfTkxIi8r4DlrDkoriaHA9V/AM6E4A7l0gDIlXK+YLah7D3A/JnCXEF0eRMc7gyiZ4Od8/bu6/W17GFHqHqIsi9ALtordoH40Q84z32fvqfM/+V/87/6Pxup76z0P0IXyf/4B3mP0Ag=</latexit>
slide-15
SLIDE 15

15

  • Class of policies defined by parameters
  • Eg: can be parameters of linear transformation, deep network, etc.
  • Want to maximize:
  • In other words,

Learning the optimal policy

πθ(a|s) : S → A

<latexit sha1_base64="HA8+0KZHcZrLUZfmEHRvpHjDEI=">ACG3icbVDJSgNBEO1xjXGLevTSGIR4CTNRUDxFvXiMaBbIhFDT6SRNeha6a5Qw5j+8+CtePCjiSfDg39hZE18UPB4r4qel4khUb/rLm5hcWl5ZTK+nVtfWNzczWdkWHsWK8zEIZqpoHmksR8DIKlLwWKQ6+J3nV610M/eotV1qEwQ32I97woROItmCARmpmCm4kmi52OUIO7vXBKXV9wC4DmVwPXCU6XQSlwrsf9WzQzGTtvD0CnSXOhGTJBKVm5sNthSz2eYBMgtZ1x46wkYBCwSQfpN1Y8whYDzq8bmgAPteNZPTbgO4bpUXboTIVIB2pvycS8LXu+57pHJ6op72h+J9Xj7F90khEMXIAzZe1I4lxZAOg6ItoThD2TcEmBLmVsq6oIChiTNtQnCmX54lULeOcwXro6yxfNJHCmyS/ZIjkmBTJSmRMmHkgTyRF/JqPVrP1pv1Pm6dsyYzO+QPrM9vS7uiOQ=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

J(π) = E " T X

t=1

R(st, at) #

<latexit sha1_base64="B+j1qSim06aX+cYcVTPiILNpSYU=">AC63icfVJdixMxFM2MX+v4sV19CVYCi2WMrMK+lJYFEF8WFbZ7i40s0MmzbRhMx8kd8QS8xd8UERX/1DvlvzHS6uLsVLwQO95ycnNwkraTQEIa/Pf/a9Rs3b23dDu7cvXd/u7Pz4EiXtWJ8wkpZqpOUai5FwScgQPKTSnGap5Ifp2evGv74A1dalMUhLCse53ReiEwCq6V7Hheb4xJTmGRpua1TQxNgGiRk0r0CZuV8EknMBhinRh4EtmGwtVfZujkA0skz2BKdJ071Ti0p4etJaPSvLf9cx1RYr6AGBMS9DChVaXKj5hkijITWbNvWwMxjuzp/rlZtGlmxBDsENMWDIK3fRd2gC9e43KgxuM/eZJONxyFq8KbIFqDLlrXQdL5RWYlq3NeAJNU62kUVhAbqkAwyW1Aas0rys7onE8dLGjOdWxWb2Vxz3VmOCuVWwXgVfiDkNzrZd56pRNZH2Va5r/4qY1ZC9iI4qBl6w9qCslhK3Dw8ngnFGcilA5Qp4bJitqBu+OC+R+CGEF298iY42h1FT0e751916ux7GFHqHqI8i9BztoTfoAE0Q8xbeZ+r983P/S/+d/9HK/W9Z6H6FL5P/8ALC3rGQ=</latexit>

π∗ = arg max

π:S→A E

" T X

t=1

R(st, at) #

<latexit sha1_base64="hQq64rXZEwu0gPZ8pj2AdWqwCY=">ADGnicfVJbaxNBGJ1dbzXeUn30ZTAEg1ltwqKEKiK4FOp2rSFTLMTmaTobMXZr7VhnF+hy/+FV98UMQ38cV/42w2bW0jfjBwOd8t5mJCyk0BMFvz79w8dLlK2tXG9eu37h5q7l+e0/npWJ8wHKZq4OYai5FxgcgQPKDQnGaxpLvx4cvKn3/HVda5NkuzAs+Suk0E4lgFBwVrXtBu49JSmEWx+aljQyNgGiRkJ0CJvk8EFH0O1hHRl4ENpKwsWp0nP2riWSJzAkukydqx/Y8W5dklFp3tjOsY8oMZ3BCBPSaGNCi0LlR5gkijITWrNt6wKiH9rx9nGxcLWYET2wPUxr0G24Ucf3sduCqlzHkXGMU9Pct7aui9VKn9/wj6z9nTrs/NXLf8zftRsBRvBIvAqCJeghZaxEzV/knOypRnwCTVehgGBYwMVSCY5LZBSs0Lyg7plA8dzGjK9cgsntbitmMmOMmVOxngBft3hqGp1vM0ds5qZH1eq8h/acMSkicjI7KiBJ6xulFSgw5rv4JngjFGci5A5Qp4WbFbEbdW4H7TQ13CeH5lVfB3uZG+HBj8/Wj1tbz5XWsobvoHuqgED1GW+gV2kEDxLyP3mfvq/fN/+R/8b/7P2qr7y1z7qAz4f/6Axr5AH4=</latexit>

θ∗ = arg max

θ

E " T X

t=1

R(st, at) #

<latexit sha1_base64="6GmiULs0oBvYilwKGbksjpJFOA=">AC/nicfVJdixMxFM2MX+v41VXfAmWQqulzKyCvhQWRfBpW7u9C0QybNtGEzHyR3ZEsM+Fd8UERX/0dvlvzHS6rLsVLwQO5yc3JskKaXQEIa/Pf/K1WvXb2zdDG7dvnP3Xmv7/qEuKsX4iBWyUMcJ1VyKnI9AgOTHpeI0SyQ/Sk5e1/rRB60KPIDWJZ8ktF5LlLBKDgq3vYedoaYZBQWSWLe2NjQGIgWGSlFl7BZAR91DL0+1rGBp5GtJVyeK31n71kieQpjoqvMuYahnR40kYxK8952z3xEifkCJpiQoIMJLUtVnGKSKspMZM2ebQLEMLTvbOwaDPMiD7YPqYN6AUEFhzo9Al2g1A1d+bT2DSkPZ/sYo917H9ajFvtcBCuCm+CaA3aF37cesXmRWsyngOTFKtx1FYwsRQBYJbgNSaV5SdkLnfOxgTjOuJ2b1fBZ3HDPDaHcygGv2L93GJpvcwS56xb1pe1mvyXNq4gfTkxIi8r4DlrDkoriaHA9V/AM6E4A7l0gDIlXK+YLah7D3A/JnCXEF0eRMc7gyiZ4Od8/bu6/W17GFHqHqIsi9ALtordoH40Q84z32fvqfM/+V/87/6Pxup76z0P0IXyf/4B3mP0Ag=</latexit>
slide-16
SLIDE 16

16

Learning the optimal policy

  • Slightly rewriting the notation:
  • Let , the trajectory

τ = (s0, a0, . . . sT , aT )

<latexit sha1_base64="vwRExvovk3yqK212mquf8D5yk/I=">ACnicbVDLSgMxFM3UV62vUZduokWoIGWmCroRim5cVugL2mHIpJk2NPMguSOUoWs3/obF4q49Qvc+Tdm2lo9UDCyTn3JrnHiwVXYFlfRmFpeWV1rbhe2tjc2t4xd/faKkokZS0aiUh2PaKY4CFrAQfBurFkJPAE63jm8zv3DOpeBQ2YRIzJyDkPucEtCSax72gST4CleUa51ikm39QK7eZnZsnrlm2qtYM+C+xc1JGORqu+alvoEnAQqCKNWzrRiclEjgVLBpqZ8oFhM6JkPW0zQkAVNOhtlio+1MsB+JPUKAc/Unx0pCZSaBJ6uDAiM1KXif95vQT8SyflYZwAC+n8IT8RGCKc5YIHXDIKYqIJoZLrv2I6IpJQ0OmVdAj24sh/SbtWtc+qtbvzcv06j6OIDtARqiAbXaA6ukUN1EIUPaAn9IJejUfj2Xgz3uelBSPv2Ue/YHx8A6cil7M=</latexit>

pθ(τ) = pθ(s0, a0, . . . sT , aT )

<latexit sha1_base64="s8bA72w0YkcwfJbRWFZT159+D+w=">ACGXicbVDLSgMxFM3UV62vUZdugkVoQcpMFXQjFN24rNAXtGW4k6ZtaOZBckcopb/hxl9x40IRl7ryb0wfiLYeuHByzr1J7vFjKTQ6zpeVWldW9Ib2a2tnd29+z9g5qOEsV4lUyUg0fNJci5FUKHkjVhwCX/K6P7iZ+PV7rSIwgoOY94OoBeKrmCARvJsJ/Za2OcIuRZCkqdX9EfQnMKplqdCDXVXsWcKnPzjoFZwq6TNw5yZI5yp79YS5gScBDZBK0brpOjO0RKBRM8nGmlWgeAxtAjzcNDSHguj2abjamJ0bp0G6kTIVIp+rviREWg8D3QGgH296E3E/7xmgt3L9kiEcYI8ZLOHuomkGNFJTLQjFGcoh4YAU8L8lbI+KGBowsyYENzFlZdJrVhwzwrFu/Ns6XoeR5ockWOSIy65ICVyS8qkSh5IE/khbxaj9az9Wa9z1pT1nzmkPyB9fkNiuSevA=</latexit>

=

T

Y

t=0

pθ(at | st) · p(st+1 | st, at)

<latexit sha1_base64="Qqm7vlMzjz6M4jPwkfTAmE5nExY=">ACLHicbVDLSgMxFM34rPVdekmWIQWpcxUQTeFYjcuK1gtdOqQyaQ2NDMJyR2hDP0gN/6KIC4s4tbvMK0VfB0IHM45l5t7QiW4AdcdO3PzC4tLy7mV/Ora+sZmYWv7yshU9aiUkjdDolhgiesBRwEayvNSBwKdh0OGhP/+o5pw2VyCUPFujG5TXiPUwJWCgqNGvaVlGQc0d3VxiFfjQZ0BKJA/5hE2AZSxTyMJWJWMzR14oy/jENtUOSgU3Yo7Bf5LvBkpohmaQeHJjyRNY5YAFcSYjucq6GZEA6eCjfJ+apgidEBuWcfShMTMdLPpsSO8b5UI96S2LwE8Vb9PZCQ2ZhiHNhkT6Jvf3kT8z+uk0DvtZjxRKbCEfi7qpQKDxJPmcMQ1oyCGlhCquf0rpn2iCQXb96W4P0+S+5qla8o0r14rhYP5vVkUO7aA+VkIdOUB2doyZqIYru0SN6QWPnwXl2Xp23z+icM5vZQT/gvH8AtFGmpA=</latexit>

arg max

θ

Eτ∼pθ(τ) [R(τ)]

<latexit sha1_base64="vz2U4Z7rn8xSwcClkWhPnhpDG+o=">ACOnicbVDLShxBFK02asz4GpOlm8JB0M3QrUKylISASxVHhamuV1ze6aw+kHV7ZCh6e9yk6/ILotsXCQEt35Aqmd64etAweGc6l7T1xoZcn3f3kLbxaXlt+uvOusrq1vbHa3l/avDQSBzLXubmOwaJWGQ5IkcbrwiCkscar+OZL419Q2NVnl3QtMAwhXGmEiWBnBR1zwSYsUjheyRogSO0iSOq691VAmCUliV8qI19xplvxYaExrOkhJ0dV7PdWHUeEJh1O35fX8G/pIELemxFqdR96cY5bJMSOpwdph4BcUVmBISY1R5QWC5A3MahoxmkaMNqdnrNd50y4klu3MuIz9THExWk1k7T2CWbfe1zrxFf84YlJZ/CSmVFSZjJ+UdJqTnlvOmRj5RBSXrqCEij3K5cTsCAJNd2x5UQPD/5Jbk86AeH/YOzo97x57aOFbNdtgeC9hHdsxO2CkbMlu2W/2h/31fnh3j/vfh5d8NqZD+wJvIf/bDCvkw=</latexit>
slide-17
SLIDE 17

17

Learning the optimal policy

= Eat∼π(·|st),st+1∼p(·|st,at) " T X

t=0

R(st, at) #

<latexit sha1_base64="dTw5RPEHuCGnbFAjleAzHB8YyQ=">ACqHicbVFdixMxFM2MX2v92KqPvgRLodVSZtaF9aWwKIgLFXa7UrTHTJpg2bmQnJHbHE/Db/g2/+GzOdLurWC4HDOScnN/emSgoDUfQrCG/dvnP3sH91oOHjx4ftp8PTdlpRmfslKW+iKlhktR8CkIkPxCaU7zVPJZevWu1mdfuTaiLCawUXyR01UhMsEoeCp/xhklNYp6l97xJLEyBG5ESJHmHLEr6bBPoDbBILr2JXS1j9UQbe3ndE8gzmxFS5d40idzlpIhmV9rPrXfuIFqs1LDAhrS4mVCldfsMk05TZ2Nkz1wSIUewuz67D4v0wKwbgBpg2oJ+0O9Ew2hbeB/EOdNCuxkn7J1mWrMp5AUxSY+ZxpGBhqQbBJHctUhmuKLuiKz73sKA5Nwu7HbTDXc8scVZqfwrAW/bvG5bmxmzy1Dvrps1NrSb/p80ryN4srChUBbxgzUNZJTGUuN4aXgrNGciNB5Rp4XvFbE397MDvtuWHEN/8j4PxrGr4dHn47p2934zhAz9EL1EMxOkGn6AMaoyliQTf4GEyCafgyHIez8EtjDYPdnWfonwrT35Ms0as=</latexit>

Sample a few trajectories by acting according to πθ

<latexit sha1_base64="nGJzvb0FX+zqN95ckRcGOfjP8Y=">ADCnicfVJNixMxGM6MX+v41dWjl2AptFrKzCropbAogqdle3uQtMOmThs18kLwjW2LOXvwrXjwo4tVf4M1/Y6bTZd2t+ELg4XmePHnfJEkphYw/O35V65eu35j62Zw6/adu/da2/cPdVEpxkeskIU6TqjmUuR8BAIkPy4Vp1ki+VFy8qrWj95zpUWRH8Cy5JOMznORCkbBUfG2hztDTDIKiyQxr21saAxEi4yUokvYrIAPOoZeH+vYwJPI1hIuz5W+s/cskTyFMdFV5lzD0E4PmkhGpXlnu2c+osR8ARNMSNDBhJalKk4xSRVlJrJmzYBYhjZ6d5ZWLQZkQfbB/TBvSCDoEFBzp9jN0kVM2d+zQ2DWnPR7vYZJ37nx4DN3/cRMStdjgIV4U3QbQGbSu/bj1i8wKVmU8Byap1uMoLGFiqALBJLcBqTQvKTuhcz52MKcZ1xOzekqLO46Z4bRQbuWAV+zfOwzNtF5miXPW3evLWk3+SxtXkL6YGJGXFfCcNQelcRQ4Ppf4JlQnIFcOkCZEq5XzBbUvQ243xO4S4guj7wJDncG0dPBztn7d2X6+vYQg/RI9RFEXqOdtEbtI9GiHkfvc/eV+b/8n/4n/3fzRW31veYAulP/zD9es+K8=</latexit>

J(θ) = Eτ∼pθ(τ) [R(τ)]

<latexit sha1_base64="t5Tn5QM9CIdZdL/y4B50l43iK5w=">ADXicfVJNbxMxEHV2KbRLKSkcOHCxiCIlEW7BQkukSoQEuJQFdS0leJk5XW8idX9kj2LGi3+k9zgwl/Bzia0TRAjWXp68zxp6oSIQC3/ZcNx7O/cf7O5D/cfHTxuHj45V3kpGR+yPMnlZUQVT0TGhyAg4ZeF5DSNEn4RX2w+YtvXCqRZ2ewKPg4pbNMxIJRMFR42FBtPMAkpTCPouqjDisaAlEiJYXoEDbN4bsKodvDKqzgVaBtChc3mZ6RdzVJeAwjosrUqAa+npzVJRlNq+6s9YRKWZzGNCvM8dAnMOtLvRnQAt6x5hLehY5m+HW0WX/Lqi18aEFoXMrzGJWVoKsTXRsSg0BPTtbmgm1zleiB7mFag67XrjtPXlpvVM6M+to6s6S+MXt3aFv3PzNbg4VYzRQ2W37fXwbeBsEKtNAqTsPmDzLNWZnyDFhClRoFfgHjikoQLOHaI6XiBWVXdMZHBmY05WpcLbdD47ZhpjOpTkZ4CV7+0ZFU6UWaWSU1r/azFnyX7lRCfG7cSWyogSesbpRXCYcmxXDU+F5AyShQGUSWG8Yjan5nfALKRnHiHYHkbnB/1g9f9oy9vWsfvV8+xi56jF6iDAvQWHaNP6BQNEWv8cpCz53jOb3fH3XcPaqnTWN15iu6E+wPYrISzQ=</latexit>

≈ 1 N

N

X

i=1 T

X

t=1

r(si

t, ai t)

<latexit sha1_base64="QoASQNhWVjKDWEoe6z7xTOghe8g=">ACJHicbZDLSgMxFIYz9VbrerSTbAIFaTMVEFBNGNK1GwF+i0QybNtKGZC8kZsQzMG58FTcuvODCjc9ipu1CWw+EfPz/OSTndyPBFZjml5Gbm19YXMovF1ZW19Y3iptbdRXGkrIaDUomy5RTPCA1YCDYM1IMuK7gjXcwWXmN+6ZVDwM7mAYsbZPegH3OCWgJad4apMokuEDtj1JaGKlyXVq9h3En5mpZ3rMUPGd1iWVYc7cIBJdu07xZJZMUeFZ8GaQAlN6sYpvtvdkMY+C4AKolTLMiNoJ0QCp4KlBTtWLCJ0QHqspTEgPlPtZLRkive0sVeKPUJAI/U3xMJ8ZUa+q7u9An01bSXif95rRi8k3bCgygGFtDxQ14sMIQ4Swx3uWQUxFADoZLrv2LaJzot0LkWdAjW9MqzUK9WrMNK9faodH4xiSOPdtAuKiMLHaNzdIVuUA1R9Iie0St6M56MF+PD+By35ozJzDb6U8b3D8yHpNg=</latexit>

{τi}N

i=1

<latexit sha1_base64="OAWlX2MJMfZLwLqjwQeU2oVJP98=">AB+3icbVBNS8NAEJ3Ur1q/Yj16CRbBU0mqoBeh6MWTVLAf0MSw2W7apZtN2N2IJeSvePGgiFf/iDf/jds2B219MPB4b4aZeUHCqFS2/W2UVlbX1jfKm5Wt7Z3dPXO/2pFxKjBp45jFohcgSRjlpK2oYqSXCIKigJFuML6e+t1HIiSN+b2aJMSL0JDTkGKktOSbVTdzFUp96uZ+Ri+d/OHWN2t23Z7BWiZOQWpQoOWbX+4gxmlEuMIMSdl37ER5GRKYkbyiptKkiA8RkPS15SjiEgvm92eW8daGVhLHRxZc3U3xMZiqScRIHujJAayUVvKv7n9VMVXngZ5UmqCMfzRWHKLBVb0yCsARUEKzbRBGFB9a0WHiGBsNJxVXQIzuLy6TqDun9cbdWa15VcRhkM4ghNw4ByacAMtaAOGJ3iGV3gzcuPFeDc+5q0lo5g5gD8wPn8A7DuUWw=</latexit>
slide-18
SLIDE 18

18

  • 1. Sample trajectories by acting according to
  • 2. Compute policy gradient as
  • 3. Update policy

REINFORCE

πθ

<latexit sha1_base64="b5U2DTuOeBsxifnAUAZsgnRhcpo=">AF7nicjVRLb9NAEHaLAyW8WjhyWVFSmhU2QUJLpEqEBLiUBXUl5RNrfV6k6y6fmh3jFK5/hFcOIAQV34PN/4Nu7bJHZVdSVb45nZ7/tmZtd+IrgCx/m3snrHbt29t3a/eDho8dP1jeHqk4lZQd0ljE8sQnigkesUPgINhJIhkJfcGO/bP3Jn78lUnF4+gAzhM2Csk4mNOCWiXt2HbHTRAOCQw9f3sQ+5lxAOseIgT3sU0iOFCedDrI+VlsOXmJoSeaSv03s5FmwMQ6zSUGcNnPz0oISkRGRf8u5lHpZ8MoURwrjdQZ+6GKYMSK/Gj4GkJYtXJnSN54pjAbwX2JqREySRMYzhMeS0MzNs728lMQHbn6dynPbcrLeB/yPiKl0Wt3SubTl0YbkROdPTPKjDOfi10u2+DeULURmPCqJvMREV+Q6nu5G0uR2/Sm2RZNp4F4BHW0KwlV+8wgy4bVQ3nTU2bXuQLzbhfnqCs7aorEPGkoeIGyNsfjcU5LFLWGPVcLoq5FMXcdng1xIJQ4+q7oDynh7bQEpKJLFBqpOIaXZdmAMqrFfIAGV5UOzXAZlDc9EyQE/lunZWrR5Uh78x9IWRVmbengv01jedbadYqGm4lbFpVWvfW/+Lg5imIYuACqLU0HUSGVEAqeCaehUsYTQMzJhQ21GJGRqlBVF5KijPQEax1I/+rwU3sUdGQmVOg9nWkmouox47wuNkxh/HaU8ShJgUW0JBqnAkGMzL8PBVwyCuJcG4RKrUiOiW6X6D/kG3dBLdectM42tl2X23vfH69ufuasea9dx6YXUt13pj7VofrX3r0KL2mf3N/mH/bCWt761frd9l6upKteZtbRaf/4DCFQG3A=</latexit>

τi = {s1, a1, . . . sT , aT }i

<latexit sha1_base64="TFbOelEzSv4cKXAFyEjbwcOLY4=">AGIXicjVRLb9NAEHZLAiW8UjhyWVFSmhUxQWpXCJVICTEoSofUjZ1FpvNsmqfsk7Rq3c/Stc+CtcOIBQb4g/w6ztoldqlpKPJ6Z/b5vZnbXjTypoNf7s7R8p1a/e2/lfuPBw0ePnzRXn+6rMIm52OhF8aHLlPCk4HYAwmeOIxiwXzXEwfu8TsTP/giYiXDYACnkRj5bBrIieQM0OWs1rZapE+oz2Dmul7aTMAaqkTyPZpnwcwplyoNMlyklh3dYmRKrSBfTO5p6YgJDqhIfs/o9fTITnz0s+6fZFHYzmdwYhQ2miRj20KMwGsU+KnwJKcxckT2sZzyTEHm/kvMBGRsiKwxNCJzHjqa3THZ1Lkn1bH+1cyLOr8lLZBd0lLDc6jVbOfPTSaGPxFLNPjDLj1FdiF8s2uDdUbQRGsqjJfATM9VjxvdiNhchtelNtC9IhkAygjHYpoWifGWTesHJIVz15dplrbP4b2T7KCEurygq8cFpR8T/IG5p0+y0zP595vJISnNdZNq+syNsOtYSYESIunhHl9DpknSwgmcgcJSJlx+u6NAOQHzlfjonhJaXdBOIEshsgjcUYp3Vdm4sR9ItDUdkMc6MuTN0wb0eaNalybCzX7hKPVGoYoCfA6od6TXehu97CFVwy6MNat4dp3mOWLwxBcBcI8pNbR7EYxSFoPknkDWRImI8WM2FUM0A+YLNUqz+jRpoWdMJmGMP9ximXd+Rcp8pU59FzPNsFQ5ZpzXxYJTN6MUhlECYiA50STxCMQEnNdkrGMBQfvFA3GY4laCZ8xbCXgpdrAJtjlkqvG/uaG/Wpj89Prte23RTtWrOfWC6t2daWtW19sHatPYvXvta+137WftW/1X/Uf9fP89TlpWLNM2vhqf/9ByWzGbc=</latexit>

θ θ + αrθJ(θ)

<latexit sha1_base64="9FzOBrFC5DSMeH+3r2YIaAdeD8k=">AGVHicjVTfb9MwEM62towAo4NHXiymSi2rpmQgwUulCYSEeJgG6n5IdRc5rtuaOT9kO7Ap8x8JD0j8JbzwgJ2kXZu0iEhtL3fn+7lw/ZlRIx/m1sblVqzfubd+3Hzx8tPO4ufvkTEQJx+QURyziFz4ShNGQnEoqGbmIOUGBz8i5f/XOxM+/Ei5oFPblTUyGAZqEdEwxktrl7da+tEAPwADJqe+n75WXIk9CQMY0zbEo0jeCk92ukB4qdx3lQmB+C7S1ekdBRkZywEUSaCzeo67OclMWLpZ9We5UFOJ1M5BDaLfCxDeWUSNQp4UOJkhzFyxPaxjPHWCib+Wc1dUWI4phH1wCOcKpq9JjlVOiPVdHs/ouV6Ke1K1QUoNzp2K0e+fG4IT7R2deGmXGqO7LbZu6/+jaEIxp0ZN5CZHPUPG+rMZS5H+0qcqi4XQhGspytTmFQj4zyFywckhVPXl2GWtkvu1sjzLA0qkyAxZNKizWlVwn0nzURnu6Yg6L50qIZpO7VN2i/LeQYO0Qs6TuPHk2zBJCRkDj6LshPKcD9sFSRNZoKDXIrtWq9Li9uyqBXQEzBKB0hZJci2zm59yMtJTWiVvIX2vuAyVJVgYcWGqrDRKPGpOpcJzdNuF0CtjtA8+vq1D5VH7TmSbhpxHn0DhUd3g1g8RWum5jX3nAMne0DVcAtjzyqeE6/5Q4PjJChxAwJMXCdWA5TxCXFjCgbJoLECF+hCRloM0QBEcM0k0aBlvaMwDji+qO3MvMunkhRIMRN4OtM29RjhnqtgkeM3w5SGcSJiHOgcKAjID5hwUjygmW7EYbCHOquQI8RXoKUv8P21oEt9xy1Tg7PHBfHhx+erV39LaQY9t6Zj232pZrvbaOrA/WiXVq4dr32u+6Vd+o/6z/aWw16nq5kZx5qm19DR2/gI58yjp</latexit>

Run the policy and sample trajectories Compute policy gradient Update policy Slide credit: Sergey Levine

rθJ(θ) ⇡ X

i

" T X

t=1

rθ log πθ(ai

t | si t) · T

X

t=1

R(si

t | ai t)

#

<latexit sha1_base64="RBn+cpP1fVseXo4iuczuvOlwew=">ACgHicbZFdi9QwFIbT+rWOX6NehMchC7o2K6CIgiL3ohXq+zsLky65TRNZ8ImTUhOxaHM7/B/ePEUxniji7Hg8ec95yck5pVXSY5r+iuJr12/cvLV3e3Tn7r37D8YPH5140zouZtwo485K8ELJRsxQohJn1gnQpRKn5cXHPn/6TgvTXOMKytyDYtG1pIDBqkY/2ANlAoKhkuBQD8nW9hnYK0z35lvdSGZEjXON9zh+2x9frzjYsosmJXDLYECzyXTsqK+p3GK4N0x60BlxU93Wd+L/VsK12crHEvBhP0m6CXoVsgEmZIijYvyTVYa3WjTIFXg/z1KLeQcOJVdiPWKtFxb4BSzEPGADWvi82wxwTZ8FpaK1ceE0SDfqv4OtPcrXYbKvnV/OdeL/8vNW6zf5p1sbIui4duH6lZRNLTfBq2kExzVKgBwJ0OvlC/BAcews1EYQnb5y1fh5GCavZoefHk9OfwjGOPCFPSUIy8oYck/kiMwIJ7+jSfQ8ehHcRK/jLNtaRwNnsdkJ+J3fwA7gcRo</latexit>
slide-19
SLIDE 19

19

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

Expand expectation

slide-20
SLIDE 20

20

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

= Z rθπθ(τ)R(τ)dτ

<latexit sha1_base64="NIR6vyQDXONwIZJVMqpd8Xs+Xk=">AD/nicjVNdixMxFJ2d8WMdv7rqmy/BUmi1lJlV0JfCogjiw7LKdnehmQ6ZNG3DzheTO7IlBvwrvigiK/+Dt/8Nyad6e62FTEw825J+e0miPOYCPO/3lu1cuXrt+vYN9+at23fuNnbuHYmsLCgb0CzOipOICBbzlA2AQ8xO8oKRJIrZcXT6yuSP7BC8Cw9hHnOgoRMUz7hlICGwh37Qv1EU4IzKJIvlahJCFgwROc8zam4w+ihA6XSRCU98ZVIov8h0Nb2jcMwmMSiTDSr76nRYSVJSzfq/aShws+nUGAMHZb6G0bw4wB6azVx0DKqkpYEdoGOa9xSXaBLzW1IiZ5XmRnCE8KQqWv5L6qLPG+r0b7S3v+pj3Ju6C6iFRBx21VlUePjTdSTDX7zDgzoLowu9q20f1H18ZgzuezCYlUzq/eo0VjL/M5vNsQSuEeIprKudW6jHt35wbP5ho+n1vMVCm4FfB02rXgdh4xceZ7RMWAo0JkIMfS+HQJICOI2ZcnEpWE7oKZmyoQ5TkjARyMX1VailkTGaZIX+tOEFevmEJIkQ8yTSTONWrOcM+LfcsITJi0DyNC+BpbQqNCljBkybwGNecEoxHMdEFpw7RXRGdGXB/SLcfUQ/PWN4Oj3Z7/tLf7lz72U9jm3rofXIalu+9dzas95YB9bAora0P9tf7W/OJ+eL8935UVHtrfrMfWtlOT/AER1UBU=</latexit>

Expand expectation Exchange integration and expectation

slide-21
SLIDE 21

21

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

= Z rθπθ(τ)R(τ)dτ

<latexit sha1_base64="NIR6vyQDXONwIZJVMqpd8Xs+Xk=">AD/nicjVNdixMxFJ2d8WMdv7rqmy/BUmi1lJlV0JfCogjiw7LKdnehmQ6ZNG3DzheTO7IlBvwrvigiK/+Dt/8Nyad6e62FTEw825J+e0miPOYCPO/3lu1cuXrt+vYN9+at23fuNnbuHYmsLCgb0CzOipOICBbzlA2AQ8xO8oKRJIrZcXT6yuSP7BC8Cw9hHnOgoRMUz7hlICGwh37Qv1EU4IzKJIvlahJCFgwROc8zam4w+ihA6XSRCU98ZVIov8h0Nb2jcMwmMSiTDSr76nRYSVJSzfq/aShws+nUGAMHZb6G0bw4wB6azVx0DKqkpYEdoGOa9xSXaBLzW1IiZ5XmRnCE8KQqWv5L6qLPG+r0b7S3v+pj3Ju6C6iFRBx21VlUePjTdSTDX7zDgzoLowu9q20f1H18ZgzuezCYlUzq/eo0VjL/M5vNsQSuEeIprKudW6jHt35wbP5ho+n1vMVCm4FfB02rXgdh4xceZ7RMWAo0JkIMfS+HQJICOI2ZcnEpWE7oKZmyoQ5TkjARyMX1VailkTGaZIX+tOEFevmEJIkQ8yTSTONWrOcM+LfcsITJi0DyNC+BpbQqNCljBkybwGNecEoxHMdEFpw7RXRGdGXB/SLcfUQ/PWN4Oj3Z7/tLf7lz72U9jm3rofXIalu+9dzas95YB9bAora0P9tf7W/OJ+eL8935UVHtrfrMfWtlOT/AER1UBU=</latexit>

= Z rθπθ(τ) · πθ(τ) πθ(τ) · R(τ)dτ

<latexit sha1_base64="6Cf4ltUuB2RixiA9bfabPClb2TA=">AEMXicjVPLbtNAFHVtHsU8msKSzYgoUgJRZRck2ESqQAjEoiqoaStlHGs8GSej+iXPNWpk5pfY8CeITRcgxJafYCZ2sZBiJFs3Tn3zL3nHnuCLOICHOd8w7SuXb9xc/OWfvO3Xtbre37RyItcsqGNI3S/CQgkU8YUPgELGTLGckDiJ2HJy+0vnjywXPE0OYZ4xLybThIecElCQv2+6aABwjGBWRCUr6VfEh+w4DHOeBfTSQqfhA+9PhJ+CU9cqVMou8z0Fb0ncRCGFRxIo1cOT4sCpJSVR+kN0lD+d8OgMPYWx30LsuhkD0mv0x0CKqotfEboauehxpewCX9ZUFTHJsjw9QzjMCS1dWe7LShIfuHK8v5TnrsreR9kH5Eq6NmdqvP4sdZG8qlin2lGpSXYlfH1nX/MbUWmPF6Jr1JSBCRer/qxkrmf7xZt8WzdSGeQLPahYTaPv0hK8OaKbmOVOxmr4l+62s+MsFloP3DpoG/U68Ftf8SlRcwSoBERYuQ6GXglyYHTiEkbF4JlhJ6SKRupMCExE165+OMl6ihkgsI0V4+acYFePVGSWIh5HCimViuaOQ3+LTcqIHzhlTzJCmAJrRqFRYQgRfr6oAnPGYVorgJCc60Ijojyj5Ql8xWJrjNkdeDo90d9+nO7vtn7b2XtR2bxkPjkdE1XO5sWe8NQ6MoUHNz+Y387v5w/pinVs/rV8V1dyozwVpb1+w8M52YB</latexit>

Expand expectation Exchange integration and expectation

slide-22
SLIDE 22

22

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

= Z rθπθ(τ)R(τ)dτ

<latexit sha1_base64="NIR6vyQDXONwIZJVMqpd8Xs+Xk=">AD/nicjVNdixMxFJ2d8WMdv7rqmy/BUmi1lJlV0JfCogjiw7LKdnehmQ6ZNG3DzheTO7IlBvwrvigiK/+Dt/8Nyad6e62FTEw825J+e0miPOYCPO/3lu1cuXrt+vYN9+at23fuNnbuHYmsLCgb0CzOipOICBbzlA2AQ8xO8oKRJIrZcXT6yuSP7BC8Cw9hHnOgoRMUz7hlICGwh37Qv1EU4IzKJIvlahJCFgwROc8zam4w+ihA6XSRCU98ZVIov8h0Nb2jcMwmMSiTDSr76nRYSVJSzfq/aShws+nUGAMHZb6G0bw4wB6azVx0DKqkpYEdoGOa9xSXaBLzW1IiZ5XmRnCE8KQqWv5L6qLPG+r0b7S3v+pj3Ju6C6iFRBx21VlUePjTdSTDX7zDgzoLowu9q20f1H18ZgzuezCYlUzq/eo0VjL/M5vNsQSuEeIprKudW6jHt35wbP5ho+n1vMVCm4FfB02rXgdh4xceZ7RMWAo0JkIMfS+HQJICOI2ZcnEpWE7oKZmyoQ5TkjARyMX1VailkTGaZIX+tOEFevmEJIkQ8yTSTONWrOcM+LfcsITJi0DyNC+BpbQqNCljBkybwGNecEoxHMdEFpw7RXRGdGXB/SLcfUQ/PWN4Oj3Z7/tLf7lz72U9jm3rofXIalu+9dzas95YB9bAora0P9tf7W/OJ+eL8935UVHtrfrMfWtlOT/AER1UBU=</latexit>

= Z rθπθ(τ) · πθ(τ) πθ(τ) · R(τ)dτ

<latexit sha1_base64="6Cf4ltUuB2RixiA9bfabPClb2TA=">AEMXicjVPLbtNAFHVtHsU8msKSzYgoUgJRZRck2ESqQAjEoiqoaStlHGs8GSej+iXPNWpk5pfY8CeITRcgxJafYCZ2sZBiJFs3Tn3zL3nHnuCLOICHOd8w7SuXb9xc/OWfvO3Xtbre37RyItcsqGNI3S/CQgkU8YUPgELGTLGckDiJ2HJy+0vnjywXPE0OYZ4xLybThIecElCQv2+6aABwjGBWRCUr6VfEh+w4DHOeBfTSQqfhA+9PhJ+CU9cqVMou8z0Fb0ncRCGFRxIo1cOT4sCpJSVR+kN0lD+d8OgMPYWx30LsuhkD0mv0x0CKqotfEboauehxpewCX9ZUFTHJsjw9QzjMCS1dWe7LShIfuHK8v5TnrsreR9kH5Eq6NmdqvP4sdZG8qlin2lGpSXYlfH1nX/MbUWmPF6Jr1JSBCRer/qxkrmf7xZt8WzdSGeQLPahYTaPv0hK8OaKbmOVOxmr4l+62s+MsFloP3DpoG/U68Ftf8SlRcwSoBERYuQ6GXglyYHTiEkbF4JlhJ6SKRupMCExE165+OMl6ihkgsI0V4+acYFePVGSWIh5HCimViuaOQ3+LTcqIHzhlTzJCmAJrRqFRYQgRfr6oAnPGYVorgJCc60Ijojyj5Ql8xWJrjNkdeDo90d9+nO7vtn7b2XtR2bxkPjkdE1XO5sWe8NQ6MoUHNz+Y387v5w/pinVs/rV8V1dyozwVpb1+w8M52YB</latexit>

Expand expectation Exchange integration and expectation

rθ log π(τ) = rθπ(τ) π(τ)

<latexit sha1_base64="PFBv8gP4ATPO1SJnvIETFRlOHmU=">AFOnicjVTLbtNAFHWbAMW8WliyGVFSiCq4oIEm0gVCAmxqALqS8qk1ngySUYdPzRzjVKZ+S42fAU7FmxYgBbPoAZ20Tu6o6kq3rc+/c+7x2EiuIJe7/vKaqN54+atdvunbv37j9Y3h4oOJUrZPYxHLo4AoJnjE9oGDYEeJZCQMBDsMTt7Y/OEnJhWPoz04TdgoJNOITzglYCB/ozFoT7CIYFZEGRvtZ8RH7DiIU54G9NxDJ+VD50uUn4GzxtUyi5yHRNeUdjwSYwxCoNTVW/p4/3ipaUiOyjbp/VYcmnMxghjN0Wet/GMGNAOhV+DCQtWPyioG2Rc46Ftjl+1tN0xCRJZDxHeCIJzTyd7epCEu97+nj3TJ5Xl5fxLuguIkXQcVsF8/FTq43IqameW2UW1Bdil8e2fa+Y2gpMeDmTfYhIEj5vOzGUuY63tRtMXSmEY+g2u1cQmfZGFYdWUriNFdZVrbO9ufo5ywsquqgIRT2sqrmh5rfEXGa5FMHIxsDnkH1Am2dgMe5nKcoJ+eaZqXi4VYbaX9/sbfXyheqBVwabTrkG/vo3PI5pGrIqCBKDb1eAqOMSOBUMO3iVLGE0BMyZUMTRiRkapTlyjVqGWSMJrE0l/E+Rxd3ZCRU6jQMTKW1QFVzFrwsN0xh8mqU8ShJgUW0IJqkAkGM7H8EjblkFMSpCQiV3GhFdEaMSWD+Nq4xwauOXA8Otre851vbH15s7rwu7VhzHjtPnLbjOS+dHedM3D2Hdr40vjR+NX43fza/Nn80/xblK6ulHseOUur+e8/uA7IlQ=</latexit>
slide-23
SLIDE 23

23

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

= Z rθπθ(τ)R(τ)dτ

<latexit sha1_base64="NIR6vyQDXONwIZJVMqpd8Xs+Xk=">AD/nicjVNdixMxFJ2d8WMdv7rqmy/BUmi1lJlV0JfCogjiw7LKdnehmQ6ZNG3DzheTO7IlBvwrvigiK/+Dt/8Nyad6e62FTEw825J+e0miPOYCPO/3lu1cuXrt+vYN9+at23fuNnbuHYmsLCgb0CzOipOICBbzlA2AQ8xO8oKRJIrZcXT6yuSP7BC8Cw9hHnOgoRMUz7hlICGwh37Qv1EU4IzKJIvlahJCFgwROc8zam4w+ihA6XSRCU98ZVIov8h0Nb2jcMwmMSiTDSr76nRYSVJSzfq/aShws+nUGAMHZb6G0bw4wB6azVx0DKqkpYEdoGOa9xSXaBLzW1IiZ5XmRnCE8KQqWv5L6qLPG+r0b7S3v+pj3Ju6C6iFRBx21VlUePjTdSTDX7zDgzoLowu9q20f1H18ZgzuezCYlUzq/eo0VjL/M5vNsQSuEeIprKudW6jHt35wbP5ho+n1vMVCm4FfB02rXgdh4xceZ7RMWAo0JkIMfS+HQJICOI2ZcnEpWE7oKZmyoQ5TkjARyMX1VailkTGaZIX+tOEFevmEJIkQ8yTSTONWrOcM+LfcsITJi0DyNC+BpbQqNCljBkybwGNecEoxHMdEFpw7RXRGdGXB/SLcfUQ/PWN4Oj3Z7/tLf7lz72U9jm3rofXIalu+9dzas95YB9bAora0P9tf7W/OJ+eL8935UVHtrfrMfWtlOT/AER1UBU=</latexit>

= Z rθπθ(τ) · πθ(τ) πθ(τ) · R(τ)dτ

<latexit sha1_base64="6Cf4ltUuB2RixiA9bfabPClb2TA=">AEMXicjVPLbtNAFHVtHsU8msKSzYgoUgJRZRck2ESqQAjEoiqoaStlHGs8GSej+iXPNWpk5pfY8CeITRcgxJafYCZ2sZBiJFs3Tn3zL3nHnuCLOICHOd8w7SuXb9xc/OWfvO3Xtbre37RyItcsqGNI3S/CQgkU8YUPgELGTLGckDiJ2HJy+0vnjywXPE0OYZ4xLybThIecElCQv2+6aABwjGBWRCUr6VfEh+w4DHOeBfTSQqfhA+9PhJ+CU9cqVMou8z0Fb0ncRCGFRxIo1cOT4sCpJSVR+kN0lD+d8OgMPYWx30LsuhkD0mv0x0CKqotfEboauehxpewCX9ZUFTHJsjw9QzjMCS1dWe7LShIfuHK8v5TnrsreR9kH5Eq6NmdqvP4sdZG8qlin2lGpSXYlfH1nX/MbUWmPF6Jr1JSBCRer/qxkrmf7xZt8WzdSGeQLPahYTaPv0hK8OaKbmOVOxmr4l+62s+MsFloP3DpoG/U68Ftf8SlRcwSoBERYuQ6GXglyYHTiEkbF4JlhJ6SKRupMCExE165+OMl6ihkgsI0V4+acYFePVGSWIh5HCimViuaOQ3+LTcqIHzhlTzJCmAJrRqFRYQgRfr6oAnPGYVorgJCc60Ijojyj5Ql8xWJrjNkdeDo90d9+nO7vtn7b2XtR2bxkPjkdE1XO5sWe8NQ6MoUHNz+Y387v5w/pinVs/rV8V1dyozwVpb1+w8M52YB</latexit>

= Z πθ(τ)rθ log πθ(τ)R(τ)dτ

<latexit sha1_base64="81H4Sz9KP1lTiTaRTFADwQ1hG8=">AEgnicjVNdb9MwFM3aACN8dfDIi0VqR1Vl2xI8EClCYSEeJgGWrdJdRc5rtNacz4U36BVwf+D38UbvwbsJt3WdEJYSnR97vE9517LQSq4BNf9vdVo2vfuP9h+6Dx6/OTps9bO81OZ5BlI5qIJDsPiGSCx2wEHAQ7TzNGokCws+Dyo8mfWeZ5El8AouUTSIyi3nIKQEN+TuNnx0RDgiMA+C4pPyC+IDljzCKe9iOk3gh/Sh10fSL+C1p0wKpTeZvqb3FBYshDGWeaRZQ1dnJQlKRHFN9Vd8XDGZ3OYIydDvrSxTBnQHo1fQwkL1X8ktA1yLXGrbJLfFVTV8QkTbPkCuEwI7TwVHGkSkt86KmLo5U9b9Newfug+oiUQc/plMoXu8YbyWafWcGVDdmF1v29T9R9fGYMqrnswmJoEg1X59GmuZ/5nN5li0nC7EY6hXu7ZQjc9cZDmwekptIiW7rjU1f2clVztT1xfJbMPDnQX9VtsduMuFNgOvCtpWtY791i8TWgesRioIFKOPTeFSUEy4FQw5eBcspTQSzJjYx3GJGJyUiyfkEIdjUxRmGT6010s0dsnChJuYgCzTRuZT1nwLty4xzCd5OCx2kOLKalUJgLBAky7xFNecYoiIUOCM249oronOj7AP1qHT0Er97yZnC6P/AOBvtf37QP1Tj2LZeWq+sruVZb61D67N1bI0s2vjT7DQHzT3btndtz4oqY2t6swLa23Z7/8CAmB+6g=</latexit>

Expand expectation Exchange integration and expectation

rθ log π(τ) = rθπ(τ) π(τ)

<latexit sha1_base64="PFBv8gP4ATPO1SJnvIETFRlOHmU=">AFOnicjVTLbtNAFHWbAMW8WliyGVFSiCq4oIEm0gVCAmxqALqS8qk1ngySUYdPzRzjVKZ+S42fAU7FmxYgBbPoAZ20Tu6o6kq3rc+/c+7x2EiuIJe7/vKaqN54+atdvunbv37j9Y3h4oOJUrZPYxHLo4AoJnjE9oGDYEeJZCQMBDsMTt7Y/OEnJhWPoz04TdgoJNOITzglYCB/ozFoT7CIYFZEGRvtZ8RH7DiIU54G9NxDJ+VD50uUn4GzxtUyi5yHRNeUdjwSYwxCoNTVW/p4/3ipaUiOyjbp/VYcmnMxghjN0Wet/GMGNAOhV+DCQtWPyioG2Rc46Ftjl+1tN0xCRJZDxHeCIJzTyd7epCEu97+nj3TJ5Xl5fxLuguIkXQcVsF8/FTq43IqameW2UW1Bdil8e2fa+Y2gpMeDmTfYhIEj5vOzGUuY63tRtMXSmEY+g2u1cQmfZGFYdWUriNFdZVrbO9ufo5ywsquqgIRT2sqrmh5rfEXGa5FMHIxsDnkH1Am2dgMe5nKcoJ+eaZqXi4VYbaX9/sbfXyheqBVwabTrkG/vo3PI5pGrIqCBKDb1eAqOMSOBUMO3iVLGE0BMyZUMTRiRkapTlyjVqGWSMJrE0l/E+Rxd3ZCRU6jQMTKW1QFVzFrwsN0xh8mqU8ShJgUW0IJqkAkGM7H8EjblkFMSpCQiV3GhFdEaMSWD+Nq4xwauOXA8Otre851vbH15s7rwu7VhzHjtPnLbjOS+dHedM3D2Hdr40vjR+NX43fza/Nn80/xblK6ulHseOUur+e8/uA7IlQ=</latexit>
slide-24
SLIDE 24

24

= rθ Z πθ(τ)R(τ)dτ

<latexit sha1_base64="4gyGx/NjBUPBtNX1H/STwXdkv6o=">AD/nicjVNLixNBEJ6d8bGOj82qNy+NIZBoCDOroJfAogjiYVlX5CeD2dTtJsz4PuGtkwDvhXvHhQxKu/w5v/xu7MxM1DxIZqr/6quqrojvKBFfgeb+2bOfK1WvXt2+4N2/dvrPT2L17otJcUnZMU5HKs4goJnjCjoGDYGeZCSOBDuNzl+a+Ol7JhVPkyOYZSyIySThY04JaCjcte+3UB/hmMA0iopXZViQELDiMc54G9NRCh9UCJ0uUmEBj/3ShFB2GelqeqfEgo1hgFUea1bfK4dHVUlKRPGubC94WPLJFAKEsdtCb9oYpgxIZ60/BpJXcK0DbInx5LZef4oqauiEmWyfQC4bEktPDL4qCsJPG+Xw4PFvL8TXkF70LZRaRyOm6r6jx8ZLQROdHsC6PMgOWl2NWxTd1/TG0EZryeyRwSEglSn1e3sRL5n91sriVwNwvxBJYk1OtbTxyZf9hoej1vbmjT8WunadV2GDZ+4lFK85glQAVRauB7GQFkcCpYKWLc8UyQs/JhA20m5CYqaCYX98StTQyQuNU6k9LnKPLGQWJlZrFkWYatWo9ZsC/xQY5jJ8HBU+yHFhCq0bjXCBIkXkLaMQloyBm2iFUcq0V0SnRlwf0i3H1Evz1kTedk72e/6S39/Zpc/9FvY5t64H10GpbvXM2rdeW4fWsUXtwv5kf7G/Oh+dz84353tFtbfqnHvWijk/fgNGClAV</latexit>

Policy Gradients

rθJ(θ) = rθEτ∼pθ(τ)[R(τ)]

<latexit sha1_base64="Ak6iFp2ndaPkRSt0gfcdrY+Q7cQ=">AEBXicjVNb9NAEN3afBTzlcKxlxVRJAeiyi5IcIlUgZAQh6qgpq2UTaz1ZpOs6i/tjlEjsxcu/BUuHECIK/+BG/+Gdey0TYJQR7I0fvP2zZvRbphFQoHn/dmw7GvXb9zcvOXcvnP3v3G1oMjleaS8R5Lo1SehFTxSCS8BwIifpJTuMw4sfh6auyfvyBSyXS5BmGR/EdJKIsWAUDBRsWdst3MUkpjANw+K1DgoaAFEiJplwCRul8FEF0O5gFRTwxNdlCWcXlY6htzWJ+Bj6ROWxYXU9PTysJBmNivfaXfCIFJMpDAhTgu/dQlMOdD2Sn8CNK+6BXBLZHzHpdk5/hC0ygSmUyPcNkLCkrfF3s68qS6Pp6uL+w56/bK0QHdAfTKmk7rarz8HpjcqJYZ+VzkpQX5hdHrvU/c/UpcFM1DM5JKFhROu/5V0sVa6ymfWlM2uNVzjrum0g4aTW/HmwdeT/w6aI6DoLGbzJKWR7zBFhEler7XgaDgkoQLOLaIbniGWndML7Jk1ozNWgmN9ijVsGeFxKs2XAJ6jl08UNFZqFoeGWTpVq7US/Fetn8P4xaAQSZYDT1jVaJxHGFJcPgk8EpIziGYmoUwK4xWzKTV3CMzDcwS/NWR15Oj3R3/6c7u2fNvZf1OjbRNnqEXOSj52gPvUEHqIeY9cn6Yn2zvtuf7a/2D/tnRbU26jMP0VLYv/4CxJTew=</latexit>

= Z rθπθ(τ)R(τ)dτ

<latexit sha1_base64="NIR6vyQDXONwIZJVMqpd8Xs+Xk=">AD/nicjVNdixMxFJ2d8WMdv7rqmy/BUmi1lJlV0JfCogjiw7LKdnehmQ6ZNG3DzheTO7IlBvwrvigiK/+Dt/8Nyad6e62FTEw825J+e0miPOYCPO/3lu1cuXrt+vYN9+at23fuNnbuHYmsLCgb0CzOipOICBbzlA2AQ8xO8oKRJIrZcXT6yuSP7BC8Cw9hHnOgoRMUz7hlICGwh37Qv1EU4IzKJIvlahJCFgwROc8zam4w+ihA6XSRCU98ZVIov8h0Nb2jcMwmMSiTDSr76nRYSVJSzfq/aShws+nUGAMHZb6G0bw4wB6azVx0DKqkpYEdoGOa9xSXaBLzW1IiZ5XmRnCE8KQqWv5L6qLPG+r0b7S3v+pj3Ju6C6iFRBx21VlUePjTdSTDX7zDgzoLowu9q20f1H18ZgzuezCYlUzq/eo0VjL/M5vNsQSuEeIprKudW6jHt35wbP5ho+n1vMVCm4FfB02rXgdh4xceZ7RMWAo0JkIMfS+HQJICOI2ZcnEpWE7oKZmyoQ5TkjARyMX1VailkTGaZIX+tOEFevmEJIkQ8yTSTONWrOcM+LfcsITJi0DyNC+BpbQqNCljBkybwGNecEoxHMdEFpw7RXRGdGXB/SLcfUQ/PWN4Oj3Z7/tLf7lz72U9jm3rofXIalu+9dzas95YB9bAora0P9tf7W/OJ+eL8935UVHtrfrMfWtlOT/AER1UBU=</latexit>

= Z rθπθ(τ) · πθ(τ) πθ(τ) · R(τ)dτ

<latexit sha1_base64="6Cf4ltUuB2RixiA9bfabPClb2TA=">AEMXicjVPLbtNAFHVtHsU8msKSzYgoUgJRZRck2ESqQAjEoiqoaStlHGs8GSej+iXPNWpk5pfY8CeITRcgxJafYCZ2sZBiJFs3Tn3zL3nHnuCLOICHOd8w7SuXb9xc/OWfvO3Xtbre37RyItcsqGNI3S/CQgkU8YUPgELGTLGckDiJ2HJy+0vnjywXPE0OYZ4xLybThIecElCQv2+6aABwjGBWRCUr6VfEh+w4DHOeBfTSQqfhA+9PhJ+CU9cqVMou8z0Fb0ncRCGFRxIo1cOT4sCpJSVR+kN0lD+d8OgMPYWx30LsuhkD0mv0x0CKqotfEboauehxpewCX9ZUFTHJsjw9QzjMCS1dWe7LShIfuHK8v5TnrsreR9kH5Eq6NmdqvP4sdZG8qlin2lGpSXYlfH1nX/MbUWmPF6Jr1JSBCRer/qxkrmf7xZt8WzdSGeQLPahYTaPv0hK8OaKbmOVOxmr4l+62s+MsFloP3DpoG/U68Ftf8SlRcwSoBERYuQ6GXglyYHTiEkbF4JlhJ6SKRupMCExE165+OMl6ihkgsI0V4+acYFePVGSWIh5HCimViuaOQ3+LTcqIHzhlTzJCmAJrRqFRYQgRfr6oAnPGYVorgJCc60Ijojyj5Ql8xWJrjNkdeDo90d9+nO7vtn7b2XtR2bxkPjkdE1XO5sWe8NQ6MoUHNz+Y387v5w/pinVs/rV8V1dyozwVpb1+w8M52YB</latexit>

= Z πθ(τ)rθ log πθ(τ)R(τ)dτ

<latexit sha1_base64="81H4Sz9KP1lTiTaRTFADwQ1hG8=">AEgnicjVNdb9MwFM3aACN8dfDIi0VqR1Vl2xI8EClCYSEeJgGWrdJdRc5rtNacz4U36BVwf+D38UbvwbsJt3WdEJYSnR97vE9517LQSq4BNf9vdVo2vfuP9h+6Dx6/OTps9bO81OZ5BlI5qIJDsPiGSCx2wEHAQ7TzNGokCws+Dyo8mfWeZ5El8AouUTSIyi3nIKQEN+TuNnx0RDgiMA+C4pPyC+IDljzCKe9iOk3gh/Sh10fSL+C1p0wKpTeZvqb3FBYshDGWeaRZQ1dnJQlKRHFN9Vd8XDGZ3OYIydDvrSxTBnQHo1fQwkL1X8ktA1yLXGrbJLfFVTV8QkTbPkCuEwI7TwVHGkSkt86KmLo5U9b9Newfug+oiUQc/plMoXu8YbyWafWcGVDdmF1v29T9R9fGYMqrnswmJoEg1X59GmuZ/5nN5li0nC7EY6hXu7ZQjc9cZDmwekptIiW7rjU1f2clVztT1xfJbMPDnQX9VtsduMuFNgOvCtpWtY791i8TWgesRioIFKOPTeFSUEy4FQw5eBcspTQSzJjYx3GJGJyUiyfkEIdjUxRmGT6010s0dsnChJuYgCzTRuZT1nwLty4xzCd5OCx2kOLKalUJgLBAky7xFNecYoiIUOCM249oronOj7AP1qHT0Er97yZnC6P/AOBvtf37QP1Tj2LZeWq+sruVZb61D67N1bI0s2vjT7DQHzT3btndtz4oqY2t6swLa23Z7/8CAmB+6g=</latexit>

= Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="68w1wxgh2qI5Wm1k19LrzwOBZpQ=">AE3icjVTPb9MwFM7WAqP86uDIxaKq1EJVJQMJLpUmEBLiMA20bpPqNnJcp7Xm/FD8glYFX7hwACGu/Fvc+EO4YzfptiYTqVEz97/r7PL3a8WHAJtv1na7tWv3Hz1s7txp279+4/aO4+PJZRmlA2pJGIklOPSCZ4yIbAQbDTOGEk8AQ78c7emPzJ5ZIHoVHsIjZOCzkPucEtCQu7v9t40GCAcE5p6XvVuRlzAkgc45h1MpxF8li50e0i6GTxzlEmh+DLT0+VdhQXzYRlGuiqga0mRzklJSL7qDqrOpzw2RzGCONG73vYJgzIN2SPgaS5ipuXtAxyIXGFdolvuLUjJjEcRKdI+wnhGaOyg5UbokPHDU5WNlzqvYy3gPVQyQPuo12rjx5aryRZKarz40zA6pLs+vbNrz/2bUxGPNiT2YSEk+QYr7ejbXMJr2ptkXLaSIeQpntwkLRPvMh84aVU6qK5NVlral5N5bnaClYWlV2IKJZxcX1lJsdjNEa/0b0Y7fZsv2cqBq4BRByrGodv8jacRTQMWAhVEypFjxzDOSAKcCqYaOJUsJvSMzNhIhyEJmBxny/upUFsjU+RHiX50g5bo1RUZCaRcBJ6uNE5lOWfA63KjFPxX4yHcQospLmQnwoETKXHU15wiIhQ4ITbj2iuic6I8N+pfQ0E1wyluBsd7fed5f+/Di9b+6IdO9Zj64nVsRzrpbVvbMOraFa7j2pfat9r1O6l/rP+o/89LtrWLNI2t1H/9A7FZpC8=</latexit>

Expand expectation Exchange integration and expectation

rθ log π(τ) = rθπ(τ) π(τ)

<latexit sha1_base64="PFBv8gP4ATPO1SJnvIETFRlOHmU=">AFOnicjVTLbtNAFHWbAMW8WliyGVFSiCq4oIEm0gVCAmxqALqS8qk1ngySUYdPzRzjVKZ+S42fAU7FmxYgBbPoAZ20Tu6o6kq3rc+/c+7x2EiuIJe7/vKaqN54+atdvunbv37j9Y3h4oOJUrZPYxHLo4AoJnjE9oGDYEeJZCQMBDsMTt7Y/OEnJhWPoz04TdgoJNOITzglYCB/ozFoT7CIYFZEGRvtZ8RH7DiIU54G9NxDJ+VD50uUn4GzxtUyi5yHRNeUdjwSYwxCoNTVW/p4/3ipaUiOyjbp/VYcmnMxghjN0Wet/GMGNAOhV+DCQtWPyioG2Rc46Ftjl+1tN0xCRJZDxHeCIJzTyd7epCEu97+nj3TJ5Xl5fxLuguIkXQcVsF8/FTq43IqameW2UW1Bdil8e2fa+Y2gpMeDmTfYhIEj5vOzGUuY63tRtMXSmEY+g2u1cQmfZGFYdWUriNFdZVrbO9ufo5ywsquqgIRT2sqrmh5rfEXGa5FMHIxsDnkH1Am2dgMe5nKcoJ+eaZqXi4VYbaX9/sbfXyheqBVwabTrkG/vo3PI5pGrIqCBKDb1eAqOMSOBUMO3iVLGE0BMyZUMTRiRkapTlyjVqGWSMJrE0l/E+Rxd3ZCRU6jQMTKW1QFVzFrwsN0xh8mqU8ShJgUW0IJqkAkGM7H8EjblkFMSpCQiV3GhFdEaMSWD+Nq4xwauOXA8Otre851vbH15s7rwu7VhzHjtPnLbjOS+dHedM3D2Hdr40vjR+NX43fza/Nn80/xblK6ulHseOUur+e8/uA7IlQ=</latexit>
slide-25
SLIDE 25

25

Policy Gradients

rθJ(θ) = Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="wXmFKxUm8+0SLgNSqHwpfetw8w=">AFUnicjVRb9MwFM6WFUYZbINHXiymSi1UzOQ4KXSBEJCPEwD7SbVXeS4bmvNucg+Qa2CfyMS4oUfwgsPgJ1kW5tM0ywlOjm37zufHQeJ4Ap6vV8rq+5a4979QfNhxuPHm9ubT85UXEqKTumsYjlWUAUEzxix8BsLNEMhIGgp0GF+9t/PQrk4rH0RHMEzYMySTiY04JGJe/7U5bqI9wSGAaBNkH7WfEB6x4iBPexnQUwzflQ6eLlJ/BS0/bEquI12T3tFYsDEMsEpDk9Xv6fOjoiUlIvui25d5WPLJFIYI42YLfWpjmDIgnQo+BpIWKH6R0LaeK4yFtrn/sqfpiEmSyHiG8FgSmnk6O9AFJd739PnBJT2vTi/jXdBdRAqj02wVyOcvLDciJyZ7ZplZp74muzy27XvL1JZgwsuZ7EdEAkHK72U1liJ30aYui4EzjXgE1W5XFEr57EYWglVDu4psqtYI/tu5ucoB6xUVRmIeFJjcXPLWyS6kyiL5XeCzTcJ2AzyHyuTbGREuIl9OVm/PGs1jRcULE3tb+30dnv5QnXDK40dp1yH/tYPIpGrIqCBKDbxeAsOMSOBUMN3EqWIJoRdkwgbGjEjI1DLmWvUMp4RGsfSPGZPcu9iRUZCpeZhYDKtCKoas86bYoMUxm+HGY+SFhEC6BxKhDEyN4vaMQloyDmxiBUcsMV0SkxIoG5hZpGBK86ct042dv1Xu3ufX69s/+ulGPdeY8d9qO57x9p2PzqFz7FD3u/vb/ev+W/u59qex0nCL1NWVsuaps7QaG/8Bl7bPA=</latexit>

rθ " log p(s0) +

T

X

t=1

log πθ(at|st) +

T

X

t=1

log p(st+1 | st, at) #

<latexit sha1_base64="ht8omtVqWDaNhA+LkRIL8/ARjQ=">AFzHicjVRLb9NAEHbBIp5NIUjlxVpIRGlV2Q4BKpAiEBh6qgvqRsaq3Xm2TV9UPeMUpk9soP5MaVX8Ku7baJXapaSjSexzfjNrPxFcguP8WVvfaLUfPNx8ZD9+8vTZVmf7+amMs5SyExqLOD3iWSCR+wEOAh2nqSMhL5gZ/7lRxM/+8FSyePoGBYJG4dkGvEJpwS0y9ve+NtFQ4RDAjPfz8pLyceYMlDnPAepkEMP6UH/QGSXg67rjIhlNxEBjq9r7BgExhmYU6a+ioi+MSkhKRf1e9qzyc8ukMxghju4u+9jDMGJB+rT8GkpVdvDKhZzXPZgC/8VpkbEJEnSeI7wJCU0d1V+qEpKfOiqi8Mrem6TXs4HoAaIlEbf7padL14biSd6uy5YWac6obs6rEN7h2nNgQTXp3JvETEF6R6X1VjJXIfbZqy6HYaiEdQR7umUMlnBlkKVg+pqfMrvcKzL9d7FHRsFZVZyDiaYPF/yDvEOlesiyX36vx2K7VFBPWlXrpef0S5ambeJLIHqeRf35bY0A1DeoZAHyGwHq0HsDkUVzpPWaDlv023StNhteWN6S7NrjKV19lx9pziQU3DrYwdq3qOvM5vHMQ0C1kEVBApR6TwDgnKXAqmLJxJlC6CWZspE2IxIyOc4L5gp1tSdAkzjVP70NhXe5IiehlIvQ15lGfFmPGedtsVEGk/fjnEdJBiyiZaNJhDEyHzZUMBTRkEstEFoyjVXRGdEiwT6+2drEdz6kZvG6f6e+2Zv/9vbnYMPlRyb1kvrldWzXOudWB9to6sE4u2vrTi1ry1aB+2oZ23VZm6vlbVvLBWnvavf0W9+Kw=</latexit>

pθ(τ) = pθ(s0, a0, . . . sT , aT )

<latexit sha1_base64="s8bA72w0YkcwfJbRWFZT159+D+w=">ACGXicbVDLSgMxFM3UV62vUZdugkVoQcpMFXQjFN24rNAXtGW4k6ZtaOZBckcopb/hxl9x40IRl7ryb0wfiLYeuHByzr1J7vFjKTQ6zpeVWldW9Ib2a2tnd29+z9g5qOEsV4lUyUg0fNJci5FUKHkjVhwCX/K6P7iZ+PV7rSIwgoOY94OoBeKrmCARvJsJ/Za2OcIuRZCkqdX9EfQnMKplqdCDXVXsWcKnPzjoFZwq6TNw5yZI5yp79YS5gScBDZBK0brpOjO0RKBRM8nGmlWgeAxtAjzcNDSHguj2abjamJ0bp0G6kTIVIp+rviREWg8D3QGgH296E3E/7xmgt3L9kiEcYI8ZLOHuomkGNFJTLQjFGcoh4YAU8L8lbI+KGBowsyYENzFlZdJrVhwzwrFu/Ns6XoeR5ockWOSIy65ICVyS8qkSh5IE/khbxaj9az9Wa9z1pT1nzmkPyB9fkNiuSevA=</latexit>

=

T

Y

t=0

pθ(at | st) · p(st+1 | st, at)

<latexit sha1_base64="Qqm7vlMzjz6M4jPwkfTAmE5nExY=">ACLHicbVDLSgMxFM34rPVdekmWIQWpcxUQTeFYjcuK1gtdOqQyaQ2NDMJyR2hDP0gN/6KIC4s4tbvMK0VfB0IHM45l5t7QiW4AdcdO3PzC4tLy7mV/Ora+sZmYWv7yshU9aiUkjdDolhgiesBRwEayvNSBwKdh0OGhP/+o5pw2VyCUPFujG5TXiPUwJWCgqNGvaVlGQc0d3VxiFfjQZ0BKJA/5hE2AZSxTyMJWJWMzR14oy/jENtUOSgU3Yo7Bf5LvBkpohmaQeHJjyRNY5YAFcSYjucq6GZEA6eCjfJ+apgidEBuWcfShMTMdLPpsSO8b5UI96S2LwE8Vb9PZCQ2ZhiHNhkT6Jvf3kT8z+uk0DvtZjxRKbCEfi7qpQKDxJPmcMQ1oyCGlhCquf0rpn2iCQXb96W4P0+S+5qla8o0r14rhYP5vVkUO7aA+VkIdOUB2doyZqIYru0SN6QWPnwXl2Xp23z+icM5vZQT/gvH8AtFGmpA=</latexit>
slide-26
SLIDE 26

26

Policy Gradients

rθJ(θ) = Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="wXmFKxUm8+0SLgNSqHwpfetw8w=">AFUnicjVRb9MwFM6WFUYZbINHXiymSi1UzOQ4KXSBEJCPEwD7SbVXeS4bmvNucg+Qa2CfyMS4oUfwgsPgJ1kW5tM0ywlOjm37zufHQeJ4Ap6vV8rq+5a4979QfNhxuPHm9ubT85UXEqKTumsYjlWUAUEzxix8BsLNEMhIGgp0GF+9t/PQrk4rH0RHMEzYMySTiY04JGJe/7U5bqI9wSGAaBNkH7WfEB6x4iBPexnQUwzflQ6eLlJ/BS0/bEquI12T3tFYsDEMsEpDk9Xv6fOjoiUlIvui25d5WPLJFIYI42YLfWpjmDIgnQo+BpIWKH6R0LaeK4yFtrn/sqfpiEmSyHiG8FgSmnk6O9AFJd739PnBJT2vTi/jXdBdRAqj02wVyOcvLDciJyZ7ZplZp74muzy27XvL1JZgwsuZ7EdEAkHK72U1liJ30aYui4EzjXgE1W5XFEr57EYWglVDu4psqtYI/tu5ucoB6xUVRmIeFJjcXPLWyS6kyiL5XeCzTcJ2AzyHyuTbGREuIl9OVm/PGs1jRcULE3tb+30dnv5QnXDK40dp1yH/tYPIpGrIqCBKDbxeAsOMSOBUMN3EqWIJoRdkwgbGjEjI1DLmWvUMp4RGsfSPGZPcu9iRUZCpeZhYDKtCKoas86bYoMUxm+HGY+SFhEC6BxKhDEyN4vaMQloyDmxiBUcsMV0SkxIoG5hZpGBK86ct042dv1Xu3ufX69s/+ulGPdeY8d9qO57x9p2PzqFz7FD3u/vb/ev+W/u59qex0nCL1NWVsuaps7QaG/8Bl7bPA=</latexit>

rθ " log p(s0) +

T

X

t=1

log πθ(at|st) +

T

X

t=1

log p(st+1 | st, at) #

<latexit sha1_base64="ht8omtVqWDaNhA+LkRIL8/ARjQ=">AFzHicjVRLb9NAEHbBIp5NIUjlxVpIRGlV2Q4BKpAiEBh6qgvqRsaq3Xm2TV9UPeMUpk9soP5MaVX8Ku7baJXapaSjSexzfjNrPxFcguP8WVvfaLUfPNx8ZD9+8vTZVmf7+amMs5SyExqLOD3iWSCR+wEOAh2nqSMhL5gZ/7lRxM/+8FSyePoGBYJG4dkGvEJpwS0y9ve+NtFQ4RDAjPfz8pLyceYMlDnPAepkEMP6UH/QGSXg67rjIhlNxEBjq9r7BgExhmYU6a+ioi+MSkhKRf1e9qzyc8ukMxghju4u+9jDMGJB+rT8GkpVdvDKhZzXPZgC/8VpkbEJEnSeI7wJCU0d1V+qEpKfOiqi8Mrem6TXs4HoAaIlEbf7padL14biSd6uy5YWac6obs6rEN7h2nNgQTXp3JvETEF6R6X1VjJXIfbZqy6HYaiEdQR7umUMlnBlkKVg+pqfMrvcKzL9d7FHRsFZVZyDiaYPF/yDvEOlesiyX36vx2K7VFBPWlXrpef0S5ambeJLIHqeRf35bY0A1DeoZAHyGwHq0HsDkUVzpPWaDlv023StNhteWN6S7NrjKV19lx9pziQU3DrYwdq3qOvM5vHMQ0C1kEVBApR6TwDgnKXAqmLJxJlC6CWZspE2IxIyOc4L5gp1tSdAkzjVP70NhXe5IiehlIvQ15lGfFmPGedtsVEGk/fjnEdJBiyiZaNJhDEyHzZUMBTRkEstEFoyjVXRGdEiwT6+2drEdz6kZvG6f6e+2Zv/9vbnYMPlRyb1kvrldWzXOudWB9to6sE4u2vrTi1ry1aB+2oZ23VZm6vlbVvLBWnvavf0W9+Kw=</latexit>

Doesn’t depend on Transition probabilities!

slide-27
SLIDE 27

27

Policy Gradients

rθJ(θ) = Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="wXmFKxUm8+0SLgNSqHwpfetw8w=">AFUnicjVRb9MwFM6WFUYZbINHXiymSi1UzOQ4KXSBEJCPEwD7SbVXeS4bmvNucg+Qa2CfyMS4oUfwgsPgJ1kW5tM0ywlOjm37zufHQeJ4Ap6vV8rq+5a4979QfNhxuPHm9ubT85UXEqKTumsYjlWUAUEzxix8BsLNEMhIGgp0GF+9t/PQrk4rH0RHMEzYMySTiY04JGJe/7U5bqI9wSGAaBNkH7WfEB6x4iBPexnQUwzflQ6eLlJ/BS0/bEquI12T3tFYsDEMsEpDk9Xv6fOjoiUlIvui25d5WPLJFIYI42YLfWpjmDIgnQo+BpIWKH6R0LaeK4yFtrn/sqfpiEmSyHiG8FgSmnk6O9AFJd739PnBJT2vTi/jXdBdRAqj02wVyOcvLDciJyZ7ZplZp74muzy27XvL1JZgwsuZ7EdEAkHK72U1liJ30aYui4EzjXgE1W5XFEr57EYWglVDu4psqtYI/tu5ucoB6xUVRmIeFJjcXPLWyS6kyiL5XeCzTcJ2AzyHyuTbGREuIl9OVm/PGs1jRcULE3tb+30dnv5QnXDK40dp1yH/tYPIpGrIqCBKDbxeAsOMSOBUMN3EqWIJoRdkwgbGjEjI1DLmWvUMp4RGsfSPGZPcu9iRUZCpeZhYDKtCKoas86bYoMUxm+HGY+SFhEC6BxKhDEyN4vaMQloyDmxiBUcsMV0SkxIoG5hZpGBK86ct042dv1Xu3ufX69s/+ulGPdeY8d9qO57x9p2PzqFz7FD3u/vb/ev+W/u59qex0nCL1NWVsuaps7QaG/8Bl7bPA=</latexit>

rθ " log p(s0) +

T

X

t=1

log πθ(at|st) +

T

X

t=1

log p(st+1 | st, at) #

<latexit sha1_base64="ht8omtVqWDaNhA+LkRIL8/ARjQ=">AFzHicjVRLb9NAEHbBIp5NIUjlxVpIRGlV2Q4BKpAiEBh6qgvqRsaq3Xm2TV9UPeMUpk9soP5MaVX8Ku7baJXapaSjSexzfjNrPxFcguP8WVvfaLUfPNx8ZD9+8vTZVmf7+amMs5SyExqLOD3iWSCR+wEOAh2nqSMhL5gZ/7lRxM/+8FSyePoGBYJG4dkGvEJpwS0y9ve+NtFQ4RDAjPfz8pLyceYMlDnPAepkEMP6UH/QGSXg67rjIhlNxEBjq9r7BgExhmYU6a+ioi+MSkhKRf1e9qzyc8ukMxghju4u+9jDMGJB+rT8GkpVdvDKhZzXPZgC/8VpkbEJEnSeI7wJCU0d1V+qEpKfOiqi8Mrem6TXs4HoAaIlEbf7padL14biSd6uy5YWac6obs6rEN7h2nNgQTXp3JvETEF6R6X1VjJXIfbZqy6HYaiEdQR7umUMlnBlkKVg+pqfMrvcKzL9d7FHRsFZVZyDiaYPF/yDvEOlesiyX36vx2K7VFBPWlXrpef0S5ambeJLIHqeRf35bY0A1DeoZAHyGwHq0HsDkUVzpPWaDlv023StNhteWN6S7NrjKV19lx9pziQU3DrYwdq3qOvM5vHMQ0C1kEVBApR6TwDgnKXAqmLJxJlC6CWZspE2IxIyOc4L5gp1tSdAkzjVP70NhXe5IiehlIvQ15lGfFmPGedtsVEGk/fjnEdJBiyiZaNJhDEyHzZUMBTRkEstEFoyjVXRGdEiwT6+2drEdz6kZvG6f6e+2Zv/9vbnYMPlRyb1kvrldWzXOudWB9to6sE4u2vrTi1ry1aB+2oZ23VZm6vlbVvLBWnvavf0W9+Kw=</latexit>

= Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) #

<latexit sha1_base64="HPDVFtvrP5v+WBkOkoi+3XBfyVU=">AGTXiclVRbaxNBFN7WpK3x0lYfRksgcSGslsFfQkURAfSpX0Apl0mZ1MkqF7Y2ZW7bzB30RfPNf+OKDIuKZ3U2b3aSiCwlnz/nmfN+5zHqxz6Wy7a9Ly7dq9ZXVtduNO3fv3V/f2HxwJKNEUHZIz8SJx6RzOchO1Rc+ewkFowEns+OvbNXJn78gQnJo7CnLmI2CMg45CNOiQKXu1mjTdRFOCBq4npa+2mxFVY8gDHvIXpMFKX0lXtDpJuqrYdbUIovo50AN7W2Gcj1cyCQDVtfVpL09JiZ+160pDgs+nqgBwrjRG9bWE2YIu0KP1YkyVncHNAyniuOmbSZf5oTMmISxyI6R3gkCE0dne7rXBLvOvp0fyrPmZeX8o7SHURyo91o5synT4w2IsaAPjfKjFNfiy2XbfL+pWojMOZFTeYlJ5PivdyN0qRf+nNfFuADhLxUFWzXUko2mcGmTesGtLznhxd5Rqa/0a2Rxlh5VRVgR+N51QsTvlfizE7hVnCh/MWOlLs826KP7G8QGkQ3Lg9QjLqTNmIArIV27jbZRKZ2JlLjz27QIFremNyzgQ2R2B1WR7FzlV34VLAhDGdRV4uOd4s7MDf7mckWps5Sk8Tl5lQqXQdqdjoIQ2sk6OjBaw9rl2e4gvKJkJEH6ceqIf48YTcsNPuxpa9Y2cPmjecwtiyiufA3fgC9DQJWKioT6TsO3asBikRilOf6QZOJIsJPSNj1gczJAGTgzRrjkZN8AzRKBLwg3XMvLMnUhJIeRF4gDTjltWYcS6K9RM1ejFIeRgnioU0JxolPlIRMp9WNOSCUeVfgEGo4KAV0QmBOSj4ADegCU615HnjaHfHebqz+7Z1t7Loh1r1iPrsdWyHOu5tWe9sQ6sQ4vWPtW+1X7UftY/17/Xf9V/59DlpeLMQ6v0rKz+AVUeKsI=</latexit>
slide-28
SLIDE 28

28

Policy Gradients

rθJ(θ) = Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="wXmFKxUm8+0SLgNSqHwpfetw8w=">AFUnicjVRb9MwFM6WFUYZbINHXiymSi1UzOQ4KXSBEJCPEwD7SbVXeS4bmvNucg+Qa2CfyMS4oUfwgsPgJ1kW5tM0ywlOjm37zufHQeJ4Ap6vV8rq+5a4979QfNhxuPHm9ubT85UXEqKTumsYjlWUAUEzxix8BsLNEMhIGgp0GF+9t/PQrk4rH0RHMEzYMySTiY04JGJe/7U5bqI9wSGAaBNkH7WfEB6x4iBPexnQUwzflQ6eLlJ/BS0/bEquI12T3tFYsDEMsEpDk9Xv6fOjoiUlIvui25d5WPLJFIYI42YLfWpjmDIgnQo+BpIWKH6R0LaeK4yFtrn/sqfpiEmSyHiG8FgSmnk6O9AFJd739PnBJT2vTi/jXdBdRAqj02wVyOcvLDciJyZ7ZplZp74muzy27XvL1JZgwsuZ7EdEAkHK72U1liJ30aYui4EzjXgE1W5XFEr57EYWglVDu4psqtYI/tu5ucoB6xUVRmIeFJjcXPLWyS6kyiL5XeCzTcJ2AzyHyuTbGREuIl9OVm/PGs1jRcULE3tb+30dnv5QnXDK40dp1yH/tYPIpGrIqCBKDbxeAsOMSOBUMN3EqWIJoRdkwgbGjEjI1DLmWvUMp4RGsfSPGZPcu9iRUZCpeZhYDKtCKoas86bYoMUxm+HGY+SFhEC6BxKhDEyN4vaMQloyDmxiBUcsMV0SkxIoG5hZpGBK86ct042dv1Xu3ufX69s/+ulGPdeY8d9qO57x9p2PzqFz7FD3u/vb/ev+W/u59qex0nCL1NWVsuaps7QaG/8Bl7bPA=</latexit>

rθ " log p(s0) +

T

X

t=1

log πθ(at|st) +

T

X

t=1

log p(st+1 | st, at) #

<latexit sha1_base64="ht8omtVqWDaNhA+LkRIL8/ARjQ=">AFzHicjVRLb9NAEHbBIp5NIUjlxVpIRGlV2Q4BKpAiEBh6qgvqRsaq3Xm2TV9UPeMUpk9soP5MaVX8Ku7baJXapaSjSexzfjNrPxFcguP8WVvfaLUfPNx8ZD9+8vTZVmf7+amMs5SyExqLOD3iWSCR+wEOAh2nqSMhL5gZ/7lRxM/+8FSyePoGBYJG4dkGvEJpwS0y9ve+NtFQ4RDAjPfz8pLyceYMlDnPAepkEMP6UH/QGSXg67rjIhlNxEBjq9r7BgExhmYU6a+ioi+MSkhKRf1e9qzyc8ukMxghju4u+9jDMGJB+rT8GkpVdvDKhZzXPZgC/8VpkbEJEnSeI7wJCU0d1V+qEpKfOiqi8Mrem6TXs4HoAaIlEbf7padL14biSd6uy5YWac6obs6rEN7h2nNgQTXp3JvETEF6R6X1VjJXIfbZqy6HYaiEdQR7umUMlnBlkKVg+pqfMrvcKzL9d7FHRsFZVZyDiaYPF/yDvEOlesiyX36vx2K7VFBPWlXrpef0S5ambeJLIHqeRf35bY0A1DeoZAHyGwHq0HsDkUVzpPWaDlv023StNhteWN6S7NrjKV19lx9pziQU3DrYwdq3qOvM5vHMQ0C1kEVBApR6TwDgnKXAqmLJxJlC6CWZspE2IxIyOc4L5gp1tSdAkzjVP70NhXe5IiehlIvQ15lGfFmPGedtsVEGk/fjnEdJBiyiZaNJhDEyHzZUMBTRkEstEFoyjVXRGdEiwT6+2drEdz6kZvG6f6e+2Zv/9vbnYMPlRyb1kvrldWzXOudWB9to6sE4u2vrTi1ry1aB+2oZ23VZm6vlbVvLBWnvavf0W9+Kw=</latexit>

= Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) #

<latexit sha1_base64="HPDVFtvrP5v+WBkOkoi+3XBfyVU=">AGTXiclVRbaxNBFN7WpK3x0lYfRksgcSGslsFfQkURAfSpX0Apl0mZ1MkqF7Y2ZW7bzB30RfPNf+OKDIuKZ3U2b3aSiCwlnz/nmfN+5zHqxz6Wy7a9Ly7dq9ZXVtduNO3fv3V/f2HxwJKNEUHZIz8SJx6RzOchO1Rc+ewkFowEns+OvbNXJn78gQnJo7CnLmI2CMg45CNOiQKXu1mjTdRFOCBq4npa+2mxFVY8gDHvIXpMFKX0lXtDpJuqrYdbUIovo50AN7W2Gcj1cyCQDVtfVpL09JiZ+160pDgs+nqgBwrjRG9bWE2YIu0KP1YkyVncHNAyniuOmbSZf5oTMmISxyI6R3gkCE0dne7rXBLvOvp0fyrPmZeX8o7SHURyo91o5synT4w2IsaAPjfKjFNfiy2XbfL+pWojMOZFTeYlJ5PivdyN0qRf+nNfFuADhLxUFWzXUko2mcGmTesGtLznhxd5Rqa/0a2Rxlh5VRVgR+N51QsTvlfizE7hVnCh/MWOlLs826KP7G8QGkQ3Lg9QjLqTNmIArIV27jbZRKZ2JlLjz27QIFremNyzgQ2R2B1WR7FzlV34VLAhDGdRV4uOd4s7MDf7mckWps5Sk8Tl5lQqXQdqdjoIQ2sk6OjBaw9rl2e4gvKJkJEH6ceqIf48YTcsNPuxpa9Y2cPmjecwtiyiufA3fgC9DQJWKioT6TsO3asBikRilOf6QZOJIsJPSNj1gczJAGTgzRrjkZN8AzRKBLwg3XMvLMnUhJIeRF4gDTjltWYcS6K9RM1ejFIeRgnioU0JxolPlIRMp9WNOSCUeVfgEGo4KAV0QmBOSj4ADegCU615HnjaHfHebqz+7Z1t7Loh1r1iPrsdWyHOu5tWe9sQ6sQ4vWPtW+1X7UftY/17/Xf9V/59DlpeLMQ6v0rKz+AVUeKsI=</latexit>
slide-29
SLIDE 29

29

  • 1. Sample trajectories by acting according to
  • 2. Compute policy gradient as
  • 3. Update policy

REINFORCE

πθ

<latexit sha1_base64="b5U2DTuOeBsxifnAUAZsgnRhcpo=">AF7nicjVRLb9NAEHaLAyW8WjhyWVFSmhU2QUJLpEqEBLiUBXUl5RNrfV6k6y6fmh3jFK5/hFcOIAQV34PN/4Nu7bJHZVdSVb45nZ7/tmZtd+IrgCx/m3snrHbt29t3a/eDho8dP1jeHqk4lZQd0ljE8sQnigkesUPgINhJIhkJfcGO/bP3Jn78lUnF4+gAzhM2Csk4mNOCWiXt2HbHTRAOCQw9f3sQ+5lxAOseIgT3sU0iOFCedDrI+VlsOXmJoSeaSv03s5FmwMQ6zSUGcNnPz0oISkRGRf8u5lHpZ8MoURwrjdQZ+6GKYMSK/Gj4GkJYtXJnSN54pjAbwX2JqREySRMYzhMeS0MzNs728lMQHbn6dynPbcrLeB/yPiKl0Wt3SubTl0YbkROdPTPKjDOfi10u2+DeULURmPCqJvMREV+Q6nu5G0uR2/Sm2RZNp4F4BHW0KwlV+8wgy4bVQ3nTU2bXuQLzbhfnqCs7aorEPGkoeIGyNsfjcU5LFLWGPVcLoq5FMXcdng1xIJQ4+q7oDynh7bQEpKJLFBqpOIaXZdmAMqrFfIAGV5UOzXAZlDc9EyQE/lunZWrR5Uh78x9IWRVmbengv01jedbadYqGm4lbFpVWvfW/+Lg5imIYuACqLU0HUSGVEAqeCaehUsYTQMzJhQ21GJGRqlBVF5KijPQEax1I/+rwU3sUdGQmVOg9nWkmouox47wuNkxh/HaU8ShJgUW0JBqnAkGMzL8PBVwyCuJcG4RKrUiOiW6X6D/kG3dBLdectM42tl2X23vfH69ufuasea9dx6YXUt13pj7VofrX3r0KL2mf3N/mH/bCWt761frd9l6upKteZtbRaf/4DCFQG3A=</latexit>

τi = {s1, a1, . . . sT , aT }i

<latexit sha1_base64="TFbOelEzSv4cKXAFyEjbwcOLY4=">AGIXicjVRLb9NAEHZLAiW8UjhyWVFSmhUxQWpXCJVICTEoSofUjZ1FpvNsmqfsk7Rq3c/Stc+CtcOIBQb4g/w6ztoldqlpKPJ6Z/b5vZnbXjTypoNf7s7R8p1a/e2/lfuPBw0ePnzRXn+6rMIm52OhF8aHLlPCk4HYAwmeOIxiwXzXEwfu8TsTP/giYiXDYACnkRj5bBrIieQM0OWs1rZapE+oz2Dmul7aTMAaqkTyPZpnwcwplyoNMlyklh3dYmRKrSBfTO5p6YgJDqhIfs/o9fTITnz0s+6fZFHYzmdwYhQ2miRj20KMwGsU+KnwJKcxckT2sZzyTEHm/kvMBGRsiKwxNCJzHjqa3THZ1Lkn1bH+1cyLOr8lLZBd0lLDc6jVbOfPTSaGPxFLNPjDLj1FdiF8s2uDdUbQRGsqjJfATM9VjxvdiNhchtelNtC9IhkAygjHYpoWifGWTesHJIVz15dplrbP4b2T7KCEurygq8cFpR8T/IG5p0+y0zP595vJISnNdZNq+syNsOtYSYESIunhHl9DpknSwgmcgcJSJlx+u6NAOQHzlfjonhJaXdBOIEshsgjcUYp3Vdm4sR9ItDUdkMc6MuTN0wb0eaNalybCzX7hKPVGoYoCfA6od6TXehu97CFVwy6MNat4dp3mOWLwxBcBcI8pNbR7EYxSFoPknkDWRImI8WM2FUM0A+YLNUqz+jRpoWdMJmGMP9ximXd+Rcp8pU59FzPNsFQ5ZpzXxYJTN6MUhlECYiA50STxCMQEnNdkrGMBQfvFA3GY4laCZ8xbCXgpdrAJtjlkqvG/uaG/Wpj89Prte23RTtWrOfWC6t2daWtW19sHatPYvXvta+137WftW/1X/Uf9fP89TlpWLNM2vhqf/9ByWzGbc=</latexit>

θ θ + αrθJ(θ)

<latexit sha1_base64="9FzOBrFC5DSMeH+3r2YIaAdeD8k=">AGVHicjVTfb9MwEM62towAo4NHXiymSi2rpmQgwUulCYSEeJgG6n5IdRc5rtuaOT9kO7Ap8x8JD0j8JbzwgJ2kXZu0iEhtL3fn+7lw/ZlRIx/m1sblVqzfubd+3Hzx8tPO4ufvkTEQJx+QURyziFz4ShNGQnEoqGbmIOUGBz8i5f/XOxM+/Ei5oFPblTUyGAZqEdEwxktrl7da+tEAPwADJqe+n75WXIk9CQMY0zbEo0jeCk92ukB4qdx3lQmB+C7S1ekdBRkZywEUSaCzeo67OclMWLpZ9We5UFOJ1M5BDaLfCxDeWUSNQp4UOJkhzFyxPaxjPHWCib+Wc1dUWI4phH1wCOcKpq9JjlVOiPVdHs/ouV6Ke1K1QUoNzp2K0e+fG4IT7R2deGmXGqO7LbZu6/+jaEIxp0ZN5CZHPUPG+rMZS5H+0qcqi4XQhGspytTmFQj4zyFywckhVPXl2GWtkvu1sjzLA0qkyAxZNKizWlVwn0nzURnu6Yg6L50qIZpO7VN2i/LeQYO0Qs6TuPHk2zBJCRkDj6LshPKcD9sFSRNZoKDXIrtWq9Li9uyqBXQEzBKB0hZJci2zm59yMtJTWiVvIX2vuAyVJVgYcWGqrDRKPGpOpcJzdNuF0CtjtA8+vq1D5VH7TmSbhpxHn0DhUd3g1g8RWum5jX3nAMne0DVcAtjzyqeE6/5Q4PjJChxAwJMXCdWA5TxCXFjCgbJoLECF+hCRloM0QBEcM0k0aBlvaMwDji+qO3MvMunkhRIMRN4OtM29RjhnqtgkeM3w5SGcSJiHOgcKAjID5hwUjygmW7EYbCHOquQI8RXoKUv8P21oEt9xy1Tg7PHBfHhx+erV39LaQY9t6Zj232pZrvbaOrA/WiXVq4dr32u+6Vd+o/6z/aWw16nq5kZx5qm19DR2/gI58yjp</latexit>

Run the policy and sample trajectories Compute policy gradient Update policy Slide credit: Sergey Levine

rθJ(θ) ⇡ X

i

" T X

t=1

rθ log πθ(ai

t | si t) · T

X

t=1

R(si

t | ai t)

#

<latexit sha1_base64="RBn+cpP1fVseXo4iuczuvOlwew=">ACgHicbZFdi9QwFIbT+rWOX6NehMchC7o2K6CIgiL3ohXq+zsLky65TRNZ8ImTUhOxaHM7/B/ePEUxniji7Hg8ec95yck5pVXSY5r+iuJr12/cvLV3e3Tn7r37D8YPH5140zouZtwo485K8ELJRsxQohJn1gnQpRKn5cXHPn/6TgvTXOMKytyDYtG1pIDBqkY/2ANlAoKhkuBQD8nW9hnYK0z35lvdSGZEjXON9zh+2x9frzjYsosmJXDLYECzyXTsqK+p3GK4N0x60BlxU93Wd+L/VsK12crHEvBhP0m6CXoVsgEmZIijYvyTVYa3WjTIFXg/z1KLeQcOJVdiPWKtFxb4BSzEPGADWvi82wxwTZ8FpaK1ceE0SDfqv4OtPcrXYbKvnV/OdeL/8vNW6zf5p1sbIui4duH6lZRNLTfBq2kExzVKgBwJ0OvlC/BAcews1EYQnb5y1fh5GCavZoefHk9OfwjGOPCFPSUIy8oYck/kiMwIJ7+jSfQ8ehHcRK/jLNtaRwNnsdkJ+J3fwA7gcRo</latexit>
slide-30
SLIDE 30

Pong from pixels

30

Image Credit: http://karpathy.github.io/2016/05/31/rl/

slide-31
SLIDE 31

Pong from pixels

31

Image Credit: http://karpathy.github.io/2016/05/31/rl/

slide-32
SLIDE 32

Pong from pixels

32

Image Credit: http://karpathy.github.io/2016/05/31/rl/

slide-33
SLIDE 33

Intuition

(C) Dhruv Batra 33

slide-34
SLIDE 34

34

Policy Gradients

rθJ(θ) = Eτ∼pθ(τ)[rθ log πθ(τ)R(τ)]

<latexit sha1_base64="wXmFKxUm8+0SLgNSqHwpfetw8w=">AFUnicjVRb9MwFM6WFUYZbINHXiymSi1UzOQ4KXSBEJCPEwD7SbVXeS4bmvNucg+Qa2CfyMS4oUfwgsPgJ1kW5tM0ywlOjm37zufHQeJ4Ap6vV8rq+5a4979QfNhxuPHm9ubT85UXEqKTumsYjlWUAUEzxix8BsLNEMhIGgp0GF+9t/PQrk4rH0RHMEzYMySTiY04JGJe/7U5bqI9wSGAaBNkH7WfEB6x4iBPexnQUwzflQ6eLlJ/BS0/bEquI12T3tFYsDEMsEpDk9Xv6fOjoiUlIvui25d5WPLJFIYI42YLfWpjmDIgnQo+BpIWKH6R0LaeK4yFtrn/sqfpiEmSyHiG8FgSmnk6O9AFJd739PnBJT2vTi/jXdBdRAqj02wVyOcvLDciJyZ7ZplZp74muzy27XvL1JZgwsuZ7EdEAkHK72U1liJ30aYui4EzjXgE1W5XFEr57EYWglVDu4psqtYI/tu5ucoB6xUVRmIeFJjcXPLWyS6kyiL5XeCzTcJ2AzyHyuTbGREuIl9OVm/PGs1jRcULE3tb+30dnv5QnXDK40dp1yH/tYPIpGrIqCBKDbxeAsOMSOBUMN3EqWIJoRdkwgbGjEjI1DLmWvUMp4RGsfSPGZPcu9iRUZCpeZhYDKtCKoas86bYoMUxm+HGY+SFhEC6BxKhDEyN4vaMQloyDmxiBUcsMV0SkxIoG5hZpGBK86ct042dv1Xu3ufX69s/+ulGPdeY8d9qO57x9p2PzqFz7FD3u/vb/ev+W/u59qex0nCL1NWVsuaps7QaG/8Bl7bPA=</latexit>

rθ " log p(s0) +

T

X

t=1

log πθ(at|st) +

T

X

t=1

log p(st+1 | st, at) #

<latexit sha1_base64="ht8omtVqWDaNhA+LkRIL8/ARjQ=">AFzHicjVRLb9NAEHbBIp5NIUjlxVpIRGlV2Q4BKpAiEBh6qgvqRsaq3Xm2TV9UPeMUpk9soP5MaVX8Ku7baJXapaSjSexzfjNrPxFcguP8WVvfaLUfPNx8ZD9+8vTZVmf7+amMs5SyExqLOD3iWSCR+wEOAh2nqSMhL5gZ/7lRxM/+8FSyePoGBYJG4dkGvEJpwS0y9ve+NtFQ4RDAjPfz8pLyceYMlDnPAepkEMP6UH/QGSXg67rjIhlNxEBjq9r7BgExhmYU6a+ioi+MSkhKRf1e9qzyc8ukMxghju4u+9jDMGJB+rT8GkpVdvDKhZzXPZgC/8VpkbEJEnSeI7wJCU0d1V+qEpKfOiqi8Mrem6TXs4HoAaIlEbf7padL14biSd6uy5YWac6obs6rEN7h2nNgQTXp3JvETEF6R6X1VjJXIfbZqy6HYaiEdQR7umUMlnBlkKVg+pqfMrvcKzL9d7FHRsFZVZyDiaYPF/yDvEOlesiyX36vx2K7VFBPWlXrpef0S5ambeJLIHqeRf35bY0A1DeoZAHyGwHq0HsDkUVzpPWaDlv023StNhteWN6S7NrjKV19lx9pziQU3DrYwdq3qOvM5vHMQ0C1kEVBApR6TwDgnKXAqmLJxJlC6CWZspE2IxIyOc4L5gp1tSdAkzjVP70NhXe5IiehlIvQ15lGfFmPGedtsVEGk/fjnEdJBiyiZaNJhDEyHzZUMBTRkEstEFoyjVXRGdEiwT6+2drEdz6kZvG6f6e+2Zv/9vbnYMPlRyb1kvrldWzXOudWB9to6sE4u2vrTi1ry1aB+2oZ23VZm6vlbVvLBWnvavf0W9+Kw=</latexit>

= Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) #

<latexit sha1_base64="HPDVFtvrP5v+WBkOkoi+3XBfyVU=">AGTXiclVRbaxNBFN7WpK3x0lYfRksgcSGslsFfQkURAfSpX0Apl0mZ1MkqF7Y2ZW7bzB30RfPNf+OKDIuKZ3U2b3aSiCwlnz/nmfN+5zHqxz6Wy7a9Ly7dq9ZXVtduNO3fv3V/f2HxwJKNEUHZIz8SJx6RzOchO1Rc+ewkFowEns+OvbNXJn78gQnJo7CnLmI2CMg45CNOiQKXu1mjTdRFOCBq4npa+2mxFVY8gDHvIXpMFKX0lXtDpJuqrYdbUIovo50AN7W2Gcj1cyCQDVtfVpL09JiZ+160pDgs+nqgBwrjRG9bWE2YIu0KP1YkyVncHNAyniuOmbSZf5oTMmISxyI6R3gkCE0dne7rXBLvOvp0fyrPmZeX8o7SHURyo91o5synT4w2IsaAPjfKjFNfiy2XbfL+pWojMOZFTeYlJ5PivdyN0qRf+nNfFuADhLxUFWzXUko2mcGmTesGtLznhxd5Rqa/0a2Rxlh5VRVgR+N51QsTvlfizE7hVnCh/MWOlLs826KP7G8QGkQ3Lg9QjLqTNmIArIV27jbZRKZ2JlLjz27QIFremNyzgQ2R2B1WR7FzlV34VLAhDGdRV4uOd4s7MDf7mckWps5Sk8Tl5lQqXQdqdjoIQ2sk6OjBaw9rl2e4gvKJkJEH6ceqIf48YTcsNPuxpa9Y2cPmjecwtiyiufA3fgC9DQJWKioT6TsO3asBikRilOf6QZOJIsJPSNj1gczJAGTgzRrjkZN8AzRKBLwg3XMvLMnUhJIeRF4gDTjltWYcS6K9RM1ejFIeRgnioU0JxolPlIRMp9WNOSCUeVfgEGo4KAV0QmBOSj4ADegCU615HnjaHfHebqz+7Z1t7Loh1r1iPrsdWyHOu5tWe9sQ6sQ4vWPtW+1X7UftY/17/Xf9V/59DlpeLMQ6v0rKz+AVUeKsI=</latexit>

Formalizes notion of “trial and error”:

  • If reward is high, probability of actions seen is increased
  • If reward is low, probability of actions seen is reduced
slide-35
SLIDE 35

35

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance à leading to unstable training
slide-36
SLIDE 36

36

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance à leading to unstable training
  • How to reduce the variance?
  • Subtract a constant from the reward!

rθJ(θ) = Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) b #

<latexit sha1_base64="kJ2fPiz67nf2xmaLnSPs7U4v8=">AGe3icjVRb9MwFM7GWka5bfCIhCyqinSEKRkgeKk0gZAQD9NA3UWqu8hx3dac5HtwKbMP4K/xhv/hBck7CTdmrSbZqnV8TnH5/v8neMECaNCu6fldU7a43m3fV7rfsPHj56vLH5FDEKcfkAMcs5scBEoTRiBxIKhk5TjhBYcDIUXD6ycSPfhAuaBz15XlChiGaRHRMZLa5W+u/eqAHoAhktMgyD4rP0O+hIKGMKE2xKNYXghfdh0g/Ey+8pQJgeQq4uj0roKMjOUAijTUWT1XnfSLkhix7LuyZ3mQ08lUDgGErQ74akM5JRJ1a/hQorRA8YsE23guMebK5v5ZTV0RoiTh8RmAY45w5qlsTxWUaM9TJ3szet4ivYw6UjkAFUa31SmQT7YMN8QnOvMDNOdUW2em1T94ZbG4IJLe9kNhEKGCr3VTUqkdtosyiLhtOFaCTr1S4plPKZRhaC1UNq0VNk17FG5r+Vz1EOWDtVZ8DiyQKL5SVvkOj2AzPfnfl6NR691JdmClXpSjXtlWnOHl6F7wGUJIzmb/DjJORygIF5hpeBcz5aFj9gITvdsErUAExkQqj4u0tS0vs2XsM6QiYSQO1UavRWtqDsj+98sUsTMrcHJSmykuj1KfmVCZ8TyvhOQBqwYTm0dfbPlQ+zfNKLH1txHn8c+bR90EsmaJr2utvtN1tN19g0fBKo2Va9/f+K3hcRqSGKGhBh4biKHGeKSYkZUC6aCJAifogkZaDNCIRHDLBdHgY72jMA45vqnhzf3zp/IUCjEeajb2jFDIOox41wWG6Ry/GY0ShJYlwATROGZAxMB9iMKcYMnOtYEwp5orwFOk+yD157qlRfDqV140Dne2vTfbO9/etnc/lnKsW8+sF5ZtedZ7a9f6Yu1bBxZe+9t43njZsBv/mu3mVtMpUldXyjNPrcpqvsP7hc3Zg=</latexit>
slide-37
SLIDE 37

37

Issues with Policy Gradients

  • Credit assignment is hard!
  • Which specific action led to increase in reward
  • Suffers from high variance à leading to unstable training
  • How to reduce the variance?
  • Subtract a constant from the reward!
  • Why does it work?
  • What is the best choice of b?

rθJ(θ) = Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) b #

<latexit sha1_base64="kJ2fPiz67nf2xmaLnSPs7U4v8=">AGe3icjVRb9MwFM7GWka5bfCIhCyqinSEKRkgeKk0gZAQD9NA3UWqu8hx3dac5HtwKbMP4K/xhv/hBck7CTdmrSbZqnV8TnH5/v8neMECaNCu6fldU7a43m3fV7rfsPHj56vLH5FDEKcfkAMcs5scBEoTRiBxIKhk5TjhBYcDIUXD6ycSPfhAuaBz15XlChiGaRHRMZLa5W+u/eqAHoAhktMgyD4rP0O+hIKGMKE2xKNYXghfdh0g/Ey+8pQJgeQq4uj0roKMjOUAijTUWT1XnfSLkhix7LuyZ3mQ08lUDgGErQ74akM5JRJ1a/hQorRA8YsE23guMebK5v5ZTV0RoiTh8RmAY45w5qlsTxWUaM9TJ3szet4ivYw6UjkAFUa31SmQT7YMN8QnOvMDNOdUW2em1T94ZbG4IJLe9kNhEKGCr3VTUqkdtosyiLhtOFaCTr1S4plPKZRhaC1UNq0VNk17FG5r+Vz1EOWDtVZ8DiyQKL5SVvkOj2AzPfnfl6NR691JdmClXpSjXtlWnOHl6F7wGUJIzmb/DjJORygIF5hpeBcz5aFj9gITvdsErUAExkQqj4u0tS0vs2XsM6QiYSQO1UavRWtqDsj+98sUsTMrcHJSmykuj1KfmVCZ8TyvhOQBqwYTm0dfbPlQ+zfNKLH1txHn8c+bR90EsmaJr2utvtN1tN19g0fBKo2Va9/f+K3hcRqSGKGhBh4biKHGeKSYkZUC6aCJAifogkZaDNCIRHDLBdHgY72jMA45vqnhzf3zp/IUCjEeajb2jFDIOox41wWG6Ry/GY0ShJYlwATROGZAxMB9iMKcYMnOtYEwp5orwFOk+yD157qlRfDqV140Dne2vTfbO9/etnc/lnKsW8+sF5ZtedZ7a9f6Yu1bBxZe+9t43njZsBv/mu3mVtMpUldXyjNPrcpqvsP7hc3Zg=</latexit>

Homework!

slide-38
SLIDE 38

38

Taking a step back

rθJ(θ) = Eτ∼pθ(τ) " T X

t=1

rθ log πθ(at|st) ·

T

X

t=1

R(st, at) #

<latexit sha1_base64="Zpn69rO9o7ySEME81bejTDTspys=">AGZnicjVRLa9tAEFbS2E3cNk9KD70sDQa7MUFKC+3FEFoKpYeQFucBXkes1mt7iV7srtoEZf9kbz30p/RWUlOLNkJEdiMZr6d75vHyot9LpVt/1lafrJSqz9dXWs8e/5ifWNza/tURomg7IRGfiTOPSKZz0N2orjy2XksGAk8n515l59N/OwnE5JHYU9dx2wQkHIR5wSBS53a+WmiboIB0RNPC/9ot2UuApLHuCYtzAdRupGuqrdQdJN1Z6jTQjFd5EOwNsa+2yk+lgmAaC6tr7o5Skp8dMfujXFYcHEzVAGDea6FsLqwlTpF3hx4okOYubA1rGc8sxkzbzT3NCRkziWERXCI8Eoamj0yOdS+JdR18cTeU58/JS3lG6g0hutBvNnPnirdFGxBjQV0aZceo7seWyTd4HqjYCY17UZF5C4vmkeC93oxR5TG/m2wJ0kIiHqprtVkLRPjPIvGHVkJ735Ogq19D8N7I9ygrp6oK/Gg8p2Jxygda9PiFmZ3ObL6KDpi90jdmy3XRlHvHCpBOBofS7mZbzp1RAwPcFenabSHSvlMpESeX7NFsLg1vXoBHyKzVKiyVYpdqexLkAo2hKktancxim5xOeaWYmbkhamz1CRxuTmVSteBop0OwtAbCTp68NrD2uUZruCsokQ0a+pB+ohfjwh90zS3dy19+3sQfOGUxi7VvEcu5u/gZ4mAQsV9YmUfceO1SAlQnHqM93AiWQxoZdkzPpghiRgcpBmzdGoCZ4hGkUCfrCnmXf2REoCKa8D5Bm3rIaM85FsX6iRh8HKQ/jRLGQ5kSjxEcqQuabi4ZcMKr8azAIFRy0IjohMAcFX+YGNMGpljxvnB7sO+/2D76/3z38VLRj1XptvbFalmN9sA6tr9axdWLRlb+1tdp2baf2r75ef1l/lUOXl4ozO1bpqaP/TgUuWg=</latexit>

Policy Evaluation (Recall Policy iteration)

  • REINFORCE: Evaluate and update policy based on Monte-Carlo estimates of

the total reward – very noisy!

  • Other ways of policy evaluation?
  • If we had the Q function, we could have used it!
slide-39
SLIDE 39

39

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy

Actor-Critic

slide-40
SLIDE 40

40

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy
  • REINFORCE:
  • Actor-critic:

Actor-Critic

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)R(s, a)]

<latexit sha1_base64="st/dtwMZce7i5vgRwutBqN5vog0=">AG83icjVLb9NAEHYLCW14tXDksqKlNCoigsSXCJVICTEoSofUjZ1FpvNsmqfml3Da3c/RtcOIAQV/4MN/4Ns360sd1WtZRodmZ25ptvZmw38rhU/f6/peU7dxvNeyurfsPHj56vLb+5ECGsaBsn4ZeKI5cIpnHA7avuPLYUSQY8V2PHbon74z98AsTkofBUJ1FbOyTWcCnBIFKme9sdpGA4R9ouaum7zXTkIchSX3cQ7mE5CdS4d1e0h6SRq09bGhKJLSw/cuxp7bKpGWMY+eA36+niYhaTESz7rTuGHBZ/N1Rh3Gqjx2s5kyRbiU/ViTOsjiZQ8doLnIshE31RUyIiEkUifAU4akgNLF1sqszSHxg6+PdAp5dh5fwntI9RDKh2pnmY9fGxEzMD71CAzSn0Jtly2iXtD1QZgxPOazCEgrkfyc5mNkuU23NRpgXQiAeqGu0CQk6faWRGWNWk65rMu5prYv5b6RylCSu3qgi8cFZDcV3IG0i6/cgs9mcxXgUJdF/pczPnOqfl2saCSy91h+K7tZqMLP0UBSWCDp9LtoE5VSGEsJT7Z7V7lFnWIfT5BZtJQZdQUO1Xp6yERbAKtvKoHeX8G+cbUJmVhDnJRp6FJ7HBzK5GODTzYPYSBLgk4hnAcYu3w1C/PBWUTIcKvhQbqIV40J9c0t1XTX3BSbTzJ31ROsZoZxTc1+VyWpk32SMGas7bR3+qnD6oLdi5sWPmz56z9hbJp7LNAUY9IObL7kRonRChOPaZbOJYsIvSEzNgIxID4TI6TtCkatUEzQdNQwA+WJtUu3kiIL+WZ74KngSurNqO8yjaK1fTNOFBFCsW0CzRNPaQCpH5AKAJF4wq7wEQgUHrIjOCfRfwWeiBSTY1ZLrwsH2lv1ya/vTq42dtzkdK9Yz67nVsWzrtbVjfbD2rH2LNqLGt8aPxs9m3Pze/NX8nbkuL+V3nlqlp/nPyDZSE=</latexit>

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)Qπθ(s, a)]

<latexit sha1_base64="h9wAiZS/Hu8y95WfLcZV9OYpXmU=">AHB3icjVLb9NAEHYLCSW8WjgipBVpASiyi5IcIlUgZAQh6pF6UPKJtZ6s0lW9Uu7a2jl7o0Lf4ULBxDiyl/gxr9h1nbaxG6rWko0OzM7803M7YX+1wq2/63tHzjZq1+a+V2487de/cfrK493JdRIijbo5EfiUOPSObzkO0prnx2GAtGAs9nB97RW2M/+MSE5FHYUycxGwRkEvIxp0SByl2rPWmiLsIBUVPS9pNyWuwpIHOYtTEeROpWuaneQdFP13NHGhOJzSwfc2xr7bKz6WCYBeHVtPezlISnx04+6NfPDgk+maoAwbjTRhxZWU6ZIu5QfK5LkWdzcoWU0Zznmwmb6WUyIiEkci+gY4bEgNHV0uq1zSLzr6OH2DJ5ThZfyjtIdRHKh3WjmYfPDYiJuB9bJAZpT4Hu1i2iXtF1QZgzIuazCEknk+K8yIbC5brcFOlBdJBIB6qcrQzCAV9pE5YWTrmpy73KukflvZHOUJSzdKiPwo0kFxWUhryDp+iMz35/5eCUk0H2lT82c64KWSxsLp3MHYqbdbdSQxl+gQaSwgJ126j52ghbEs4Ml37yK3uDXbx4CPkJk0VBo1xY5V9npIBRtBKy/qQdGfbrExlUmZm4NC1Flokrjc3Eql6wAPTgdhoEsCjh4ce1i7PMrckHZRIjo80wD9RA/npJLmtuo6M84KTeFG8qd7aOcVXNflUtsvc7A7nJh3a2yEwOjmT7uq6vWFnD6oKTiGsW8Wz467+BSpoErBQUZ9I2XfsWA1SIhSnPtMNnEgWE3pEJqwPYkgCJgdpBkajJmhGaBwJ+MEiZdr5GykJpDwJPA0JMiyzSgvsvUTNX49SHkYJ4qFNE80TnykImQ+CmjEBaPKPwGBUMEBK6JTAjOh4NPRABKcslVYX9zw3mxsbn7cn3rTUHivXYemq1LMd6ZW1Z760da8+itS+1b7UftZ/1r/Xv9V/137nr8lJx5G18NT/Ae3vG6B</latexit>
slide-41
SLIDE 41

41

  • Learn both policy and Q function
  • Use the “actor” to sample trajectories
  • Use the Q function to “evaluate” or “critic” the policy
  • REINFORCE:
  • Actor-critic:
  • Q function is unknown too! Update using

Actor-Critic

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)R(s, a)]

<latexit sha1_base64="st/dtwMZce7i5vgRwutBqN5vog0=">AG83icjVLb9NAEHYLCW14tXDksqKlNCoigsSXCJVICTEoSofUjZ1FpvNsmqfml3Da3c/RtcOIAQV/4MN/4Ns360sd1WtZRodmZ25ptvZmw38rhU/f6/peU7dxvNeyurfsPHj56vLb+5ECGsaBsn4ZeKI5cIpnHA7avuPLYUSQY8V2PHbon74z98AsTkofBUJ1FbOyTWcCnBIFKme9sdpGA4R9ouaum7zXTkIchSX3cQ7mE5CdS4d1e0h6SRq09bGhKJLSw/cuxp7bKpGWMY+eA36+niYhaTESz7rTuGHBZ/N1Rh3Gqjx2s5kyRbiU/ViTOsjiZQ8doLnIshE31RUyIiEkUifAU4akgNLF1sqszSHxg6+PdAp5dh5fwntI9RDKh2pnmY9fGxEzMD71CAzSn0Jtly2iXtD1QZgxPOazCEgrkfyc5mNkuU23NRpgXQiAeqGu0CQk6faWRGWNWk65rMu5prYv5b6RylCSu3qgi8cFZDcV3IG0i6/cgs9mcxXgUJdF/pczPnOqfl2saCSy91h+K7tZqMLP0UBSWCDp9LtoE5VSGEsJT7Z7V7lFnWIfT5BZtJQZdQUO1Xp6yERbAKtvKoHeX8G+cbUJmVhDnJRp6FJ7HBzK5GODTzYPYSBLgk4hnAcYu3w1C/PBWUTIcKvhQbqIV40J9c0t1XTX3BSbTzJ31ROsZoZxTc1+VyWpk32SMGas7bR3+qnD6oLdi5sWPmz56z9hbJp7LNAUY9IObL7kRonRChOPaZbOJYsIvSEzNgIxID4TI6TtCkatUEzQdNQwA+WJtUu3kiIL+WZ74KngSurNqO8yjaK1fTNOFBFCsW0CzRNPaQCpH5AKAJF4wq7wEQgUHrIjOCfRfwWeiBSTY1ZLrwsH2lv1ya/vTq42dtzkdK9Yz67nVsWzrtbVjfbD2rH2LNqLGt8aPxs9m3Pze/NX8nbkuL+V3nlqlp/nPyDZSE=</latexit>

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)Qπθ(s, a)]

<latexit sha1_base64="h9wAiZS/Hu8y95WfLcZV9OYpXmU=">AHB3icjVLb9NAEHYLCSW8WjgipBVpASiyi5IcIlUgZAQh6pF6UPKJtZ6s0lW9Uu7a2jl7o0Lf4ULBxDiyl/gxr9h1nbaxG6rWko0OzM7803M7YX+1wq2/63tHzjZq1+a+V2487de/cfrK493JdRIijbo5EfiUOPSObzkO0prnx2GAtGAs9nB97RW2M/+MSE5FHYUycxGwRkEvIxp0SByl2rPWmiLsIBUVPS9pNyWuwpIHOYtTEeROpWuaneQdFP13NHGhOJzSwfc2xr7bKz6WCYBeHVtPezlISnx04+6NfPDgk+maoAwbjTRhxZWU6ZIu5QfK5LkWdzcoWU0Zznmwmb6WUyIiEkci+gY4bEgNHV0uq1zSLzr6OH2DJ5ThZfyjtIdRHKh3WjmYfPDYiJuB9bJAZpT4Hu1i2iXtF1QZgzIuazCEknk+K8yIbC5brcFOlBdJBIB6qcrQzCAV9pE5YWTrmpy73KukflvZHOUJSzdKiPwo0kFxWUhryDp+iMz35/5eCUk0H2lT82c64KWSxsLp3MHYqbdbdSQxl+gQaSwgJ126j52ghbEs4Ml37yK3uDXbx4CPkJk0VBo1xY5V9npIBRtBKy/qQdGfbrExlUmZm4NC1Flokrjc3Eql6wAPTgdhoEsCjh4ce1i7PMrckHZRIjo80wD9RA/npJLmtuo6M84KTeFG8qd7aOcVXNflUtsvc7A7nJh3a2yEwOjmT7uq6vWFnD6oKTiGsW8Wz467+BSpoErBQUZ9I2XfsWA1SIhSnPtMNnEgWE3pEJqwPYkgCJgdpBkajJmhGaBwJ+MEiZdr5GykJpDwJPA0JMiyzSgvsvUTNX49SHkYJ4qFNE80TnykImQ+CmjEBaPKPwGBUMEBK6JTAjOh4NPRABKcslVYX9zw3mxsbn7cn3rTUHivXYemq1LMd6ZW1Z760da8+itS+1b7UftZ/1r/Xv9V/137nr8lJx5G18NT/Ae3vG6B</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>
slide-42
SLIDE 42

42

  • Initialize s, (policy network) and (Q network)

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>
slide-43
SLIDE 43

43

  • Initialize s, (policy network) and (Q network)
  • sample action

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>
slide-44
SLIDE 44

44

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>

s0 ∼ p(s0|s, a)

<latexit sha1_base64="qjvPZnMrTgJaZuVsuWAeqJ2o8h4=">AHGnicjVLb9NAEHYLCSW8WjhyWVFTWgU2QUJLpEqEBLiULUofUjZxFo7m2RVv+RdQytnfwcX/goXDiDEDXHh3zBrb9rETqulGh2Znbm29m107kMS5M89/K6q3bleqdtbu1e/cfPHy0vH4iIdJ7NJDN/TC+MQhnHosoIeCY+eRDElvuPRY+f0rbIf6IxZ2HQFecR7ftkHLARc4kAlb1RMeuog7BPxMRx0nfSToktMGc+jlgDu8NQTLktmi3E7VRsW1KZUHRpaYF7U2KPjkQP8QHr4pB908pEu89KNszPxwzMYT0UcY1+roQwOLCRWkWciPBUnyLHbu0FCaixzYTP9LCZExCSK4vAM4VFM3NS6Z7MIbGOJQd7M3hWGV7KWkK2EMmFZq2eZx48V9hIPAbvM4VMKeUl2MWyVdxrqlYAI6ZrUpuAOB7R+0U2Fiw34aZMC6SDQCwQxWgXEDR9qpE5YUWTLGty72KuofqvZXOUJSycKiLwnEJxVUhryHp5iMz35/5eAUk0H0hp2rOpablysaCSytzh+Jm3S3VUISv0UBSuEDcNptoGy2kUJYFPndW+YWNWb30WdDpCYNFUZN0DORPQ9pTIfQymU90P3p6BtTmpS5OdCizEKTxGbqVMptC3iwWgDXRxwdGHbxdJmZ/OBWTOA4/zRQD/GiCbmiucvafsFKsfVEv1X27HLmJF/X5ilvFtk5GMzNOjS4RWB4NJd8S795fGuqDPb6ptk2s4XKgqWFTUOvfXv9D7DjJj4NhOsRznuWGYl+SmLBXI/KGk4jYh7Ssa0B2JAfMr7aYZOojpohmgUxvCDu5Vp50+kxOf83HfAU7HCizalXGbrJWL0up+yIEoEDdw80SjxkAiR+k6gIYupK7xzEIgbM8CK3AmBMRHwNakBCVax5LJwtNO2XrR3Dl5u7r7RdKwZT41nRsOwjFfGrvHe2DcODbfypfKt8qPys/q1+r36q/o7d1d0WeGAur+vc/yoZ0xA=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>
slide-45
SLIDE 45

45

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic”

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>

s0 ∼ p(s0|s, a)

<latexit sha1_base64="qjvPZnMrTgJaZuVsuWAeqJ2o8h4=">AHGnicjVLb9NAEHYLCSW8WjhyWVFTWgU2QUJLpEqEBLiULUofUjZxFo7m2RVv+RdQytnfwcX/goXDiDEDXHh3zBrb9rETqulGh2Znbm29m107kMS5M89/K6q3bleqdtbu1e/cfPHy0vH4iIdJ7NJDN/TC+MQhnHosoIeCY+eRDElvuPRY+f0rbIf6IxZ2HQFecR7ftkHLARc4kAlb1RMeuog7BPxMRx0nfSToktMGc+jlgDu8NQTLktmi3E7VRsW1KZUHRpaYF7U2KPjkQP8QHr4pB908pEu89KNszPxwzMYT0UcY1+roQwOLCRWkWciPBUnyLHbu0FCaixzYTP9LCZExCSK4vAM4VFM3NS6Z7MIbGOJQd7M3hWGV7KWkK2EMmFZq2eZx48V9hIPAbvM4VMKeUl2MWyVdxrqlYAI6ZrUpuAOB7R+0U2Fiw34aZMC6SDQCwQxWgXEDR9qpE5YUWTLGty72KuofqvZXOUJSycKiLwnEJxVUhryHp5iMz35/5eAUk0H0hp2rOpablysaCSytzh+Jm3S3VUISv0UBSuEDcNptoGy2kUJYFPndW+YWNWb30WdDpCYNFUZN0DORPQ9pTIfQymU90P3p6BtTmpS5OdCizEKTxGbqVMptC3iwWgDXRxwdGHbxdJmZ/OBWTOA4/zRQD/GiCbmiucvafsFKsfVEv1X27HLmJF/X5ilvFtk5GMzNOjS4RWB4NJd8S795fGuqDPb6ptk2s4XKgqWFTUOvfXv9D7DjJj4NhOsRznuWGYl+SmLBXI/KGk4jYh7Ssa0B2JAfMr7aYZOojpohmgUxvCDu5Vp50+kxOf83HfAU7HCizalXGbrJWL0up+yIEoEDdw80SjxkAiR+k6gIYupK7xzEIgbM8CK3AmBMRHwNakBCVax5LJwtNO2XrR3Dl5u7r7RdKwZT41nRsOwjFfGrvHe2DcODbfypfKt8qPys/q1+r36q/o7d1d0WeGAur+vc/yoZ0xA=</latexit>

Qβ(s, a)

<latexit sha1_base64="OBpgOmPA0pqOmGPEf8xNI1PvyCg=">AHFnicjVLb9NAEHYLCSW8WjhyWVFSqhVxQUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/gou/BUuHECIK+LGv2HW3rSJnVa1lGh2Znbm29mbC/2uZCdzr+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVtsP7FE8CjsybOYDQIyDvmIUyJB5a7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKto2Em8kNR2kTi8tNri3FfbZSPaxSAPw6nbUca8ISYmfVStqR9O+HgiBwjRhN9aGE5YZK0S/mxJGmRxS0cWlpzkWMmbK6fxoSImMRxEp0iPEoIzRyV7agCEu86nhnCs+pwsu4LZWNSCG0G80i8/FzjY0kY/A+1ci0Ul2CnS9bx72mag0w5qYmfQiJ5xNznmdjznITbq0QDoIxENZjnYBwdCnG1kQVjapqbwLuca6v9GPkd5wtKtMgI/GldQXBXyGpJuPjKz/ZmNV0IC3ZfqXM+5MrRc2VhwsXN3KG7a3UoNZfgGDSFBRJup4020FwKbZnDU+zeIre4Nd3HgA+RnjRUGjXJTmX+esgSNoRWLuqB6U/XbExlUmbmwIgqD01Sl+tbmXAd4MGxEQa6BODowbGHlctzP5MLyiZJEn2eaqAe4scTckVzF7X9gpVy64l5V7nT5SxIvq7N56JdZmfveGbWocE2geExXO652NP3tNJdXe9sdvIHVQXHCOuWeXbd1b/ADE0DFkrqEyH6TieWg4wklOfqQZOBYsJPSFj1gcxJAETgyxHplATNEM0ihL4wV7l2tkbGQmEOAs8NSMiLJNKxfZ+qkcvR5kPIxTyUJaJBqlPpIR0t8INOQJo9I/A4HQhANWRCcERkTCl6QBJDjlkqvCwdam82Jza+/l+vYbQ8eK9dR6ZrUsx3plbVvrV1r36K1L7VvtR+1n/Wv9e/1X/XfhevykrnzxJp76n/+A6wMc18=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>
slide-46
SLIDE 46

46

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>

s0 ∼ p(s0|s, a)

<latexit sha1_base64="qjvPZnMrTgJaZuVsuWAeqJ2o8h4=">AHGnicjVLb9NAEHYLCSW8WjhyWVFTWgU2QUJLpEqEBLiULUofUjZxFo7m2RVv+RdQytnfwcX/goXDiDEDXHh3zBrb9rETqulGh2Znbm29m107kMS5M89/K6q3bleqdtbu1e/cfPHy0vH4iIdJ7NJDN/TC+MQhnHosoIeCY+eRDElvuPRY+f0rbIf6IxZ2HQFecR7ftkHLARc4kAlb1RMeuog7BPxMRx0nfSToktMGc+jlgDu8NQTLktmi3E7VRsW1KZUHRpaYF7U2KPjkQP8QHr4pB908pEu89KNszPxwzMYT0UcY1+roQwOLCRWkWciPBUnyLHbu0FCaixzYTP9LCZExCSK4vAM4VFM3NS6Z7MIbGOJQd7M3hWGV7KWkK2EMmFZq2eZx48V9hIPAbvM4VMKeUl2MWyVdxrqlYAI6ZrUpuAOB7R+0U2Fiw34aZMC6SDQCwQxWgXEDR9qpE5YUWTLGty72KuofqvZXOUJSycKiLwnEJxVUhryHp5iMz35/5eAUk0H0hp2rOpablysaCSytzh+Jm3S3VUISv0UBSuEDcNptoGy2kUJYFPndW+YWNWb30WdDpCYNFUZN0DORPQ9pTIfQymU90P3p6BtTmpS5OdCizEKTxGbqVMptC3iwWgDXRxwdGHbxdJmZ/OBWTOA4/zRQD/GiCbmiucvafsFKsfVEv1X27HLmJF/X5ilvFtk5GMzNOjS4RWB4NJd8S795fGuqDPb6ptk2s4XKgqWFTUOvfXv9D7DjJj4NhOsRznuWGYl+SmLBXI/KGk4jYh7Ssa0B2JAfMr7aYZOojpohmgUxvCDu5Vp50+kxOf83HfAU7HCizalXGbrJWL0up+yIEoEDdw80SjxkAiR+k6gIYupK7xzEIgbM8CK3AmBMRHwNakBCVax5LJwtNO2XrR3Dl5u7r7RdKwZT41nRsOwjFfGrvHe2DcODbfypfKt8qPys/q1+r36q/o7d1d0WeGAur+vc/yoZ0xA=</latexit>

Qβ(s, a)

<latexit sha1_base64="OBpgOmPA0pqOmGPEf8xNI1PvyCg=">AHFnicjVLb9NAEHYLCSW8WjhyWVFSqhVxQUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/gou/BUuHECIK+LGv2HW3rSJnVa1lGh2Znbm29mbC/2uZCdzr+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVtsP7FE8CjsybOYDQIyDvmIUyJB5a7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKto2Em8kNR2kTi8tNri3FfbZSPaxSAPw6nbUca8ISYmfVStqR9O+HgiBwjRhN9aGE5YZK0S/mxJGmRxS0cWlpzkWMmbK6fxoSImMRxEp0iPEoIzRyV7agCEu86nhnCs+pwsu4LZWNSCG0G80i8/FzjY0kY/A+1ci0Ul2CnS9bx72mag0w5qYmfQiJ5xNznmdjznITbq0QDoIxENZjnYBwdCnG1kQVjapqbwLuca6v9GPkd5wtKtMgI/GldQXBXyGpJuPjKz/ZmNV0IC3ZfqXM+5MrRc2VhwsXN3KG7a3UoNZfgGDSFBRJup4020FwKbZnDU+zeIre4Nd3HgA+RnjRUGjXJTmX+esgSNoRWLuqB6U/XbExlUmbmwIgqD01Sl+tbmXAd4MGxEQa6BODowbGHlctzP5MLyiZJEn2eaqAe4scTckVzF7X9gpVy64l5V7nT5SxIvq7N56JdZmfveGbWocE2geExXO652NP3tNJdXe9sdvIHVQXHCOuWeXbd1b/ADE0DFkrqEyH6TieWg4wklOfqQZOBYsJPSFj1gcxJAETgyxHplATNEM0ihL4wV7l2tkbGQmEOAs8NSMiLJNKxfZ+qkcvR5kPIxTyUJaJBqlPpIR0t8INOQJo9I/A4HQhANWRCcERkTCl6QBJDjlkqvCwdam82Jza+/l+vYbQ8eK9dR6ZrUsx3plbVvrV1r36K1L7VvtR+1n/Wv9e/1X/XfhevykrnzxJp76n/+A6wMc18=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>

θ θ + αrθ log πθ(a | s)Qβ(s, a)

<latexit sha1_base64="KwldA0dB3fuMg0CQdmoJ7xdIG+M=">ACOXicbVBNSxBEO3RfOjmw1WPXposASVhmVEhHsVcPK6QVWF7GWp6a3Ybe7qH7pqEZfFvefFfeBO85KCIV/+AvbsTSDQPGt57VUV1vazUylMcX0cLi69ev3m7tNx49/7Dx5Xm6tqxt5WT2JVW3eagUetDHZJkcbT0iEUmcaT7Oz7tH7yE51X1vygcYn9AoZG5UoCBStdgSNkEBozAmcs7/mn/hAnQ5Ai4MZBrSP212KEpVq0QhRpwv3WUimyq/VfYSputuB3PwF+SpCYtVqOTNq/EwMqQENSg/e9JC6pPwFHSmo8b4jKYwnyDIbYC9RAgb4/mV1+zj8HZ8Bz68IzxGfu3xMTKLwfF1noLIBG/nltav6v1qso3+tPlCkrQiPni/JKc7J8GiMfKIeS9DgQkE6Fv3I5AgeSQtiNELy/OSX5Hi7ney0t492W/sHdRxLbIN9YpsYd/YPjtkHdZlkl2wG3bL7qL6Hd0Hz3MWxeiemad/YPo8QnYXK1</latexit>
slide-47
SLIDE 47

47

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:
  • Update “critic”:
  • Recall Q-learning

Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>

s0 ∼ p(s0|s, a)

<latexit sha1_base64="qjvPZnMrTgJaZuVsuWAeqJ2o8h4=">AHGnicjVLb9NAEHYLCSW8WjhyWVFTWgU2QUJLpEqEBLiULUofUjZxFo7m2RVv+RdQytnfwcX/goXDiDEDXHh3zBrb9rETqulGh2Znbm29m107kMS5M89/K6q3bleqdtbu1e/cfPHy0vH4iIdJ7NJDN/TC+MQhnHosoIeCY+eRDElvuPRY+f0rbIf6IxZ2HQFecR7ftkHLARc4kAlb1RMeuog7BPxMRx0nfSToktMGc+jlgDu8NQTLktmi3E7VRsW1KZUHRpaYF7U2KPjkQP8QHr4pB908pEu89KNszPxwzMYT0UcY1+roQwOLCRWkWciPBUnyLHbu0FCaixzYTP9LCZExCSK4vAM4VFM3NS6Z7MIbGOJQd7M3hWGV7KWkK2EMmFZq2eZx48V9hIPAbvM4VMKeUl2MWyVdxrqlYAI6ZrUpuAOB7R+0U2Fiw34aZMC6SDQCwQxWgXEDR9qpE5YUWTLGty72KuofqvZXOUJSycKiLwnEJxVUhryHp5iMz35/5eAUk0H0hp2rOpablysaCSytzh+Jm3S3VUISv0UBSuEDcNptoGy2kUJYFPndW+YWNWb30WdDpCYNFUZN0DORPQ9pTIfQymU90P3p6BtTmpS5OdCizEKTxGbqVMptC3iwWgDXRxwdGHbxdJmZ/OBWTOA4/zRQD/GiCbmiucvafsFKsfVEv1X27HLmJF/X5ilvFtk5GMzNOjS4RWB4NJd8S795fGuqDPb6ptk2s4XKgqWFTUOvfXv9D7DjJj4NhOsRznuWGYl+SmLBXI/KGk4jYh7Ssa0B2JAfMr7aYZOojpohmgUxvCDu5Vp50+kxOf83HfAU7HCizalXGbrJWL0up+yIEoEDdw80SjxkAiR+k6gIYupK7xzEIgbM8CK3AmBMRHwNakBCVax5LJwtNO2XrR3Dl5u7r7RdKwZT41nRsOwjFfGrvHe2DcODbfypfKt8qPys/q1+r36q/o7d1d0WeGAur+vc/yoZ0xA=</latexit>

Qβ(s, a)

<latexit sha1_base64="OBpgOmPA0pqOmGPEf8xNI1PvyCg=">AHFnicjVLb9NAEHYLCSW8WjhyWVFSqhVxQUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/gou/BUuHECIK+LGv2HW3rSJnVa1lGh2Znbm29mbC/2uZCdzr+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVtsP7FE8CjsybOYDQIyDvmIUyJB5a7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKto2Em8kNR2kTi8tNri3FfbZSPaxSAPw6nbUca8ISYmfVStqR9O+HgiBwjRhN9aGE5YZK0S/mxJGmRxS0cWlpzkWMmbK6fxoSImMRxEp0iPEoIzRyV7agCEu86nhnCs+pwsu4LZWNSCG0G80i8/FzjY0kY/A+1ci0Ul2CnS9bx72mag0w5qYmfQiJ5xNznmdjznITbq0QDoIxENZjnYBwdCnG1kQVjapqbwLuca6v9GPkd5wtKtMgI/GldQXBXyGpJuPjKz/ZmNV0IC3ZfqXM+5MrRc2VhwsXN3KG7a3UoNZfgGDSFBRJup4020FwKbZnDU+zeIre4Nd3HgA+RnjRUGjXJTmX+esgSNoRWLuqB6U/XbExlUmbmwIgqD01Sl+tbmXAd4MGxEQa6BODowbGHlctzP5MLyiZJEn2eaqAe4scTckVzF7X9gpVy64l5V7nT5SxIvq7N56JdZmfveGbWocE2geExXO652NP3tNJdXe9sdvIHVQXHCOuWeXbd1b/ADE0DFkrqEyH6TieWg4wklOfqQZOBYsJPSFj1gcxJAETgyxHplATNEM0ihL4wV7l2tkbGQmEOAs8NSMiLJNKxfZ+qkcvR5kPIxTyUJaJBqlPpIR0t8INOQJo9I/A4HQhANWRCcERkTCl6QBJDjlkqvCwdam82Jza+/l+vYbQ8eK9dR6ZrUsx3plbVvrV1r36K1L7VvtR+1n/Wv9e/1X/XfhevykrnzxJp76n/+A6wMc18=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>

θ θ + αrθ log πθ(a | s)Qβ(s, a)

<latexit sha1_base64="KwldA0dB3fuMg0CQdmoJ7xdIG+M=">ACOXicbVBNSxBEO3RfOjmw1WPXposASVhmVEhHsVcPK6QVWF7GWp6a3Ybe7qH7pqEZfFvefFfeBO85KCIV/+AvbsTSDQPGt57VUV1vazUylMcX0cLi69ev3m7tNx49/7Dx5Xm6tqxt5WT2JVW3eagUetDHZJkcbT0iEUmcaT7Oz7tH7yE51X1vygcYn9AoZG5UoCBStdgSNkEBozAmcs7/mn/hAnQ5Ai4MZBrSP212KEpVq0QhRpwv3WUimyq/VfYSputuB3PwF+SpCYtVqOTNq/EwMqQENSg/e9JC6pPwFHSmo8b4jKYwnyDIbYC9RAgb4/mV1+zj8HZ8Bz68IzxGfu3xMTKLwfF1noLIBG/nltav6v1qso3+tPlCkrQiPni/JKc7J8GiMfKIeS9DgQkE6Fv3I5AgeSQtiNELy/OSX5Hi7ney0t492W/sHdRxLbIN9YpsYd/YPjtkHdZlkl2wG3bL7qL6Hd0Hz3MWxeiemad/YPo8QnYXK1</latexit>
slide-48
SLIDE 48

48

  • Initialize s, (policy network) and (Q network)
  • sample action
  • For each step:
  • Sample reward and next state
  • evaluate “actor” using “critic” and update policy:
  • Update “critic”:
  • Recall Q-learning
  • Update Accordingly
  • Actor-Critic

θ

<latexit sha1_base64="2zAaSTtsz8LmjwqlL7nkIJ/gjBM=">AHEHicjVLb9NAEHYLCSU82sKRy4oqKFRFRckuESqQEiIQ9Wi9CFlE2u92Sr+qXdNbRy9ydw4a9w4QBCXDly498waztYqdVLSWanZmd+eabGduNPC5Vu/1vafnW7Ur1zsrd2r37Dx6urq0/OpRhLCg7oKEXimOXSObxgB0orjx2HAlGfNdjR+7JW2M/+sSE5GHQVWcR6/tkHPARp0SBylmvPKujDsI+URPXTd5pJyGOwpL7OINTIehOpeOaraQdBK1aWtjQtGlpQXuTY09NlI9LGMfvDptPehmISnxko+6MfXDgo8nqo8wrtXRhwZWE6ZIs5AfKxJnWZzMoWE0Fzlmwqb6aUyIiEkUifAU4ZEgNLF1sqszSLxj68HuFJ5dhpfwltItRDKhWatnmQfPDTYixuB9apAZpb4EO1+2iXtN1QZgxPOazCEgrkfy8zwbc5abcFOmBdJBIB6oYrQLCDl9pEZYUWTLmsy72KuofmvpXOUJizcKiLwnEJxVUhryHp5iMz25/ZeAUk0H2lz82c65yWKxsLq3UHYqbdrdUQxF+jgaSwgJp91Em2guhbHM4cl2b5Fb1Jjuo8+HyEwaKoyaYqcqfT0kg2hlYt6kPenk29MaVJm5iAXdRqaxA43txLp2MCD3UIY6JKAowvHLtYOT/3yXFA2ESL8PNVAPcSLJuSK5i5q+wUrxdaT/F3lTJczI/m6Np/LZpGd/cHMrEODWwSGJ+cyUzprG+2tdvqgsmDnwoaVP3vO2l/ghMY+CxT1iJQ9ux2pfkKE4tRjuoZjySJCT8iY9UAMiM9kP0kxaVQHzRCNQgE/2KhUO3sjIb6UZ74LnoYLWbQZ5SJbL1aj1/2EB1GsWECzRKPYQypE5uAhlwqrwzEAgVHLAiOiEwHAq+ITUgwS6WXBYOt7fsF1vb+y83dt7kdKxYT6ynVsOyrVfWjvXe2rMOLFr5UvlW+VH5Wf1a/V79Vf2duS4v5XceW3NP9c9/BzNxnA=</latexit>

R(s, a)

<latexit sha1_base64="P29ktxwYJQfTrBZxHPb58etxQ20=">AHGnicjVLb9NAEHYLCSW8WjhyWVFSmhU2QUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/g4u/BUuHECIG+LCv2HW3rSJnVa1lGh3Znbm+btb3Y50La9r+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVvsP7FE8CjsybOYDQIyDvmIUyLB5K7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKdgcJN5MbjtIuF96OhDeVthnI9nHIg0gqmur416RkhI/+6ha0zic8PFEDhDGjSb60MJywiRpl+pjSdKilsEtLTlosZM2tw+zQkZMYnjJDpFeJQmjkq21EFJN51PHOFJ5ThZfxjlQdRIpFu9EsKh8/19hIMoboU41MG9Ul2Pm2d5rutYAY2560puQeD4x+3k25jw34aZKC5SDRDyU5WwXEAx9WsiCsLJLVS1FdLnWUP838jnKC5ZOlRH40biC4qU15B085GZ1Wc2XwkJqC/VuZ5zZWi5UlgI6eTh0NxU3UoPZfgGDRSFCyRcu4020FwJ7ZnDU9y9RWFxa3ofAz5EetJQadQkO5X56yFL2BCkXKSB0adrbkxlUmbmwCxVnpqkLtenMuE6wIPTQRjoEoCjB9seVi7P40wtaJskSfR5aoF+iB9PyBXiLpL9gpWy9MS8q9zp5SxIvk7mc9Eus7N3PDPrIHCHwPAYLueUB4e7um5v2vmDqgvHLNYt8+y6q3+AHZoGLJTUJ0L0HTuWg4wklOfqQZOBYsJPSFj1odlSAImBlmOTqEmWIZoFCXwg7uVW2dPZCQ4izwIFLjFGWfNi7y9VM5ej3IeBinkoW0KDRKfSQjpL8TaMgTRqV/BgtCEw5YEZ0QGBMJX5MGkOCUW64uDrY2nRebW3sv17fGDpWrKfWM6tlOdYra9t6b+1a+xatfal9q/2o/ax/rX+v/6r/LkKXl8yZJ9bcU/7H7vgdV8=</latexit>

s0 ∼ p(s0|s, a)

<latexit sha1_base64="qjvPZnMrTgJaZuVsuWAeqJ2o8h4=">AHGnicjVLb9NAEHYLCSW8WjhyWVFTWgU2QUJLpEqEBLiULUofUjZxFo7m2RVv+RdQytnfwcX/goXDiDEDXHh3zBrb9rETqulGh2Znbm29m107kMS5M89/K6q3bleqdtbu1e/cfPHy0vH4iIdJ7NJDN/TC+MQhnHosoIeCY+eRDElvuPRY+f0rbIf6IxZ2HQFecR7ftkHLARc4kAlb1RMeuog7BPxMRx0nfSToktMGc+jlgDu8NQTLktmi3E7VRsW1KZUHRpaYF7U2KPjkQP8QHr4pB908pEu89KNszPxwzMYT0UcY1+roQwOLCRWkWciPBUnyLHbu0FCaixzYTP9LCZExCSK4vAM4VFM3NS6Z7MIbGOJQd7M3hWGV7KWkK2EMmFZq2eZx48V9hIPAbvM4VMKeUl2MWyVdxrqlYAI6ZrUpuAOB7R+0U2Fiw34aZMC6SDQCwQxWgXEDR9qpE5YUWTLGty72KuofqvZXOUJSycKiLwnEJxVUhryHp5iMz35/5eAUk0H0hp2rOpablysaCSytzh+Jm3S3VUISv0UBSuEDcNptoGy2kUJYFPndW+YWNWb30WdDpCYNFUZN0DORPQ9pTIfQymU90P3p6BtTmpS5OdCizEKTxGbqVMptC3iwWgDXRxwdGHbxdJmZ/OBWTOA4/zRQD/GiCbmiucvafsFKsfVEv1X27HLmJF/X5ilvFtk5GMzNOjS4RWB4NJd8S795fGuqDPb6ptk2s4XKgqWFTUOvfXv9D7DjJj4NhOsRznuWGYl+SmLBXI/KGk4jYh7Ssa0B2JAfMr7aYZOojpohmgUxvCDu5Vp50+kxOf83HfAU7HCizalXGbrJWL0up+yIEoEDdw80SjxkAiR+k6gIYupK7xzEIgbM8CK3AmBMRHwNakBCVax5LJwtNO2XrR3Dl5u7r7RdKwZT41nRsOwjFfGrvHe2DcODbfypfKt8qPys/q1+r36q/o7d1d0WeGAur+vc/yoZ0xA=</latexit>

Qβ(s, a)

<latexit sha1_base64="OBpgOmPA0pqOmGPEf8xNI1PvyCg=">AHFnicjVLb9NAEHYLCSW8WjhyWVFSqhVxQUJLpEqEBLiULUofUjZ1FpvNsmqfsm7hlbu/gou/BUuHECIK+LGv2HW3rSJnVa1lGh2Znbm29mbC/2uZCdzr+l5Vu3a/U7K3cb9+4/ePhode3xgYjShLJ9GvlRcuQRwXwesn3Jpc+O4oSRwPZoXfyVtsP7FE8CjsybOYDQIyDvmIUyJB5a7V7CbqIhwQOfG87J1yM+JKLHiAY97CdBjJc+HKto2Em8kNR2kTi8tNri3FfbZSPaxSAPw6nbUca8ISYmfVStqR9O+HgiBwjRhN9aGE5YZK0S/mxJGmRxS0cWlpzkWMmbK6fxoSImMRxEp0iPEoIzRyV7agCEu86nhnCs+pwsu4LZWNSCG0G80i8/FzjY0kY/A+1ci0Ul2CnS9bx72mag0w5qYmfQiJ5xNznmdjznITbq0QDoIxENZjnYBwdCnG1kQVjapqbwLuca6v9GPkd5wtKtMgI/GldQXBXyGpJuPjKz/ZmNV0IC3ZfqXM+5MrRc2VhwsXN3KG7a3UoNZfgGDSFBRJup4020FwKbZnDU+zeIre4Nd3HgA+RnjRUGjXJTmX+esgSNoRWLuqB6U/XbExlUmbmwIgqD01Sl+tbmXAd4MGxEQa6BODowbGHlctzP5MLyiZJEn2eaqAe4scTckVzF7X9gpVy64l5V7nT5SxIvq7N56JdZmfveGbWocE2geExXO652NP3tNJdXe9sdvIHVQXHCOuWeXbd1b/ADE0DFkrqEyH6TieWg4wklOfqQZOBYsJPSFj1gcxJAETgyxHplATNEM0ihL4wV7l2tkbGQmEOAs8NSMiLJNKxfZ+qkcvR5kPIxTyUJaJBqlPpIR0t8INOQJo9I/A4HQhANWRCcERkTCl6QBJDjlkqvCwdam82Jza+/l+vYbQ8eK9dR6ZrUsx3plbVvrV1r36K1L7VvtR+1n/Wv9e/1X/XfhevykrnzxJp76n/+A6wMc18=</latexit>

β

<latexit sha1_base64="GoJsjmUFYSV/lMv6vYmSHaSE0zY=">AH13icjVLb9NAEHYLTUp4tXDksqKmtBQxQUJLpEqEBLiUDUofaBsaq2dTbKqX/KuocVdcQAhrvw1bvwI/gOzfqR+pFUtJVrPzM5832za9O3GRfd7t+l5Vu3V2r1TuNu/fuP3i4tv7okHthYNEDy7O94NgknNrMpQeCZse+wEljmnTI/P0rfIfaYBZ547EOc+HTlk6rIJs4gAk7G+8q+Jeg7RMxM3onjYgYAnPmYJ+1sDX2xAU3RLuDuBGJLV0qF/IvPR0Ib0ts04kYh46ENXrypNBktIidvRtrI4HLDpTIwQxo0m+tDCYkYFaZfqY0HCpIqRBLSUZV4jlza2ZzkhIya+H3hnCE8CYkW6jPZkAon1dHmyl8HTq/Ai1hGyg0iyaDeaSeWTZwobCaYQfaQKaO8BFtsW+W9pmsF0GdpT+rFJaZN0vciGwXPTbip0gLlIBFzRTnbHEJKnxIyIazsklVLEl2uNVb/jXiO4oKlXWUEtjetoLgq5TUk3Xxk8vrk85WQgPpCXqg5lyktVwoLIZ04HJrL1K30UIafoGicIC40W2jLVQoTwFPMnZWxTmt7Lz6LAxUpOGSqMm6JmIr4coGOQcpEGqT69MRUJiU3B+lSxqlJaDC1K+KGDjzoHYSBLg4BvA6wNJgcVxaC9omQeB9ySzQD7H9GblC3EWyz1kpS0/Su8rIDmdC8nUyX/B2mZ3+SW7WQeAOgeHJcTmXL+6j1EZ5okmiSbtvYFO9q2wqy5jasKtXvHrApxLlO9qMRzmHp9/imx2y2UbPUT/JFiduFqg1U0gx5UmpjIXYlew01ja62934QdWFni42tPTZN9b+gLRW6FBXWDbhfKh3fTGKSCYZVPZwCGnPrFOyZQOYekSh/JRFMrURMsYzTxAvjBxRBb8zsi4nB+7pgQqQjgZ8yLvINQzF5PYqY64eCulZSaBLaSHhIfeTQmAXUEvY5LIgVMCKrBmBGRfwKWwACXq5ericGdbf7G903+5sfsmpWNVe6I91Vqar3SdrX32r52oFm1Qe1r7XvtR/1T/Vv9Z/1XErq8lO5rBWe+u/8VGzg=</latexit>

a ∼ πθ(·|s)

<latexit sha1_base64="LBohNgtkxSeqGdZWzNTyq9MLOw=">AH8XicjVLb9NAEHYLJCW8WjhyWVFjSFUcUGCS6QKhIQ4VA1KH1I2tdbOJlnVL3nX0Mrdf8GFAwhx5d9w498wa69Tx0mrWkq0npmd+eb7ZtdO5DEuOp1/K6u3bt+p1dfuNu7df/Dw0frG40MeJrFLD9zQC+Njh3DqsYAeCY8ehzFlPiOR4+c0/fKf/SFxpyFQV+cR3Tok0nAxswlAkz2Rq3eRF2EfSKmjpN+kHZKbIE583HEWtgdheKC28JsI26n4oUlQtFl542hJsSe3QsBpgnPkR1O/Kkn6d0iZd+lq0iDsdsMhVDhHGjiT61sJhSQcxKfSxIklex84CWsxqlNJm9iInZMQkiuLwDOFxTNzUkumezCGxriVP9gp41iK8lLWFbCOSL8xGM698lxhI/Eos8UMmWUl2Dn21Z5r+laAYyY7km9BMTxiH6fZ2POcxNuFmBcpCIBaKabQZB06eEzAmruSiJY+u1hqp/0Y2R1nByq4qAi+cLKC4KuU1JN18ZMr6lPNVkID6Ql6oOZealiuFhZB2Fg7NFeou9FCFr9FAUThA3O6Y6AWaK6E8c3jys7csLGoV59FnI6QmDVGTdAzkV0PaUxHIOUyDbQ+X1iFialNAd6KbPUJLGZ2pVy2wIerDbCQBcH147WNpsyxO14K2SRyHXwsL9EO8aEquEHeZ7DNWqtITfVfZxeHMSb5O5gtuVtnpnZRmHQRuExieEpcz+bI+Km1UJ5rkmpg9GzvqXWVTWUbUg13d+asHfCpRuaOtbJRLeHotvtUmWyZ6iXo6W5a5OcetozFlnOe1Choyl95aIay4yU17fbOz3cketLiw9GLT0M+vf4XRHcTnwbC9QjnA6sTiWFKYsFcj8oGTjiNiHtKJnQAy4D4lA/TjHSJmAZoXEYw+ujMxa3pESn/Nz34FIRQ2v+pRxmW+QiPHbYcqCKBE0cPNC48RDIkTq84dGLKau8M5hQdyYAVbkTglMv4CPZANIsKotLy4Od7atV9s7vdebu+80HWvGU+OZ0TIs42xa3w09o0Dw60FtW+1H7WfdV7/Xv9V/52Hrq7oPU+Muaf+5z+Xmb0I</latexit>

β

<latexit sha1_base64="EpwadKl79nuFVBzWqBryC0A38=">AB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGCaQtKJvtpl262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k90KOtF+pujV3DrJKvIJUoUCzX/nqDRKWxVwhk9SYruemGORUo2CST8u9zPCUsjEd8q6lisbcBPn82Ck5t8qARIm2pZDM1d8TOY2NmcSh7YwpjsyNxP/87oZRjdBLlSaIVdsSjKJMGEzD4nA6E5QzmxhDIt7K2EjaimDG0+ZRuCt/zyKmnVa95lrf5wVW3cFnGU4BTO4AI8uIYG3EMTfGAg4Ble4c1Rzovz7nwsWtecYuYE/sD5/AHFJI6o</latexit>

a ← a0, s ← s0

<latexit sha1_base64="51Vrw+Ar65qn/gf5hVafKincpQw=">AIEnicjVLb9NAEHYLJCW8WjhyWVFiSFUcUGCS6QKhIQ4VA1KH1I3sdbOJlnVL3nX0Mrd38CFv8KFAwhx5cSNf8OsH6ljp1UtdqdmZ35vtmN1bgMC63X8rqzdu3qrV12437ty9d/B+sbDA+5HoU3bd/xwyOLcOowj+4LJhx6FISUuJZD62Tt8p/+ImGnPneQJwFdOiSqcmzCYCTOZGrd1EPYRdImaWFb+TZkxMgTlzcDa2B74pybQu8gbsbimSGVCwUXng6E6xI7dCKOMY9ciOp15WiQprSJE3+U7TwOh2w6E0OEcaOJPrSxmFB9FJ9LEiUVjHTgLayzGsU0ib2PCdkxCQIQv8U4UlI7NiQ8a5MIbGeIUe7OTyjCi9mHSE7iKQLvdFMK4+eKmwknEL0qUKmjPIC7GLbKu8VXSuAct6UhuPWA7J9otsLHiuw02VFigHiZgnytnmEDL6lJApYWXrFrS6HKtsfrfSOYoKVg6VUbg+NMKistSXkHS9UemqE8xXwkJqC/kuZpzmdFyqbAQ0knCoblc3UoPZfgZGigKF4ibXR09QwslGcBT3r3loUF7fw+umyM1KSh0qgJeiqS5yEO6RikXKZBpk8vuzGVSnMQbaUSWoSmUydirlpA9GB2GgiwOAWwHWJosictqQdskDP3PuQX6IU4wI5eIu0z2OStl6Un2Vpn5UxJvkrmc6X2emPCrMOAncIDE+By7l8SR+lNsoTVJN9L6JLbVX2VSWMXgVG/x6QGfSlTsqJWMcgFPv81bHdLS0XPUz7IlmZsL3FoZpoTztFZOQ+LKj6ISZflbrjcKLSLSgie/sOctc32zu9VNPlRdGNliU8u+PXP9L0yFHbnUE7ZDOD82uoEYxiQUzHaobOCI04DYJ2RKj2HpEZfyYZyoIlETLGM08UP4gzclsRZPxMTl/My1IFJx8s+ZVzmO47E5PUwZl4QCerZaFJ5CDhI/X7iMYspLZwzmB7JABVmTPCFwPAb+iDSDBKLdcXRxsbxkvtrb7Lzd3mR0rGmPtSdaWzO0V9qO9l7b0/Y1u/al9q32o/az/rX+vf6r/jsNXV3JzjzSFr76n/wM8mH</latexit>

θ θ + αrθ log πθ(a | s)Qβ(s, a)

<latexit sha1_base64="KwldA0dB3fuMg0CQdmoJ7xdIG+M=">ACOXicbVBNSxBEO3RfOjmw1WPXposASVhmVEhHsVcPK6QVWF7GWp6a3Ybe7qH7pqEZfFvefFfeBO85KCIV/+AvbsTSDQPGt57VUV1vazUylMcX0cLi69ev3m7tNx49/7Dx5Xm6tqxt5WT2JVW3eagUetDHZJkcbT0iEUmcaT7Oz7tH7yE51X1vygcYn9AoZG5UoCBStdgSNkEBozAmcs7/mn/hAnQ5Ai4MZBrSP212KEpVq0QhRpwv3WUimyq/VfYSputuB3PwF+SpCYtVqOTNq/EwMqQENSg/e9JC6pPwFHSmo8b4jKYwnyDIbYC9RAgb4/mV1+zj8HZ8Bz68IzxGfu3xMTKLwfF1noLIBG/nltav6v1qso3+tPlCkrQiPni/JKc7J8GiMfKIeS9DgQkE6Fv3I5AgeSQtiNELy/OSX5Hi7ney0t492W/sHdRxLbIN9YpsYd/YPjtkHdZlkl2wG3bL7qL6Hd0Hz3MWxeiemad/YPo8QnYXK1</latexit>
slide-49
SLIDE 49

49

Actor-critic

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)R(s, a)]

<latexit sha1_base64="st/dtwMZce7i5vgRwutBqN5vog0=">AG83icjVLb9NAEHYLCW14tXDksqKlNCoigsSXCJVICTEoSofUjZ1FpvNsmqfml3Da3c/RtcOIAQV/4MN/4Ns360sd1WtZRodmZ25ptvZmw38rhU/f6/peU7dxvNeyurfsPHj56vLb+5ECGsaBsn4ZeKI5cIpnHA7avuPLYUSQY8V2PHbon74z98AsTkofBUJ1FbOyTWcCnBIFKme9sdpGA4R9ouaum7zXTkIchSX3cQ7mE5CdS4d1e0h6SRq09bGhKJLSw/cuxp7bKpGWMY+eA36+niYhaTESz7rTuGHBZ/N1Rh3Gqjx2s5kyRbiU/ViTOsjiZQ8doLnIshE31RUyIiEkUifAU4akgNLF1sqszSHxg6+PdAp5dh5fwntI9RDKh2pnmY9fGxEzMD71CAzSn0Jtly2iXtD1QZgxPOazCEgrkfyc5mNkuU23NRpgXQiAeqGu0CQk6faWRGWNWk65rMu5prYv5b6RylCSu3qgi8cFZDcV3IG0i6/cgs9mcxXgUJdF/pczPnOqfl2saCSy91h+K7tZqMLP0UBSWCDp9LtoE5VSGEsJT7Z7V7lFnWIfT5BZtJQZdQUO1Xp6yERbAKtvKoHeX8G+cbUJmVhDnJRp6FJ7HBzK5GODTzYPYSBLgk4hnAcYu3w1C/PBWUTIcKvhQbqIV40J9c0t1XTX3BSbTzJ31ROsZoZxTc1+VyWpk32SMGas7bR3+qnD6oLdi5sWPmz56z9hbJp7LNAUY9IObL7kRonRChOPaZbOJYsIvSEzNgIxID4TI6TtCkatUEzQdNQwA+WJtUu3kiIL+WZ74KngSurNqO8yjaK1fTNOFBFCsW0CzRNPaQCpH5AKAJF4wq7wEQgUHrIjOCfRfwWeiBSTY1ZLrwsH2lv1ya/vTq42dtzkdK9Yz67nVsWzrtbVjfbD2rH2LNqLGt8aPxs9m3Pze/NX8nbkuL+V3nlqlp/nPyDZSE=</latexit>

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)Qπθ(s, a)]

<latexit sha1_base64="h9wAiZS/Hu8y95WfLcZV9OYpXmU=">AHB3icjVLb9NAEHYLCSW8WjgipBVpASiyi5IcIlUgZAQh6pF6UPKJtZ6s0lW9Uu7a2jl7o0Lf4ULBxDiyl/gxr9h1nbaxG6rWko0OzM7803M7YX+1wq2/63tHzjZq1+a+V2487de/cfrK493JdRIijbo5EfiUOPSObzkO0prnx2GAtGAs9nB97RW2M/+MSE5FHYUycxGwRkEvIxp0SByl2rPWmiLsIBUVPS9pNyWuwpIHOYtTEeROpWuaneQdFP13NHGhOJzSwfc2xr7bKz6WCYBeHVtPezlISnx04+6NfPDgk+maoAwbjTRhxZWU6ZIu5QfK5LkWdzcoWU0Zznmwmb6WUyIiEkci+gY4bEgNHV0uq1zSLzr6OH2DJ5ThZfyjtIdRHKh3WjmYfPDYiJuB9bJAZpT4Hu1i2iXtF1QZgzIuazCEknk+K8yIbC5brcFOlBdJBIB6qcrQzCAV9pE5YWTrmpy73KukflvZHOUJSzdKiPwo0kFxWUhryDp+iMz35/5eCUk0H2lT82c64KWSxsLp3MHYqbdbdSQxl+gQaSwgJ126j52ghbEs4Ml37yK3uDXbx4CPkJk0VBo1xY5V9npIBRtBKy/qQdGfbrExlUmZm4NC1Flokrjc3Eql6wAPTgdhoEsCjh4ce1i7PMrckHZRIjo80wD9RA/npJLmtuo6M84KTeFG8qd7aOcVXNflUtsvc7A7nJh3a2yEwOjmT7uq6vWFnD6oKTiGsW8Wz467+BSpoErBQUZ9I2XfsWA1SIhSnPtMNnEgWE3pEJqwPYkgCJgdpBkajJmhGaBwJ+MEiZdr5GykJpDwJPA0JMiyzSgvsvUTNX49SHkYJ4qFNE80TnykImQ+CmjEBaPKPwGBUMEBK6JTAjOh4NPRABKcslVYX9zw3mxsbn7cn3rTUHivXYemq1LMd6ZW1Z760da8+itS+1b7UftZ/1r/Xv9V/137nr8lJx5G18NT/Ae3vG6B</latexit>
  • In general, replacing the policy evaluation or the “critic” leads to

different flavors of the actor-critic

  • REINFORCE:
  • Q – Actor Critic
slide-50
SLIDE 50

50

Actor-critic

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)R(s, a)]

<latexit sha1_base64="st/dtwMZce7i5vgRwutBqN5vog0=">AG83icjVLb9NAEHYLCW14tXDksqKlNCoigsSXCJVICTEoSofUjZ1FpvNsmqfml3Da3c/RtcOIAQV/4MN/4Ns360sd1WtZRodmZ25ptvZmw38rhU/f6/peU7dxvNeyurfsPHj56vLb+5ECGsaBsn4ZeKI5cIpnHA7avuPLYUSQY8V2PHbon74z98AsTkofBUJ1FbOyTWcCnBIFKme9sdpGA4R9ouaum7zXTkIchSX3cQ7mE5CdS4d1e0h6SRq09bGhKJLSw/cuxp7bKpGWMY+eA36+niYhaTESz7rTuGHBZ/N1Rh3Gqjx2s5kyRbiU/ViTOsjiZQ8doLnIshE31RUyIiEkUifAU4akgNLF1sqszSHxg6+PdAp5dh5fwntI9RDKh2pnmY9fGxEzMD71CAzSn0Jtly2iXtD1QZgxPOazCEgrkfyc5mNkuU23NRpgXQiAeqGu0CQk6faWRGWNWk65rMu5prYv5b6RylCSu3qgi8cFZDcV3IG0i6/cgs9mcxXgUJdF/pczPnOqfl2saCSy91h+K7tZqMLP0UBSWCDp9LtoE5VSGEsJT7Z7V7lFnWIfT5BZtJQZdQUO1Xp6yERbAKtvKoHeX8G+cbUJmVhDnJRp6FJ7HBzK5GODTzYPYSBLgk4hnAcYu3w1C/PBWUTIcKvhQbqIV40J9c0t1XTX3BSbTzJ31ROsZoZxTc1+VyWpk32SMGas7bR3+qnD6oLdi5sWPmz56z9hbJp7LNAUY9IObL7kRonRChOPaZbOJYsIvSEzNgIxID4TI6TtCkatUEzQdNQwA+WJtUu3kiIL+WZ74KngSurNqO8yjaK1fTNOFBFCsW0CzRNPaQCpH5AKAJF4wq7wEQgUHrIjOCfRfwWeiBSTY1ZLrwsH2lv1ya/vTq42dtzkdK9Yz67nVsWzrtbVjfbD2rH2LNqLGt8aPxs9m3Pze/NX8nbkuL+V3nlqlp/nPyDZSE=</latexit>

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)Qπθ(s, a)]

<latexit sha1_base64="h9wAiZS/Hu8y95WfLcZV9OYpXmU=">AHB3icjVLb9NAEHYLCSW8WjgipBVpASiyi5IcIlUgZAQh6pF6UPKJtZ6s0lW9Uu7a2jl7o0Lf4ULBxDiyl/gxr9h1nbaxG6rWko0OzM7803M7YX+1wq2/63tHzjZq1+a+V2487de/cfrK493JdRIijbo5EfiUOPSObzkO0prnx2GAtGAs9nB97RW2M/+MSE5FHYUycxGwRkEvIxp0SByl2rPWmiLsIBUVPS9pNyWuwpIHOYtTEeROpWuaneQdFP13NHGhOJzSwfc2xr7bKz6WCYBeHVtPezlISnx04+6NfPDgk+maoAwbjTRhxZWU6ZIu5QfK5LkWdzcoWU0Zznmwmb6WUyIiEkci+gY4bEgNHV0uq1zSLzr6OH2DJ5ThZfyjtIdRHKh3WjmYfPDYiJuB9bJAZpT4Hu1i2iXtF1QZgzIuazCEknk+K8yIbC5brcFOlBdJBIB6qcrQzCAV9pE5YWTrmpy73KukflvZHOUJSzdKiPwo0kFxWUhryDp+iMz35/5eCUk0H2lT82c64KWSxsLp3MHYqbdbdSQxl+gQaSwgJ126j52ghbEs4Ml37yK3uDXbx4CPkJk0VBo1xY5V9npIBRtBKy/qQdGfbrExlUmZm4NC1Flokrjc3Eql6wAPTgdhoEsCjh4ce1i7PMrckHZRIjo80wD9RA/npJLmtuo6M84KTeFG8qd7aOcVXNflUtsvc7A7nJh3a2yEwOjmT7uq6vWFnD6oKTiGsW8Wz467+BSpoErBQUZ9I2XfsWA1SIhSnPtMNnEgWE3pEJqwPYkgCJgdpBkajJmhGaBwJ+MEiZdr5GykJpDwJPA0JMiyzSgvsvUTNX49SHkYJ4qFNE80TnykImQ+CmjEBaPKPwGBUMEBK6JTAjOh4NPRABKcslVYX9zw3mxsbn7cn3rTUHivXYemq1LMd6ZW1Z760da8+itS+1b7UftZ/1r/Xv9V/137nr8lJx5G18NT/Ae3vG6B</latexit>
  • In general, replacing the policy evaluation or the “critic” leads to

different flavors of the actor-critic

  • REINFORCE:
  • Q – Actor Critic
  • Advantage Actor Critic:

rθJ(πθ) = Ea∼πθ [rθ log πθ(a|s)Aπθ(s, a)]

<latexit sha1_base64="xt+KLi7xNCvpjJYTpDa3gE5+oP0=">AIEnicjVLbxMxEN4WSEp4tXDkYlFSWiosgUJLpEKCAlxqBqUPqQ4Xk3TmJ1X1p7odXWv4ELf4ULBxDiyokb/4bxPtJ9tFVXamXPjL/5puxY/o246LX+7e0fOPmrVp95Xbjzt179x+srj3c514YWHTP8mwvODQJpzZz6Z5gwqaHfkCJY9r0wDx+q/wHn2jAmecOxalPxw6ZuWzKLCLAZKzV2k3UR9ghYm6a0TtpRMQmDMH+6yNrYknzrghOl3EjUhs6FK5kH/u6UJ4R2KbTsUI89CBqH5PHg0TSIvY0UfZzuJwGZzMUYN5roQxuLORWkU8qPBQmTLEYS0FaWRY4cbGzPMAERE98PvBOEpwGxIl1GOzKhxPq6PNrJ6OlVehHrCtlFJFl0Gs0k89FTxY0EM4g+UcyUZ6TLZatcK+oWhH0WVqT2rjEtEm6L6pR8FxHm6oskA6AmCvKaAsKqXyqkYlgZesWpLocq6J+t+I5yhOWDpVZmB7swqLyCvEOn6I5PvTx6vxAS6L+SZmnOZynJpYyGkG4dDcVl3KzWU6adsIClcIG70OmgDFVIoT4FPcvcuCvPb2X102ASpSUOlURP0RMTPQxTQCbTyoh6k/emnN6YyKbk5SJcyhiahwdSpiBs6KB3EQa5OPAYwnaIpcHiuDQXlE2CwPucWaAeYvtzcklzGxX7QpNy40n6UhnZ1UwkvqrJZ7xT1ub1UW7Sob1dAqOTU3LRvLiKUhHleSZJRzoDA5tqr9AUyoTacKpfHjAp4DyFbXiQc7xGbR5q0taHfQMDVK0GLlZUNZMOcWKJ7kyGWJXdhSVJMte8sR1Dkha8OTn9rxlrK73Nnvxh6oLPV2sa+m3a6z+hamwQoe6wrIJ5yO954txRALBLJvKBg459Yl1TGZ0BEuXOJSPo7gvEjXBMkFTL4A/eFNia/5ERBzOTx0TIpV6vOxTxot8o1BMX40j5vqhoK6VJqGNhIeUr+PaMICagn7FBbEChwRdacwPUQ8CvaABH0csnVxf7Wpv58c2vwYn37TSrHivZYe6K1NV17qW1r7VdbU+zal9q32o/aj/rX+vf67/qv5PQ5aX0zCOt8NX/AdrdMl3</latexit>

= Qπθ(s, a) − V πθ(s)

<latexit sha1_base64="i8bxdaOGhE4+O6ebZz4faxk1DA8=">AIQHicjVLb9s4EFZfdut9Je2xF2IDw1brBla2QHsxUHRYLGHoi6SNEDoCJRM20T1gki1CRT+tF72J+xtz73soUWx1z1SFGJTCVBCQgZ4bfPNkA6yiHExHv9z4+at23c63bv3ej/8+NPv2xs3t/naZGHdC9MozQ/CAinEUvonmAiogdZTkcRPRd8P535X/3geacpcmuOMnoLCbLhC1YSASY/M3Ofh9NEI6JWAVB+Ur6JfEF5izGRvicJ6KU+4Ld4S4X4rHnlQulJ17RhDuShzRhTjEvIghajKWR7sVZEi8q0c1nE4Z8uVmCGMe305xCLFRXEtfJjQYoqi18FDJXlLEcDVtrTEDEJMvy9BjhRU7C0pPla1lRYhNPHr2u6XlteiUbCTlCpFq4vX6V+eiR4kbyJUQfK2bKM/JrpetcK+oWhHMmKlJbRISRMTs19VY81xHm7YskA6AWCJstDMKRj7VyEow2yXblirazjVX/3t6jnRC65TNIEqXLRaXQV4h0vVHptmfJp7FBLov5Kmac2lkubSxEDLS4VBc3d1WDTZ9waSwgXi/thFj9FaCuVZ41PdvYvCsmF9H2M2R2rSkDVqgh4L/TyUOZ1DKy/qgenPxNyY1qQ05sAspYmhc/UqZL7HujgjRAGuTjw2IXtLpY+03EmF5RN8jz9WFugHhJlK3Jc0FK23Emit15Yp4qv76blcZXdfmUuy1x5jRShxvqnTVM7eI2zNMqi64Ux8Has9HRHdfw9aM6+EBnwJqFjHQw9soYzrkgxEZuOgJmho0jdxfUzMwnLTKVa6cu2qjyJLpfr1rlzngGQAz3xjzwc9W6nJ9KjxKphqnqB9y+pKf2NrvD3WH2ovPLPYcsz3xt/4G2YoLGKaiDAinB9640zMSpILFkZU9nDBaUbC92RJD2GZkJjyWampSdQHyxwt0hz+4AXS1uaJksScn8QBRCrdue1Txot8h4VYPJ+VLMkKQZOwSrQoIiRSpH5N0ZzlNBTRCSxImDPgisIVgcsk4De3ByJ4dsntxf7Otvfb9s706daLl0aOu85D51dn6HjOM+eF84fzxtlzws6nzufOl87X7l/df7vfuv9VoTdvmDMPnLWv+/93B8Hcw=</latexit>

“how much better is an action than expected?

slide-51
SLIDE 51

51

  • Policy Learning:
  • Policy gradients
  • REINFORCE
  • Reducing Variance (Homework!)
  • Actor-Critic:
  • Other ways of performing “policy evaluation”
  • Variants of Actor-critic

Summary