Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - - PowerPoint PPT Presentation

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) vs ML


slide-1
SLIDE 1

Reinforcement Learning

George Konidaris gdk@cs.brown.edu

Fall 2019

slide-2
SLIDE 2

Machine Learning

Subfield of AI concerned with learning from data. Broadly, using:

  • Experience
  • To Improve Performance
  • On Some Task

(Tom Mitchell, 1997)

slide-3
SLIDE 3

vs …

ML

vs

Statistics

vs

Data Mining

slide-4
SLIDE 4

Why?

Developing effective learning methods has proved difficult. Why bother? Autonomous discovery

  • We don’t know something, want to find out.

Hard to program

  • Easier to specify task, collect data.

Adaptive behavior

  • Our agents should adapt to new data, unforeseen

circumstances.

slide-5
SLIDE 5

Types of Machine Learning

Depends on feedback available: Labeled data:

  • Supervised learning

No feedback, just data:

  • Unsupervised learning.

Sequential data, weak labels:

  • Reinforcement learning
slide-6
SLIDE 6

Supervised Learning

Input: X = {x1, …, xn} Y = {y1, …, yn} Learn to predict new labels. Given x: y?

inputs labels training data

slide-7
SLIDE 7

Unsupervised Learning

Input: X = {x1, …, xn} Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

inputs

slide-8
SLIDE 8

Reinforcement Learning

Learning counterpart of planning. max

π

R =

  • t=0

γtrt π : S → A

slide-9
SLIDE 9

MDPs

Agent interacts with an environment At each time t:

  • Receives sensor signal
  • Executes action
  • Transition:
  • new sensor signal
  • reward

st at st+1 rt Goal: find policy that maximizes expected return (sum

  • f discounted future rewards):

π max

π

E

  • R =

  • t=0

γtrt

slide-10
SLIDE 10

Markov Decision Processes

: set of states : set of actions : discount factor : reward function is the reward received taking action from state and transitioning to state . : transition function is the probability of transitioning to state after taking action in state . S A R R(s, a, s′) γ a s s′ T T(s′|s, a) s′ a s

< S, A, γ, R, T >

slide-11
SLIDE 11

RL vs Planning

In planning:

  • Transition function (T) known.
  • Reward function (R) known.
  • Computation “offline”.

In reinforcement learning:

  • One or both of T, R unknown.
  • Action in the world only source of data.
  • Transitions are executed not simulated.
slide-12
SLIDE 12

Reinforcement Learning

slide-13
SLIDE 13

RL

This formulation is general enough to encompass a wide variety of learned control problems.

slide-14
SLIDE 14

MDPs

As before, our target is a policy: A policy maps states to actions. The optimal policy maximizes: This means that we wish to find a policy that maximizes the return from every state.

π : S → A

max

π

∀s, E

  • R(s) =

  • t=0

γtrt

  • s0 = s
slide-15
SLIDE 15

Planning via Policy Iteration

In planning, we used policy iteration to find an optimal policy.

  • 1. Start with a policy
  • 2. Estimate
  • 3. Improve

a. π π

Repeat

V π π(s) = max

a

E [r + γV π(s0)] , ∀s More precisely, we use a value function: … then we would update by computing:

π

π(s) = argmaxa X

s0

T(s, a, s0) [r(s, a, s0) + γV [s0]]

V π(s) = E " ∞ X

i=0

γiri #

can’t do this anymore

slide-16
SLIDE 16

Value Functions

For learning, we use a state-action value function as follows: This is the value of executing in state , then following . Note that . π a s

|A| x

V π(s) = Qπ(s, π(s)) Qπ(s, a) = E " ∞ X

i=0

γiri|s0 = s, a0 = a #

slide-17
SLIDE 17

Policy Iteration

This leads to a general policy improvement framework:

  • 1. Start with a policy
  • 2. Learn
  • 3. Improve

a. π π π(s) = max

a

Q(s, a), ∀s

Repeat

Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step. Qπ

slide-18
SLIDE 18

Value Function Learning

Learning proceeds by gathering samples of . 
 Methods differ by:

  • How you get the samples.
  • How you use them to update .

Q Q(s, a)

slide-19
SLIDE 19

Monte Carlo

Simplest thing you can do: sample . Do this repeatedly, average values: R(s)

r r r r r r r r

Q(s, a) = R1(s) + R2(s) + ... + Rn(s) n

slide-20
SLIDE 20

Temporal Difference Learning

Where can we get more (immediate) samples? Idea: use the Bellman equation.

value of this state reward value

  • f next state

Qπ(s, a) = Es0 [r(s, a, s0) + γQπ(s0, π(s0))]

<latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit><latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit>
slide-21
SLIDE 21

TD Learning

Ideally and in expectation: is correct if this holds in expectation for all states. When it does not: temporal difference error.

ri + γQ(si+1, ai+1) − Q(si, ai) = 0

Q st st+1 at rt Q(st, at) ← rt + γQ(st+1, at+1)

slide-22
SLIDE 22

Sarsa

Sarsa: very simple algorithm

  • 1. Initialize Q[s][a] = 0
  • 2. For n episodes
  • observe state s
  • select a = argmaxa Q[s][a]
  • observe transition
  • compute TD error
  • update Q:
  • if not end of episode, repeat

(s, a, r, s′, a′) δ = r + γQ(s′, a′) − Q(s, a) Q(s, a) = Q(s, a) + αδ

zero by def. if s is absorbing

slide-23
SLIDE 23

Sarsa

slide-24
SLIDE 24

Sarsa

slide-25
SLIDE 25

Exploration vs. Exploitation

Always maxa Q(s, a)?

  • Exploit current knowledge.

What if your current knowledge is wrong? How are you going to find out?

  • Explore to gain new knowledge.

Exploration is mandatory if you want to find the optimal solution, but every exploratory action may sacrifice reward.

Exploration vs. Exploitation - when to try new things? Consistent theme of RL.

slide-26
SLIDE 26

Exploration vs. Exploitation

How to balance? Simplest, most popular approach: Instead of always being greedy:

  • maxa Q(s, a)

Explore with probability :

  • maxa Q(s, a) with probability .
  • random action with probability .

✏ ✏

(1 − ✏)

  • greedy exploration
  • Very simple
  • Ensures asymptotic coverage of state space

(✏ ≈ 0.1)

slide-27
SLIDE 27

TD vs. MC

TD and MC two extremes of obtaining samples of Q:

t=1 t=2 t=3 t=4 t=L ...

r + γV r + γV r + γV

t=1 t=2 t=3 t=4 t=L ...

  • i

γiri

slide-28
SLIDE 28

Generalizing TD

We can generalize this to the idea of an n-step rollout: R(1)

st = rt + γQ(st+1, at+1)

<latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit><latexit sha1_base64="j9xITIbTB4qobyfbs9EC/mMDBE=">ACLnicbZDLSgMxFIYz3q23qks3wSK0KDIjgm4E0Y1LFVuFtg5n0rSGJpMhOSOUYV7D5/AB3OojC7EpT6G6Wh1h8CH/85h3PyR4kUFn3/zZuYnJqemZ2bLywsLi2vFfXalanhvEq01KbmwgslyLmVRQo+U1iOKhI8uoe9qvX9zY4WOr7CX8KaCTizagE6Kyz6l7dZOajkYWZDzOkRNSHSbdrogFJAL8o2zHA7yHcoDKESFkv+rj8QHYdgBCUy0nlY/Gy0NEsVj5FJsLYe+Ak2MzAomOR5oZFangDrQofXHcaguG1mg5/ldMs5LdrWxr0Y6cD9OZGBsranItepAO/s31rf/K9WT7F92MxEnKTIYzZc1E4lRU37MdGWMJyh7DkAZoS7lbI7MDQhflrS6R1FyGyuUsm+JvDONT2dgPHF/ul45NRnNkg2ySMgnIATkmZ+ScVAkjD+SJPJMX79F79d69j2HrhDeaWSe/5H19A7iyp04=</latexit>

R(2)

st = rt + γrt+1 + γ2Q(st+2, at+2)

<latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit><latexit sha1_base64="4zB8b17MZ9xwGzV53UNZ4BSbGM=">ACQnicbZDLSgMxFIYz3u9WXboJFqGiyEwRdCMUdeGyFatCW4czaVpDk8mQnBHKMC/kc/gAbvUFBHfi1oXpZeHth8CX/5zDSf4okcKi7z97E5NT0zOzc/MLi0vLK6uFtfUrq1PDeJ1pqc1NBJZLEfM6CpT8JjEcVCT5dQ7HdSv7mxQseX2E94S0E3Fh3BAJ0VFs4ubrNSeScPMxtiTo+pCZHuUtrsglLgbhnuBrlzRsZtmdZKdmCW8z0KI9gJC0V/3x+K/oVgDEUyVjUsvDbmqWKx8gkWNsI/ARbGRgUTPJ8oZlangDrQZc3HMaguG1lw9/mdNs5bdrRxp0Y6dD9PpGBsravItepAO/s79rA/K/WSLFz1MpEnKTIYzZa1EklRU0H0dG2MJyh7DsAZoR7K2V3YIChC/jHlkjrHkJkc5dM8DuHv3BV3g8c1w6KlZNxRnNk2yREgnIamQc1IldcLIA3kiz+TFe/TevHfvY9Q64Y1nNsgPeZ9fAeyuTA=</latexit>

Each tells us something about the value function.

  • We can combine all n-step rollouts.
  • This is known as a complex backup.

R(n)

st = rt + γrt+1 + γ2rt+2 + ... + γn−1rt+n−1 + γnQ(st+n, at+n)

<latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit><latexit sha1_base64="D1+UZDYQ6+02L+k05o8Go5bdzSc=">ACcXicbZHfShtBFMZnV9tGbWu0V6U3g6EQSbvshoK9EaTe9FKLUSGJy9nJA6ZP8vM2UJY9kF9Ah/AB9DZTaj1z4GBb37fOZzhmyXwmEc3wTh2vqbt+9aG5tb7z983G7v7J47U1jGB8xIYy8zcFwKzQcoUPL3HJQmeQX2fy49i/+cuE0We4yPlYwUyLqWCAHqVt9yctXYrVdnV+xU9pDZF2qOjGSgF/lJiL6n+gat+Q/o1iaLokZf6e1I1Xi0euanXdfg6huFpdhP2504ipuiL0WyEh2yqpO0fTuaGFYorpFJcG6YxDmOS7AomOTV5qhwPAc2hxkfeqlBcTcum3Aq+tWTCZ0a649G2tD/J0pQzi1U5jsV4LV7tXwNW9Y4PTnuBQ6L5Brtlw0LSRFQ+uk6URYzlAuvABmhX8rZdgaH/jydbMmPmCJmrfDLJ8xeivN+lHh9+qNz9GuVUYt8IXukSxJyQI7Ib3JCBoSRG3IftIKN4C78HNJwb9kaBquZT+RJhb0HCRi6AQ=</latexit>

. . .

slide-29
SLIDE 29

TD(λ)

Weighted sum: . . . Estimator:

1

λ λn

weights

R(1) = r0 + γQ(s1, a1)

<latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit><latexit sha1_base64="uDxWJifMlQt4KAe6GehYKIw4vk=">ACIHicbZDNSgMxFIUz9f9/1KWbYBFalDIjgm6EohuXKrYK7TjcSdM2NJkMSUYoQ/c+hw/gVh/BnbjUF/A1TOsbOuBwMe593JvTpRwpo3nfTqFmdm5+YXFpeWV1bX1DXdzq65lqgitEcmluotAU85iWjPMcHqXKAoi4vQ26p0P67cPVGkm4xvT2goBOzNiNgrBW6u9f3WckvD/ApVqGH93GzA0IAvirp0D/AEPrl0C16FW8kPA1+DkWU6zJ0v5stSVJBY0M4aN3wvcQEGSjDCKeD5WaqaQKkBx3asBiDoDrIRn8Z4D3rtHBbKvtig0fu34kMhNZ9EdlOAarJ2tD879aIzXtkyBjcZIaGpPfRe2UYyPxMBjcYoSw/sWgChmb8WkCwqIsfGNbYmk7BmI9MAm40/mMA31w4pv+eqoWD3LM1pEO2gXlZCPjlEVXaBLVEMEPaJn9IJenSfnzXl3Pn5bC04+s43G5Hz9ACr1oLM=</latexit>

R(2) = r0 + γr1 + γ2Q(s2, a2)

<latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit><latexit sha1_base64="l2c5gNt6eQPaiXyRV08dc0/47M=">ACL3icbVDLSgMxFM3U97vq0k2wCBWlzAyCboSiG5dWrAp9DHfStA1NJkOSEcrQ7/A7/AC3+gniRtyJf2HaDqKtBwLnMv9+aEMWfauO6bk5uZnZtfWFxaXldW9/Ib27daJkoQqtEcqnuQtCUs4hWDTOc3sWKg5vQ1750P/9p4qzWR0bfoxbQjoRKzNCBgrBXnvqpkW/f0BPsUqcPEBrndACLCF91M0fVwp6sA/xBD4+0G+4JbcEfA08TJSQBkug/xnvSVJImhkCAeta54bm0YKyjDC6WC5nmgaA+lBh9YsjUBQ3UhHXxvgPau0cFsq+yKDR+rviRSE1n0R2k4BpqsnvaH4n1dLTPukbIoTgyNyHhRO+HYSDzMCbeYosTwviVAFLO3YtIFBcTYNP9sCaXsGQj1wCbjTeYwTW78kmd5ahQPsyWkQ7aBcVkYeOURldoEtURQ9oCf0jF6cR+fVeXc+xq05J5vZRn/gfH0DPwWlyA=</latexit>

R(n) =

n−1

X

i=0

γiri + γnQ(sn, an)

<latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit><latexit sha1_base64="xQYsfVsSW37Ao9g3ItcL4OTPTs=">ACPXicbZDLSgMxFIYzXmu9V26CRah4oUZEXRTFN24VLEq9DKcSdM2NJchyQhlmMfxOXwAt7rwAcSNuHVrWitY9YfAl/+cw0n+KObMWN9/9sbGJyanpnMz+dm5+YXFwtLylVGJrRCFf6JgJDOZO0Ypnl9CbWFETE6XUPenXr2+pNkzJS9uLaV1AW7IWI2CdFRYOLxpSW5kuIxrJhFhysp+1kjldpDhWhuEgAbDOmR4E3/fJT4vmVBuYQjlRlgo+jv+QPgvBEMoqHOwsJralIqi0hIMx1cCPbT0FbRnhNMvXEkNjIF1o06pDCYKaejr4aIbXndPELaXdkRYP3J8TKQhjeiJynQJsx/yu9c3/atXEtg7qKZNxYqkX4taCcdW4X5quMk0JZb3HADRzL0Vkw5oINZlO7IlUqprITKZSyb4ncNfuNrdCRyf7xWPjocZ5dAqWkMlFKB9dIRO0RmqILu0AN6RE/evfivXnvX61j3nBmBY3I+/gEVl+tFw=</latexit>
slide-30
SLIDE 30

Sarsa(λ)

This is called the λ-return.

  • At λ=0 we get Sarsa, at λ=1 we get MC.
  • Intermediate values of λ usually best.
  • TD(λ) family of algorithms
slide-31
SLIDE 31

Sarsa(λ): Implementation

Each state has eligibility trace e(s, a). At time t: e(st, at) = 1 e(s, a) = γλe(s, a), for all other (s, a) pairs. At end of episode: e(s, a) = 0, for all (s, a) pairs. When updating:

  • Compute 𝜀 as before
  • Q(s, a) = Q(s, a) + α𝜀e(s, a), for each (s, a) pair.
slide-32
SLIDE 32

Sarsa(λ): Implementation

  • 1. Initialize Q[s][a] = 0, for all (s, a)
  • 2. Initialize e[s][a] = 0, for all (s, a)
  • 2. For n episodes
  • observe state st
  • select at = argmaxa Q(st, a)
  • observe transition (st, at, rt, st+1, at+1)
  • compute TD error
  • e(st, at) = 1; other e(s, a) = γλe(s, a)
  • update Q:
  • if not end of episode, repeat
  • if end of episode, e[s][a] = 0 for all (s,a)

zero by def. if s is absorbing

δ = rt + γQ(st+1, at+1) − Q(st, at)

don’t forget!

Q(s, a) = Q(s, a) + αδe(s, a), ∀s, a

<latexit sha1_base64="ovQUocCMgFfaqSFD3hZAMzjrQ/Q=">ACNHicbVDLSgMxFM34flt16SZYBEUpM1XUjSC6cdmCVaFTyp301oZmJkNyRyilf+J3+AFu9QcEdyLu/AbTh+DrQODknHs5yYlSJS35/rM3Nj4xOTU9Mzs3v7C4tJxbWb20OjMCK0Irba4jsKhkghWSpPA6NQhxpPAqap/1/atbNFbq5I6KdZiuElkUwogJ9VzB+Utu8thmx/zL7bDQ1BpC8IGKgKOQ3mXh01tQCnev9Zzeb/gD8D/kmBE8myEUj3Hja0yGJMSCiwthr4KdW6YEgKhb25MLOYgmjDVYdTSBGW+sO/tfjm05pcBfvTkJ8oH7f6EJsbSeO3GQM1LK/vb74n1fNqHlU68okzQgTMQxqZoqT5v2yeEMaFKQ6joAw0r2VixYEOQq/ZESad0miGzPNRP87uEvuSwWgr1CsbyfPzkdTD1tkG2IBO2Qn7JyVWIUJdsce2CN78u69F+/VexuOjnmjnTX2A97HJ8a1p50=</latexit>
slide-33
SLIDE 33

Sarsa(λ)

slide-34
SLIDE 34

Sarsa(λ)

slide-35
SLIDE 35

Next Week: More Realism