Eligibility Traces Chapter 12 Eligibility traces are Another way - PowerPoint PPT Presentation

Eligibility Traces Chapter 12

Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound λ -return targets A basic mechanistic idea — a short-term, fading memory A new style of algorithm development/analysis the forward-view ⇔ backward-view transformation Forward view:   conceptually simple — good for theory, intuition Backward view:   computationally congenial implementation of the f. view

Unified View width of backup Dynamic Temporal- programming difference learning height Multi-step (depth) bootstrapping of backup Exhaustive Monte search Carlo ... 3

Recall n -step targets For example, in the episodic case,   with linear function approximation: 2-step target: . G (2) = R t +1 + γ R t +2 + γ 2 θ > t +1 φ t +2 t n -step target: . G ( n ) = R t +1 + · · · + γ n � 1 R t + n + γ n θ > t + n � 1 φ t + n t taken as zero and the n - . ( G ( n ) with = G t if t + n � T ). t

Any set of update targets can be averaged to produce new compound update targets A compound backup For example, half a 2-step plus half a 4-step U t = 1 + 1 2 G (2) 2 G (4) t t 1 2 Called a compound backup Draw each component 1 2 Label with the weights for that component

The λ -return is a compound update target TD( " ), " -return The λ -return a target that   averages all n -step targets each weighted by λ n -1 1 !" (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 # = 1 T-t- 1 "

λ -return Weighting Function weight given to total area = 1 the 3-step return is (1 − λ ) λ 2 decay by " Weight weight given to 1 !" actual, final return is λ T − t − 1 t T Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 7

Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get the MC target: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get the TD(0) target: T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 8

The off-line λ -return “algorithm” Wait until the end of the episode (offline) Then go back over the time steps, updating θ t +1 . h i G λ = θ t + α t � ˆ v ( S t , θ t ) r ˆ v ( S t , θ t ) , t = 0 , . . . , T � 1 .

The λ -return alg performs similarly to n -step algs   on the 19-state random walk (Tabular) n-step TD methods Off-line λ -return algorithm (from Chapter 7) 256 512 λ =1 128 n=64 n=32 λ =.99 λ =.975 λ =.95 RMS error at the end of the episode over the first 10 episodes n=32 n=1 λ =0 λ =.95 n=16 λ =.9 n=2 λ =.4 n=8 n=4 λ =.8 α α Intermediate λ is best (just like intermediate n is best) λ -return slightly better than n -step

The forward view looks forward from the state being updated to future states and rewards R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T

The backward view looks back to the recently visited states (marked by eligibility traces) ! t δ t e t e t e t e t s t -3 S t - 3 e t e t s t -2 S t - 2 e t e t s t -1 S t - 1 s t S t s t +1 S t + 1 T i m e Shout the TD error backwards The traces fade with temporal distance by γλ

Demo Here we are marking state-action pairs with a replacing eligibility trace 13

Eligibility traces (mechanism) The forward view was for theory The backward view is for mechanism same shape as 휽 e t ∈ R n ≥ 0 New memory vector called eligibility trace On each step, decay each component by γλ and increment the trace for the current state by 1 Accumulating trace e 0 . = 0 , accumulating eligibility trace e t . = r ˆ v ( S t , θ t ) + γλ e t − 1 , times of visits to a state 14

The Semi-gradient TD( λ ) algorithm θ t +1 . = θ t + αδ t e t , . = R t +1 + γ ˆ v ( S t +1 , θ t ) � ˆ v ( S t , θ t ) . δ t e 0 . = 0 , e t . = r ˆ v ( S t , θ t ) + γλ e t − 1

TD( λ ) performs similarly to offline λ -return alg. but slightly worse, particularly at high α Tabular 19-state random walk task Off-line λ -return algorithm TD( λ ) (from the previous section) 1 λ =1 .99 .975 λ =.95 λ =.99 λ =.9 λ =.975 λ =.8 λ =.95 RMS error at the end of the episode over the first λ =0 10 episodes λ =0 λ =.95 λ =.4 λ =.9 λ =.9 λ =.4 λ =.8 λ =.8 α α Can we do better? Can we update online?

The online λ -return algorithm performs best of all Tabular 19-state random walk task On-line λ -return algorithm Off-line λ -return algorithm = true online TD( λ ) λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 RMS error λ =.95 over first 10 episodes λ =0 λ =0 λ =.95 λ =.95 λ =.9 λ =.9 λ =.4 λ =.4 λ =.8 λ =.8 α α Figure 12.7:

The online λ -return alg uses a truncated λ -return   as its target h − t − 1 . G λ | h λ n − 1 G ( n ) λ h − t − 1 G ( h − t ) X = (1 − λ ) + , 0 ≤ t < h ≤ T. t t t n =1 r T R r t +3 R s t +3 horizon h = t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h !

The online λ -return algorithm There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h ! . h i G λ | 1 θ 1 = θ 1 v ( S 0 , θ 1 v ( S 0 , θ 1 h = 1 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 θ 0 0 . h i G λ | 2 θ 2 = θ 2 v ( S 0 , θ 2 v ( S 0 , θ 2 θ 1 θ 1 h = 2 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 1 0 θ 2 θ 2 θ 2 . h i 0 1 2 G λ | 2 θ 2 = θ 2 v ( S 1 , θ 2 v ( S 1 , θ 2 1 + α � ˆ 1 ) r ˆ 1 ) , θ 3 θ 3 θ 3 θ 3 2 1 0 1 2 3 . . . . ... . . . . . . . . . θ T θ T θ T θ T θ T h i G λ | 3 θ 3 = θ 3 v ( S 0 , θ 3 v ( S 0 , θ 3 h = 3 : 0 + α � ˆ 0 ) r ˆ 0 ) , · · · 0 1 2 3 T 1 0 . h i G λ | 3 θ 3 = θ 3 v ( S 1 , θ 3 v ( S 1 , θ 3 1 + α � ˆ 1 ) r ˆ 1 ) , 2 1 True online TD( λ ) . h i G λ | 3 θ 3 = θ 3 v ( S 2 , θ 3 v ( S 2 , θ 3 2 + α � ˆ 2 ) r ˆ 2 ) . computes just the 3 2 diagonal, cheaply … (for linear FA)

True online TD( λ ) θ t +1 . ⇣ ⌘ θ > t φ t − θ > = θ t + αδ t e t + α ( e t − φ t ) , t � 1 φ t e t . ⇣ ⌘ 1 − αγλ e > = γλ e t � 1 + φ t . t � 1 φ t dutch trace

Accumulating, Dutch, and Replacing Traces All traces fade the same: But increment differently! times of state visits accumulating traces dutch traces ( α = 0.5) replacing traces 21

The simplest example of deriving a backward view from a forward view Monte Carlo learning of a final target Will derive dutch traces Showing the dutch traces really are not about TD They are about efficiently implementing online algs

The Problem: Predict final target Z with linear function approximation episode next episode Time 0 1 2 . . . T-1 T 0 1 2 φ 0 φ 1 φ 2 φ T − 1 Z Data . . . θ 0 θ 0 θ 0 θ 0 θ T θ T θ T θ T Weights . . . θ > 0 φ 0 θ > 0 φ 1 θ > θ > 0 φ 2 0 φ T � 1 Predictions . . . ≈ Z θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , step size all done at time T

Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , What is the span? T step size all done at time T Is MC indep of span? No

Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , Computation and memory needed step size all done at time T all done at time T at step T increases with T ⇒ not IoS

Final Result Given: θ 0 φ 0 , φ 1 , φ 2 , . . . , φ T � 1 Z MC algorithm: θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , t θ t φ t , Equivalent independent-of-span algorithm: θ T . a t 2 < n , e t 2 < n = a T � 1 + Z e T � 1 , a 0 . = θ 0 , then a t . = a t � 1 − α t φ t φ > t = 1 , . . . , T − 1 t a t � 1 , e 0 . = α 0 φ 0 , then e t . = e t � 1 − α t φ t φ > t e t � 1 + α t φ t , t = 1 , . . . , T − 1 Proved: θ T = θ T

Eligibility Traces Chapter 12 Eligibility traces are Another way - PowerPoint PPT Presentation

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm

Traces Exist (Hypothetically)! Carl Pollard Structure and Evidence in Linguistics Workshop in

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Agenda Tribal employee eligibility Family member eligibility Tribal employer

LN-10 Searching for traces of Norwegian economic thought Intro When trying to find the traces of

The evaluation of evidence relating to traces of cocaine on banknotes Amy Wilson September 2015

Capturing Traffic Traces with Ground- Capturing Traffic Traces with Ground- Truth Information

Modeling interaction traces of an online panel From raw interaction traces to actionable

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

Discovering Bits of Place Histories from People's Activity Traces from People s Activity Traces

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Checking Eligibility Resources and Guide Proprietary and Confidential 1 Themes of the course What

Release Update Office of Medicaid Eligibility and Policy Medicaid Eligibility and Community

Shelter Eligibility & Service Restrictions Q&A December 19, 2017 Shelter Eligibility

NCAA Initial Eligibility Burlington Central High School September 20, 2017 Overview NCAA

NCAA Athletic Eligibility Rules Mathieu Chapman School Counselor Highland School of Technology

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled

Math 211 Math 211 Lecture #19 Nullspaces and Subspaces October 9, 2002 2 Homogeneous Systems

Week 4 -Wednesday What did we talk about last time? Functions Unix never says

Control Announcements Print and None (Demo) None Indicates that Nothing is Returned The special

What is the kernel upto? Powerful tracing techniques Joel Fernandes

Trace Abstraction Monday, December 14, 2011 Example Our Model of a Verification Problem 0

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Job Shop Lecture 15 Modelling Flow

Bidirectional Flow Export using IPFIX draft-ietf-ipfix-biflow-00

Eligibility Traces Chapter 12 Eligibility traces are Another way - PowerPoint PPT Presentation

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm

Traces Exist (Hypothetically)! Carl Pollard Structure and Evidence in Linguistics Workshop in

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Agenda Tribal employee eligibility Family member eligibility Tribal employer

LN-10 Searching for traces of Norwegian economic thought Intro When trying to find the traces of

The evaluation of evidence relating to traces of cocaine on banknotes Amy Wilson September 2015

Capturing Traffic Traces with Ground- Capturing Traffic Traces with Ground- Truth Information

Modeling interaction traces of an online panel From raw interaction traces to actionable

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

Discovering Bits of Place Histories from People's Activity Traces from People s Activity Traces

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Checking Eligibility Resources and Guide Proprietary and Confidential 1 Themes of the course What

Release Update Office of Medicaid Eligibility and Policy Medicaid Eligibility and Community

Shelter Eligibility &amp; Service Restrictions Q&amp;A December 19, 2017 Shelter Eligibility

NCAA Initial Eligibility Burlington Central High School September 20, 2017 Overview NCAA

NCAA Athletic Eligibility Rules Mathieu Chapman School Counselor Highland School of Technology

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled

Math 211 Math 211 Lecture #19 Nullspaces and Subspaces October 9, 2002 2 Homogeneous Systems

Week 4 -Wednesday What did we talk about last time? Functions Unix never says

Control Announcements Print and None (Demo) None Indicates that Nothing is Returned The special

What is the kernel upto? Powerful tracing techniques Joel Fernandes

Trace Abstraction Monday, December 14, 2011 Example Our Model of a Verification Problem 0

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Job Shop Lecture 15 Modelling Flow

Bidirectional Flow Export using IPFIX draft-ietf-ipfix-biflow-00

Shelter Eligibility & Service Restrictions Q&A December 19, 2017 Shelter Eligibility