An Emphatic Approach to the Problem of Off-policy TD Learning Rich - PowerPoint PPT Presentation

R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada

Temporal-Difference Learning with Linear Function Approximation 2 R policy e π ( a | s ) . on A t 2 A e S t 2 S R t +1 2 R = P { A t = a | S t = s } . states actions rewards × π s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P = P { S t +1 = j | S t = i, A t = a } . T transition prob matrix [ d π ] s . = d π ( s ) . property of d is that P > ergodic stationary distribution = lim t !1 P { S t = s } , w > 0 π d π = d π . G t . = R t +1 + γ R t +2 + γ 2 R t +3 + · · · return 0 ≤ γ < 1 d x ( s ) 2 R n ∀ s ∈ S feature vectors ⇡ v π ( s ) . ) ⇡ w > weight vector or w t 2 R n , n ⌧ | S | t x ( s ) = E π [ G t | S t = s ] , value function ⇣ ⌘ w t +1 . R t +1 + γ w > t x ( S t +1 ) � w > linear TD(0): = w t + α t x ( S t ) x ( S t ) , ⇣ ⌘ � x ( S t ) ( x ( S t ) � γ x ( S t +1 )) > w t +1 = w t + α R t +1 x ( S t ) w t | {z } | {z } b t 2 R n A t 2 R n ⇥ n = w t + α ( b t � A t w t ) = ( I � α A t ) w t + α b t . . w t +1 = ( I − α A ) ¯ ¯ w t + α b deterministic ‘expected’ update: . h x ( S t ) ( x ( S t ) − γ x ( S t +1 )) > i A = lim t !1 E [ A t ] = lim t !1 E π Stable if is positive definite A . y > Ay > 0 , 8 y 6 = 0 i.e., if ! > X X w t = A − 1 b . [ P π ] ss 0 x ( s 0 ) = d π ( s ) x ( s ) x ( s ) − γ Converges to lim t →∞ ¯ s s 0 ⇣ = X > D π ( I − γ P π ) X , t +1 t − x (1) > −   | {z }   R − x (2) > − 0 if this “key matrix” I showed in 1988 X .   D π . = d π .   is pos. def., then that the key matrix =  ,  .  .   . 0 is pos. def. and is pos. def. if its A  − x ( | S | ) > − everything is stable column sums are >0

× π s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P = P { S t +1 = j | S t = i, A t = a } . T transition prob matrix property of d is that [ d π ] s . = d π ( s ) . P > ergodic stationary distribution = lim t !1 P { S t = s } , w > 0 π d π = d π . . w t +1 = ( I − α A ) ¯ ¯ w t + α b deterministic ‘expected’ update: . h x ( S t ) ( x ( S t ) − γ x ( S t +1 )) > i A = lim t !1 E [ A t ] = lim t !1 E π Stable if is positive definite A . y > Ay > 0 , 8 y 6 = 0 i.e., if ! > X X w t = A − 1 b . [ P π ] ss 0 x ( s 0 ) = d π ( s ) x ( s ) x ( s ) − γ Converges to lim t →∞ ¯ s s 0 ⇣ = X > D π ( I − γ P π ) X , t +1 t | {z } . For the j th column, the sum is R if this “key matrix” I showed in 1988 is pos. def., then that the key matrix X X X [ D π ( I − γ P π )] ij = [ D π ] ik [ I − γ P π ] kj is pos. def. and is pos. def. if its A i i k everything is stable column sums are >0 X = [ D π ] ii [ I − γ P π ] ij i X = d π ( i )[ I − γ P π ] ij i = [ d > π ( I − γ P π )] j = [ d > π − γ d > π )] j π P − x (1) > −   = [ d > π − γ d >   π ] j − x (2) > − 0 X .   D π . = d π .   = (1 − γ ) d π ( j ) =  ,  .  .   . 0  > 0 . − x ( | S | ) > −

2 off-policy learning problems 1. Correcting for the distribution of future returns solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD( λ ) and GQ( λ ) 2. Correcting for the state-update distribution solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

Off-policy Temporal-Difference Learning with Linear Function Approximation 2 R on A t 2 A e S t 2 S R t +1 2 R states actions rewards h ⇡ ( a | s ) assume coverage: target policy is no longer used to select actions ∀ s, a π ( a | s ) > 0 = µ ( a | s ) > 0 µ ( a | s ) behavior policy is used to select actions! ⇒ [ d µ ] s . on d µ ( s ) . = new ergodic stationary distribution = lim t →∞ P { S t = s } > 0 , ∀ s ∈ S , w ⇡ v π ( s ) . ) ⇡ w > t x ( s ) = E π [ G t | S t = s ] old value function = ⇡ ( A t | S t ) µ ( a | s ) ⇡ ( a | s ) importance sampling ratio ⇢ t . X X µ ( A t | S t ) . E µ [ ⇢ t | S t = s ] = µ ( a | s ) = ⇡ ( a | s ) = 1 . a a µ ( a | s ) π ( a | s ) X X For any r.v. : Z t +1 E µ [ ρ t Z t +1 | S t = s ] = µ ( a | s ) Z t +1 = π ( a | s ) Z t +1 = E π [ Z t +1 | S t = s ] {z } | a a ⇣ ⌘ w t +1 . d x t . R t +1 + γ w > t x t +1 � w > = w t + ρ t α linear off-policy TD(0): t x t x t = x ( S t ) ⇣ ⌘ � ρ t x t ( x t � γ x t +1 ) > = w t + α ρ t R t +1 x t w t | {z } | {z } b t A t h ρ t x t ( x t � γ x t +1 ) > i and its A matrix: A = lim t !1 E [ A t ] = lim t !1 E µ ρ t x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E µ � S t = s s x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E π � S t = s s ! > key matrix now X X [ P π ] ss 0 x ( s 0 ) = d µ ( s ) x ( s ) x ( s ) � γ has mismatched s D and P matrices; s 0 = X > D µ ( I � γ P π ) X , it is not stable

h ρ t x t ( x t � γ x t +1 ) > i A = lim t !1 E [ A t ] = lim t !1 E µ Off-policy Temporal-Difference Learning with Linear Function Approximation 2 R ρ t x t ( x t � γ x t +1 ) > � h i X on A t 2 A � e S t 2 S R t +1 2 R states actions rewards = d µ ( s ) E µ � S t = s s h ⇡ ( a | s ) assume coverage: target policy is no longer used to select actions x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E π � S t = s ∀ s, a π ( a | s ) > 0 = µ ( a | s ) > 0 µ ( a | s ) behavior policy is used to select actions! ⇒ [ d µ ] s . on d µ ( s ) . s = new ergodic stationary distribution = lim t →∞ P { S t = s } > 0 , ∀ s ∈ S , w ! > ⇡ v π ( s ) . X X ) ⇡ w > [ P π ] ss 0 x ( s 0 ) t x ( s ) = E π [ G t | S t = s ] = d µ ( s ) x ( s ) x ( s ) � γ old value function key matrix now s s 0 has mismatched = X > D off-policy TD(0)’s A matrix: µ ( I � γ P π ) X , A D and P matrices; it is not stable Counterexample:  � λ = 0 µ (right |· ) = 0 . 5 1 2 w X = w γ = 0 . 9 π (right |· ) = 1 2  0 � π 1 s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P π = = P P transition prob matrix: 0 1 property of d is that  0 . 5 �  1 �  0 . 5 � 0 − 0 . 9 − 0 . 45 µ ( I − γ P π ) = = key matrix: D sums to <0! × 0 0 . 5 0 0 . 1 0 0 . 05  0 . 5 �  1 �  � − 0 . 45 − 0 . 4 X > D ⇥ ⇤ ⇥ ⇤ pos def test: µ ( I − γ P π ) X = 1 2 = 1 2 = − 0 . 2 . × × × 0 0 . 05 2 0 . 1 A is not positive definite! Stability is not assured.

2 off-policy learning problems 1. Correcting for the distribution of future returns solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD( λ ) and GQ( λ ) 2. Correcting for the state-update distribution solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

Geometric Insight ˜ ˆ J r v v π J ∗ Ben Van Roy 2009

Other Distribution ˜ ˆ J r v v π J ∗ Ben Van Roy 2009

Problem 2 of off-policy learning: Correcting for the state-update distribution • The distribution of updated states does not ‘match’ the target policy • Only a problem with function approximation, but that’s a show stopper • Precup, Sutton & Dasgupta (2001) treated the episodic case, used importance sampling to warp the state distribution from the behavior policy’s distribution to the target policy’s distribution, then did a future- reweighted update at each state • equivalent to emphasis = product of all i.s. ratios since the beginning of time • ok algorithm, but severe variance problems in both theory and practice • Performance assessed on whole episodes following the target policy • This ‘alternate life’ view of off-policy learning was then abandoned

The excursion view   of off-policy learning • In which we are following a (possibly changing) behavior policy forever, and are in its stationary distribution • We want to predict the consequences of deviating from it for a limited time with various target policies (e.g., options) • Error is assessed on these ‘excursions’ starting from states in the behavior distribution • Much more practical setting than ‘alternate life’ • This setting was the basis for all the work with gradient-TD and MSPBE

Emphasis warping • The idea is that emphasis warps the distribution of updated states from the behavior policy’s stationary distribution to something like the ‘followon distribution’ of the target policy started in the behavior policy’s stationary distribution • From which future-reweighted updates will be stable in expectation—this follows from old results (Dayan 1992, Sutton 1988) on convergence of TD( λ ) in episodic MDPs • A new algorithm: Emphatic TD( λ )

An Emphatic Approach to the Problem of Off-policy TD Learning Rich - PowerPoint PPT Presentation

R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta,

EMPHATIC A new hadron production experiment for improved neutrino flux predictions Jonathan

Robotic Coach: how to revise humans' motions by Emphatic Demonstration Tetsunari Inamura National

Incremental Package Builds Guillaume Maudoux @layus NixCon 2017 Louvain-la-Neuve hold a huge

Off-policy methods with approximation Recall off-policy learning involves two policies One

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Texture Synthesis Presented by James Hays Problem Statement 1 Problem Statement Problem

Moving Up & Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

Sometimes I read the sayings of Jesus and I am taken aback! Sometimes I read the sayings of

1a. Off-grid Lighting & Rural Electrification 1. What is the role of off-grid lighting and

Problems Problem Spaces Problems, Problem Spaces, and Search Ahmed Rafea Ahmed Rafea Problem

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

SELF RETRACTING LIFELINE FOOT LEVEL TIE OFF NO PROBLEM. SHARP EDGES NO PROBLEM.

Instructor Guide for Hand-off Workshop Resources: 1)Hand-off work script This provides a

Acoustic particle detection Robert Lahmann HAP Workshop: The Non-Thermal Universe Erlangen,

g-2 of the muon Klaus Jungmann with slides from Rijksuniversiteit Groningen B.L. Roberts

Geothermal energy R& D in the 7th Framew ork Programme Jeroen Schuppers European Commission,

Roadmap for Section 1.2. History of Operating Systems Tasks of an Operating System OS as

Customer Service Customer Service Contact Points Hosting Support Ticketing System Service

Selfishness and Rupert Property of convex bodies Liping Yuan College of Mathematics and

Return-oriented programming without returns S. Checkoway, L. Davi, A. Dmitrienko, A. Sadeghi, H.

Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1

An Emphatic Approach to the Problem of Off-policy TD Learning Rich - PowerPoint PPT Presentation

R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta,

EMPHATIC A new hadron production experiment for improved neutrino flux predictions Jonathan

Robotic Coach: how to revise humans' motions by Emphatic Demonstration Tetsunari Inamura National

Incremental Package Builds Guillaume Maudoux @layus NixCon 2017 Louvain-la-Neuve hold a huge

Off-policy methods with approximation Recall off-policy learning involves two policies One

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Texture Synthesis Presented by James Hays Problem Statement 1 Problem Statement Problem

Moving Up &amp; Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it &amp; why do we use it? *

Sometimes I read the sayings of Jesus and I am taken aback! Sometimes I read the sayings of

1a. Off-grid Lighting &amp; Rural Electrification 1. What is the role of off-grid lighting and

Problems Problem Spaces Problems, Problem Spaces, and Search Ahmed Rafea Ahmed Rafea Problem

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

SELF RETRACTING LIFELINE FOOT LEVEL TIE OFF NO PROBLEM. SHARP EDGES NO PROBLEM.

Instructor Guide for Hand-off Workshop Resources: 1)Hand-off work script This provides a

Acoustic particle detection Robert Lahmann HAP Workshop: The Non-Thermal Universe Erlangen,

g-2 of the muon Klaus Jungmann with slides from Rijksuniversiteit Groningen B.L. Roberts

Geothermal energy R&amp; D in the 7th Framew ork Programme Jeroen Schuppers European Commission,

Roadmap for Section 1.2. History of Operating Systems Tasks of an Operating System OS as

Customer Service Customer Service Contact Points Hosting Support Ticketing System Service

Selfishness and Rupert Property of convex bodies Liping Yuan College of Mathematics and

Return-oriented programming without returns S. Checkoway, L. Davi, A. Dmitrienko, A. Sadeghi, H.

Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1

Moving Up & Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

1a. Off-grid Lighting & Rural Electrification 1. What is the role of off-grid lighting and

Geothermal energy R& D in the 7th Framew ork Programme Jeroen Schuppers European Commission,