( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( - PDF document

π ( a | s, θ ) . = Pr { A t = a | S t = s } n � � � � 1 r ( π ) . p ( s ′ , r | s, a ) r = lim E π [ R t ] = d π ( s ) π ( a | s ) n n →∞ t =1 s a s ′ ,r � � d π . π ( a | s, θ ) p ( s ′ | s, a ) = d π ( s ′ ) = lim t →∞ Pr { S t = s } d π ( s ) s a � ∞ v π ( s ) . ˜ = E π [ R t + k − r ( π ) | S t = s ] k =1 ∞ � q π ( s, a ) . ˜ = E π [ R t + k − r ( π ) | S t = s, A t = a ] k =1 � ∂r ( π ) ∆ θ t . . = α � = α ∇ r ( π ) ∂ θ � � ∇ r ( π ) = d π ( s ) q π ( s, a ) ∇ π ( a | s, θ ) ˜ (the policy-gradient theorem) s a � � � � = d π ( s ) q π ( s, a ) − v ( s ) ˜ ∇ π ( a | s, θ ) (for any v : S → R ) s a � � ∇ π ( a | s, θ ) � � = d π ( s ) π ( a | s, θ ) q π ( s, a ) − v ( s ) ˜ π ( a | s, θ ) s a �� ∇ π ( A t | S t , θ ) � � = E q π ( S t , A t ) − v ( S t ) ˜ � S t ∼ d π , A t ∼ π ( ·| S t , θ ) π ( A t | S t , θ ) Forward view: θ t +1 . = θ t + α � ∇ r ( π ) � � ∇ π ( A t | S t , θ ) . ˜ G λ = θ t + α t − ˆ v ( S t , w ) π ( A t | S t , θ ) e.g., in the one-step linear case: � � ∇ π ( A t | S t , θ ) R t +1 − ¯ R t + w ⊤ t φ t +1 − w ⊤ = θ t + α t φ t ) π ( A t | S t , θ ) . = θ t + αδ t e ( S t , A t )

Deriving the policy-gradient theorem: ∇ r ( π ) = � s d π ( s ) � a ˜ q π ( s, a ) ∇ π ( a | s, θ ): � ∇ ˜ v π ( s ) = ∇ π ( a | s, θ )˜ q π ( s, a ) a � � � = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) ∇ ˜ q π ( s, a ) a � �� p ( s ′ , r | s, a ) v π ( s ′ ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) ∇ r − r ( π ) + ˜ a s ′ ,r � � �� p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) −∇ r ( π ) + a s ′ Re-arranging terms: � � � � p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) ∇ r ( π ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) − ∇ ˜ v π ( s ) a s ′ Summing both sides over s , weighted by d π ( s ): � � � d π ( s ) ∇ r ( π ) = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s s a � � � � p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) − + d π ( s ) π ( a | s, θ ) d π ( s ) ∇ ˜ v π ( s ) s a s ′ s � � ∇ r ( π ) = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s a � � � � π ( a | s, θ ) p ( s ′ | s, a ) v π ( s ′ ) − + d π ( s ) ∇ ˜ d π ( s ) ∇ ˜ v π ( s ) s a s s ′ � �� d π ( s ′ ) � � = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s a � � d π ( s ′ ) ∇ ˜ v π ( s ′ ) − + d π ( s ) ∇ ˜ v π ( s ) s ′ s � � = d π ( s ) q π ( s, a ) ∇ π ( a | s, θ ) ˜ Q.E.D. s a

Final, complete policy-gradient algorithm: Initialize parameters of policy θ ∈ R n , and state-value function w ∈ R m Initialize eligibility traces z θ ∈ R n and z w ∈ R m to 0 Initialize ¯ R = 0 On each step, in state S : Choose A according to π ( ·| S, θ ) Take action A , observe S ′ , R δ ← R − ¯ v ( S ′ , w ) − ˆ R + ˆ v ( S, w ) R ← ¯ ¯ R + α 1 δ z w ← λ z w + ∇ w ˆ v ( S, w ) w ← w + α 2 δ z w z θ ← λ z θ + ∇ π ( A | S, θ ) π ( A | S, θ ) θ ← θ + α 3 δ z θ exp( θ ⊤ φ ( s, a )) π ( a | s, θ ) . � = b exp( θ ⊤ φ ( s, b )) � = ∇ π ( a | s, θ ) e ( s, a ) . π ( a | s, θ ) = φ ( s, a ) − π ( b | s, θ ) φ ( s, b ) b µ ( s ) . = θ ⊤ µ φ ( s ) σ ( s ) . = exp( θ ⊤ σ φ ( s ) � � − ( a − µ ( s )) 2 1 π ( a | s, θ ) . = √ 2 π exp 2 σ ( s ) 2 σ ( s ) θ . = ( θ ⊤ µ ; θ ⊤ σ ) ⊤ ∇ θ µ π ( a | s, θ ) 1 = σ ( s ) 2 ( a − µ ( s )) φ µ ( s ) π ( a | s, θ ) � ( a − µ ( s )) 2 � ∇ θ σ π ( a | s, θ ) = − 1 φ σ ( s ) σ ( s ) 2 π ( a | s, θ )

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( - PDF document

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( s , r | s, a ) r = lim E [ R t ] = d ( s ) ( a | s ) n n t =1 s a s ,r d . ( a | s, ) p ( s | s, a ) = d ( s

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges

WEBINAR WEDNESDAY IB Insights: The New IB Math Curriculum and University Considerations October

A computation with Bernstein projectors of depth 0 for SL(2) Allen Mo y Chia go September

Market Timing: Why and How Mark Pankin MDP Associates LLC Registered Investment Advisor March

Board Succession Planning BoardVision Julie Hembrock Daum, Practice Leader, North American Board

Statistical inference in a spiked population model Jian-feng Yao Joint work with Weiming Li

@ Leuven H. Blockeel, J. Davis, L. De Raedt, D. Fierens, W. Meert, N. Taghipour, G. Van den

User-level scheduling Don Porter CSE 506 Context Multi-threaded application; more threads

ECED2200 Digital Circuits Time Response & Hazards 18/07/2012 Colin OFlynn - CC BY-SA

Software Analysis and Verification Group Viktor Vafeiadis Mustafa Zengin (Tenure-track faculty)

Office of Research Administration KPI Summary/Highlights Proposals FYTD November 2019 2018

Show, Match and Segment: Joint Weakly Supervised Learning of Semantic Matching and Object

Relative Partial Combinatory Algebras over Heyting Categories Jetze Zoethout Category Theory, 8

Categories for the Working Haskeller Jeremy Gibbons, University of Oxford Haskell eXchange,

On Hrushovski properties of Hrushovski constructions Jan Hubi cka Department of Applied

Administration Tiffany, Lori, Cathy, Shelby, Adrienne Organizational chart NEC Tribal

CHURCH AND PRISON PASTORAL CARE IN KENYA Initiatives the church in Kenya has put into practice in

Keeping Children out of Young Offenders Institutions sharing practice from England TIM

IMTKU Emotional Dialogue System for Short Text Conversation at NTCIR-14 STC-3 (CECG) Task

AI Conversational Robo-Advisor with Finance Big Data Analytics Host: Prof. Yung-Chun Chang,

Satori: Grzegorz Mi o , Derek Murray, Steven Hand Michael Fetterman University of Cambridge

Learning to Reason for Neural Question Answering Jianfeng Gao Joint work with Ming-Wei Chang,

Data constraints at run time Or: a whirlwind tour of some crazy ideas. Stephen Kell

Sambuz

Useful Links

Newsletter

Mail Us

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( - PDF document

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( s , r | s, a ) r = lim E [ R t ] = d ( s ) ( a | s ) n n t =1 s a s ,r d . ( a | s, ) p ( s | s, a ) = d ( s

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges

WEBINAR WEDNESDAY IB Insights: The New IB Math Curriculum and University Considerations October

A computation with Bernstein projectors of depth 0 for SL(2) Allen Mo y Chia go September

Market Timing: Why and How Mark Pankin MDP Associates LLC Registered Investment Advisor March

Board Succession Planning BoardVision Julie Hembrock Daum, Practice Leader, North American Board

Statistical inference in a spiked population model Jian-feng Yao Joint work with Weiming Li

@ Leuven H. Blockeel, J. Davis, L. De Raedt, D. Fierens, W. Meert, N. Taghipour, G. Van den

User-level scheduling Don Porter CSE 506 Context Multi-threaded application; more threads

ECED2200 Digital Circuits Time Response &amp; Hazards 18/07/2012 Colin OFlynn - CC BY-SA

Software Analysis and Verification Group Viktor Vafeiadis Mustafa Zengin (Tenure-track faculty)

Office of Research Administration KPI Summary/Highlights Proposals FYTD November 2019 2018

Show, Match and Segment: Joint Weakly Supervised Learning of Semantic Matching and Object

Relative Partial Combinatory Algebras over Heyting Categories Jetze Zoethout Category Theory, 8

Categories for the Working Haskeller Jeremy Gibbons, University of Oxford Haskell eXchange,

On Hrushovski properties of Hrushovski constructions Jan Hubi cka Department of Applied

Administration Tiffany, Lori, Cathy, Shelby, Adrienne Organizational chart NEC Tribal

CHURCH AND PRISON PASTORAL CARE IN KENYA Initiatives the church in Kenya has put into practice in

Keeping Children out of Young Offenders Institutions sharing practice from England TIM

IMTKU Emotional Dialogue System for Short Text Conversation at NTCIR-14 STC-3 (CECG) Task

AI Conversational Robo-Advisor with Finance Big Data Analytics Host: Prof. Yung-Chun Chang,

Satori: Grzegorz Mi o , Derek Murray, Steven Hand Michael Fetterman University of Cambridge

Learning to Reason for Neural Question Answering Jianfeng Gao Joint work with Ming-Wei Chang,

Data constraints at run time Or: a whirlwind tour of some crazy ideas. Stephen Kell

Sambuz

Useful Links

Newsletter

Mail Us

ECED2200 Digital Circuits Time Response & Hazards 18/07/2012 Colin OFlynn - CC BY-SA