a s pr a t a s t s

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( - PDF document

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( s , r | s, a ) r = lim E [ R t ] = d ( s ) ( a | s ) n n t =1 s a s ,r d . ( a | s, ) p ( s | s, a ) = d ( s


  1. π ( a | s, θ ) . = Pr { A t = a | S t = s } n � � � � 1 r ( π ) . p ( s ′ , r | s, a ) r = lim E π [ R t ] = d π ( s ) π ( a | s ) n n →∞ t =1 s a s ′ ,r � � d π . π ( a | s, θ ) p ( s ′ | s, a ) = d π ( s ′ ) = lim t →∞ Pr { S t = s } d π ( s ) s a � ∞ v π ( s ) . ˜ = E π [ R t + k − r ( π ) | S t = s ] k =1 ∞ � q π ( s, a ) . ˜ = E π [ R t + k − r ( π ) | S t = s, A t = a ] k =1 � ∂r ( π ) ∆ θ t . . = α � = α ∇ r ( π ) ∂ θ � � ∇ r ( π ) = d π ( s ) q π ( s, a ) ∇ π ( a | s, θ ) ˜ (the policy-gradient theorem) s a � � � � = d π ( s ) q π ( s, a ) − v ( s ) ˜ ∇ π ( a | s, θ ) (for any v : S → R ) s a � � ∇ π ( a | s, θ ) � � = d π ( s ) π ( a | s, θ ) q π ( s, a ) − v ( s ) ˜ π ( a | s, θ ) s a �� � � � ∇ π ( A t | S t , θ ) � � = E q π ( S t , A t ) − v ( S t ) ˜ � S t ∼ d π , A t ∼ π ( ·| S t , θ ) π ( A t | S t , θ ) Forward view: θ t +1 . = θ t + α � ∇ r ( π ) � � ∇ π ( A t | S t , θ ) . ˜ G λ = θ t + α t − ˆ v ( S t , w ) π ( A t | S t , θ ) e.g., in the one-step linear case: � � ∇ π ( A t | S t , θ ) R t +1 − ¯ R t + w ⊤ t φ t +1 − w ⊤ = θ t + α t φ t ) π ( A t | S t , θ ) . = θ t + αδ t e ( S t , A t )

  2. Deriving the policy-gradient theorem: ∇ r ( π ) = � s d π ( s ) � a ˜ q π ( s, a ) ∇ π ( a | s, θ ): � ∇ ˜ v π ( s ) = ∇ π ( a | s, θ )˜ q π ( s, a ) a � � � = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) ∇ ˜ q π ( s, a ) a � �� � � � p ( s ′ , r | s, a ) v π ( s ′ ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) ∇ r − r ( π ) + ˜ a s ′ ,r � � �� � � p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) −∇ r ( π ) + a s ′ Re-arranging terms: � � � � p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) ∇ r ( π ) = ∇ π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) − ∇ ˜ v π ( s ) a s ′ Summing both sides over s , weighted by d π ( s ): � � � d π ( s ) ∇ r ( π ) = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s s a � � � � p ( s ′ | s, a ) ∇ ˜ v π ( s ′ ) − + d π ( s ) π ( a | s, θ ) d π ( s ) ∇ ˜ v π ( s ) s a s ′ s � � ∇ r ( π ) = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s a � � � � π ( a | s, θ ) p ( s ′ | s, a ) v π ( s ′ ) − + d π ( s ) ∇ ˜ d π ( s ) ∇ ˜ v π ( s ) s a s s ′ � �� � d π ( s ′ ) � � = d π ( s ) ∇ π ( a | s, θ )˜ q π ( s, a ) s a � � d π ( s ′ ) ∇ ˜ v π ( s ′ ) − + d π ( s ) ∇ ˜ v π ( s ) s ′ s � � = d π ( s ) q π ( s, a ) ∇ π ( a | s, θ ) ˜ Q.E.D. s a

  3. Final, complete policy-gradient algorithm: Initialize parameters of policy θ ∈ R n , and state-value function w ∈ R m Initialize eligibility traces z θ ∈ R n and z w ∈ R m to 0 Initialize ¯ R = 0 On each step, in state S : Choose A according to π ( ·| S, θ ) Take action A , observe S ′ , R δ ← R − ¯ v ( S ′ , w ) − ˆ R + ˆ v ( S, w ) R ← ¯ ¯ R + α 1 δ z w ← λ z w + ∇ w ˆ v ( S, w ) w ← w + α 2 δ z w z θ ← λ z θ + ∇ π ( A | S, θ ) π ( A | S, θ ) θ ← θ + α 3 δ z θ exp( θ ⊤ φ ( s, a )) π ( a | s, θ ) . � = b exp( θ ⊤ φ ( s, b )) � = ∇ π ( a | s, θ ) e ( s, a ) . π ( a | s, θ ) = φ ( s, a ) − π ( b | s, θ ) φ ( s, b ) b µ ( s ) . = θ ⊤ µ φ ( s ) σ ( s ) . = exp( θ ⊤ σ φ ( s ) � � − ( a − µ ( s )) 2 1 π ( a | s, θ ) . = √ 2 π exp 2 σ ( s ) 2 σ ( s ) θ . = ( θ ⊤ µ ; θ ⊤ σ ) ⊤ ∇ θ µ π ( a | s, θ ) 1 = σ ( s ) 2 ( a − µ ( s )) φ µ ( s ) π ( a | s, θ ) � ( a − µ ( s )) 2 � ∇ θ σ π ( a | s, θ ) = − 1 φ σ ( s ) σ ( s ) 2 π ( a | s, θ )

Recommend


More recommend