Steps to understanding Policy-gradient methods Policy approximation - PowerPoint PPT Presentation

Steps to understanding Policy-gradient methods • Policy approximation π ( a | s, θ ) • The average-reward (reward rate) objective r ( θ ) ¯ • Stochastic gradient ascent/descent ∆ θ t ≈ α∂ ¯ r ( θ ) ∂ θ • The policy-gradient theorem and its proof • Approximating the gradient • Eligibility functions for a few cases • A final algorithm

Policy Approximation • Policy = a function from state to action • How does the agent select actions? • In such a way that it can be affected by learning? • In such a way as to assure exploration? • Approximation: there are too many states and/or actions to represent all policies • To handle large/continuous action spaces

What is learned and stored? 1. Action-value methods : learn the value of each action; pick the max (usually) 2. Policy-gradient methods : learn the parameters u of a stochastic policy , update by ∇ u Performance • including actor-critic methods , which learn both value and policy parameters 3. Dynamic Policy Programming 4. Drift-diffusion models (Psychology)

Actor-critic architecture World

  Action-value methods • The value of an action in a state given a policy is the expected future reward starting from the state taking that first action, then following the policy thereafter " ∞ � # � X γ t − 1 R t q π ( π ( s, a ) = E � S 0 = s, A 0 = a � � t =1 • Policy: pick the max most of the time   ˆ A t = arg max Q t ( S t , a ) a but sometimes pick at random ( 휀 -greedy)

Why approximate policies rather than values? • In many problems, the policy is simpler to approximate than the value function • In many problems, the optimal policy is stochastic • e.g., bluffing, POMDPs • To enable smoother change in policies • To avoid a search on every step (the max) • To better relate to biology

Gradient-bandit algorithm • Store action preferences   e H t ( a ) " rather than action-value estimates Q t ( a ) + • Instead of 휀 -greedy, pick actions by an exponential soft-max: e H t ( a ) Pr { A t = a } . . = = π t ( a ) P k b =1 e H t ( b ) • � ¯ �� Also store the sample average of rewards as R t • � Then update: R t − ¯ � �� H t +1 ( a ) = H t ( a ) + α R t 1 a = A t − π t ( a ) 1 or 0, depending on whether the predicate (subscript) is true

Gradient-bandit algorithms on the 10-armed testbed 100% α = 0.1 80% with baseline α = 0.4 % 60% Optimal α = 0.1 action 40% without baseline α = 0.4 20% 0% 0 250 500 750 1000 Steps Figure 2.6: Average performance of the gradient-bandit algorithm with and without a reward baseline on the 10-armed testbed when the q ∗ ( a ) are chosen to be near +4 rather than near zero.

∂ f ( x ) ∂ x g ( x ) − f ( x ) ∂ g ( x )  f ( x ) � ∂ ∂ x = . g ( x ) 2 g ( x ) ∂ x ∂π t ( b ) ∂ ∂ H t ( a ) = ∂ H t ( a ) π t ( b ) " # e H t ( b ) ∂ = P k ∂ H t ( a ) c =1 e H t ( c ) c =1 e H t ( c ) − e H t ( b ) ∂ P k c =1 e Ht ( c ) ∂ e Ht ( b ) P k ∂ H t ( a ) ∂ H t ( a ) = (by the quotient rule) ⌘ 2 ⇣P k c =1 e H t ( c ) = 1 a = b e H t ( a ) P k c =1 e H t ( c ) − e H t ( b ) e H t ( a ) (because ∂ e x ∂ x = e x ) ⌘ 2 ⇣P k c =1 e H t ( c ) = 1 a = b e H t ( b ) e H t ( b ) e H t ( a ) c =1 e H t ( c ) − ⌘ 2 P k ⇣P k c =1 e H t ( c ) = 1 a = b π t ( b ) − π t ( b ) π t ( a ) � � = π t ( b ) 1 a = b − π t ( a ) Q.E.D. .

Steps to understanding Policy-gradient methods • Policy approximation π ( a | s, θ ) • The average-reward (reward rate) objective r ( θ ) ¯ • Stochastic gradient ascent/descent ∆ θ t ≈ α∂ ¯ r ( θ ) ∂ θ • The policy-gradient theorem and its proof • Approximating the gradient • Eligibility functions for a few cases • A complete algorithm

eg, linear-exponential policies (discrete actions) • The “preference” for action a in state s is linear in 휽 and a state-action feature vector 휙 ( s,a ) • The probability of action a in state s is exponential in its preference exp( θ > φ ( s, a )) π ( a | s, θ ) . = P b exp( θ > φ ( s, b )) • Corresponding eligibility function: P r π ( a | s, θ ) X π ( a | s, θ ) = φ ( s, a ) � π ( b | s, θ ) φ ( s, b ) b

Policy-gradient setup parameterized π ( a | s, θ ) . = Pr { A t = a | S t = s } policies n 1 average-reward r ( π ) . X X X X p ( s 0 , r | s, a ) r = lim E π [ R t ] = d π ( s ) π ( a | s ) objective n n !1 t =1 s a s 0 ,r steady-state d π . = lim t !1 Pr { S t = s } distribution 1 differential v π ( s ) . X ˜ = E π [ R t + k � r ( π ) | S t = s ] state-value fn k =1 1 q π ( s, a ) . differential X ˜ = E π [ R t + k � r ( π ) | S t = s, A t = a ] action-value fn k =1 ∆ θ t ⇡ α∂ r ( π ) stochastic . = α r r ( π ) gradient ascent ∂ θ X X r r ( π ) = d π ( s ) q π ( s, a ) r π ( a | s, θ ) ˜ (the policy-gradient theorem) s a �  �

q π ( s, a ) . differential X ˜ = E π [ R t + k � r ( π ) | S t = s, A t = a ] action-value fn k =1 ∆ θ t ⇡ α∂ r ( π ) stochastic . = α r r ( π ) gradient ascent ∂ θ policy-gradient X X r r ( π ) = d π ( s ) q π ( s, a ) r π ( a | s, θ ) ˜ (the policy-gradient theorem) theorem s a � ⇣ ⌘ r π ( A t | S t , θ ) � � = E q π ( S t , A t ) � v ( S t ) ˜ � S t ⇠ d π , A t ⇠ π ( ·| S t , θ ) � π ( A t | S t ) � ⇣ � ⌘ r π ( A t | S t , θ ) ˜ � G λ = E t � ˆ v ( S t , w ) � S t ⇠ d π , A t : 1 ⇠ π � π ( A t | S t ) ⌘ r π ( A t | S t , θ ) ⇣ ˜ G λ t � ˆ v ( S t , w ) (by sampling under π ) ⇡ π ( A t | S t ) ⌘ r π ( A t | S t , θ ) θ t +1 . ⇣ ˜ G λ = θ t + α t � ˆ v ( S t , w ) π ( A t | S t ) e.g., in the one-step linear case: ⌘ r π ( A t | S t , θ ) ⇣ R t +1 � ¯ R t + w > t φ t +1 � w > = θ t + α t φ t ) π ( A t | S t ) . = θ t + αδ t e ( A t , S t )

Deriving the policy-gradient theorem: r r ( π ) = P s d π ( s ) P a ˜ q π ( s, a ) r π ( a | s, θ ): X r ˜ v π ( s ) = r π ( a | s, θ )˜ q π ( s, a ) a h i X = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) r ˜ q π ( s, a ) a h ⇤i X X ⇥ p ( s 0 , r | s, a ) v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) r r � r ( π ) + ˜ a s 0 ,r h h ii X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) �r r ( π ) + a s 0 ,r h i X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) ∴ r r ( π ) = r π ( a | s, θ )˜ q π ( s, a )+ π ( a | s, θ ) �r ˜ v π ( s ) a s 0 X X X d π ( s ) r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) ∴ s s a X X X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) � + d π ( s ) π ( a | s, θ ) d π ( s ) r ˜ v π ( s ) s a s s 0 X X

Xh ⇤i X ⇥ 0 h h ii X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) �r r ( π ) + a s 0 ,r h i X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) ∴ r r ( π ) = r π ( a | s, θ )˜ q π ( s, a )+ π ( a | s, θ ) �r ˜ v π ( s ) a s 0 X X X d π ( s ) r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) ∴ s s a X X X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) � + d π ( s ) π ( a | s, θ ) d π ( s ) r ˜ v π ( s ) s a s s 0 X X = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) s a X X X X π ( a | s, θ ) p ( s 0 | s, a ) v π ( s 0 ) � + d π ( s ) r ˜ d π ( s ) r ˜ v π ( s ) s a s s 0 | {z } d π ( s 0 ) X X r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) Q.E.D. s a

Complete PG algorithm Initialize parameters of policy θ 2 R n , and state-value function w 2 R m Initialize eligibility traces e θ 2 R n and e w 2 R m to 0 Initialize ¯ R = 0 On each step, in state S : Choose A according to π ( ·| S, θ ) Take action A , observe S 0 , R δ R � ¯ form TD error from critic v ( S 0 , w ) � ˆ R + ˆ v ( S, w ) R ¯ ¯ update average reward estimate R + α θ δ e w λ e w + r update eligibility trace for critic w ˆ v ( S, w ) update critic parameters w w + α w δ e w e θ λ e θ + r π ( A | S, θ ) update eligibility trace for actor π ( A | S, θ ) update actor parameters θ θ + α θ δ e θ

The generality of the policy-gradient strategy • Can be applied whenever we can compute the effect of parameter changes on the action probabilities, ⌘ r π ( A t | S t , θ ) ) • E.g., has been applied to spiking neuron models • There are many possibilities other than linear- exponential and linear-gaussian • e.g., mixture of random, argmax, and fixed- width gaussian; learn the mixing weights, drift/ diffusion models

eg, linear-gaussian policies (continuous actions) action 휇 and 휎 linear prob. in the state density action

eg, linear-gaussian policies (continuous actions) • The mean and std. dev. for the action taken in state s are linear and linear-exponential in µ ( s ) . σ ( s ) . θ . = exp( θ > = θ > = ( θ > µ ; θ > σ ) > µ φ ( s ) σ φ ( s ) • The probability density function for the action taken in state s is gaussian � ( a � µ ( s )) 2 ✓ ◆ 1 π ( a | s, θ ) . = 2 π exp p 2 σ ( s ) 2 σ ( s )

Gaussian eligibility functions θ µ π ( a | s, θ ) 1 r = σ ( s ) 2 ( a � µ ( s )) φ µ ( s ) π ( a | s, θ ) ✓ ( a � µ ( s )) 2 ◆ θ σ π ( a | s, θ ) r = � 1 φ σ ( s ) σ ( s ) 2 π ( a | s, θ )

Steps to understanding Policy-gradient methods Policy approximation - PowerPoint PPT Presentation

Steps to understanding Policy-gradient methods Policy approximation ( a | s, ) The average-reward (reward rate) objective r ( ) Stochastic gradient ascent/descent t r ( ) The

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Escaping Saddle Points with Adaptive Gradient Methods Matthew Staib 1 , Sashank Reddi 2 ,

Introductory Course on Non-smooth Optimisation Lecture 01 - Gradient methods Jingwei Liang

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

Boston University students - June 05 area.. Young Aboriginal dancers performed the welcome

Timebanking UK 2018 Conference WELCOME DAY 1 Celebrating 20 years of timebanking in the UK DAY

De Developme pment nt Nayeli i Gonzalez zalez-Gom omez z , C Cath therin rine Davi

MOE KINDERGARTEN ON TO A STRONG START! MOE Kindergarten @ Frontier Mrs Tonnine Chua Principal

Expectation-Maximization (EM) E-step: update q ( z | k ) = p ( z | x, k 1 )

ENUMERATING (2+2)-FREE POSETS BY THE NUMBER OF MINIMAL ELEMENTS AND OTHER STATISTICS SERGEY

Optimistic Policy Optimization via Multiple Importance Sampling Matteo Papini Alberto Maria

Linear Resolution, Chordality and Ascent of Clutters Ashkan Nikseresht

Sambuz

Useful Links

Newsletter

Mail Us