Policy Approximation Policy = a function from state to action ! How - PowerPoint PPT Presentation

Policy Approximation • Policy = a function from state to action ! • How does the agent select actions? ! • In such a way that it can be affected by learning? ! • In such a way as to assure exploration? ! • Approximation: there are too many states and/or actions to represent all policies ! • To handle large/continuous action spaces

What is learned and stored? 1. Action-value methods : learn the value of each action; pick the max (usually) ! 2. Policy-gradient methods : learn the parameters u of a stochastic policy , update by ! ∇ u Performance • including actor-critic methods , which learn both value and policy parameters ! 3. Dynamic Policy Programming ! 4. Drift-diffusion models (Psychology)

Actor-critic architecture World

  Action-value methods • The value of an action in a state given a policy is the expected future reward starting from the state taking that first action, then following the policy thereafter ! " ∞ � # ! � X γ t − 1 R t q π ( π ( s, a ) = E � S 0 = s, A 0 = a � � ! t =1 • Policy: pick the max most of the time   ˆ A t = arg max Q t ( S t , a ) a but sometimes pick at random ( 휀 -greedy)

We should never discount   when approximating policies! ! � � is ok it there is a start state/distribution

Average reward setting • All rewards are compared to the average reward ! " ∞ � # � X ! q π ( π ( s, a ) = E R t − ¯ r ( π ) � S 0 = s, A 0 = a � � t =1 • where ! 1 r ( π ) = lim ¯ t E [ R 1 + R 2 + · · · + R t | A 0: t − 1 ∼ π ] ! t →∞ • and we learn an approximation r t ≈ ¯ ¯ r ( π t )

Why approximate policies rather than values? • In many problems, the policy is simpler to approximate than the value function ! • In many problems, the optimal policy is stochastic ! • e.g., bluffing, POMDPs ! • To enable smoother change in policies ! • To avoid a search on every step (the max) ! • To better relate to biology

Policy-gradient methods • The policy itself is learned and stored ! • the policy is parameterized by u ∈ � n ! • we learn and store u ! P r [ A t = a ] = π u t ( a | S t ) ! • u is updated by approximate gradient ascent u t +1 = u t + α \ u ¯ r ( π u ) r

eg, linear-exponential policies (discrete actions) • The “preference” for action a in state s is linear in u ! X u > x sa ≡ u ( i ) x sa ( i ) ! i feature vector ∈ � n • The probability of action a in state s is exponential in its preference e u > x sa π u ( a | s ) = b e u > x sb P

eg, linear-gaussian policies (continuous actions) action 휇 and 휎 linear prob. ! in the state density action

eg, linear-gaussian policies (continuous actions) • The mean and std. dev. for the action taken in state s are linear and linear-exponential in u 휇 , u 휎 ! ! µ ( s ) = u > σ ( s ) = e u > σ φ s µ φ s ! • The probability density function for the action taken in state s is gaussian − ( a − µ ( s ))2 1 π u ( a | s ) = 2 σ ( s )2 2 π e √ σ ( s )

The generality of the policy-gradient strategy • Can be applied whenever we can compute the effect of parameter changes   on the action probabilities, ∇ u π ( a | s ) ! • E.g., has been applied to spiking neuron models ! • There are many possibilities other than linear- exponential and linear-gaussian ! • e.g., mixture of random, argmax, and fixed- width gaussian; learn the mixing weights ! • drift/diffusion models?

Can policy gradient methods solve the twitching problem ? (the problem of decisiveness in adaptive behavior) • The problem: ! • we need stochastic policies to get exploration ! • but all of our policies have been i.i.d. (independent, identically distributed) ! • if the time step is small, the robot just twitches! ! • really, no aspect of behavior should depend on the length of the time step

Can we design a parameterized policy whose stochasticity is independent of the time step? • let a “noise” variable take a random walk,   drifting but tending back to zero ! ! ! • add it to the action, and adapt its parameters by the PG algorithm (or have several such noise variables)

The generality of the policy-gradient strategy (2) • Can be applied whenever we can compute the effect of parameter changes   on the action probabilities, ∇ u π ( a | s ) ! • Can we apply PG when outcomes are viewed as action? ! • e.g., the action of “turning on the light”   or the action of “going to the bank” ! • is this an alternate strategy for temporal abstraction? ! • We would need to learn—not compute—the gradient of these states w.r.t. the policy parameter

Have we eliminated action? • If any state can be an action, then what is still special about actions? ! • The parameters/weights are what we can really, directly control ! • We have always, in effect, “sensed” our actions   (even in the 휀 -greedy case) ! • Perhaps actions are just sensory signals that we can usually control easily ! • Perhaps there is no longer any need for a special concept of action in the RL framework

Policy Approximation Policy = a function from state to action ! How - PowerPoint PPT Presentation

Policy Approximation Policy = a function from state to action ! How does the agent select actions? ! In such a way that it can be affected by learning? ! In such a way as to assure exploration? ! Approximation: there are too many

6. Approximation and fitting norm approximation least-norm problems regularized

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Off-policy methods with approximation Recall off-policy learning involves two policies One

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

DM865 (10 ECTS) Heuristikker og Approximationsalgoritmer [Heuristics and Approximation

Gaussian Quadrature September 25, 2011 Interpolation Approximation of integrals Approximation

On the Approximation of Mean-Payoff Games Raffaella Gentilini University of Perugia Convegno

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Approximation Algorithms for Geometric Proximity Problems: Introduction Background Approximation

Peter reports to the church in Jerusalem Acts 11:1-18 Now the apostles and the brothers who were

Super-Resolution via Image Recapture and Bayesian Effect Modeling Neil Toronto Oral Thesis

Calculating Source Line Level Energy Information for Android Applications Ding Li, Shuai Hao,

NetMicroscope: Passive Measurements of Residential Internet Performance Renata Teixeira with

Machine Learn ning in Search Motivation & Overview Thomas Hofmann Engineering Director g

Differentiated Communication in a guidance role Sue Blair The Plan What is Temperament?

Service Animals and the ADA Breakout Session #2.2 Mid-Atlantic ADA Update Conference John L.

1 What does this mean? Limits the species of service animals to dogs; Makes clear that

Policy Approximation Policy = a function from state to action ! How - PowerPoint PPT Presentation

Policy Approximation Policy = a function from state to action ! How does the agent select actions? ! In such a way that it can be affected by learning? ! In such a way as to assure exploration? ! Approximation: there are too many

6. Approximation and fitting norm approximation least-norm problems regularized

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Off-policy methods with approximation Recall off-policy learning involves two policies One

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

DM865 (10 ECTS) Heuristikker og Approximationsalgoritmer [Heuristics and Approximation

Gaussian Quadrature September 25, 2011 Interpolation Approximation of integrals Approximation

On the Approximation of Mean-Payoff Games Raffaella Gentilini University of Perugia Convegno

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Approximation Algorithms for Geometric Proximity Problems: Introduction Background Approximation

Peter reports to the church in Jerusalem Acts 11:1-18 Now the apostles and the brothers who were

Super-Resolution via Image Recapture and Bayesian Effect Modeling Neil Toronto Oral Thesis

Calculating Source Line Level Energy Information for Android Applications Ding Li, Shuai Hao,

NetMicroscope: Passive Measurements of Residential Internet Performance Renata Teixeira with

Machine Learn ning in Search Motivation &amp; Overview Thomas Hofmann Engineering Director g

Differentiated Communication in a guidance role Sue Blair The Plan What is Temperament?

Service Animals and the ADA Breakout Session #2.2 Mid-Atlantic ADA Update Conference John L.

1 What does this mean? Limits the species of service animals to dogs; Makes clear that

Machine Learn ning in Search Motivation & Overview Thomas Hofmann Engineering Director g