Trust Region Policy Optimization Yixin Lin Duke University - - PowerPoint PPT Presentation

trust region policy optimization
SMART_READER_LITE
LIVE PREVIEW

Trust Region Policy Optimization Yixin Lin Duke University - - PowerPoint PPT Presentation

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview Preliminaries 1 Markov Decision Processes Policy iteration Policy gradients TRPO 2


slide-1
SLIDE 1

Trust Region Policy Optimization

Yixin Lin

Duke University yixin.lin@duke.edu

March 28, 2017

Yixin Lin (Duke) TRPO March 28, 2017 1 / 21

slide-2
SLIDE 2

Overview

1

Preliminaries Markov Decision Processes Policy iteration Policy gradients

2

TRPO Kakade & Langford Natural policy gradients Overview of TRPO Practice Experiments

3

References

Yixin Lin (Duke) TRPO March 28, 2017 2 / 21

slide-3
SLIDE 3

Introduction

Reinforcement learning is the problem of sequential decision making in dynamic environment Goal: capture the most important aspects of an agent making decisions

Input (sensing the state of the environment) Action (choosing to affect on the environment) Goal (prefers some states of the environment over others)

This is incredibly general Examples

Robots (and their components) Games Better A/B testing

Yixin Lin (Duke) TRPO March 28, 2017 3 / 21

slide-4
SLIDE 4

The Markov Decision Process (MDP)

S: set of possible states of the environment

p(sinit), sinit ∈ S: a distribution over initial state Markov property: we assume that the current state summarizes everything we need to remember

A: set of possible actions

P(snew|sold, a): state transitions, for each state s and action a

R : S → R: reward

γ ∈ [0, 1]: discount factor

Yixin Lin (Duke) TRPO March 28, 2017 4 / 21

slide-5
SLIDE 5

Policies and value functions

π: a policy (what action to do, given a state) Return: (possibly discounted) sum of future rewards rt + γrt+1 + γ2rt+2 . . . Performance of policy: η(π) = E[return] V π(s) = Eπ[return | s]

How good is a state, given a policy?

Qπ(s, a) = Eπ[return | s, a]

How good is an action at a state, given a policy?

Yixin Lin (Duke) TRPO March 28, 2017 5 / 21

slide-6
SLIDE 6

Policy iteration

Assume perfect model of MDP Alternate between the following until convergence

Evaluating the policy (computing V π)

For each state s, V (s) = E[

s′,r r + γV (s′)]

Repeat until convergence (or just once for value iteration)

Setting policy to be greedy (π(s) = arg maxa E[r + γV π(s′)])

Guaranteed convergence (for both policy and value iteration)

Yixin Lin (Duke) TRPO March 28, 2017 6 / 21

slide-7
SLIDE 7

Policy gradients

Policy iteration scales very badly: have to repeatedly evaluate policy

  • n all states

Parameterize policy a ∼ π|θ We can sample instead

Yixin Lin (Duke) TRPO March 28, 2017 7 / 21

slide-8
SLIDE 8

Policy gradients

Sample a lot of trajectories (simulate your environment under the policy) under the current policy Make good actions more probable

Specifically, estimate gradient using score function gradient estimator For each trajectory τi, ˆ gi = R(τi)∇θ log p(τi|θ) Intuitively, take the gradient of log probability of the trajectory, then weight it by the final reward Reduce variance by temporal structure and other tricks (e.g. baseline)

Replace reward by the advantage function Aπ = Qπ(s, a) − Vπ(s) Intuitively, how much better is the action we picked over the average action?

Repeat

Yixin Lin (Duke) TRPO March 28, 2017 8 / 21

slide-9
SLIDE 9

Vanilla policy gradient algorithm/REINFORCE

Initialize policy π|θ while gradient estimate has not converged do Sample trajectories using π for each timestep do Compute return and advantage estimate end for Refit optimal baseline Update the policy using gradient estimate ˆ g end while

Yixin Lin (Duke) TRPO March 28, 2017 9 / 21

slide-10
SLIDE 10

Connection to supervised learning

Minimizing L(θ) =

t log π(at|st; θ) ˆ

At

In the paper, they use cost functions instead of reward functions

Intuitively, we have some parameterized policy (“model”) giving us a distribution over actions We don’t have the correct action (“label”), so we just use the reward at the end as our label We can do better. How do we do credit assignment?

Baseline (roughly encourage half of the actions, not just all of them) Discounted future reward (actions affect near-term future), etc.

Yixin Lin (Duke) TRPO March 28, 2017 10 / 21

slide-11
SLIDE 11

Kakade & Langford: conservative policy iteration

“A useful identity” for η˜

π, the expected discounted cost of a new

policy ˜ π η(˜ π) = η(π) + E[

t=0 γtAπ(st, at)] =

η(π) +

s ρ˜ π(s) a ˜

π(a|s)Aπ(s, a) Intuitively: the expected return of a new policy is the expected return

  • f the old policy, plus how much better the new policy is at each state

Local approximation: switch out ρ˜

π for ρπ, since we only have the

state visitation frequency for the old policy, not the new policy

Lπ(˜ π) = η(π) +

s ρπ(s) a ˜

π(a|s)Aπ(s, a)

Kakade & Langford proved that optimizing this local approximation is good for small step sizes, but for mixture policies only

Yixin Lin (Duke) TRPO March 28, 2017 11 / 21

slide-12
SLIDE 12

Natural policy gradients

In this paper, they prove that η(˜ π) ≤ Lπ(˜ π) + CDmax

KL (π, ˜

π), C is a constant dependent on γ Intuitively, we optimize the approximation, but regularize with the KL divergence between old and new policy Algorithm called the natural policy gradient Problem: choosing hyperparameter C is difficult

Yixin Lin (Duke) TRPO March 28, 2017 12 / 21

slide-13
SLIDE 13

Overview of TRPO

Instead of adding KL divergence as a cost, simply use it as an

  • ptimization constraint

TRPO algorithm: minimize Lπ(˜ π), constraint that Dmax

KL ≤ δ for some

easily-picked hyperparameter δ

Yixin Lin (Duke) TRPO March 28, 2017 13 / 21

slide-14
SLIDE 14

Practical considerations

How do we sample trajectories?

Single-path: simply run each sample to completion “Vine”: for each sampled trajectory, pick random states along the trajectory and perform small rollout

How do we compute gradient? Use conjugate gradient algorithm followed by line search

Yixin Lin (Duke) TRPO March 28, 2017 14 / 21

slide-15
SLIDE 15

Algorithm

while gradient not converged do Collect trajectories (either single-path or vine) Estimate advantage function Compute policy gradient estimator Solve quadratic approximation to L(πθ) using CG Rescale using line search Apply update end while

Yixin Lin (Duke) TRPO March 28, 2017 15 / 21

slide-16
SLIDE 16

Experiments - MuJoCo robotic locomotion

Link to demonstration Same δ hyperparameter across experiments

Yixin Lin (Duke) TRPO March 28, 2017 16 / 21

slide-17
SLIDE 17

Experiments - MuJoCo learning curves

Link to demonstration Same δ hyperparameter across experiments

Yixin Lin (Duke) TRPO March 28, 2017 17 / 21

slide-18
SLIDE 18

Experiments - Atari

Not always better than previous techniques, but consistently decent Very little problem-specific engineering

Yixin Lin (Duke) TRPO March 28, 2017 18 / 21

slide-19
SLIDE 19

Takeaway

TRPO is a good default policy gradient technique which scales well and has minimal hyperparameter tuning Just use KL constraint on gradient approximator

Yixin Lin (Duke) TRPO March 28, 2017 19 / 21

slide-20
SLIDE 20

References

RS Sutton. Introduction to reinforcement learning Kakade and Langford. Approximately optimal approximate reinforcement learning Schulman et al. Trust Region Policy Optimization Schulman, Levine, Finn. Deep Reinforcement Learning course: link Andrej Karpathy. Deep Reinforcement Learning: From Pong to Pixels: link Trust Region Policy Optimization summary: link

Yixin Lin (Duke) TRPO March 28, 2017 20 / 21

slide-21
SLIDE 21

Thanks!

Link to presentation: yixinlin.net/trpo

Yixin Lin (Duke) TRPO March 28, 2017 21 / 21