Deep Reinforcement Learning through Policy Op7miza7on Pieter - PowerPoint PPT Presentation

DerivaMon from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Suggests we can also look at more than just gradient! E.g., can use importance sampled objecMve as “surrogate loss” (locally) [Tang&Abbeel, NIPS 2011] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Likelihood RaMo Gradient: Validity m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Valid even if R is disconMnuous, and unknown, or sample space (of paths) is a discrete set John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Likelihood RaMo Gradient: IntuiMon m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with posiMve R n Decrease probability of paths with negaMve R ! Likelihood raMo changes probabiliMes of experienced paths, does not try to change the paths (see Path DerivaMve later) John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Let’s Decompose Path into States and AcMons John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Likelihood RaMo Gradient EsMmate John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Likelihood RaMo Gradient EsMmate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world pracMcality n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick, equally applicable to perturbaMon analysis and finite differences) John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Likelihood RaMo Gradient: Baseline m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 To build intuiMon, let’s assume R > 0 sMll unbiased n [Williams 1992] n Then tries to increase probabiliMes of all paths E [ r θ log P ( τ ; θ ) b ] X = P ( τ ; θ ) r θ log P ( τ ; θ ) b à Consider baseline b: τ P ( τ ; θ ) r θ P ( τ ; θ ) X = P ( τ ; θ ) b m g = 1 τ X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) X r U ( θ ) ⇡ ˆ = r θ P ( τ ; θ ) b τ m X ! i =1 = r θ P ( τ ) b Good choices for b? τ = r θ ( b ) =0 m b = E [ R ( τ )] ≈ 1 X R ( τ ( i ) ) m i =1 [See: Greensmith, BartleB, Baxter, JMLR 2004 for variance reducMon techniques.]

Likelihood RaMo and Temporal Structure Current esMmate: m g = 1 n X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 m H − 1 ! H − 1 ! = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 Future acMons do not depend on past rewards, hence can lower variance n by instead using: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � b ( s ( i ) X X X t ) k ) m i =1 t =0 k = t Good choice for b? n Expected return: b ( s t ) = E [ r t + r t +1 + r t +2 + . . . + r H − 1 ] à Increase logprob of acMon proporMonally to how much its returns are beBer than the expected return under the current policy [Policy Gradient Theorem: SuBon et al, NIPS 1999; GPOMDP: BartleB & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Pseudo-code Reinforce aka Vanilla Policy Gradient ~ [Williams, 1992] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Outline DerivaMve free methods n Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed n Likelihood RaMo (LR) Policy Gradient n DerivaMon / ConnecMon w/Importance Sampling n Natural Gradient / Trust Regions (-> TRPO) n Variance Reduc'on using Value Func'ons (Actor-Cri'c) (-> GAE, A3C) n Pathwise Deriva'ves (PD) (-> DPG, DDPG, SVG) n Stochas'c Computa'on Graphs (generalizes LR / PD) n Guided Policy Search (GPS) n Inverse Reinforcement Learning n

Trust Region Policy Optimization

Desiderata Desiderata for policy optimization method: I Stable, monotonic improvement. (How to choose stepsizes?) I Good sample e ffi ciency

Step Sizes Why are step sizes a big deal in RL? I Supervised learning I Step too far → next updates will fix it I Reinforcement learning I Step too far → bad policy I Next batch: collected under bad policy I Can’t recover, collapse in performance!

Surrogate Objective I Let η ( π ) denote the expected return of π I We collect data with π old . Want to optimize some objective to get a new policy π I Define L π old ( π ) to be the “surrogate objective” 1  π ( a | s ) � π old ( a | s ) A π old ( s , a ) L ( π ) = E π old � � r θ L ( π θ ) θ old = r θ η ( π θ ) (policy gradient) � � θ old I Local approximation to the performance of the policy; does not depend on parameterization of π 1 S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. In: ICML . vol. 2. 2002, pp. 267–274.

Improvement Theory I Theory: bound the di ff erence between L π old ( ⇡ ) and ⌘ ( ⇡ ), the performance of the policy I Result: ⌘ ( ⇡ ) ≥ L π old ( ⇡ ) − C · max s KL[ ⇡ old ( · | s ) , ⇡ ( · | s )], where c = 2 ✏� / (1 − � ) 2 I Monotonic improvement guaranteed (MM algorithm)

Practical Algorithm: TRPO I Constrained optimization problem max L ( π ) , subject to KL[ π old , π ] ≤ δ π  π ( a | s ) � where L ( π ) = E π old π old ( a | s ) A π old ( s , a ) I Construct loss from empirical data N π ( a n | s n ) ˆ X ˆ L ( π ) = A n π old ( a n | s n ) n =1 I Make quadratic approximation and solve with conjugate gradient algorithm J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

Practical Algorithm: TRPO for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F − 1 g Do line search on surrogate loss and KL constraint end for J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

Practical Algorithm: TRPO Applied to I Locomotion controllers in 2D I Atari games with pixel input J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

“Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1

“Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1 I Pseudocode: for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β . If KL too low, decrease β . end for

“Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1 I Pseudocode: for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β . If KL too low, decrease β . end for I ≈ same performance as TRPO, but only first-order optimization

Variance Reduction Using Value Functions

Variance Reduction I Now, we have the following policy gradient formula: " T − 1 # X r θ log π ( a t | s t , θ ) A π ( s t , a t ) r θ E τ [ R ] = E τ t =0 I A π is not known, but we can plug in ˆ A t , an advantage estimator I Previously, we showed that taking ˆ A t = r t + r t +1 + r t +2 + · · · � b ( s t ) for any function b ( s t ), gives an unbiased policy gradient estimator. b ( s t ) ⇡ V π ( s t ) gives variance reduction.

The Delayed Reward Problem I With policy gradient methods, we are confounding the e ff ect of multiple actions: ˆ A t = r t + r t +1 + r t +2 + · · · − b ( s t ) mixes e ff ect of a t , a t +1 , a t +2 , . . . I SNR of ˆ A t scales roughly as 1 / T I Only a t contributes to signal A π ( s t , a t ), but a t +1 , a t +2 , . . . contribute to noise.

Variance Reduction with Discounts I Discount factor γ , 0 < γ < 1, downweights the e ff ect of rewars that are far in the future—ignore long term dependencies I We can form an advantage estimator using the discounted return : ˆ t = r t + γ r t +1 + γ 2 r t +2 + . . . A γ − b ( s t ) | {z } discounted return reduces to our previous estimator when γ = 1. I So advantage has expectation zero, we should fit baseline to be discounted value function ⇥ ⇤ r 0 + γ r 1 + γ 2 r 2 + . . . | s 0 = s V π , γ ( s ) = E τ I Discount γ is similar to using a horizon of 1 / (1 − γ ) timesteps I ˆ A γ t is a biased estimator of the advantage function

Value Functions in the Future I Baseline accounts for and removes the e ff ect of past actions I Can also use the value function to estimate future rewards r t + γ V ( s t +1 ) cut o ff at one timestep r t + γ r t +1 + γ 2 V ( s t +2 ) cut o ff at two timesteps . . . r t + γ r t +1 + γ 2 r t +2 + . . . ∞ timesteps (no V )

Value Functions in the Future I Subtracting out baselines, we get advantage estimators A (1) ˆ = r t + γ V ( s t +1 ) − V ( s t ) t A (2) ˆ = r t + r t +1 + γ 2 V ( s t +2 ) − V ( s t ) t . . . A ( ∞ ) ˆ = r t + γ r t +1 + γ 2 r t +2 + . . . − V ( s t ) t A (1) A ( ∞ ) I ˆ has low variance but high bias, ˆ has high variance but low bias. t t I Using intermediate k (say, 20) gives an intermediate amount of bias and variance

Finite-Horizon Methods: Advantage Actor-Critic I A2C / A3C uses this fixed-horizon advantage estimator V. Mnih, A. P. Badia, M. Mirza, et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML (2016)

Finite-Horizon Methods: Advantage Actor-Critic I A2C / A3C uses this fixed-horizon advantage estimator I Pseudocode for iteration=1 , 2 , . . . do Agent acts for T timesteps (e.g., T = 20), For each timestep t , compute R t = r t + γ r t +1 + · · · + γ T − t +1 r T − 1 + γ T − t V ( s t ) ˆ A t = ˆ ˆ R t � V ( s t ) ˆ R t is target value function, in regression problem ˆ A t is estimated advantage function h R t ) 2 i P T � log π θ ( a t | s t ) ˆ A t + c ( V ( s ) � ˆ Compute loss gradient g = r θ t =1 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML (2016)

A3C Video

A3C Results

TD( λ ) Methods: Generalized Advantage Estimation I Recall, finite-horizon advantage estimators A ( k ) ˆ = r t + γ r t +1 + · · · + γ k − 1 r t + k − 1 + γ k V ( s t + k ) − V ( s t ) t I Define the TD error δ t = r t + γ V ( s t +1 ) − V ( s t ) I By a telescoping sum, A ( k ) ˆ = δ t + γδ t +1 + · · · + γ k − 1 δ t + k − 1 t I Take exponentially weighted average of finite-horizon estimators: A λ = ˆ + λ 2 ˆ ˆ A (1) + λ ˆ A (2) A (3) + . . . t t t I We obtain ˆ A λ t = δ t + ( γλ ) δ t +1 + ( γλ ) 2 δ t +2 + . . . I This scheme named generalized advantage estimation (GAE) in [1], though versions have appeared earlier, e.g., [2]. Related to TD( λ ) J. Schulman, P. Moritz, S. Levine, et al. “High-dimensional continuous control using generalized advantage estimation”. In: ICML . 2015 H. Kimura and S. Kobayashi. “An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function.” In: ICML . 1998, pp. 278–286

Choosing parameters γ , λ Performance as γ , λ are varied

TRPO+GAE Video

Pathwise Derivative Policy Gradient Methods

Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ )

Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ ) I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T . . . z 1 z 2 z T

Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ ) I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T . . . z 1 z 2 z T I Only works if P ( s 2 | s 1 , a 1 ) is known ¨ _

Using a Q -function θ . . . s 1 s 2 s T R T a 1 a 2 . . . a T z 1 z 2 . . . z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " T " T # # d Q ( s t , a t ) d a t d X X = E = E d θ Q ( s t , π ( s t , z t ; θ )) d a t d θ t =1 t =1

SVG(0) Algorithm I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

SVG(0) Algorithm I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. I Pseudocode: for iteration=1 , 2 , . . . do Execute policy π θ to collect T timesteps of data P T Update π θ using g / r θ t =1 Q ( s t , π ( s t , z t ; θ )) P T t =1 ( Q φ ( s t , a t ) � ˆ Q t ) 2 , e.g. with TD( λ ) Update Q φ using g / r φ end for N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

SVG(1) Algorithm θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T I Instead of learning Q , we learn I State-value function V ≈ V π , γ I Dynamics model f , approximating s t +1 = f ( s t , a t ) + ζ t I Given transition ( s t , a t , s t +1 ), infer ζ t = s t +1 − f ( s t , a t ) I Q ( s t , a t ) = E [ r t + γ V ( s t +1 )] = E [ r t + γ V ( f ( s t , a t ) + ζ t )], and a t = π ( s t , θ , ζ t )

SVG( ∞ ) Algorithm θ . . . s 1 s 2 s T . . . R T a 1 a 2 a T . . . z 1 z 2 z T I Just learn dynamics model f I Given whole trajectory, infer all noise variables I Freeze all policy and dynamics noise, di ff erentiate through entire deterministic computation graph

SVG Results I Applied to 2D robotics tasks I Overall: di ff erent gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

Deterministic Policy Gradient I For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero I But SVG(0) gradient is fine when σ ! 0 X Q ( s t , π ( s t , θ , ζ t )) r θ t I Problem: there’s no exploration. I Solution: add noise to the policy, but estimate Q with TD(0), so it’s valid o ff -policy I Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is o ff —it gets the right gradient at every state D. Silver, G. Lever, N. Heess, et al. “Deterministic policy gradient algorithms”. In: ICML . 2014

Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability I Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π , γ ) with TD(0) ˆ Q t = r t + γ Q φ 0 ( s t +1 , π ( s t +1 ; θ 0 )) T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability I Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π , γ ) with TD(0) ˆ Q t = r t + γ Q φ 0 ( s t +1 , π ( s t +1 ; θ 0 )) I Pseudocode: for iteration=1 , 2 , . . . do Act for several timesteps, add data to replay bu ff er Sample minibatch P T Update π θ using g / r θ t =1 Q ( s t , π ( s t , z t ; θ )) t =1 ( Q φ ( s t , a t ) � ˆ P T Q t ) 2 , Update Q φ using g / r φ end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

Policy Gradient Methods: Comparison I Two kinds of policy gradient estimator I REINFORCE / score function estimator: r log π ( a | s ) ˆ A . I Learn Q or V for variance reduction, to estimate ˆ A I Pathwise derivative estimators (di ff erentiate wrt action) I SVG(0) / DPG: d d a Q ( s , a ) (learn Q ) I SVG(1): d d a ( r + γ V ( s 0 )) (learn f , V ) I SVG( 1 ): d a t ( r t + γ r t +1 + γ 2 r t +2 + . . . ) (learn f ) d I Pathwise derivative methods more sample-e ffi cient when they work (maybe), but work less generally due to high bias

Policy Gradient Methods: Comparison Y. Duan, X. Chen, R. Houthooft, et al. “Benchmarking Deep Reinforcement Learning for Continuous Control”. In: ICML (2016)

Stochastic Computation Graphs

Gradients of Expectations Want to compute r θ E [ F ]. Where’s θ ? I In distribution, e.g., E x ∼ p ( · | θ ) [ F ( x )] I r θ E x [ f ( x )] = E x [ f ( x ) r θ log p x ( x ; θ )] . I Score function estimator I Example: REINFORCE policy gradients, where x is the trajectory I Outside distribution: E z ∼ N (0 , 1) [ F ( θ , z )] r θ E z [ f ( x ( z , θ ))] = E z [ r θ f ( x ( z , θ ))] . I Pathwise derivative estimator I Example: SVG policy gradient I Often, we can reparametrize, to change from one form to another I What if F depends on θ in complicated way, a ff ecting distribution and F ? M. C. Fu. “Gradient estimation”. In: Handbooks in operations research and management science 13 (2006), pp. 575–616

Stochastic Computation Graphs I Stochastic computation graph is a DAG, each node corresponds to a deterministic or stochastic operation I Can automatically derive unbiased gradient estimators, with variance reduction Computation Graphs Stochastic Computation Graphs L L stochastic node J. Schulman, N. Heess, T. Weber, et al. “Gradient Estimation Using Stochastic Computation Graphs”. In: NIPS . 2015

Worked Example θ a b c d e φ I L = c + e . Want to compute d d d θ E [ L ] and d φ E [ L ]. I Treat stochastic nodes ( b , d ) as constants, and introduce losses logprob ∗ ( futurecost ) at each stochastic node I Obtain unbiased gradient estimate by di ff erentiating surrogate: + log p (ˆ Surrogate ( θ , ψ ) = c + e b | a , d )ˆ c | {z } | {z } (1) (2) (1): how parameters influence cost through deterministic dependencies (2): how parameters a ff ect distribution over random variables.

Outline DerivaMve free methods n Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed n Likelihood RaMo (LR) Policy Gradient n DerivaMon / ConnecMon w/Importance Sampling n Natural Gradient / Trust Regions (-> TRPO) n Variance ReducMon using Value FuncMons (Actor-CriMc) (-> GAE, A3C) n Pathwise DerivaMves (PD) (-> DPG, DDPG, SVG) n StochasMc ComputaMon Graphs (generalizes LR / PD) n Guided Policy Search (GPS) n Inverse Reinforcement Learning n

Goal n Find parameterized policy that opMmizes: π θ ( u t | x t ) T X J ( θ ) = E π θ ( x t , u t ) [ l ( x t , u t )] t =1 T n NotaMon: Y π θ ( τ ) = p ( x 1 ) p ( x t +1 | x t , u t ) π θ ( u t | x t ) t =1 τ = { x 1 , u 1 , . . . , x T , u T } n RL takes lots of data… Can we reduce to supervised learning? John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Naïve SoluMon n Step 1: n Consider sampled problem instances i = 1 , 2 , . . . , I n Find a trajectory-centric controller for each problem instance π i ( u t | x t ) n Step 2: n Supervised training of neural net to match all π i ( u t | x t ) X π θ ← arg min D KL ( p i ( τ ) || π θ ( τ )) θ i n ISSUES: n Compounding error (Ross, Gordon, Bagnell JMLR 2011 “Dagger”) n Mismatch train vs. test E.g., Blind peg, Vision,… John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

(Generic) Guided Policy Search n OpMmizaMon formulaMon: ParMcular form of the constraint varies depending on the specific method: Dual gradient descent: Levine and Abbeel, NIPS 2014 Penalty methods: Mordatch, Lowrey, Andrew, Popovic, Todorov, NIPS 2016 ADMM: Mordatch and Todorov, RSS 2014 Bregman ADMM: Levine, Finn, Darrell, Abbeel, JMLR 2016 Mirror Descent: Montgomery, Levine, NIPS 2016 John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

[Levine & Abbeel, NIPS 2014] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Comparison [Levine, Wagener, Abbeel, ICRA 2015] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Block Stacking – Learning the Controller for a Single Instance [Levine, Wagener, Abbeel, ICRA 2015] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Deep Reinforcement Learning through Policy Op7miza7on Pieter - PowerPoint PPT Presentation

Deep Reinforcement Learning through Policy Op7miza7on Pieter Abbeel John Schulman Open AI / Berkeley AI Research Lab Reinforcement Learning u t [Figure source: SuBon & Barto, 1998] John Schulman & Pieter Abbeel OpenAI + UC

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci & Eng Univ of South

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

101 iOS Container View Controllers Container View Controllers Display a view controller inside

Introduction to Ruby, MVC, and the Rails Framework Professor Larry Heimann Application Design

Ethane: taking control of the enterprise Martin Casado et al Giang Nguyen Motivation

Robust control of a risk-sensitive performance measure Paul Dupuis Division of Applied

Multitier Cloud Computing, 5G Networks & Smart Cities Alberto Leon-Garcia and Hadi Bannazadeh

Lecture 14: I/O Controllers & Devices RS 232 UART/ modem ACIA USB printer video CPU

Towards Assured Artificial Intelligence Resilient Traffic Grids Aidan Smith and Brian Wheatman

D Dia. D - 25 . The stroke carriage is used for a simple, fast action setting or positioning

Deep Reinforcement Learning through Policy Op7miza7on Pieter - PowerPoint PPT Presentation

Deep Reinforcement Learning through Policy Op7miza7on Pieter Abbeel John Schulman Open AI / Berkeley AI Research Lab Reinforcement Learning u t [Figure source: SuBon & Barto, 1998] John Schulman & Pieter Abbeel OpenAI + UC

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci &amp; Eng Univ of South

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

101 iOS Container View Controllers Container View Controllers Display a view controller inside

Introduction to Ruby, MVC, and the Rails Framework Professor Larry Heimann Application Design

Ethane: taking control of the enterprise Martin Casado et al Giang Nguyen Motivation

Robust control of a risk-sensitive performance measure Paul Dupuis Division of Applied

Multitier Cloud Computing, 5G Networks &amp; Smart Cities Alberto Leon-Garcia and Hadi Bannazadeh

Lecture 14: I/O Controllers &amp; Devices RS 232 UART/ modem ACIA USB printer video CPU

Towards Assured Artificial Intelligence Resilient Traffic Grids Aidan Smith and Brian Wheatman

D Dia. D - 25 . The stroke carriage is used for a simple, fast action setting or positioning

CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S ci & Eng Univ of South

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Multitier Cloud Computing, 5G Networks & Smart Cities Alberto Leon-Garcia and Hadi Bannazadeh

Lecture 14: I/O Controllers & Devices RS 232 UART/ modem ACIA USB printer video CPU