deep reinforcement learning
play

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz - PowerPoint PPT Presentation

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient


  1. Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab

  2. Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbJ

  3. Introduction and Overview

  4. What is Reinforcement Learning? ◮ Branch of machine learning concerned with taking sequences of actions ◮ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward

  5. Motor Control and Robotics Robotics: ◮ Observations: camera images, joint angles ◮ Actions: joint torques ◮ Rewards: stay balanced, navigate to target locations, serve and protect humans

  6. Business Operations ◮ Inventory Management ◮ Observations: current inventory levels ◮ Actions: number of units of each item to purchase ◮ Rewards: profit ◮ Resource allocation: who to provide customer service to first ◮ Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo

  7. Games A different kind of optimization problem (min-max) but still considered to be RL. ◮ Go (complete information, deterministic) – AlphaGo 2 ◮ Backgammon (complete information, stochastic) – TD-Gammon 3 ◮ Stratego (incomplete information, deterministic) ◮ Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 3 Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68.

  8. Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Q-Learning Actor-Critic Methods

  9. What is Deep RL? ◮ RL using nonlinear function approximators ◮ Usually, updating parameters with stochastic gradient descent

  10. What’s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)

  11. Markov Decision Processes

  12. Definition ◮ Markov Decision Process (MDP) defined by ( S , A , P ), where ◮ S : state space ◮ A : action space ◮ P ( r , s ′ | s , a ): a transition probability distribution ◮ Extra objects defined depending on problem setting ◮ µ : Initial state distribution ◮ γ : discount factor

  13. Episodic Setting ◮ In each episode, the initial state is sampled from µ , and the process proceeds until the terminal state is reached. For example: ◮ Taxi robot reaches its destination (termination = good) ◮ Waiter robot finishes a shift (fixed time) ◮ Walking robot falls over (termination = bad) ◮ Goal: maximize expected reward per episode

  14. Policies ◮ Deterministic policies: a = π ( s ) ◮ Stochastic policies: a ∼ π ( a | s ) ◮ Parameterized policies: π θ

  15. Episodic Setting s 0 ∼ µ ( s 0 ) a 0 ∼ π ( a 0 | s 0 ) s 1 , r 0 ∼ P ( s 1 , r 0 | s 0 , a 0 ) a 1 ∼ π ( a 1 | s 1 ) s 2 , r 1 ∼ P ( s 2 , r 1 | s 1 , a 1 ) . . . a T − 1 ∼ π ( a T − 1 | s T − 1 ) s T , r T − 1 ∼ P ( s T | s T − 1 , a T − 1 ) Objective: maximize η ( π ) , where η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 | π ]

  16. Episodic Setting π Agent a T-1 a 0 a 1 s T s 0 s 1 s 2 r 0 r 1 r T-1 μ 0 Environment P Objective: maximize η ( π ) , where η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 | π ]

  17. Parameterized Policies ◮ A family of policies indexed by parameter vector θ ∈ R d ◮ Deterministic: a = π ( s , θ ) ◮ Stochastic: π ( a | s , θ ) ◮ Analogous to classification or regression with input s , output a . E.g. for neural network stochastic policies: ◮ Discrete action space: network outputs vector of probabilities ◮ Continuous action space: network outputs mean and diagonal covariance of Gaussian

  18. Reinforcement Learning via Black-Box Optimization

  19. Derivative Free Optimization Approach ◮ Objective: maximize E [ R | π ( · , θ )] ◮ View θ → � → R as a black box ◮ Ignore all other information other than R collected during episode

  20. Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well Istv´ an Szita and Andr´ as L¨ orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems . 2013

  21. Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well ◮ A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:

  22. Cross-Entropy Method Initialize µ ∈ R d , σ ∈ R d for iteration = 1 , 2 , . . . do Collect n samples of θ i ∼ N ( µ, diag( σ )) Perform a noisy evaluation R i ∼ θ i Select the top p % of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ . end for Return the final µ .

  23. Cross-Entropy Method ◮ Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward ◮ Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob ◮ We can derive MM algorithm where each iteration you maximize � i log p ( θ i ) R i

  24. Policy Gradient Methods

  25. Policy Gradient Methods: Overview Problem: maximize E [ R | π θ ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)

  26. Score Function Gradient Estimator ◮ Consider an expectation E x ∼ p ( x | θ ) [ f ( x )]. Want to compute gradient wrt θ � ∇ θ E x [ f ( x )] = ∇ θ d x p ( x | θ ) f ( x ) � = d x ∇ θ p ( x | θ ) f ( x ) � d x p ( x | θ ) ∇ θ p ( x | θ ) = p ( x | θ ) f ( x ) � = d x p ( x | θ ) ∇ θ log p ( x | θ ) f ( x ) = E x [ f ( x ) ∇ θ log p ( x | θ )] . ◮ Last expression gives us an unbiased gradient estimator. Just sample x i ∼ p ( x | θ ), and compute ˆ g i = f ( x i ) ∇ θ log p ( x i | θ ). ◮ Need to be able to compute and differentiate density p ( x | θ ) wrt θ

  27. Derivation via Importance Sampling Alternate Derivation Using Importance Sampling � p ( x | θ ) � E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � ∇ θ p ( x | θ ) � ∇ θ E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � � � ∇ θ p ( x | θ ) � � θ = θ old ∇ θ E x ∼ θ [ f ( x )] θ = θ old = E x ∼ θ old f ( x ) � p ( x | θ old ) � � � = E x ∼ θ old ∇ θ log p ( x | θ ) θ = θ old f ( x ) �

  28. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ ) ◮ Let’s say that f ( x ) measures how good the sample x is. ◮ Moving in the direction ˆ g i pushes up the logprob of the sample, in proportion to how good it is ◮ Valid even if f ( x ) is discontinuous, and unknown, or sample space (containing x) is a discrete set

  29. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ )

  30. Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ )

  31. Score Function Gradient Estimator for Policies ◮ Now random variable x is a whole trajectory τ = ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T − 1 , a T − 1 , r T − 1 , s T ) ∇ θ E τ [ R ( τ )] = E τ [ ∇ θ log p ( τ | θ ) R ( τ )] ◮ Just need to write out p ( τ | θ ): T − 1 � p ( τ | θ ) = µ ( s 0 ) [ π ( a t | s t , θ ) P ( s t +1 , r t | s t , a t )] t =0 T − 1 � log p ( τ | θ ) = log µ ( s 0 ) + [log π ( a t | s t , θ ) + log P ( s t +1 , r t | s t , a t )] t =0 T − 1 � ∇ θ log p ( τ | θ ) = ∇ θ log π ( a t | s t , θ ) t =0 � � T − 1 � ∇ θ E τ [ R ] = E τ R ∇ θ log π ( a t | s t , θ ) t =0 ◮ Interpretation: using good trajectories (high R ) as supervised examples in classification / regression

  32. Policy Gradient–Slightly Better Formula ◮ Previous slide: �� T − 1 �� T − 1 �� � � ∇ θ E τ [ R ] = E τ r t ∇ θ log π ( a t | s t , θ ) t =0 t =0 ◮ But we can cut trajectory to t steps and derive gradient estimator for one reward term r t ′ . � � t � ∇ θ E [ r t ′ ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 ◮ Sum this formula over t , obtaining � T − 1 � t ′ � � ∇ θ E [ R ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 t =0 � T − 1 � T − 1 � � = E ∇ θ log π ( a t | s t , θ ) r t ′ t =0 t ′ = t

  33. Adding a Baseline ◮ Suppose f ( x ) ≥ 0 , ∀ x ◮ Then for every x i , gradient estimator ˆ g i tries to push up it’s density ◮ We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i . ∇ θ E x [ f ( x )] = ∇ θ E x [ f ( x ) − b ] = E x [ ∇ θ log p ( x | θ )( f ( x ) − b )] ◮ A near-optimal choice of b is always E [ f ( x )] (which must be estimated)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend