Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz - PowerPoint PPT Presentation

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab

Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbJ

Introduction and Overview

What is Reinforcement Learning? ◮ Branch of machine learning concerned with taking sequences of actions ◮ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward

Motor Control and Robotics Robotics: ◮ Observations: camera images, joint angles ◮ Actions: joint torques ◮ Rewards: stay balanced, navigate to target locations, serve and protect humans

Business Operations ◮ Inventory Management ◮ Observations: current inventory levels ◮ Actions: number of units of each item to purchase ◮ Rewards: profit ◮ Resource allocation: who to provide customer service to first ◮ Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo

Games A different kind of optimization problem (min-max) but still considered to be RL. ◮ Go (complete information, deterministic) – AlphaGo 2 ◮ Backgammon (complete information, stochastic) – TD-Gammon 3 ◮ Stratego (incomplete information, deterministic) ◮ Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 3 Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68.

Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Q-Learning Actor-Critic Methods

What is Deep RL? ◮ RL using nonlinear function approximators ◮ Usually, updating parameters with stochastic gradient descent

What’s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)

Markov Decision Processes

Definition ◮ Markov Decision Process (MDP) defined by ( S , A , P ), where ◮ S : state space ◮ A : action space ◮ P ( r , s ′ | s , a ): a transition probability distribution ◮ Extra objects defined depending on problem setting ◮ µ : Initial state distribution ◮ γ : discount factor

Episodic Setting ◮ In each episode, the initial state is sampled from µ , and the process proceeds until the terminal state is reached. For example: ◮ Taxi robot reaches its destination (termination = good) ◮ Waiter robot finishes a shift (fixed time) ◮ Walking robot falls over (termination = bad) ◮ Goal: maximize expected reward per episode

Policies ◮ Deterministic policies: a = π ( s ) ◮ Stochastic policies: a ∼ π ( a | s ) ◮ Parameterized policies: π θ

Episodic Setting π Agent a T-1 a 0 a 1 s T s 0 s 1 s 2 r 0 r 1 r T-1 μ 0 Environment P Objective: maximize η ( π ) , where η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 | π ]

Parameterized Policies ◮ A family of policies indexed by parameter vector θ ∈ R d ◮ Deterministic: a = π ( s , θ ) ◮ Stochastic: π ( a | s , θ ) ◮ Analogous to classification or regression with input s , output a . E.g. for neural network stochastic policies: ◮ Discrete action space: network outputs vector of probabilities ◮ Continuous action space: network outputs mean and diagonal covariance of Gaussian

Reinforcement Learning via Black-Box Optimization

Derivative Free Optimization Approach ◮ Objective: maximize E [ R | π ( · , θ )] ◮ View θ → � → R as a black box ◮ Ignore all other information other than R collected during episode

Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well Istv´ an Szita and Andr´ as L¨ orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems . 2013

Cross-Entropy Method ◮ Evolutionary algorithm ◮ Works embarrassingly well ◮ A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:

Cross-Entropy Method Initialize µ ∈ R d , σ ∈ R d for iteration = 1 , 2 , . . . do Collect n samples of θ i ∼ N ( µ, diag( σ )) Perform a noisy evaluation R i ∼ θ i Select the top p % of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ . end for Return the final µ .

Cross-Entropy Method ◮ Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward ◮ Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob ◮ We can derive MM algorithm where each iteration you maximize � i log p ( θ i ) R i

Policy Gradient Methods

Policy Gradient Methods: Overview Problem: maximize E [ R | π θ ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)

Derivation via Importance Sampling Alternate Derivation Using Importance Sampling � p ( x | θ ) � E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � ∇ θ p ( x | θ ) � ∇ θ E x ∼ θ [ f ( x )] = E x ∼ θ old p ( x | θ old ) f ( x ) � � � ∇ θ p ( x | θ ) � � θ = θ old ∇ θ E x ∼ θ [ f ( x )] θ = θ old = E x ∼ θ old f ( x ) � p ( x | θ old ) � � � = E x ∼ θ old ∇ θ log p ( x | θ ) θ = θ old f ( x ) �

Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ ) ◮ Let’s say that f ( x ) measures how good the sample x is. ◮ Moving in the direction ˆ g i pushes up the logprob of the sample, in proportion to how good it is ◮ Valid even if f ( x ) is discontinuous, and unknown, or sample space (containing x) is a discrete set

Score Function Gradient Estimator: Intuition ˆ g i = f ( x i ) ∇ θ log p ( x i | θ )

Score Function Gradient Estimator for Policies ◮ Now random variable x is a whole trajectory τ = ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T − 1 , a T − 1 , r T − 1 , s T ) ∇ θ E τ [ R ( τ )] = E τ [ ∇ θ log p ( τ | θ ) R ( τ )] ◮ Just need to write out p ( τ | θ ): T − 1 � p ( τ | θ ) = µ ( s 0 ) [ π ( a t | s t , θ ) P ( s t +1 , r t | s t , a t )] t =0 T − 1 � log p ( τ | θ ) = log µ ( s 0 ) + [log π ( a t | s t , θ ) + log P ( s t +1 , r t | s t , a t )] t =0 T − 1 � ∇ θ log p ( τ | θ ) = ∇ θ log π ( a t | s t , θ ) t =0 � � T − 1 � ∇ θ E τ [ R ] = E τ R ∇ θ log π ( a t | s t , θ ) t =0 ◮ Interpretation: using good trajectories (high R ) as supervised examples in classification / regression

Policy Gradient–Slightly Better Formula ◮ Previous slide: �� T − 1 �� T − 1 �� ∇ θ E τ [ R ] = E τ r t ∇ θ log π ( a t | s t , θ ) t =0 t =0 ◮ But we can cut trajectory to t steps and derive gradient estimator for one reward term r t ′ . � � t � ∇ θ E [ r t ′ ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 ◮ Sum this formula over t , obtaining � T − 1 � t ′ � � ∇ θ E [ R ] = E r t ′ ∇ θ log π ( a t | s t , θ ) t =0 t =0 � T − 1 � T − 1 � � = E ∇ θ log π ( a t | s t , θ ) r t ′ t =0 t ′ = t

Adding a Baseline ◮ Suppose f ( x ) ≥ 0 , ∀ x ◮ Then for every x i , gradient estimator ˆ g i tries to push up it’s density ◮ We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i . ∇ θ E x [ f ( x )] = ∇ θ E x [ f ( x ) − b ] = E x [ ∇ θ log p ( x | θ )( f ( x ) − b )] ◮ A near-optimal choice of b is always E [ f ( x )] (which must be estimated)

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz - PowerPoint PPT Presentation

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

N E U R O N A L R H Y T H M S A F R A M E W O R K F O R U N D E R S T A N D I N G I N T E R A

------------------------ Cognitive benefits of learning to play chess and other strategy games

Networks in Biology and Neuroscience CSE 5339: Topics in Network Data Analysis Samir Chowdhury

Chapter 3 Part 1 Orientation Directions in the nervous system are described relatively to

C3GI 2017 Structural and Functional Neural Correlates of Emotional Responses to Music Gianluca

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC

Neural Synchronization and Consciousness Lawrence M. Ward Department of Psychology, The Brain

INTRACEREBRAL HEMORRHAGE: STROKE RECOVERY TRAJECTORY AND OUTCOMES 1 Racing Against the Clock: