Deep Reinforcement Learning M. Soleymani Sharif University of - PowerPoint PPT Presentation

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016.

Supervised Learning • Data: (x, y) – x is data – y is label • Goal: Learn a function to map x -> y • Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

Unsupervised Learning • Data: x – Just data, no labels! • Goal: Learn some underlying hidden structure of the data • Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc.

Reinforcement Learning • Goal : Learn how to take actions an agent interacting with an environment , in order to maximize reward which provides numeric reward signals – Concerned with taking sequences of actions • Described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward

Overview • What is Reinforcement Learning? • Markov Decision Processes • Q-Learning • Policy Gradients

Reinforcement Learning

Robot Locomotion

Motor Control and Robotics • Robotics: – Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans

Atari Games

How Does RL Relate to Other Machine Learning Problems? • Differences between RL and supervised learning: – You don't have full access to the function you're trying to optimize • must query it through interaction. – Interacting with a stateful world: input 𝑦 𝑢 depend on your previous actions

How can we mathematically formalize the RL problem?

Markov Decision Process

Markov Decision Process • At time step t=0, environment samples initial state 𝑡 0 ~𝑞(𝑡 0 ) • Then, for t=0 until done: – Agent selects action 𝑏 𝑢 – Environment samples reward 𝑠 𝑢 ~𝑆(. |𝑡 𝑢 , 𝑏 𝑢 ) – Environment samples next state 𝑡 𝑢+1 ~𝑄(. |𝑡 𝑢 , 𝑏 𝑢 ) – Agent receives reward 𝑠 𝑢 and next state 𝑡 𝑢+1 • A policy 𝜌 is a function from S to A that specifies what action to take in each state • Objective: find policy 𝜌 ∗ that maximizes cumulative discounted reward: ∞ 1 + 𝛿 2 𝑠 𝛿 𝑙 𝑠 𝑠 0 + 𝛿𝑠 2 + ⋯ = 𝑙 𝑙=0

A simple MDP: Grid World

The optimal policy 𝜌 ∗ • We want to find optimal policy 𝝆 ∗ that maximizes the sum of rewards. • How do we handle the randomness (initial state, transition probability … )? – Maximize the expected sum of rewards!

Definitions: Value function and Q-value function

Value function for policy 𝜌 𝑊 𝜌 𝑡 = 𝐹 𝑙=0 ∞ 𝛿 𝑙 𝑠 𝑢 𝑡 0 = 𝑡, 𝜌 𝑅 𝜌 𝑡, 𝑏 = 𝐹 𝑙=0 ∞ 𝛿 𝑙 𝑠 𝑢 𝑡 0 = 𝑡, 𝑏 0 = 𝑏 , 𝜌 • 𝑊 𝜌 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌 – It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌 𝑊 𝜌 𝑡 = 𝐹 𝑠 + 𝛿𝑊 𝜌 (𝑡′)|𝑡, 𝜌 Bellman Equations 𝑅 𝜌 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅 𝜌 (𝑡 ′ , 𝑏′)|𝑡, 𝑏, 𝜌 22

Bellman optimality equation 𝑊 ∗ 𝑡 = max 𝑏∈𝒝(𝑡) 𝐹 𝑠 + 𝛿𝑊 ∗ (𝑡′)|𝑡, 𝑏 𝑅 ∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ |𝑡, 𝑏 23

Bellman equation

Optimal policy • It can also be computed as: 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 𝑏∈𝒝(𝑡) 25

Solving for the optimal policy: Value iteration

Solving for the optimal policy: Q-learning algorithm • Initialize 𝑅(𝑡, 𝑏) arbitrarily • Repeat (for each episode): Initialize 𝑡 • e.g., greedy, ε -greedy Repeat (for each step of episode): • Choose 𝑏 from 𝑡 using a policy derived from 𝑅 • Take action 𝑏 , receive reward 𝑠 , observe new state 𝑡 ′ • 𝑅 𝑡 ′ , 𝑏 ′ − 𝑅 𝑡, 𝑏 ← 𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑅 𝑡, 𝑏 • 𝑏 ′ 𝑡 ← 𝑡 ′ • until 𝑡 is terminal • 27

Problem • Not scalable. – Must compute Q(s,a) for every state-action pair. • it computationally infeasible to compute for entire state space! • Solution: use a function approximator to estimate Q(s,a). – E.g. a neural network!

Solving for the optimal policy: Q-learning

Solving for the optimal policy: Q-learning Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Case Study: Playing Atari Games (seen before) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Q-network Architecture Last FC layer has 4-d output (if 4 actions) Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 ) A single feedforward pass to compute Number of actions between 4-18 Q-values for all actions from the current depending on Atari game state => efficient! [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Training the Q-network: Experience Replay • Learning from batches of consecutive samples is problematic: – Samples are correlated => inefficient learning – Current Q-network parameters determines next training samples • can lead to bad feedback loops • Address these problems using experience replay – Continually update a replay memory table of transitions (𝑡 𝑢 , 𝑏 𝑢 , 𝑠 𝑢 , 𝑡 𝑢+1 ) – Train Q-network on random minibatches of transitions from the replay memory  Each transition can also contribute to multiple weight updates => greater data efficiency  Smoothing out learning and avoiding oscillations or divergence in the parameters

Putting it together: Deep Q-Learning with Experience Replay [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay For each time-step of game [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Take the selected action observe the reward and next state [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Putting it together: Deep Q-Learning with Experience Replay Sample a random minibatch of transitions and perform a gradient descent step [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

Results on 49 Games • The architecture and hyperparameter values were the same for all 49 games. • DQN achieved performance comparable to or better than an experienced human on 29 out of 49 games. [V. Mnih et al., Human-level control through deep reinforcement learning, Nature 2015]

Policy Gradients • What is a problem with Q-learning? – The Q-function can be very complicated! • Hard to learn exact value of every (state, action) pair • But the policy can be much simple • Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

The goal of RL the policy that must be learnt

Policy Gradients

REINFORCE algorithm

⇒ Can estimate with Monte Carlo sampling

REINFORCE algorithm

REINFORCE algorithm 𝑂 𝛼 𝜄 𝐾(𝜄) ≈ 1 𝑠(𝜐 (𝑜) ) 𝛼 𝜄 log 𝑞 𝜐 (𝑜) ; 𝜄 𝑂 𝑜=1 𝑂 𝛼 𝜄 𝐾(𝜄) ≈ 1 𝑜 , 𝑏 𝑢 𝑜 |𝑡 𝑢 (𝑜) 𝑜 𝑂 𝑠 𝑡 𝑢 𝛼 𝜄 log 𝜌 𝜄 𝑏 𝑢 𝑜=1 𝑢≥0 𝑢≥0

Deep Reinforcement Learning M. Soleymani Sharif University of - PowerPoint PPT Presentation

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016. Supervised Learning

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

CSC2523: Deep Learning in Computer Vision Introduction Sanja Fidler January 12, 2016 Sanja

Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU Viktoriia

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Lecture 2 Bond Valuation Contact: Natt Koowattanatianchai Email: fbusnwk@ku.ac.th

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain

Markov Decision Processes School of Data Science, Fudan

Int Introductio ion t n to Deep Deep Lea earn rning Prof. Leal-Taix and Prof. Niessner 1

Parallel Programming Libraries and Implementations Reusing this material This work is licensed