Lecture #12 – Deep Reinforcement Learning
Aykut Erdem // Hacettepe University // Spring 2018
CMP784
DEEP LEARNING
DeepLoco by X. B. Peng, G. Berseth & M. van de Panne
CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - - PowerPoint PPT Presentation
DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018 Neural Face by Taehoon Kim Previously on CMP784 Generative
Lecture #12 – Deep Reinforcement Learning
Aykut Erdem // Hacettepe University // Spring 2018
DEEP LEARNING
DeepLoco by X. B. Peng, G. Berseth & M. van de Panne
Previously on CMP784
(GANs)
Neural Face by Taehoon Kim
Lecture overview
Disclaimer: r: Much of the material and slides for this lecture were borrowed from
—John Schulman’s talk on “Deep Reinforcement Learning: Policy Gradients and Q-Learning” —David Silver’s tutorial on “Deep Reinforcement Learning” —Lex Fridman’s MIT 6.S094 Deep Learning for Self-Driving Cars class
3What is Reinforcement Learning?
5Supervised Learning Unsupervised Learning Reinforcement Learning
Slide credit: Razvan Pascanu
What is Reinforcement Learning?
actions
unknown environment, trying to maximize cumulative reward
6Agent Environment action
Motor Control and Robotics
Robotics:
Business Operations
8Inventory Management:
Image Captioning
9Hard Attention for Image Captioning:
Games
A different kind of optimization problem (min-max) but still considered to be RL.
– TD-Gammon
Matej Moravčík et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 02 Mar 2017 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68. Game tree complexity = bd Brute force search intractable: 1. Search space is huge 2. “Impossible” for computers to evaluate who is winning
Why is Go hard for computers to play?
How does RL relate to Supervised Learning?
— Environment samples input-output pair — Agent predicts — Agent receives loss — Environment asks agent a question, and then tells her the right answer
11pair (xt, yt) ∼ ⇢
ts ˆ yt = f (xt)
ss `(yt, ˆ yt)
How does RL relate to Supervised Learning?
− Environment samples input-output pair
§ Input depends on your previous actions!
− Agent predicts − Agent receives cost where P a probability distribution unknown to the agent.
12ts ˆ yt = f (xt)
input xt ∼ P(xt | xt−1, yt−1)
st ct ∼ P(ct | xt, ˆ yt) w
Reinforcement Learning in a nutshell
RL is a general-purpose framework for decision-making
agen ent with the capacity to ac act
action influences the agent’s future st state
reward rd signal
select ct act ctions
uture ure re reward rd
14Deep Learning in a nutshell
RL is a general-purpose framework for decision-making
ctive
repre rese sentatio ion that is required to achieve objective
raw in inputs
Deep Reinforcement Learning: AI = RL + DL
We seek a single agent which can solve any human-level task
− Pla Play games: Atari, poker, Go, ... − Ex Explor plore worlds: 3D worlds, Labyrinth, ... − Co Control physical systems: manipulate, walk, swim, ... − Interact ct with users: recommend, optimize, personalize, ...
16Agent and Environment
reward action at rt
Example Reinforcement Learning Problem
Atari Breakout
act
futu ture sta tate te
reward rd signal
Goal is to select actions to maximize future reward
18State
st = f(o1,r1,a1,...,at-1,ot,rt)
st =f(ot)
19Major Components of an RL Agent
− Policy: Agent’s behavior function − Value function: How good is each state and/or action − Model: Agent’s representation of the environment
20Policy
− Deterministic policy: − Stochastic policy:
21: a = π(s) P
: π(a|s) = P [a|s]
Value Function
− “How much reward will I get from action a in state s?”
− from state s and action a − under policy − with discount factor
y π
⇥
Qπ(s, a) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + ... | s, a ⇤
Qπ(s, a) = Es0,a0 ⇥ r + γQπ(s0, a0) | s, a ⇤
Q⇤(s, a) = Es0 r + γ max
a0
Q⇤(s0, a0) | s, a
Q⇤(s, a) = max
π
Qπ(s, a) = Qπ⇤(s, a)
π⇤(s) = argmax
a
Q⇤(s, a)
Q⇤(s, a) = rt+1 + γ max
at+1 rt+2 + γ2 max at+2 rt+3 + ...
= rt+1 + γ max
at+1 Q⇤(st+1, at+1)
Model
24reward action at rt
Model
Approaches To Reinforcement Learning
− Estimate the optimal value function Q*(s,a) − This is the maximum value achievable under any policy
− Search directly for the optimal policy − This is the policy achieving maximum future reward
− Build a model of the environment − Plan (e.g. by lookahead) using model
26y π∗
Deep Reinforcement Learning
− Value function − Policy − Model
Q-Networks
Q(s, a, w) ≈ Q∗(s, a)
s s a Q(s,a,w) Q(s,a1,w) Q(s,am,w) … w w
l = ✓ r + γ max
a0
Q(s0, a0, w) − Q(s, a, w) ◆2
Q⇤(s, a) = Es0 r + γ max
a0
Q⇤(s0, a0) | s, a
as a target
− Correlations between samples − Non-stationary targets
30e r + γ max
a0
Q(s0, a0, w)
Deep Q-Networks (DQN): Experience Replay
s1, a1, r2, s2 s2, a2, r3, s3 → s, a, r, s0 s3, a3, r4, s4 ... st, at, rt+1, st+1 → st, at, rt+1, st+1
l = ✓ r + γ max
a0
Q(s0, a0, w) − Q(s, a, w) ◆2
Deep Reinforcement Learning in Atari
32state reward action at rt st
Human level control through deep reinforcement learning, V. Mnih et al. Nature 518:529-533, 2015.
DQN in Atari
Network architecture and hyperparameters fixed across all games
33DQN Results in Atari
34V
Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.
Improvements since Nature DQN
Double DQ DQN: Remove upward bias caused by
− Current Q-network w is used to select actions − Older Q-network w — is used to evaluate actions
riori riti tized re replay: Weight experience according to surprise
− Store experience in priority queue according to DQN error
Duelling netw twork rk: Split Q-network into two channels
− Action-independent value function V(s,v) − Action-dependent advantage function A(s, a, w)
Combined algorithm: 3x mean Atari score vs Nature DQN
38y max
a
Q(s, a, w)
l = ✓ r + γQ(s0, argmax
a0
Q(s0, a0, w), w) − Q(s, a, w) ◆2
a0
Q(s0, a0, w) − Q(s, a, w)
Gorila
Asynchronous Reinforcement Learning
− Viable alternative to experience replay
Deep Policy Networks
a = π(a|s, u) or a = π(s, u)
L(u) = E ⇥ r1 + γr2 + γ2r3 + ... | π(·, u) ⇤
Policy Gradients
How to make high-value actions more likely:
if a is continuous and Q is differentiable
43∂L(u) ∂u = E ∂log π(a|s, u) ∂u Qπ(s, a)
∂L(u) ∂u = E ∂Qπ(s, a) ∂a ∂a ∂u
Actor-Critic Algorithm
∂l ∂u = ∂log π(a|s, u) ∂u Q(s, a, w)
∂l ∂u = ∂Q(s, a, w) ∂a ∂a ∂u
Asynchronous Advantage Actor-Critic (A3C)
V (s, v) ≈ E [rt+1 + γrt+2 + ...|s]
qt = rt+1 + γrt+2... + γn−1rt+n + γnV (st+n, v) ∂lu ∂u = ∂log π(at|st, u) ∂u (qt − V (st, v))
lv = (qt − V (st, v))2
Deep Reinforcement Learning in Labyrinth
46A3C in Labyrinth
47st st+1 st-1
π(a|st-1) π(a|st) π(a|st+1) V(st-1) V(st) V(st-1)
y π(a|st)
s π(a|s)
softmax policy from pixels
pixels from current frame
recurrent neural network (LSTM)
softmax over actions
reward) and escape (+10 reward)
Deep Reinforcement Learning with Continuous Actions
How can we deal with high-dimensional continuous action spaces?
− Actor-critic algorithms learn without max
e max
a
Q(s, a)
∂a
Deep DPG
DPG is the continuous analogue of DQN
To deal with non-stationarity, targets u — ,w — are held fixed
lw = ✓ r + γQ(s0, π(s0, u), w) − Q(s, a, w) ◆2
∂lu ∂u = ∂Q(s, a, w) ∂a ∂a ∂u
DPG in Simulated Physics
MuJoCo
policy from raw pixels s
from last 4 frames
for Q and
that most improves Q
52Q(s,a)
π(s) a
and π and π
A3C in Simulated Physics Demo
Learning Models of the Environment
− Errors in the transition model compound over the trajectory − Planning trajectories differ from executed trajectories − At end of long, unusual trajectory, rewards are totally wrong
58Learning Models of the Environment
− Errors in the transition model compound over the trajectory − Planning trajectories differ from executed trajectories − At end of long, unusual trajectory, rewards are totally wrong
59Why is Go hard for computers to play?
Game tree complexity = bd Brute force search intractable:
to evaluate who is winning
62Game tree complexity = bd Brute force search intractable: 1. Search space is huge 2. “Impossible” for computers to evaluate who is winning
Why is Go hard for computers to play?
Convolutional Neural Network
63Convolutional neural network
Value Network
64Position Move probabilities
Policy Network
65Policy network
Move probabilities Position
s p (a|s)
Move probabilities
Neural Network Training Pipeline
66Neural network training pipeline
Human expert positions Supervised Learning policy network Self-play data Value network Reinforcement Learning policy network
Supervised Learning of Policy Networks
Policy cy ne networ
k: 12 layer convolutional neural network
Traini ning ng da data: 30M positions from human expert games (KGS 5+ dan)
Traini ning ng algor gorithm hm: maximize likelihood by SGD
Traini ning ng time: 4 weeks on 50 GPUs using Google Cloud
Results ts: 57% accuracy on held out test data (state-of-the art was 44%)
67Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.
Reinforcement learning of policy networks
Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximise likelihood by stochastic gradient descent Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%)
Supervised learning of policy networks
Reinforcement Learning of Policy Networks
Policy cy ne networ
k: 12 layer convolutional neural network
Traini ning ng da data: games of self-play between policy network
Traini ning ng algor gorithm hm: maximize wins z by policy gradient reinforcement learning
Traini ning ng time: 1 week on 50 GPUs using Google Cloud
Results ts: 80% vs supervised learning. Raw network ~3 amateur dan.
68Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.
Reinforcement learning of policy networks
Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.
Reinforcement learning of policy networks
Reinforcement Learning of Value Networks
Value e netw etwork: 12 layer convolutional neural network
Traini ning ng da data: 30 million games of self-play
Traini ning ng algor gorithm hm: minimize MSE by stochastic gradient descent
Traini ning ng time: 1 week on 50 GPUs using Google Cloud
Results ts: First strong position evaluation function - previously thought impossible
69Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible
Reinforcement learning of value networks
Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible
Reinforcement learning of value networks
Exhaustive Search
70Exhaustive search
Reducing Depth with Value Network
71Reducing Breadth with Policy Network
72Evaluating AlphaGo Against Computers
73Evaluating AlphaGo against computers
Zen Crazy Stone 3000 2500 2000 1500 1000 500 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul v18)
9p 7p 5p 3p 1p 9d 7d 5d 3d 1d 1k 3k 5k 7k
Professional dan (p) Amateur dan (d) Beginner kyu (k)
AlphaGo (Mar 2016) AlphaGo (Oct 2015) Crazy Stone and Zen Lee Sedol (9p) Top player of past decade Fan Hui (2p) 3-times reigning Euro Champion Amateur humans Computer Programs Calibration Human Players Beats Beats Nature match KGS DeepMind challenge match Beats Beats 5-0 4-1