CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - - PowerPoint PPT Presentation

cmp784
SMART_READER_LITE
LIVE PREVIEW

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - - PowerPoint PPT Presentation

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018 Neural Face by Taehoon Kim Previously on CMP784 Generative


slide-1
SLIDE 1

Lecture #12 – Deep Reinforcement Learning

Aykut Erdem // Hacettepe University // Spring 2018

CMP784

DEEP LEARNING

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne

slide-2
SLIDE 2

Previously on CMP784

  • Generative Adversarial Networks

(GANs)

  • How do GANs work
  • Conditional GAN
  • Tips and Tricks
  • Applications
2

Neural Face by Taehoon Kim

slide-3
SLIDE 3

Lecture overview

  • What is Reinforcement Learning?
  • Components of a RL problem
  • Markov Decision Processes
  • Value-Based Deep RL
  • Policy-Based Deep RL
  • Model-Based Deep RL

Disclaimer: r: Much of the material and slides for this lecture were borrowed from

—John Schulman’s talk on “Deep Reinforcement Learning: Policy Gradients and Q-Learning” —David Silver’s tutorial on “Deep Reinforcement Learning” —Lex Fridman’s MIT 6.S094 Deep Learning for Self-Driving Cars class

3
slide-4
SLIDE 4

What is Reinforcement Learning?

4
slide-5
SLIDE 5

What is Reinforcement Learning?

5

Supervised Learning Unsupervised Learning Reinforcement Learning

Slide credit: Razvan Pascanu

slide-6
SLIDE 6

What is Reinforcement Learning?

  • Branch of machine learning concerned with taking sequences of

actions

  • Usually described in terms of agent interacting with a previously

unknown environment, trying to maximize cumulative reward

6

Agent Environment action

  • bservation, reward
slide-7
SLIDE 7

Motor Control and Robotics

Robotics:

  • Observations: camera images, joint angles
  • Actions: joint torques
  • Rewards: stay balanced, navigate to target locations, serve and protect humans
7
slide-8
SLIDE 8

Business Operations

8

Inventory Management:

  • Observations: current inventory levels
  • Actions: number of units of each item to purchase
  • Rewards: profit
slide-9
SLIDE 9

Image Captioning

9

Hard Attention for Image Captioning:

  • Observations: current image window
  • Actions: where to look
  • Rewards: classification
slide-10
SLIDE 10

Games

A different kind of optimization problem (min-max) but still considered to be RL.

  • Go (complete information, deterministic) – AlphaGo
  • Backgammon (complete information, stochastic)

– TD-Gammon

  • Stratego (incomplete information, deterministic)
  • Poker (incomplete information, stochastic)
10

Matej Moravčík et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 02 Mar 2017 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68. Game tree complexity = bd Brute force search intractable: 1. Search space is huge 2. “Impossible” for computers to evaluate who is winning

Why is Go hard for computers to play?

slide-11
SLIDE 11

How does RL relate to Supervised Learning?

  • Supervised learning:

— Environment samples input-output pair — Agent predicts — Agent receives loss — Environment asks agent a question, and then tells her the right answer

11

pair (xt, yt) ∼ ⇢

ts ˆ yt = f (xt)

ss `(yt, ˆ yt)

slide-12
SLIDE 12

How does RL relate to Supervised Learning?

  • Reinforcement learning:

− Environment samples input-output pair

§ Input depends on your previous actions!

− Agent predicts − Agent receives cost where P a probability distribution unknown to the agent.

12

ts ˆ yt = f (xt)

input xt ∼ P(xt | xt−1, yt−1)

st ct ∼ P(ct | xt, ˆ yt) w

slide-13
SLIDE 13

Reinforcement Learning in a nutshell

RL is a general-purpose framework for decision-making

  • RL is for an ag

agen ent with the capacity to ac act

  • Each ac

action influences the agent’s future st state

  • Success is measured by a scalar re

reward rd signal

  • Goal: se

select ct act ctions

  • ns to
  • maximize fut

uture ure re reward rd

14
slide-14
SLIDE 14

Deep Learning in a nutshell

RL is a general-purpose framework for decision-making

  • Given an obj
  • bject

ctive

  • Learn re

repre rese sentatio ion that is required to achieve objective

  • Directly from ra

raw in inputs

  • Using minimal domain knowledge
15
slide-15
SLIDE 15

Deep Reinforcement Learning: AI = RL + DL

We seek a single agent which can solve any human-level task

  • RL defines the objective
  • DL gives the mechanism
  • RL + DL = general intelligence
  • Examples:

− Pla Play games: Atari, poker, Go, ... − Ex Explor plore worlds: 3D worlds, Labyrinth, ... − Co Control physical systems: manipulate, walk, swim, ... − Interact ct with users: recommend, optimize, personalize, ...

16
slide-16
SLIDE 16

Agent and Environment

  • At each step t the agent:
  • Executes action at
  • Receives observation ot
  • Receives scalar reward rt
  • The environment:
  • Receives action at
  • Emits observations ot+1
  • Emits scalar reward rt+1
17
  • bservation

reward action at rt

  • t
slide-17
SLIDE 17

Example Reinforcement Learning Problem

  • An agent operates in an environment: At

Atari Breakout

  • ut
  • An agent has the capacity to ac

act

  • Each action influences the agent’s fu

futu ture sta tate te

  • Success is measured by a re

reward rd signal

  • Go

Goal is to select actions to maximize future reward

18
  • Each action influences the agent’s
  • Each action influences the agent’s
slide-18
SLIDE 18

State

  • Experience is a sequence of observations, actions, rewards
  • 1, r1, a1, ..., at-1, ot , rt
  • The state is a summary of experience

st = f(o1,r1,a1,...,at-1,ot,rt)

  • In a fully observed environment

st =f(ot)

19
slide-19
SLIDE 19

Major Components of an RL Agent

  • An RL agent may include one or more of these components:

− Policy: Agent’s behavior function − Value function: How good is each state and/or action − Model: Agent’s representation of the environment

20
slide-20
SLIDE 20

Policy

  • A policy is the agent’s behavior
  • It is a map from state to action:

− Deterministic policy: − Stochastic policy:

21

: a = π(s) P

: π(a|s) = P [a|s]

slide-21
SLIDE 21

Value Function

  • A value function is a prediction of future reward

− “How much reward will I get from action a in state s?”

  • Q-value function gives expected total reward

− from state s and action a − under policy − with discount factor

  • Value functions decompose into a Bellman equation
22

y π

  • r γ

Qπ(s, a) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + ... | s, a ⇤

Qπ(s, a) = Es0,a0 ⇥ r + γQπ(s0, a0) | s, a ⇤

slide-22
SLIDE 22

Q⇤(s, a) = Es0  r + γ max

a0

Q⇤(s0, a0) | s, a

  • Optimal Value Functions
  • An optimal value function is the maximum achievable value
  • Once we have Q* we can act optimally,
  • Optimal value maximizes over all decisions. Informally:
  • Formally, optimal values decompose into a Bellman equation
23

Q⇤(s, a) = max

π

Qπ(s, a) = Qπ⇤(s, a)

π⇤(s) = argmax

a

Q⇤(s, a)

Q⇤(s, a) = rt+1 + γ max

at+1 rt+2 + γ2 max at+2 rt+3 + ...

= rt+1 + γ max

at+1 Q⇤(st+1, at+1)

slide-23
SLIDE 23

Model

24
  • bservation

reward action at rt

  • t
slide-24
SLIDE 24

Model

  • Model is learnt from experience
  • Acts as proxy for environment
  • Planner interacts with model
  • e.g. using lookahead search
25
slide-25
SLIDE 25

Approaches To Reinforcement Learning

  • Value-based RL

− Estimate the optimal value function Q*(s,a) − This is the maximum value achievable under any policy

  • Policy-based RL

− Search directly for the optimal policy − This is the policy achieving maximum future reward

  • Model-based RL

− Build a model of the environment − Plan (e.g. by lookahead) using model

26

y π∗

slide-26
SLIDE 26

Deep Reinforcement Learning

  • Use deep neural networks to represent

− Value function − Policy − Model

  • Optimize loss function by stochastic gradient descent
27
slide-27
SLIDE 27

Value-Based Deep RL

28
slide-28
SLIDE 28

Q-Networks

  • Represent value function by Q-network with weights w
29

Q(s, a, w) ≈ Q∗(s, a)

s s a Q(s,a,w) Q(s,a1,w) Q(s,am,w) … w w

slide-29
SLIDE 29

l = ✓ r + γ max

a0

Q(s0, a0, w) − Q(s, a, w) ◆2

Q⇤(s, a) = Es0  r + γ max

a0

Q⇤(s0, a0) | s, a

  • Q-Learning
  • Optimal Q-values should obey Bellman equation
  • Treat right-handside

as a target

  • Minimize MSE loss by stochastic gradient descent
  • Converges to Q* using table lookup representation
  • But diverges using neural networks due to:

− Correlations between samples − Non-stationary targets

30

e r + γ max

a0

Q(s0, a0, w)

slide-30
SLIDE 30

Deep Q-Networks (DQN): Experience Replay

  • To remove correlations, build data-set from agent’s own experience
  • Sample experiences from data-set and apply update
  • To deal with non-stationarity, target parameters w — are held fixed
31

s1, a1, r2, s2 s2, a2, r3, s3 → s, a, r, s0 s3, a3, r4, s4 ... st, at, rt+1, st+1 → st, at, rt+1, st+1

l = ✓ r + γ max

a0

Q(s0, a0, w) − Q(s, a, w) ◆2

slide-31
SLIDE 31

Deep Reinforcement Learning in Atari

32

state reward action at rt st

Human level control through deep reinforcement learning, V. Mnih et al. Nature 518:529-533, 2015.

slide-32
SLIDE 32

DQN in Atari

  • End-to-end learning of values Q(s,a) from pixels s
  • Input state s is stack of raw pixels from last 4 frames
  • Output is Q(s,a) for 18 joystick/button positions
  • Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

33
slide-33
SLIDE 33

DQN Results in Atari

34
slide-34
SLIDE 34 35
slide-35
SLIDE 35 36
slide-36
SLIDE 36 37

V

Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders.

slide-37
SLIDE 37

Improvements since Nature DQN

  • Do

Double DQ DQN: Remove upward bias caused by

− Current Q-network w is used to select actions − Older Q-network w — is used to evaluate actions

  • Pri

riori riti tized re replay: Weight experience according to surprise

− Store experience in priority queue according to DQN error

  • Du

Duelling netw twork rk: Split Q-network into two channels

− Action-independent value function V(s,v) − Action-dependent advantage function A(s, a, w)

Combined algorithm: 3x mean Atari score vs Nature DQN

38

y max

a

Q(s, a, w)

l = ✓ r + γQ(s0, argmax

a0

Q(s0, a0, w), w) − Q(s, a, w) ◆2

  • r + γ max

a0

Q(s0, a0, w) − Q(s, a, w)

  • Q(s, a) = V (s, v) + A(s, a, w)
slide-38
SLIDE 38

Gorila

  • 10x faster than Nature DQN on 38 out of 49 Atari games
  • Applied to recommender systems within Google
39
slide-39
SLIDE 39

Asynchronous Reinforcement Learning

  • Exploits multithreading of standard CPU
  • Execute many instances of agent in parallel
  • Network parameters shared between threads
  • Parallelism decorrelates data

− Viable alternative to experience replay

  • Similar speedup to Gorila - on a single machine!
40
slide-40
SLIDE 40

Policy-Based Deep RL

41
slide-41
SLIDE 41

Deep Policy Networks

  • Represent policy by deep network with weights u
  • Define objective function as total discounted reward
  • Optimize objective end-to-end by SGD
  • i.e. Adjust policy parameters u to achieve more reward
42

a = π(a|s, u) or a = π(s, u)

L(u) = E ⇥ r1 + γr2 + γ2r3 + ... | π(·, u) ⇤

slide-42
SLIDE 42

Policy Gradients

How to make high-value actions more likely:

  • The gradient of a stochastic policy is given by
  • The gradient of a deterministic policy is given by

if a is continuous and Q is differentiable

43

∂L(u) ∂u = E ∂log π(a|s, u) ∂u Qπ(s, a)

  • cy π(a|s, u)

∂L(u) ∂u = E ∂Qπ(s, a) ∂a ∂a ∂u

  • y a = π(s)
slide-43
SLIDE 43

Actor-Critic Algorithm

  • Estimate value function
  • Update policy parameters u by stochastic gradient ascent
  • r
44
  • n Q(s, a, w) ≈ Qπ(s, a)

∂l ∂u = ∂log π(a|s, u) ∂u Q(s, a, w)

∂l ∂u = ∂Q(s, a, w) ∂a ∂a ∂u

slide-44
SLIDE 44

Asynchronous Advantage Actor-Critic (A3C)

  • Estimate state-value function
  • Q-value estimated by an n-step sample
  • Actor is updated towards target
  • Critic is updated to minimize MSE w.r.t. target
  • 4x mean Atari score vs Nature DQN
45

V (s, v) ≈ E [rt+1 + γrt+2 + ...|s]

qt = rt+1 + γrt+2... + γn−1rt+n + γnV (st+n, v) ∂lu ∂u = ∂log π(at|st, u) ∂u (qt − V (st, v))

lv = (qt − V (st, v))2

slide-45
SLIDE 45

Deep Reinforcement Learning in Labyrinth

46
slide-46
SLIDE 46

A3C in Labyrinth

47

st st+1 st-1

  • t-1
  • t
  • t+1

π(a|st-1) π(a|st) π(a|st+1) V(st-1) V(st) V(st-1)

y π(a|st)

s π(a|s)

  • End-to-end learning of

softmax policy from pixels

  • Observations ot are raw

pixels from current frame

  • State st =f (o1,...,ot ) is a

recurrent neural network (LSTM)

  • Outputs both value V(s) and

softmax over actions

  • Task is to collect apples (+1

reward) and escape (+10 reward)

slide-47
SLIDE 47 48
slide-48
SLIDE 48 49
slide-49
SLIDE 49

Deep Reinforcement Learning with Continuous Actions

How can we deal with high-dimensional continuous action spaces?

  • Can’t easily compute

− Actor-critic algorithms learn without max

  • Q-values are differentiable w.r.t a
  • Deterministic policy gradients exploit knowledge of
50

e max

a

Q(s, a)

  • f ∂Q

∂a

slide-50
SLIDE 50

Deep DPG

DPG is the continuous analogue of DQN

  • Experience replay: build data-set from agent’s experience
  • Critic estimates value of current policy by DQN

To deal with non-stationarity, targets u — ,w — are held fixed

  • Actor updates policy in direction that improves Q
  • In other words critic provides loss function for actor
51

lw = ✓ r + γQ(s0, π(s0, u), w) − Q(s, a, w) ◆2

∂lu ∂u = ∂Q(s, a, w) ∂a ∂a ∂u

slide-51
SLIDE 51

DPG in Simulated Physics

  • Physics domains are simulated in

MuJoCo

  • End-to-end learning of control

policy from raw pixels s

  • Input state s is stack of raw pixels

from last 4 frames

  • Two separate convnets are used

for Q and

  • Policy is adjusted in direction

that most improves Q

52

Q(s,a)

π(s) a

and π and π

slide-52
SLIDE 52

A3C in Simulated Physics Demo

  • Asynchronous RL is viable alternative to experience replay
  • Train a hierarchical, recurrent locomotion controller
  • Retrain controller on more challenging tasks
53
slide-53
SLIDE 53 54
slide-54
SLIDE 54

Model-Based Deep RL

57
slide-55
SLIDE 55

Learning Models of the Environment

  • Challenging to plan due to compounding errors

− Errors in the transition model compound over the trajectory − Planning trajectories differ from executed trajectories − At end of long, unusual trajectory, rewards are totally wrong

58
slide-56
SLIDE 56

Learning Models of the Environment

  • Challenging to plan due to compounding errors

− Errors in the transition model compound over the trajectory − Planning trajectories differ from executed trajectories − At end of long, unusual trajectory, rewards are totally wrong

59
slide-57
SLIDE 57

Case Study: AlphaGo

60
slide-58
SLIDE 58 61
slide-59
SLIDE 59

Why is Go hard for computers to play?

Game tree complexity = bd Brute force search intractable:

  • 1. Search space is huge
  • 2. “Impossible” for computers

to evaluate who is winning

62

Game tree complexity = bd Brute force search intractable: 1. Search space is huge 2. “Impossible” for computers to evaluate who is winning

Why is Go hard for computers to play?

slide-60
SLIDE 60

Convolutional Neural Network

63

Convolutional neural network

slide-61
SLIDE 61

Value Network

64

Position Move probabilities

slide-62
SLIDE 62

Policy Network

65

Policy network

Move probabilities Position

s p (a|s)

  • Position

Move probabilities

slide-63
SLIDE 63

Neural Network Training Pipeline

66

Neural network training pipeline

Human expert positions Supervised Learning policy network Self-play data Value network Reinforcement Learning policy network

slide-64
SLIDE 64

Supervised Learning of Policy Networks

  • Pol

Policy cy ne networ

  • rk:

k: 12 layer convolutional neural network

  • Tr

Traini ning ng da data: 30M positions from human expert games (KGS 5+ dan)

  • Tr

Traini ning ng algor gorithm hm: maximize likelihood by SGD

  • Tr

Traini ning ng time: 4 weeks on 50 GPUs using Google Cloud

  • Res

Results ts: 57% accuracy on held out test data (state-of-the art was 44%)

67

Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.

Reinforcement learning of policy networks

Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximise likelihood by stochastic gradient descent Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%)

Supervised learning of policy networks

slide-65
SLIDE 65

Reinforcement Learning of Policy Networks

  • Pol

Policy cy ne networ

  • rk:

k: 12 layer convolutional neural network

  • Tr

Traini ning ng da data: games of self-play between policy network

  • Tr

Traini ning ng algor gorithm hm: maximize wins z by policy gradient reinforcement learning

  • Tr

Traini ning ng time: 1 week on 50 GPUs using Google Cloud

  • Res

Results ts: 80% vs supervised learning. Raw network ~3 amateur dan.

68

Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.

Reinforcement learning of policy networks

Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximise wins z by policy gradient reinforcement learning Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan.

Reinforcement learning of policy networks

slide-66
SLIDE 66

Reinforcement Learning of Value Networks

  • Val

Value e netw etwork: 12 layer convolutional neural network

  • Tr

Traini ning ng da data: 30 million games of self-play

  • Tr

Traini ning ng algor gorithm hm: minimize MSE by stochastic gradient descent

  • Tr

Traini ning ng time: 1 week on 50 GPUs using Google Cloud

  • Res

Results ts: First strong position evaluation function - previously thought impossible

69

Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible

Reinforcement learning of value networks

Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimise MSE by stochastic gradient descent Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible

Reinforcement learning of value networks

slide-67
SLIDE 67

Exhaustive Search

70

Exhaustive search

slide-68
SLIDE 68

Reducing Depth with Value Network

71
slide-69
SLIDE 69

Reducing Breadth with Policy Network

72
slide-70
SLIDE 70

Evaluating AlphaGo Against Computers

73

Evaluating AlphaGo against computers

Zen Crazy Stone 3000 2500 2000 1500 1000 500 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul v18)

9p 7p 5p 3p 1p 9d 7d 5d 3d 1d 1k 3k 5k 7k

Professional dan (p) Amateur dan (d) Beginner kyu (k)

slide-71
SLIDE 71 74

AlphaGo (Mar 2016) AlphaGo (Oct 2015) Crazy Stone and Zen Lee Sedol (9p) Top player of past decade Fan Hui (2p) 3-times reigning Euro Champion Amateur humans Computer Programs Calibration Human Players Beats Beats Nature match KGS DeepMind challenge match Beats Beats 5-0 4-1