Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed - - PowerPoint PPT Presentation

Deep Reinforcement Learning Axel Perschmann Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit at Freiburg 07. July 2016 Reinforcement


slide-1
SLIDE 1

Deep Reinforcement Learning

Axel Perschmann

Supervisor: Ahmed Abdulkadir Seminar: Current Works in Computer Vision Research Group: Pattern Recognition and Image Processing Albert-Ludwigs-Universit¨ at Freiburg

  • 07. July 2016
slide-2
SLIDE 2

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Inhalt

1

Reinforcement Learning Basics Learning Methods

2

Asynchronous Reinforcement Learning Related Work Multi-threaded learning

3

Experiments Benchmark Environments Network Architecture Results

4

Conclusion

Axel Perschmann Deep Reinforcement Learning Presentation 2 / 33

slide-3
SLIDE 3

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Motivation: Learning by experience

Agent interacts with Environment over a number of discrete timesteps t. chooses an action based on current situation recieves feedback updates future choice of actions

Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

slide-4
SLIDE 4

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Motivation: Learning by experience (Environment)

Markov decision process ) 5-tuple (T, S, A, f , c) discrete decision points t 2 T = 0, 1, ..., N system state st 2 S actions at 2 A transition function st+1 = f (st, at, wt) = pij(a) direct costs/rewards c : S ⇥ A ! R

Image: de.slideshare.net/ckmarkohchang/ Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

slide-5
SLIDE 5

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Motivation: Learning by experience (Policy)

Agents behavior is defined by it’s policy ⇡. ⇡t(st) = a ⇡ can be stationary or non-stationary ˆ ⇡ = (⇡1, ⇡2, ..., ⇡N) or ˆ ⇡ = (⇡, ⇡, ..., ⇡)

Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

slide-6
SLIDE 6

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Motivation: Learning by experience (Goal)

RL Goal: maximize long term return learn optimal policy ⇡⇤ ⇡⇤ always selects action a that maximizes long term return

Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

slide-7
SLIDE 7

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Motivation: Learning by experience (Goal)

RL Goal: maximize long term return learn optimal policy ⇡⇤ ⇡⇤ always selects action a that maximizes long term return estimate value of states and actions

Image: de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning Axel Perschmann Deep Reinforcement Learning Presentation 3 / 33

slide-8
SLIDE 8

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Value functions

State value function: Expected return when following ⇡ from state s

V π(s) = Eπ[Rt|st = s] = Eπ[

P

k=0

krt+k+1|st = s]

Action value function: Expected return starting from state s, taking action a, following ⇡

Qπ(s, a) = Eπ[Rt|st = s, at = a] = Eπ[

P

k=0

krt+k+1|st = s, at = a]

where 0   1 is the discounting rate

Axel Perschmann Deep Reinforcement Learning Presentation 4 / 33

slide-9
SLIDE 9

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Learning to choose ’good’ actions

Simply extract the optimal policy: ⇡⇤(s) = maxa Q⇤(s, a)

Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

slide-10
SLIDE 10

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Learning to choose ’good’ actions

Simply extract the optimal policy: ⇡⇤(s) = maxa Q⇤(s, a) Problems:

1 Q(s, a) only feasible for small environments.

replace Q(s, a) by a function approximator Q(s, a; ✓) e.g. utilizing a neural network

Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

slide-11
SLIDE 11

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Learning to choose ’good’ actions

Simply extract the optimal policy: ⇡⇤(s) = maxa Q⇤(s, a) Problems:

1 Q(s, a) only feasible for small environments.

replace Q(s, a) by a function approximator Q(s, a; ✓) e.g. utilizing a neural network

2 Unknown environment:

unknown transition functions unknown rewards

) no model ) learn optimal Q⇤(s, a; ✓) by updating ✓

Axel Perschmann Deep Reinforcement Learning Presentation 5 / 33

slide-12
SLIDE 12

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Learn optimal value functions

Source: Reinforcement Learning Lecture, ue01.pdf

’trial and error’ approach

take action at according to the ✏-greedy policy receive new state st+1 and reward rt continue until a terminal state is reached use history to change Q(s, a; ✓) or V (s; ✓), restart

Axel Perschmann Deep Reinforcement Learning Presentation 6 / 33

slide-13
SLIDE 13

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Value-based model-free reinforcement learning methods

Strategies to update parameters ✓ iteratively

  • ne-step Q-Learning: off-policy technique

Li(✓i) = E ✓ r + max

a0 Q(s0, a0; ✓i1) Q(s, a; ✓i)

◆2 (1)

Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

slide-14
SLIDE 14

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Value-based model-free reinforcement learning methods

Strategies to update parameters ✓ iteratively

  • ne-step Q-Learning: off-policy technique

Li(✓i) = E ✓ r + max

a0 Q(s0, a0; ✓i1) Q(s, a; ✓i)

◆2 (1)

  • ne-step SARSA: on-policy technique

Li(✓i) = E

  • r + Q(s0, a0; ✓i1) Q(s, a; ✓i)

2 (2)

Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

slide-15
SLIDE 15

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Value-based model-free reinforcement learning methods

Strategies to update parameters ✓ iteratively

  • ne-step Q-Learning: off-policy technique

Li(✓i) = E ✓ r + max

a0 Q(s0, a0; ✓i1) Q(s, a; ✓i)

◆2 (1)

  • ne-step SARSA: on-policy technique

Li(✓i) = E

  • r + Q(s0, a0; ✓i1) Q(s, a; ✓i)

2 (2) ) obtaining a reward only affects (s, a)-pair that led to the reward

Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

slide-16
SLIDE 16

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Value-based model-free reinforcement learning methods

Strategies to update parameters ✓ iteratively

  • ne-step Q-Learning: off-policy technique

Li(✓i) = E ✓ r + max

a0 Q(s0, a0; ✓i1) Q(s, a; ✓i)

◆2 (1)

  • ne-step SARSA: on-policy technique

Li(✓i) = E

  • r + Q(s0, a0; ✓i1) Q(s, a; ✓i)

2 (2) n-step Q-Learning: off-policy technique rt + rt+1 + ... + n1rt+n + maxanQ(st+n+1, a; ✓i) (3)

Axel Perschmann Deep Reinforcement Learning Presentation 7 / 33

slide-17
SLIDE 17

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Policy-based model-free reinforcement learning method

Alternative Approach: parameterize policy ⇡(s; ✓) update parameters ✓ towards the gradient rθlog⇡(at|st; ✓)(Rt bt(st)) (4) scale gradient by the estimates certainty and by the advantage

  • f taking action at in state st:

Rt is an estimate of Qt(at, st) bt is an estimate of V t(st)

Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

slide-18
SLIDE 18

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Policy-based model-free reinforcement learning method

Alternative Approach: parameterize policy ⇡(s; ✓) update parameters ✓ towards the gradient rθlog⇡(at|st; ✓)(Rt bt(st)) (4) scale gradient by the estimates certainty and by the advantage

  • f taking action at in state st:

Rt is an estimate of Qt(at, st) bt is an estimate of V t(st)

Source: Reinforcement Learning Lecture, ue07.pdf

) actor-critic architecture Actor: policy ⇡ Critic: baseline bt(st) ⇡ V π(st)

Axel Perschmann Deep Reinforcement Learning Presentation 8 / 33

slide-19
SLIDE 19

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Inhalt

1

Reinforcement Learning Basics Learning Methods

2

Asynchronous Reinforcement Learning Related Work Multi-threaded learning

3

Experiments Benchmark Environments Network Architecture Results

4

Conclusion

Axel Perschmann Deep Reinforcement Learning Presentation 9 / 33

slide-20
SLIDE 20

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Related Work (1)

Deep Q Networks (DQN) [Mnih et al., 2015] Deep Neural Network as non-linear function approximator Techniques to avoid divergence:

experience replay:

perform Q-learning updates on random samples of past experience

target network

fix neural network for several thousand iterations, before updating weights

) make training data less non-stationary, stabilize training

Axel Perschmann Deep Reinforcement Learning Presentation 10 / 33

slide-21
SLIDE 21

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Related Work (2)

General Reinforcement Learning Architecture (Gorila) of Nair et al. [2015] asynchronous training in a distributed setting, each process

has its down environment copy uses experience replay updates a target copy of the model

  • ccasionally receives updates parameters from target network

) 130 CPU cores with 100 parallel actor learners

Axel Perschmann Deep Reinforcement Learning Presentation 11 / 33

slide-22
SLIDE 22

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous Reinforcement Learning

Deep Neural Network as non-linear function approximator Multiple threads on a single machine

minimize communication cost reduce training time (roughly linear in number of threads)

Different exploration policies in each actor

maximize diversity and decorrelate updates in time no experience replay

Axel Perschmann Deep Reinforcement Learning Presentation 12 / 33

slide-23
SLIDE 23

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous Reinforcement Learning

Deep Neural Network as non-linear function approximator Multiple threads on a single machine

minimize communication cost reduce training time (roughly linear in number of threads)

Different exploration policies in each actor

maximize diversity and decorrelate updates in time no experience replay

Asynchronous variants of four standard RL algorithms:

  • ne-step Q-Learning
  • ne-step Sarsa

n-step Q-Learning advantage actor-critic

Axel Perschmann Deep Reinforcement Learning Presentation 12 / 33

slide-24
SLIDE 24

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous one-step Q-learning

Axel Perschmann Deep Reinforcement Learning Presentation 13 / 33

slide-25
SLIDE 25

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous one-step SARSA

Axel Perschmann Deep Reinforcement Learning Presentation 14 / 33

slide-26
SLIDE 26

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous n-step Q-learning

Axel Perschmann Deep Reinforcement Learning Presentation 15 / 33

slide-27
SLIDE 27

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Asynchronous advantage actor-critic (A3C)

Axel Perschmann Deep Reinforcement Learning Presentation 16 / 33

slide-28
SLIDE 28

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Inhalt

1

Reinforcement Learning Basics Learning Methods

2

Asynchronous Reinforcement Learning Related Work Multi-threaded learning

3

Experiments Benchmark Environments Network Architecture Results

4

Conclusion

Axel Perschmann Deep Reinforcement Learning Presentation 17 / 33

slide-29
SLIDE 29

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Benchmark Environments (1)

Arcade Learning Environment [Bellemare et al., 2013] Simulator for 57 Atari 2600 games most commonly used benchmark for RL algorithms input: 160x210 frames with 128 colors action: 18 (discrete joystick positions and buttons) rewards: score (sparse)

Axel Perschmann Deep Reinforcement Learning Presentation 18 / 33

slide-30
SLIDE 30

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Benchmark Environments (2)

The Open Racing Car Simulator TORCS [Wymann et al., 2014] input: RGB frames actions: discretized rewards: agent’s speed (permanent) four settings: slow/fast car, with/without opponent bots

Axel Perschmann Deep Reinforcement Learning Presentation 19 / 33

slide-31
SLIDE 31

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Benchmark Environments (3)

Physics Simulator MuJoCo [Todorov et al., 2012]

  • n motor control tasks

input: physical state (joint positions, velocities), RGB frames challenge: continuous action space ) policy based method proof-of-concept: A3C designed for discrete control problems

Axel Perschmann Deep Reinforcement Learning Presentation 20 / 33

slide-32
SLIDE 32

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Benchmark Environments (4)

Labyrinth randomly generate 3D mazes input: 84x84 RGB frames rewards: +1 (picking up fruit), +10 (entering portal) challenge: exploration of random mazes within 60 seconds

  • nly used to evaluate best performing agent

Axel Perschmann Deep Reinforcement Learning Presentation 21 / 33

slide-33
SLIDE 33

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Network Architecture (Value Based Methods)

Network architecture from Mnih et al. [2013] convolutional layer with 16 filters of size 8x8, stride 4 convolutional layer with 32 filters of size 4x4, stride 2 fully connected layer with 256 hidden units (all layers followed by a rectifier nonlinearity) Output layer: linear output: action value Q(s, a)

Axel Perschmann Deep Reinforcement Learning Presentation 22 / 33

slide-34
SLIDE 34

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Network Architecture (Actor Critic Method)

Network architecture from Mnih et al. [2013] convolutional layer with 16 filters of size 8x8, stride 4 convolutional layer with 32 filters of size 4x4, stride 2 fully connected layer with 256 hidden units (all layers followed by a rectifier nonlinearity) Output layer: linear output: state value V (s) softmax output: ⇡(s, a), probabilities of selection each action a

Axel Perschmann Deep Reinforcement Learning Presentation 22 / 33

slide-35
SLIDE 35

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Experimental Setup

Single machine: 16 actor-learner threads, no GPU updates after every 5 actions

  • ptimizer: shared RMSProp

initial learning rate sampled from LogUniform(104, 102) learning rate annealed to 0 over the course of training Value based methods: shared target network: update every 40000 frames exploration rate ✏ decreasing over the first four million frames

Axel Perschmann Deep Reinforcement Learning Presentation 23 / 33

slide-36
SLIDE 36

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Atari 2600 games: Training speed

DQN trained on a single Nvidia K40 GPU Proposed methods trained on 16 CPU cores

async methods learns faster than DQN n-step methods learn faster than one-step methods Winner: policy-based actor-critic method

Axel Perschmann Deep Reinforcement Learning Presentation 24 / 33

slide-37
SLIDE 37

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Atari 2600 games: Training speedup using threads

time required to reach a fixed reference score over seven Atari games

Minimum 12x speedup with 16 worker threads super linear speedup for one-step methods

Assumption: one-step methods often require less data to achieve a particular score when using threads

Axel Perschmann Deep Reinforcement Learning Presentation 25 / 33

slide-38
SLIDE 38

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Atari 2600 games: Score comparison

A3C, LSTM additionally used 256 LSTM cells after the final hidden layer

A3C outperforms current state of the art methods half the training time

Axel Perschmann Deep Reinforcement Learning Presentation 26 / 33

slide-39
SLIDE 39

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

TORCS: Score vs. Training Time

A3C reached 75-90% of the score obtained by a human tester

Axel Perschmann Deep Reinforcement Learning Presentation 27 / 33

slide-40
SLIDE 40

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

MuJuCo: Score vs. Learning Rate

a wide range of learning rates leads to good performance

https://youtu.be/Ajjc08-iPx8

Axel Perschmann Deep Reinforcement Learning Presentation 28 / 33

score learning rate

slide-41
SLIDE 41

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Labyrinth: Score vs. Training Steps

Top 5 Agents over 50 random learning rates. Training took approximately 3 days.

Axel Perschmann Deep Reinforcement Learning Presentation 29 / 33

slide-42
SLIDE 42

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Inhalt

1

Reinforcement Learning Basics Learning Methods

2

Asynchronous Reinforcement Learning Related Work Multi-threaded learning

3

Experiments Benchmark Environments Network Architecture Results

4

Conclusion

Axel Perschmann Deep Reinforcement Learning Presentation 30 / 33

slide-43
SLIDE 43

Reinforcement Learning Asynchronous Reinforcement Learning Experiments Conclusion

Conclusion

[Mnih et al., 2016] presented asynchronous variants of four standard RL algorithms stable training through reinforcement learning possible

value-based and policy-based methods

  • ff-policy and on-policy methods

discrete and continuous domains

parallel actor-learners have a stabilizing effect

Q-learning possible without experience replay

new state-of-the-art on the Atari domain

Axel Perschmann Deep Reinforcement Learning Presentation 31 / 33

slide-44
SLIDE 44

Literatur

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279. Chang, M. (2015). Language understanding for text-based games using deep reinforcement learning. de.slideshare.net/ckmarkohchang/ language-understanding-for-textbased-games-using-deep-reinforcement-learning. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. Sutton, R. (2015). Introduction to reinforcement learning. http://slideplayer.com/slide/7966867. Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control (under review), 2011a. url http://www.cs.washington.edu/homes/todorov/papers/mujoco. pdf. Wymann, Espi´ e, Guionneau, Dimitrakakis, Coulom, and Sumner (2014). TORCS, The Open Racing Car Simulator. http://www.torcs.org.

slide-45
SLIDE 45

Thank you for your attention! Any questions?