Deep Reinforcement Learning Introduction and State-of-the-art Arjun - - PowerPoint PPT Presentation

deep reinforcement learning introduction and state of the
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Introduction and State-of-the-art Arjun - - PowerPoint PPT Presentation

Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 24 October 2017 https://join.slack.com/t/deep-rl-tutorial/signup The


slide-1
SLIDE 1

Deep Reinforcement Learning Introduction and State-of-the-art

Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger

24 October 2017 https://join.slack.com/t/deep-rl-tutorial/signup

slide-2
SLIDE 2

The Plan

  • Some history
  • RL and Deep RL in a nutshell
  • Deep RL Toolbox
  • Challenges and State-of-the-art
  • Data Efficiency
  • Exploration
  • Temporal Abstractions
  • Generalisation
slide-3
SLIDE 3

https://vimeo.com/20042665

slide-4
SLIDE 4

Brief History

2004

Stanford

2013 —

Vlad Mnih et. al.

2015 —

David Silver et. al. Google DeepMind

RL for robots using NNs, L-J Lin. PhD 1993, CMU

1995

Gerald Tesauro

late 1980s

Rich Su;on et al. http://heli.stanford.edu/

slide-5
SLIDE 5

Problem Characteristics

requires strategy delayed consequences dynamic uncertainty/volatility uncharted/unimagined/ exception laden

Image credit: http://wonderfulengineering.com/inside-the-data-center-where-google-stores-all-its-data-pictures/
slide-6
SLIDE 6

machine with agency which learn, plan, and act to find a strategy for solving the problem

explore and exploit probe and learn from feedback autonomous to some extent focus on the long-term objective

Solution

slide-7
SLIDE 7

Reinforcement Learning

  • bservation and

feedback on actions action

Problem/ Environment

maximise return E{R}

Goal

Model dynamics model

Agent

Model

Goal

policy/value function

π/Q π/Q

slide-8
SLIDE 8

interact to maximise long term reward

Inspired by Prof. Rich Sutton's tutorial: https://www.youtube.com/watch?v=ggqnxyjaKe4

The MDP game!

  • bservation and

feedback on actions action

Problem/ Environment Agent

Model

Goal

π/Q

maximise return E{R}

Goal

slide-9
SLIDE 9

The MDP (S,A,P,R,ϒ)

https://github.com/traai/basic-rl

A B

1 2 1 2

R=-10±3 P=0.99 R=10±3 P=1.00 R=40±3 P=0.99 R=20±3 P=0.01

R: immediate reward function R(s, a) P: state transition probability P(s’|s, a)

R=20±3 P=0.99 R=40±3 P=0.01 R=-10±3 P=0.01

slide-10
SLIDE 10

Terminology

state or action value function policy dynamics model goal

home

reward

slide-11
SLIDE 11

Terminology

state or action value function goal

home

Q Q Q Q Q(s,a) V(s) V policy dynamics model goal reward

slide-12
SLIDE 12

Terminology

home

π(s|a) π(s) state or action value function policy dynamics model goal reward

slide-13
SLIDE 13

Terminology

home

If I go South, I will meet

state or action value function policy dynamics model goal reward

slide-14
SLIDE 14

Terminology

home

state or action value function policy dynamics model goal reward

slide-15
SLIDE 15

Terminology

home

state or action value function policy dynamics model goal reward

10

slide-16
SLIDE 16

Deep Reinforcement Learning

  • bservation and

feedback on actions action

Problem/ Environment

maximise return E{R}

Goal

Model dynamics model

policy/value function

π/Q

Agent

Model

Goal

  • bservation

action

slide-17
SLIDE 17

Deep Reinforcement Learning

Action Sensors

Deep Neural Networks (abstractions/representation adapted to task)

abstractions ~ info loss (manual craft)

Perception World Model Planning Control Action Sensors

vision/detection pixels prediction/physics sim/kinematics motion planner low level controller set torques motor

slide-18
SLIDE 18

SL + RL

Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car, Bojarski et. al., https://arxiv.org/pdf/1704.07911.pdf 2017

https://www.youtube.com/watch?v=KnPiP9PkLAs https://www.youtube.com/watch?v=NJU9ULQUwng

data mismatch

slide-19
SLIDE 19

Toolbox

Standard algorithms to give you a flavour of the norm!

slide-20
SLIDE 20

image score change

  • n action

DQN

action

Agent

Buffer

Goal

NN

Human-level control through deep reinforcement learning, Mnih et. al., Nature 518, Feb 2015

slide-21
SLIDE 21

experience replay buffer

save transition in memory randomly sample from memory for training = i.i.d

at st st+1 rt+1

slide-22
SLIDE 22

freeze target

freeze

slide-23
SLIDE 23

https://storage.googleapis.com/deepmind-media/dqn/ DQNNaturePaper.pdf

Human-level control through deep reinforcement learning, Mnih et. al., Nature 518, Feb 2015

slide-24
SLIDE 24

prioritised experience replay

sample from memory based on surprise

Prioritised Experience Replay, Schaul et. al., ICLR 2016

slide-25
SLIDE 25

dueling architecture

Q(s, a) = V(s) + A (s, a) V A Q Q

Dueling Network Architectures for Deep RL Wang et. al., ICML 2016

slide-26
SLIDE 26

however training is

SLOOOOOo….W

slide-27
SLIDE 27

Parallel Asynchronous Training

shared parameters parallel agents lock-free updates value and policy based methods

Asynchronous Methods for Deep Reinforcement Learning, Mnih et. al., ICML 2016

https://youtu.be/0xo1Ldx3L5Q https://youtu.be/Ajjc08-iPx8 https://youtu.be/nMR5mjCFZCw

slide-28
SLIDE 28

shared params parallel learners HOGWILD! updates

Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy

Agent https://github.com/traai/async-deep-rl

HOGWILD! updates

slide-29
SLIDE 29

So 2016… Can we train even faster?

slide-30
SLIDE 30

PAAC (Parallel Advantage Actor-Critic)

Efficient Parallel Methods for Deep Reinforcement Learning,

  • A. V. Clemente, H. N. Castejón, and A. Chandra, RLDM 2017

1 GPU/CPU SOTA performance Reduced training time

https://github.com/alfredvc/paac Alfredo Clemente

slide-31
SLIDE 31

Challenges and SOTA

Data Efficiency Exploration Temporal Abstractions Generalisation

slide-32
SLIDE 32

Data Efficiency

slide-33
SLIDE 33

Demonstrations

past

  • bservations,

action, feedback

  • bservation and

feedback on action action

Agent

Goal

NN

Learning from Demonstrations for Real World Reinforcement Learning, Hester et. al., arXiv e-print, Jul 2017

Buffer

slide-34
SLIDE 34

https://www.youtube.com/watch?v=JR6wmLaYuu4

slide-35
SLIDE 35

https://www.youtube.com/watch?v=1wsCZk0Im54

slide-36
SLIDE 36

https://www.youtube.com/watch?v=B3pf7NJFtHE

slide-37
SLIDE 37

Deep RL with Unsupervised Auxiliary Tasks

  • bservation and

feedback on actions action

Problem/ Environment Agent

Buffer

Goal

Use replay buffer wisely

Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et. al. ICML 2017

slide-38
SLIDE 38

Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et. al. ICML 2017

slide-39
SLIDE 39

learn to act to affect pixels

e.g. if grabbing fruit makes it disappear, agent would do it

slide-40
SLIDE 40

predict short term reward

e.g. replay pick key series of frames

slide-41
SLIDE 41

predict long term reward

slide-42
SLIDE 42

10x less data!

~0.25

Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et. al. ICML 2017

slide-43
SLIDE 43

https://deepmind.com/blog/reinforcement-learning- unsupervised-auxiliary-tasks/

slide-44
SLIDE 44

Distributional RL

  • bservation and

feedback on actions action

Agent

Buffer

Goal

Problem/ Environment

Q(s, a)

A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

slide-45
SLIDE 45

Normal DQN target: [sample reward after step + discounted previous return estimate from then on] BUT this: [fuse R with discounted previous return distribution]

A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

slide-46
SLIDE 46

“If I shoot now, it is game over for me”

A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

slide-47
SLIDE 47

A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

slide-48
SLIDE 48

under pressure wrong/fatal actions bimodal

A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

slide-49
SLIDE 49

Exploration

slide-50
SLIDE 50

action

Curiosity Driven Exploration

Agent

Model

Goal

NN

  • bservation and

feedback on action

slide-51
SLIDE 51

Curiosity Driven Exploration

action action prediction

… only focus on relevant parts of state curiosity as next state prediction error

next state prediction state next state action state

Curiosity-driven Exploration by Self-supervised Prediction, Pathak, Agrawal et al., ICML 2017.

slide-52
SLIDE 52

https://pathak22.github.io/noreward-rl/ https://github.com/pathak22/noreward-rl

slide-53
SLIDE 53

Temporal Abstractions

slide-54
SLIDE 54

HRL with pre-set Goals

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et. al. NIPS 2016

meta-controller chooses goals

action state

controller chooses actions C MC

select goals select primitive actions

slide-55
SLIDE 55 Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et. al. NIPS 2016
slide-56
SLIDE 56

pre-defined goal selected by meta-controller

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et.

  • al. NIPS 2016
slide-57
SLIDE 57

FeUdal Networks for HRL

manager tries to finds good directions

action state

worker tries to achieve them W M

set direction

  • primitive

actions to direction

FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets et. al. ICML 2017

slide-58
SLIDE 58

FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets et. al. ICML 2017

slide-59
SLIDE 59

Generalisation

slide-60
SLIDE 60

Meta-learning (Learn to Learn)

Versatile agents!

http://www.derinogrenme.com/2015/07/29/makale-imagenet-large-scale-visual- recognition-challenge/

Good features for decision making? Transfer learning works with images

slide-61
SLIDE 61

learn to go East learn to reduce learning time to go to X

slide-62
SLIDE 62

http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/ Code: https://github.com/cbfinn/maml_rl

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.

  • C. Finn, P

. Abbeel, S. Levine. ICML 2017.

0 grad/opt step: policy ready to learn 1 grad/opt step: learnt to achieve goal

Videos: https://sites.google.com/view/maml

slide-63
SLIDE 63

Domain Randomisation Generalising from Simulation

slide-64
SLIDE 64

https://blog.openai.com/generalizing-from-simulation/

Sim-to-Real Transfer of Robotic Control with Dynamics Randomization, Peng et

  • al. arXiv preprint, 18 Oct 2017
slide-65
SLIDE 65

Generalisation via Self-play

slide-66
SLIDE 66

Deep RL in AlphaGo Zero

Improve thinking and intuition with feedback from self-play [zero human game data]

Game Zero Zero

act act win/lose/draw

  • bservations

Mastering the game of Go without human knowledge, Silver et.al., Nature, Vol. 550, October 19, 2017

slide-67
SLIDE 67

Very High Level Mechanics

v

[Xt, Yt, Xt-1, Yt-1, …, Xt-7, Yt-7, C] residual block

  • f conv layers

[39 to 79 layers] + p and v heads [2 layers, 3 layers]

guided tree search fθ

π play to the end z

π p

slide-68
SLIDE 68

Mastering the game of Go without human knowledge, Silver et.al., Nature, Vol. 550, October 19, 2017

Self-play to end of game NN training: learn to evaluate Self-play step: select move by simulation + evaluation

slide-69
SLIDE 69

https://deepmind.com/blog/alphago-zero-learning-scratch/

slide-70
SLIDE 70

https://www.youtube.com/watch?v=WXHFqTvfFSw https://deepmind.com/blog/alphago-zero-learning-scratch/

slide-71
SLIDE 71

Inspired to study RL much?

Next lecture: Building Blocks of (Deep) RL November 8, 2017

https://join.slack.com/t/deep-rl-tutorial/signup