Human-level control through deep reinforcement Liia Butler But - - PowerPoint PPT Presentation

human level control through deep reinforcement
SMART_READER_LITE
LIVE PREVIEW

Human-level control through deep reinforcement Liia Butler But - - PowerPoint PPT Presentation

Human-level control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra Overview 1.


slide-1
SLIDE 1

Human-level control through deep reinforcement

Liia Butler

slide-2
SLIDE 2

But first... A quote

"The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra

slide-3
SLIDE 3

Overview

  • 1. Introduction
  • 2. Reinforcement Learning
  • 3. Deep neural networks
  • 4. Markov Decision Process
  • 5. Algorithm Breakdown
  • 6. Evaluation and conclusions
slide-4
SLIDE 4

Introduction

Deep Q-network (DQN)

  • The agent
  • Reinforcement learning plus
  • Deep neural networks
  • Goal: General artificial intelligence
  • How little do we have to know to be intelligent? Can we solve a wide range of challenging

tasks?

  • Pixels and game score as input
slide-5
SLIDE 5

Reinforcement Learning

  • Theory of how software agents may optimize their control of the environment
  • Inspired by the psychological and neuroscientific perspectives on animal

behavior

  • One of the three types of machine learning

http://en.proft.me/media/science/ml_types.png

slide-6
SLIDE 6

Space Invaders

slide-7
SLIDE 7

Deep Neural Networks

  • An architecture in deep learning, type of artificial neural network
  • Artificial neural network: a network of nodes representing processing elements that are highly

connected, working together towards specific problems, like in biological nervous system

  • Multiple layers of nodes with increasing abstraction of the data
  • Extract high-level representations from raw data
  • DQN uses "deep convolutional network"
  • 84 x 4 x 4 image produced by preprocessing map
  • three convolutional layers
  • Two fully connected layers
  • http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg

http://www.nature.com/nature/journal/v5 18/n7540/images/nature14236-f4.jpg

slide-8
SLIDE 8
  • State
  • Action
  • Reward

Markov Decision Process

http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg

slide-9
SLIDE 9

What these mean for DQN

  • State - What is going on?
  • The goal was to be universal so it's represented by screen pixels
  • Action - What can we do?
  • Ex. moving, direction, buttons
  • Reward - What's our motivation?
  • Points, lives, etc.

http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png

slide-10
SLIDE 10

How is DQN going to do this?

  • Preprocessing - Reduce input dimensionality, max value for pixel color,

remove flickering

  • ε-greedy policy - choosing the action
  • Bellman equation - optimal control of environment, action-value function
  • Using a function approximator to estimate the action-value function
  • Loss function and Q-learning gradient
  • Experience replay - building a data set from agent's experience
slide-11
SLIDE 11

Algorithm Breakdown

Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image Φ =preprocessing sequence T = time-step at which game terminates ε = probability in ε-greedy policy a = action s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

slide-12
SLIDE 12

Algorithm Breakdown

Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

ε-greedy policy

slide-13
SLIDE 13

ε-greedy policy

  • Exploration, random
  • Exploitation, best one according to the Q value

How to choose the action 'a' at time 't'

slide-14
SLIDE 14

Algorithm Breakdown

Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing function s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

Experience Replay

slide-15
SLIDE 15

Experience Replay

1. Take action 2. Store transition in memory 3. Sample random minibatch of transitions from D 4. Optimize using gradient descent on target 'y' and Q-network

slide-16
SLIDE 16

Optimizing the Q-Network

  • Bellman Equation:
  • The loss function we have:
  • From this:
  • Gives us the Q-learning gradient:
slide-17
SLIDE 17

Algorithm Breakdown

Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight for approximator M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

slide-18
SLIDE 18

Breakout!

slide-19
SLIDE 19

Evaluation and Conclusions

  • Agents vs. Pro gamers
  • Action at 10 Hz (an action every 0.1 seconds), every 6th frame
  • At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance
  • Controlled human conditions
  • Out of the 49 games
  • 29 at human or above
  • 20 below

http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg

slide-20
SLIDE 20

http://www.nature.com/nature/journal/v518/n7540/images_ article/nature14236-f3.jpg

29 out of 49 20 out of 49

slide-21
SLIDE 21

Questions and Discussion

  • What do you think are some non-gaming applications of deep reinforcement

learning?

  • Do you think that comparing with the "professional human game tester" is a

sufficient enough of an evaluation? Is there a better way?

  • Should we even have a general AI, or are we better off with domain specific

AIs?

  • Are there other consequences besides a computer beating your high score?

(Have we doomed society?)