Human-level control through deep reinforcement Liia Butler But - - PowerPoint PPT Presentation
Human-level control through deep reinforcement Liia Butler But - - PowerPoint PPT Presentation
Human-level control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra Overview 1.
But first... A quote
"The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra
Overview
- 1. Introduction
- 2. Reinforcement Learning
- 3. Deep neural networks
- 4. Markov Decision Process
- 5. Algorithm Breakdown
- 6. Evaluation and conclusions
Introduction
Deep Q-network (DQN)
- The agent
- Reinforcement learning plus
- Deep neural networks
- Goal: General artificial intelligence
- How little do we have to know to be intelligent? Can we solve a wide range of challenging
tasks?
- Pixels and game score as input
Reinforcement Learning
- Theory of how software agents may optimize their control of the environment
- Inspired by the psychological and neuroscientific perspectives on animal
behavior
- One of the three types of machine learning
http://en.proft.me/media/science/ml_types.png
Space Invaders
Deep Neural Networks
- An architecture in deep learning, type of artificial neural network
- Artificial neural network: a network of nodes representing processing elements that are highly
connected, working together towards specific problems, like in biological nervous system
- Multiple layers of nodes with increasing abstraction of the data
- Extract high-level representations from raw data
- DQN uses "deep convolutional network"
- 84 x 4 x 4 image produced by preprocessing map
- three convolutional layers
- Two fully connected layers
- http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg
http://www.nature.com/nature/journal/v5 18/n7540/images/nature14236-f4.jpg
- State
- Action
- Reward
Markov Decision Process
http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg
What these mean for DQN
- State - What is going on?
- The goal was to be universal so it's represented by screen pixels
- Action - What can we do?
- Ex. moving, direction, buttons
- Reward - What's our motivation?
- Points, lives, etc.
http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png
How is DQN going to do this?
- Preprocessing - Reduce input dimensionality, max value for pixel color,
remove flickering
- ε-greedy policy - choosing the action
- Bellman equation - optimal control of environment, action-value function
- Using a function approximator to estimate the action-value function
- Loss function and Q-learning gradient
- Experience replay - building a data set from agent's experience
Algorithm Breakdown
Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image Φ =preprocessing sequence T = time-step at which game terminates ε = probability in ε-greedy policy a = action s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
Algorithm Breakdown
Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
ε-greedy policy
ε-greedy policy
- Exploration, random
- Exploitation, best one according to the Q value
How to choose the action 'a' at time 't'
Algorithm Breakdown
Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing function s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
Experience Replay
Experience Replay
1. Take action 2. Store transition in memory 3. Sample random minibatch of transitions from D 4. Optimize using gradient descent on target 'y' and Q-network
Optimizing the Q-Network
- Bellman Equation:
- The loss function we have:
- From this:
- Gives us the Q-learning gradient:
Algorithm Breakdown
Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight for approximator M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q
Breakout!
Evaluation and Conclusions
- Agents vs. Pro gamers
- Action at 10 Hz (an action every 0.1 seconds), every 6th frame
- At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance
- Controlled human conditions
- Out of the 49 games
- 29 at human or above
- 20 below
http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg
http://www.nature.com/nature/journal/v518/n7540/images_ article/nature14236-f3.jpg
29 out of 49 20 out of 49
Questions and Discussion
- What do you think are some non-gaming applications of deep reinforcement
learning?
- Do you think that comparing with the "professional human game tester" is a
sufficient enough of an evaluation? Is there a better way?
- Should we even have a general AI, or are we better off with domain specific
AIs?
- Are there other consequences besides a computer beating your high score?
(Have we doomed society?)