PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK - PowerPoint PPT Presentation

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN CHANDRASEKARAN DEEP LEARNING AND PERCEPTION (ECE 6504)

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING Attribution: Christopher T Cooper

OUTLINE ▸ Playing Atari with Deep Reinforcement Learning ▸ Motivation ▸ Intro to Reinforcement Learning (RL) ▸ Deep Q-Network (DQN) ▸ BroadMind ▸ Neural Network Vision for Robot Driving

DEEPMIND PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING

MOTIVATION AUTOMATICALLY CONVERT UNSTRUCTURED INFORMATION INTO USEFUL, ACTIONABLE KNOWLEDGE. Demis Hassabis Source: Nikolai Yakovenko

MOTIVATION CREATE AN AI SYSTEM THAT HAS THE ABILITY TO LEARN FOR ITSELF FROM EXPERIENCE. Demis Hassabis Source: Nikolai Yakovenko

MOTIVATION CAN DO STUFF THAT MAYBE WE DON’T KNOW HOW TO PROGRAM. Demis Hassabis Source: Nikolai Yakovenko

MOTIVATION In short, CREATE ARTIFICIAL GENERAL INTELLIGENCE

WHY GAMES ▸ Complexity. ▸ Diversity. ▸ Easy to create more data. ▸ Meaningful reward signal. ▸ Can train and learn to transfer knowledge between similar tasks. Adapted from Nikolai Yakovenko

AGENT AND ENVIRONMENT ▸ At every time step t, state action ▸ Agent executes action At s t a t ▸ Receives observation Ot ▸ Receives scalar reward Rt reward r t ▸ Environment ▸ Receives action At ▸ Emits observation Ot+1 ▸ Emits reward Rt+1 Source: David Silver

REINFORCEMENT LEARNING ▸ RL is a general-purpose framework for artificial intelligence ▸ RL is for an agent with the capacity to act. ▸ Each action influences the agent’s future state. ▸ Success is measured by a scalar reward signal ▸ RL in a nutshell: ▸ Select actions to maximise future reward. Source: David Silver

POLICY AND ACTION-VALUE FUNCTION ▸ Policy ( ∏ ) is a behavior function selecting actions given states: a = ∏ (s) ▸ Action-Value function Q ∏ (s, a) is the expected total reward from state s and action a under policy ∏ : ▸ Q ∏ (s, a) = E[r t+1 + γ r t+2 + γ 2 r t+3 + … | s, a] ▸ Indicates “how good is action a in state s” Source: David Silver

Q FUNCTION / ACTION-VALUE FUNCTION Q ∏ (s, a) = E[r t+1 + γ r t+2 + γ 2 r t+3 + … | s, a]

MAZE EXAMPLE Start Goal Source: David Silver

TEXT POLICY Start Goal Source: David Silver

TEXT VALUE FUNCTION: TO CHANGE PICTURE TO ACTION-VALUE FUNCTION -14 -13 -12 -11 -10 -9 Start -16 -15 -12 -8 -16 -17 -6 -7 -18 -19 -5 -24 -20 -4 -3 -23 -22 -21 -22 -2 -1 Goal Source: David Silver

TEXT APPROACHES TO REINFORCEMENT LEARNING ▸ Policy-based RL ▸ Search directly for the optimal policy ∏ * ▸ This is the policy achieving maximum future reward ▸ Value-based RL ▸ Estimate the optimal value function Q*(s,a) ▸ This is the maximum value achievable under any policy ▸ Model-based RL ▸ Build a transition model of the environment ▸ Plan (e.g. by lookahead) using model Source: David Silver

TEXT APPROACHES TO REINFORCEMENT LEARNING ▸ Policy-based RL ▸ Search directly for the optimal policy ∏ * ▸ This is the policy achieving maximum future reward ▸ Value-based RL ▸ Estimate the optimal value function Q*(s,a) ▸ This is the maximum value achievable under any policy ▸ Model-based RL ▸ Build a transition model of the environment ▸ Plan (e.g. by lookahead) using model

DEEP REINFORCEMENT LEARNING ▸ How to apply reinforcement learning to deep neural networks? ▸ Use a deep network to represent value function/ policy/model. ▸ Optimize this value function/policy/model end- to-end. ▸ Use SGD to learn the weights/parameters. Source: David Silver

UNROLLING RECURSIVELY… ‣ Value function can be unrolled recursively π (s, a) = E[r + γ r t+1 + γ 2 r t+2 + ... | s, a] = E s ′ [r + γ Q π (s ′ ,a ′ ) | s, a] Q * (s,a) can be unrolled recursively ‣ Optimal value function Q Q*(s, a) = E s ′ [r + γ max a’ Q*(s ′ , a ′ ) | s, a] ‣ Value iteration algorithms solve the Bellman equation Q i+1 (s, a) = E s ′ [r + γ max a’ Q i (s ′ , a ′ ) | s, a] Source: David Silver

DEEP Q-LEARNING ▸ Represent action-value function using a deep Q-network with weights w: π (s, a) Q(s, a, w) ~ Q ▸ Loss is the mean squared error defined in Q-values: L(w) = E[(r + γ max a’ Q(s’, a’, w _ ) − Q(s, a, w)) 2 ] ‣ Gradient ∂ L(w)/ ∂ w = E[(r + γ max a’ Q(s’,a’, w _ ) − Q(s, a, w)) 2 ] * ∂ Q(s, a, w)/ ∂ w Source: David Silver

STABILITY ISSUES WITH DEEP RL ‣ Naive Q-learning oscillates or diverges with neural nets ‣ Data is sequential ‣ Successive samples are correlated, non-iid ‣ Policy changes rapidly with slight changes to Q-values ‣ Policy may oscillate ‣ Distribution of data can swing from one extreme to another ‣ Scale of rewards and Q-values is unknown ‣ Naive Q-learning gradients can be large unstable when backpropagated Source: David Silver

TEXT DEEP Q-NETWORKS ‣ DQN provides a stable solution to deep value-based RL ‣ Use experience replay ‣ Break correlations in data, bring us back to iid setting ‣ Learn from all past policies ‣ Freeze target Q-network ‣ Avoid oscillations ‣ Break correlations between Q-network and target ‣ Clip rewards or normalize network adaptively to sensible range ‣ Robust gradients Source: David Silver

TRICK 1 - EXPERIENCE REPLAY ‣ To remove correlations, build dataset from agent’s own experience ‣ Take action at according to € -greedy policy ‣ Store transition (s t , a t , r t+1 , s t+1 ) in replay memory D ‣ Sample random mini-batch of transitions (s, a, r, s’) from D ‣ Minimize MSE between Q-network and Q- learning targets Source: David Silver

TRICK 2 - FIXED TARGET Q-NETWORK ‣ To avoid oscillations, fix parameters used in Q-learning target ‣ Compute Q-learning targets w.r.t. old, fixed parameters w − r + γ max a’ Q(s’, a’, w − ) ‣ Minimize MSE between Q-network and Q-learning targets L(w)=E s,a,r,s’~D [ r+ γ max a’ Q(s’, a’, w − ) − Q(s,a,w)) 2 ] ‣ Periodically update fixed parameters w − ← w Source: David Silver

TRICK 3 - REWARD/VALUE RANGE ‣ Advantages ‣ DQN clips the rewards to [ − 1,+1] ‣ This prevents Q-values from becoming too large ‣ Ensures gradients are well-conditioned ‣ Disadvantages ‣ Can’t tell difference between small and large rewards Source: David Silver

BACK TO BROADMIND state action s t a t reward r t Source: David Silver

INTRODUCTION - ATARI AGENT (AKA BROADMIND) ▸ Aim to create a single neural network agent that is able to successfully learn to play as many of the games as possible. ▸ Agent plays 49 Atari 2600 arcade games. ▸ Learns strictly from experience - no pre-training. ▸ Inputs: game screen + score. ▸ No game-specific tuning.

INTRODUCTION - ATARI AGENT (AKA BROADMIND) ▸ State — screen transitions from a sequence of 4 frames. ▸ Screen is 210*160 pixels with 128 color palette ▸ Actions — 18 corresponding to: ▸ 9 directions of joystick (including no input). ▸ 9 directions + button. ▸ Reward — Game score.  

SCHEMATIC OF NETWORK Convolution Convolution Fully connected Fully connected No input Mnih et. al.

NETWORK ARCHITECTURE Mnih et. al.

EVALUATION Average Reward on Breakout Average Reward on Seaquest 250 1800 Average Reward per Episode Average Reward per Episode 1600 200 1400 1200 150 1000 800 100 600 400 50 200 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training Epochs Training Epochs Mnih et. al.

EVALUATION Average Q on Breakout Average Q on Seaquest 4 9 Average Action Value (Q) Average Action Value (Q) 8 3.5 7 3 6 2.5 5 2 4 1.5 3 1 2 0.5 1 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training Epochs Training Epochs Mnih et. al.

A EVALUATION B C Mnih et. al.

VISUALIZATION OF GAME STATES IN LAST HIDDEN LAYER V Mnih et. al.

AVERAGE TOTAL REWARD B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders 354 1 . 2 0 − 20 . 4 157 110 179 Random 996 5 . 2 129 − 19 614 665 271 Sarsa [3] 1743 6 159 − 17 960 723 268 Contingency [4] 4092 168 470 20 1952 1705 581 DQN 7456 31 368 − 3 18900 28010 3690 Human 1720 SINGLE BEST PERFORMING EPISODE B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders − Random 3616 354 1 . 2 52 106 0 − 20 . 4 19 1800 157 110 920 1720 179 HNeat Best [8] 1332 4 91 − 16 1325 800 1145 HNeat Pixel [8] 5184 225 661 21 4500 1740 1075 DQN Best Mnih et. al.

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK - PowerPoint PPT Presentation

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN CHANDRASEKARAN DEEP LEARNING AND PERCEPTION (ECE 6504) PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING

ATARI October 2015 7- 8 th , 2015 Summary 1. Atari today 2. Market / Products / Opportunities

ATARI October 2015 7- 8 th , 2015 Summary 1. Atari today 2. Market / Products / Opportunities

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

DeepMind Self-Learning Atari Agent Human - level control through deep reinforcement learning

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Doing More with More: Recent Achievements in Large-Scale Deep Reinforcement Learning Compiled by:

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Welcome! Todays Agenda: Introduction Light Sources Materials Sensors

Tatsu Red B Product Motivation When my ski team is out on the slopes, we have to check

Science South Dakota CCC Webinar Elementary School December 6, 2018 Debbie Taub and Mike Burdge

Hi, my name is Colin Barr-Brisebois and I'm a rendering programmer at EA Montreal. Today, Ill

Who Gets Hous ho Gets Housing Fir ing First? st? Facilitated Discussion on Prioritization

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &

DeepMDP Learning Latent Space Continuous Models for Representation Learning Carles Gelada,

Game Engines 1 Overview Game engines are a significant part of the modern games industry

Sambuz

Useful Links

Newsletter

Mail Us

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK - PowerPoint PPT Presentation

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN CHANDRASEKARAN DEEP LEARNING AND PERCEPTION (ECE 6504) PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING

ATARI October 2015 7- 8 th , 2015 Summary 1. Atari today 2. Market / Products / Opportunities

ATARI October 2015 7- 8 th , 2015 Summary 1. Atari today 2. Market / Products / Opportunities

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

DeepMind Self-Learning Atari Agent Human - level control through deep reinforcement learning

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Doing More with More: Recent Achievements in Large-Scale Deep Reinforcement Learning Compiled by:

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Welcome! Todays Agenda: Introduction Light Sources Materials Sensors

Tatsu Red B Product Motivation When my ski team is out on the slopes, we have to check

Science South Dakota CCC Webinar Elementary School December 6, 2018 Debbie Taub and Mike Burdge

Hi, my name is Colin Barr-Brisebois and I'm a rendering programmer at EA Montreal. Today, Ill

Who Gets Hous ho Gets Housing Fir ing First? st? Facilitated Discussion on Prioritization

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &amp;

DeepMDP Learning Latent Space Continuous Models for Representation Learning Carles Gelada,

Game Engines 1 Overview Game engines are a significant part of the modern games industry

Sambuz

Useful Links

Newsletter

Mail Us

Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly &