playing atari with deep reinforcement learning neural
play

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK - PowerPoint PPT Presentation

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN CHANDRASEKARAN DEEP LEARNING AND PERCEPTION (ECE 6504) PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING


  1. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN CHANDRASEKARAN DEEP LEARNING AND PERCEPTION (ECE 6504)

  2. PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING Attribution: Christopher T Cooper

  3. OUTLINE ▸ Playing Atari with Deep Reinforcement Learning ▸ Motivation ▸ Intro to Reinforcement Learning (RL) ▸ Deep Q-Network (DQN) ▸ BroadMind ▸ Neural Network Vision for Robot Driving

  4. DEEPMIND PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING

  5. OUTLINE ▸ Playing Atari with Deep Reinforcement Learning ▸ Motivation ▸ Intro to Reinforcement Learning (RL) ▸ Deep Q-Network (DQN) ▸ BroadMind ▸ Neural Network Vision for Robot Driving

  6. MOTIVATION AUTOMATICALLY CONVERT UNSTRUCTURED INFORMATION INTO USEFUL, ACTIONABLE KNOWLEDGE. Demis Hassabis Source: Nikolai Yakovenko

  7. MOTIVATION CREATE AN AI SYSTEM THAT HAS THE ABILITY TO LEARN FOR ITSELF FROM EXPERIENCE. Demis Hassabis Source: Nikolai Yakovenko

  8. MOTIVATION CAN DO STUFF THAT MAYBE WE DON’T KNOW HOW TO PROGRAM. Demis Hassabis Source: Nikolai Yakovenko

  9. MOTIVATION In short, CREATE ARTIFICIAL GENERAL INTELLIGENCE

  10. WHY GAMES ▸ Complexity. ▸ Diversity. ▸ Easy to create more data. ▸ Meaningful reward signal. ▸ Can train and learn to transfer knowledge between similar tasks. Adapted from Nikolai Yakovenko

  11. OUTLINE ▸ Playing Atari with Deep Reinforcement Learning ▸ Motivation ▸ Intro to Reinforcement Learning (RL) ▸ Deep Q-Network (DQN) ▸ BroadMind ▸ Neural Network Vision for Robot Driving

  12. AGENT AND ENVIRONMENT ▸ At every time step t, state action ▸ Agent executes action At s t a t ▸ Receives observation Ot ▸ Receives scalar reward Rt reward r t ▸ Environment ▸ Receives action At ▸ Emits observation Ot+1 ▸ Emits reward Rt+1 Source: David Silver

  13. REINFORCEMENT LEARNING ▸ RL is a general-purpose framework for artificial intelligence ▸ RL is for an agent with the capacity to act. ▸ Each action influences the agent’s future state. ▸ Success is measured by a scalar reward signal ▸ RL in a nutshell: ▸ Select actions to maximise future reward. Source: David Silver

  14. POLICY AND ACTION-VALUE FUNCTION ▸ Policy ( ∏ ) is a behavior function selecting actions given states: a = ∏ (s) ▸ Action-Value function Q ∏ (s, a) is the expected total reward from state s and action a under policy ∏ : ▸ Q ∏ (s, a) = E[r t+1 + γ r t+2 + γ 2 r t+3 + … | s, a] ▸ Indicates “how good is action a in state s” Source: David Silver

  15. Q FUNCTION / ACTION-VALUE FUNCTION Q ∏ (s, a) = E[r t+1 + γ r t+2 + γ 2 r t+3 + … | s, a]

  16. MAZE EXAMPLE Start Goal Source: David Silver

  17. TEXT POLICY Start Goal Source: David Silver

  18. TEXT VALUE FUNCTION: TO CHANGE PICTURE TO ACTION-VALUE FUNCTION -14 -13 -12 -11 -10 -9 Start -16 -15 -12 -8 -16 -17 -6 -7 -18 -19 -5 -24 -20 -4 -3 -23 -22 -21 -22 -2 -1 Goal Source: David Silver

  19. TEXT APPROACHES TO REINFORCEMENT LEARNING ▸ Policy-based RL ▸ Search directly for the optimal policy ∏ * ▸ This is the policy achieving maximum future reward ▸ Value-based RL ▸ Estimate the optimal value function Q*(s,a) ▸ This is the maximum value achievable under any policy ▸ Model-based RL ▸ Build a transition model of the environment ▸ Plan (e.g. by lookahead) using model Source: David Silver

  20. TEXT APPROACHES TO REINFORCEMENT LEARNING ▸ Policy-based RL ▸ Search directly for the optimal policy ∏ * ▸ This is the policy achieving maximum future reward ▸ Value-based RL ▸ Estimate the optimal value function Q*(s,a) ▸ This is the maximum value achievable under any policy ▸ Model-based RL ▸ Build a transition model of the environment ▸ Plan (e.g. by lookahead) using model

  21. OUTLINE ▸ Playing Atari with Deep Reinforcement Learning ▸ Motivation ▸ Intro to Reinforcement Learning (RL) ▸ Deep Q-Network (DQN) ▸ BroadMind ▸ Neural Network Vision for Robot Driving

  22. DEEP REINFORCEMENT LEARNING ▸ How to apply reinforcement learning to deep neural networks? ▸ Use a deep network to represent value function/ policy/model. ▸ Optimize this value function/policy/model end- to-end. ▸ Use SGD to learn the weights/parameters. Source: David Silver

  23. UNROLLING RECURSIVELY… ‣ Value function can be unrolled recursively π (s, a) = E[r + γ r t+1 + γ 2 r t+2 + ... | s, a] = E s ′ [r + γ Q π (s ′ ,a ′ ) | s, a] Q * (s,a) can be unrolled recursively ‣ Optimal value function Q Q*(s, a) = E s ′ [r + γ max a’ Q*(s ′ , a ′ ) | s, a] ‣ Value iteration algorithms solve the Bellman equation Q i+1 (s, a) = E s ′ [r + γ max a’ Q i (s ′ , a ′ ) | s, a] Source: David Silver

  24. DEEP Q-LEARNING ▸ Represent action-value function using a deep Q-network with weights w: π (s, a) Q(s, a, w) ~ Q ▸ Loss is the mean squared error defined in Q-values: L(w) = E[(r + γ max a’ Q(s’, a’, w _ ) − Q(s, a, w)) 2 ] ‣ Gradient ∂ L(w)/ ∂ w = E[(r + γ max a’ Q(s’,a’, w _ ) − Q(s, a, w)) 2 ] * ∂ Q(s, a, w)/ ∂ w Source: David Silver

  25. STABILITY ISSUES WITH DEEP RL ‣ Naive Q-learning oscillates or diverges with neural nets ‣ Data is sequential ‣ Successive samples are correlated, non-iid ‣ Policy changes rapidly with slight changes to Q-values ‣ Policy may oscillate ‣ Distribution of data can swing from one extreme to another ‣ Scale of rewards and Q-values is unknown ‣ Naive Q-learning gradients can be large unstable when backpropagated Source: David Silver

  26. TEXT DEEP Q-NETWORKS ‣ DQN provides a stable solution to deep value-based RL ‣ Use experience replay ‣ Break correlations in data, bring us back to iid setting ‣ Learn from all past policies ‣ Freeze target Q-network ‣ Avoid oscillations ‣ Break correlations between Q-network and target ‣ Clip rewards or normalize network adaptively to sensible range ‣ Robust gradients Source: David Silver

  27. TRICK 1 - EXPERIENCE REPLAY ‣ To remove correlations, build dataset from agent’s own experience ‣ Take action at according to € -greedy policy ‣ Store transition (s t , a t , r t+1 , s t+1 ) in replay memory D ‣ Sample random mini-batch of transitions (s, a, r, s’) from D ‣ Minimize MSE between Q-network and Q- learning targets Source: David Silver

  28. TRICK 2 - FIXED TARGET Q-NETWORK ‣ To avoid oscillations, fix parameters used in Q-learning target ‣ Compute Q-learning targets w.r.t. old, fixed parameters w − r + γ max a’ Q(s’, a’, w − ) ‣ Minimize MSE between Q-network and Q-learning targets L(w)=E s,a,r,s’~D [ r+ γ max a’ Q(s’, a’, w − ) − Q(s,a,w)) 2 ] ‣ Periodically update fixed parameters w − ← w Source: David Silver

  29. TRICK 3 - REWARD/VALUE RANGE ‣ Advantages ‣ DQN clips the rewards to [ − 1,+1] ‣ This prevents Q-values from becoming too large ‣ Ensures gradients are well-conditioned ‣ Disadvantages ‣ Can’t tell difference between small and large rewards Source: David Silver

  30. BACK TO BROADMIND state action s t a t reward r t Source: David Silver

  31. INTRODUCTION - ATARI AGENT (AKA BROADMIND) ▸ Aim to create a single neural network agent that is able to successfully learn to play as many of the games as possible. ▸ Agent plays 49 Atari 2600 arcade games. ▸ Learns strictly from experience - no pre-training. ▸ Inputs: game screen + score. ▸ No game-specific tuning.

  32. INTRODUCTION - ATARI AGENT (AKA BROADMIND) ▸ State — screen transitions from a sequence of 4 frames. ▸ Screen is 210*160 pixels with 128 color palette ▸ Actions — 18 corresponding to: ▸ 9 directions of joystick (including no input). ▸ 9 directions + button. ▸ Reward — Game score. 


  33. SCHEMATIC OF NETWORK Convolution Convolution Fully connected Fully connected No input Mnih et. al.

  34. NETWORK ARCHITECTURE Mnih et. al.

  35. EVALUATION Average Reward on Breakout Average Reward on Seaquest 250 1800 Average Reward per Episode Average Reward per Episode 1600 200 1400 1200 150 1000 800 100 600 400 50 200 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training Epochs Training Epochs Mnih et. al.

  36. EVALUATION Average Q on Breakout Average Q on Seaquest 4 9 Average Action Value (Q) Average Action Value (Q) 8 3.5 7 3 6 2.5 5 2 4 1.5 3 1 2 0.5 1 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training Epochs Training Epochs Mnih et. al.

  37. A EVALUATION B C Mnih et. al.

  38. VISUALIZATION OF GAME STATES IN LAST HIDDEN LAYER V Mnih et. al.

  39. AVERAGE TOTAL REWARD B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders 354 1 . 2 0 − 20 . 4 157 110 179 Random 996 5 . 2 129 − 19 614 665 271 Sarsa [3] 1743 6 159 − 17 960 723 268 Contingency [4] 4092 168 470 20 1952 1705 581 DQN 7456 31 368 − 3 18900 28010 3690 Human 1720 SINGLE BEST PERFORMING EPISODE B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders − Random 3616 354 1 . 2 52 106 0 − 20 . 4 19 1800 157 110 920 1720 179 HNeat Best [8] 1332 4 91 − 16 1325 800 1145 HNeat Pixel [8] 5184 225 661 21 4500 1740 1075 DQN Best Mnih et. al.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend