deep reinforcement learning
play

Deep Reinforcement Learning Dominik Winkelbauer State Value - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1


  1. Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer

  2. State ๐‘ก Value function: 1 1 0.5 ๐‘Š ๐œŒ ๐‘ก = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก] Action ๐‘ 1 0.5 Example: Reward ๐‘  ๐‘Š ๐œŒ ๐‘ก ๐‘ข = 0.8 โˆ— 0.1 โˆ— โˆ’1 + 0 0.2 Policy ๐œŒ 0.8 โˆ— 0.9 โˆ— 2 + 0.5 0.5 0 0 0.2 โˆ— 0.5 โˆ— 0 + Value ๐‘ค 0.2 โˆ— 0.5 โˆ— 1 = 1.46 1.46 Action value ๐‘Ÿ Action value function: ๐‘… ๐œŒ ๐‘ก, ๐‘ = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก, ๐‘] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1

  3. State ๐‘ก Value function: 1 ๐‘Š ๐œŒ ๐‘ก = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก] ? Action ๐‘ 1 Action value function: Reward ๐‘  ๐‘… ๐œŒ ๐‘ก, ๐‘ = ๐”ฝ[๐‘† ๐‘ข |๐‘ก ๐‘ข = ๐‘ก, ๐‘] 0 ? Policy ๐œŒ ? 1 0 Optimal action value function: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐‘›๐‘๐‘ฆ ๐œŒ ๐‘… ๐œŒ ๐‘ก, ๐‘ Value ๐‘ค Action value ๐‘Ÿ => ๐‘… โˆ— ๐‘ก, ๐‘ implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1

  4. Value-based algorithms Policy-based algorithms Try to approximate ๐‘Š โˆ— ๐‘ก or - Directly learn policy - ๐‘… โˆ— ๐‘ก, ๐‘ - Implicitly learn policy

  5. Q-Learning โ€ข Try to iteratively calculate ๐‘… โˆ— (๐‘ก, ๐‘) q = 0.5 ๐‘  = 0.5 ๐‘Ÿ โŸต 1.7 q = 1.2 ๐‘ q = -1 ๐‘ก ๐‘กโ€ฒ ๐‘โ€ฒ ๐‘…(๐‘ก โ€ฒ , ๐‘ โ€ฒ ) ๐‘… ๐‘ก, ๐‘ โŸต ๐‘  + ๐›ฟ max โ€ข Idea: Use neural network for approximating Q ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ] ๐‘€ ๐œ„ = ๐”ฝ[๐‘  + ๐›ฟ max ๐‘

  6. How to traverse through the environment โ€ข We follow an ๐œ— โ€“ greedy policy with ๐œ— โˆˆ 0,1 โ€ข In every state: โ€ข Sample random number ๐‘™ โˆˆ 0,1 โ€ข If ๐‘™ > ๐œ— => choose action with maximum q value โ€ข else => choose random action โ€ข Exploration vs. Exploitation

  7. Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable

  8. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable

  9. On-policy vs. off-policy โ€ข On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE โ€ข Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning

  10. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL โ€ข Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way

  11. Asynchronous Q-Learning โ€ข Combine Idea with Q-Learning โ€ข Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable

  12. Value-based algorithms Policy-based algorithms Try to approximate ๐‘Š โˆ— ๐‘ก or - Directly learn policy - ๐‘… โˆ— ๐‘ก, ๐‘ - Implicitly learn policy

  13. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1

  14. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข 0.5 0 0.2 0.8 2 0.9 0.1 -1

  15. REINFORCE: 1 0.5 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข 0.5 0 0.2 0.8 2 0.9 0.1 -1

  16. REINFORCE: Problem: High Variance โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Substract baseline: โˆ‡ ๐œ„ ๐‘— log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข ฮ”๐œ„ ๐‘— โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘ ๐‘ข (๐‘ก ๐‘ข )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499

  17. REINFORCE: Problem: High Variance โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐‘† ๐‘ข Substract baseline: โˆ‡ ๐œ„ ๐‘— log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ ๐ต ๐‘ข ฮ”๐œ„ ๐‘— โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘ ๐‘ข (๐‘ก ๐‘ข )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) 501 0.33 (๐‘† ๐‘ข โˆ’๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) Can be seen as estimate of advantage: 0.33 500 500 ๐ต ๐‘ ๐‘ข , ๐‘ก ๐‘ข = ๐‘… ๐‘ ๐‘ข , ๐‘ก ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข ) Actor: policy network 0.33 Critic: value network 499

  18. Update interval REINFORCE โ€ฆ 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0

  19. Asynchronous advantage actor-critic (A3C) โ€ข Update local parameters from global shared parameters โ€ข Explore environment according to policy ๐œŒ(๐‘ ๐‘ข |๐‘ก ๐‘ข ; ๐œ„) for N steps โ€ข Compute gradients for every visited state โ€ข Policy network: โˆ‡ ๐œ„ log ๐œŒ ๐‘ ๐‘ข ๐‘ก ๐‘ข , ๐œ„ (๐‘† ๐‘ข โˆ’ ๐‘Š(๐‘ก ๐‘ข , ๐œ„ ๐‘ค )) 2 โ€ข Value network: โˆ‡ ๐œ„ ๐‘ค ๐‘† โˆ’ ๐‘Š ๐‘ก ๐‘— ; ๐œ„ ๐‘ค โ€ข Update global shared parameters with computed gradients

  20. Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients

  21. Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute โ€ฆ โ€ฆ gradients

  22. Advantages of โ€ž Asynchronous methods โ€œ โ€ข Simple extension โ€ข Can be applied to a big variety of algorithms โ€ข Makes robust NN training possible โ€ข Linear speedup

  23. Advantages of โ€ž Asynchronous methods โ€œ โ€ข Simple extension โ€ข Can be applied to a big variety of algorithms โ€ข Makes robust NN training possible โ€ข Linear speedup Consumed data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend