deep reinforcement learning

Deep Reinforcement Learning Dominik Winkelbauer State Value - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1


  1. Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer

  2. State 𝑑 Value function: 1 1 0.5 π‘Š 𝜌 𝑑 = 𝔽[𝑆 𝑒 |𝑑 𝑒 = 𝑑] Action 𝑏 1 0.5 Example: Reward 𝑠 π‘Š 𝜌 𝑑 𝑒 = 0.8 βˆ— 0.1 βˆ— βˆ’1 + 0 0.2 Policy 𝜌 0.8 βˆ— 0.9 βˆ— 2 + 0.5 0.5 0 0 0.2 βˆ— 0.5 βˆ— 0 + Value 𝑀 0.2 βˆ— 0.5 βˆ— 1 = 1.46 1.46 Action value π‘Ÿ Action value function: 𝑅 𝜌 𝑑, 𝑏 = 𝔽[𝑆 𝑒 |𝑑 𝑒 = 𝑑, 𝑏] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1

  3. State 𝑑 Value function: 1 π‘Š 𝜌 𝑑 = 𝔽[𝑆 𝑒 |𝑑 𝑒 = 𝑑] ? Action 𝑏 1 Action value function: Reward 𝑠 𝑅 𝜌 𝑑, 𝑏 = 𝔽[𝑆 𝑒 |𝑑 𝑒 = 𝑑, 𝑏] 0 ? Policy 𝜌 ? 1 0 Optimal action value function: 𝑅 βˆ— 𝑑, 𝑏 = 𝑛𝑏𝑦 𝜌 𝑅 𝜌 𝑑, 𝑏 Value 𝑀 Action value π‘Ÿ => 𝑅 βˆ— 𝑑, 𝑏 implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1

  4. Value-based algorithms Policy-based algorithms Try to approximate π‘Š βˆ— 𝑑 or - Directly learn policy - 𝑅 βˆ— 𝑑, 𝑏 - Implicitly learn policy

  5. Q-Learning β€’ Try to iteratively calculate 𝑅 βˆ— (𝑑, 𝑏) q = 0.5 𝑠 = 0.5 π‘Ÿ ⟡ 1.7 q = 1.2 𝑏 q = -1 𝑑 𝑑′ 𝑏′ 𝑅(𝑑 β€² , 𝑏 β€² ) 𝑅 𝑑, 𝑏 ⟡ 𝑠 + 𝛿 max β€’ Idea: Use neural network for approximating Q 𝑅 𝑑 β€² , 𝑏 β€² ; πœ„ βˆ’ 𝑅 𝑑, 𝑏; πœ„ ] 𝑀 πœ„ = 𝔽[𝑠 + 𝛿 max 𝑏

  6. How to traverse through the environment β€’ We follow an πœ— – greedy policy with πœ— ∈ 0,1 β€’ In every state: β€’ Sample random number 𝑙 ∈ 0,1 β€’ If 𝑙 > πœ— => choose action with maximum q value β€’ else => choose random action β€’ Exploration vs. Exploitation

  7. Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable

  8. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable

  9. On-policy vs. off-policy β€’ On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE β€’ Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning

  10. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL β€’ Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way

  11. Asynchronous Q-Learning β€’ Combine Idea with Q-Learning β€’ Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable

  12. Value-based algorithms Policy-based algorithms Try to approximate π‘Š βˆ— 𝑑 or - Directly learn policy - 𝑅 βˆ— 𝑑, 𝑏 - Implicitly learn policy

  13. REINFORCE: 1 0.5 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1

  14. REINFORCE: 1 0.5 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 0.5 0 0.2 0.8 2 0.9 0.1 -1

  15. REINFORCE: 1 0.5 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 0.5 0 0.2 0.8 2 0.9 0.1 -1

  16. REINFORCE: Problem: High Variance βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 Substract baseline: βˆ‡ πœ„ 𝑗 log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 Ξ”πœ„ 𝑗 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ (𝑆 𝑒 βˆ’ 𝑐 𝑒 (𝑑 𝑒 )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499

  17. REINFORCE: Problem: High Variance βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝑆 𝑒 Substract baseline: βˆ‡ πœ„ 𝑗 log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ 𝐡 𝑒 Ξ”πœ„ 𝑗 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ (𝑆 𝑒 βˆ’ 𝑐 𝑒 (𝑑 𝑒 )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ (𝑆 𝑒 βˆ’ π‘Š(𝑑 𝑒 , πœ„ 𝑀 )) 501 0.33 (𝑆 𝑒 βˆ’π‘Š(𝑑 𝑒 , πœ„ 𝑀 )) Can be seen as estimate of advantage: 0.33 500 500 𝐡 𝑏 𝑒 , 𝑑 𝑒 = 𝑅 𝑏 𝑒 , 𝑑 𝑒 βˆ’ π‘Š(𝑑 𝑒 ) Actor: policy network 0.33 Critic: value network 499

  18. Update interval REINFORCE … 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0

  19. Asynchronous advantage actor-critic (A3C) β€’ Update local parameters from global shared parameters β€’ Explore environment according to policy 𝜌(𝑏 𝑒 |𝑑 𝑒 ; πœ„) for N steps β€’ Compute gradients for every visited state β€’ Policy network: βˆ‡ πœ„ log 𝜌 𝑏 𝑒 𝑑 𝑒 , πœ„ (𝑆 𝑒 βˆ’ π‘Š(𝑑 𝑒 , πœ„ 𝑀 )) 2 β€’ Value network: βˆ‡ πœ„ 𝑀 𝑆 βˆ’ π‘Š 𝑑 𝑗 ; πœ„ 𝑀 β€’ Update global shared parameters with computed gradients

  20. Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients

  21. Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute … … gradients

  22. Advantages of β€ž Asynchronous methods β€œ β€’ Simple extension β€’ Can be applied to a big variety of algorithms β€’ Makes robust NN training possible β€’ Linear speedup

  23. Advantages of β€ž Asynchronous methods β€œ β€’ Simple extension β€’ Can be applied to a big variety of algorithms β€’ Makes robust NN training possible β€’ Linear speedup Consumed data

Recommend


More recommend