Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer
State π‘ Value function: 1 1 0.5 π π π‘ = π½[π π’ |π‘ π’ = π‘] Action π 1 0.5 Example: Reward π π π π‘ π’ = 0.8 β 0.1 β β1 + 0 0.2 Policy π 0.8 β 0.9 β 2 + 0.5 0.5 0 0 0.2 β 0.5 β 0 + Value π€ 0.2 β 0.5 β 1 = 1.46 1.46 Action value π Action value function: π π π‘, π = π½[π π’ |π‘ π’ = π‘, π] 1.7 2 2 0.9 0.8 2 1.7 -1 0.1 -1 -1
State π‘ Value function: 1 π π π‘ = π½[π π’ |π‘ π’ = π‘] ? Action π 1 Action value function: Reward π π π π‘, π = π½[π π’ |π‘ π’ = π‘, π] 0 ? Policy π ? 1 0 Optimal action value function: π β π‘, π = πππ¦ π π π π‘, π Value π€ Action value π => π β π‘, π implicitly describes an 2 optimal policy 2 ? ? 2 -1 ? -1
Value-based algorithms Policy-based algorithms Try to approximate π β π‘ or - Directly learn policy - π β π‘, π - Implicitly learn policy
Q-Learning β’ Try to iteratively calculate π β (π‘, π) q = 0.5 π = 0.5 π β΅ 1.7 q = 1.2 π q = -1 π‘ π‘β² πβ² π (π‘ β² , π β² ) π π‘, π β΅ π + πΏ max β’ Idea: Use neural network for approximating Q π π‘ β² , π β² ; π β π π‘, π; π ] π π = π½[π + πΏ max π
How to traverse through the environment β’ We follow an π β greedy policy with π β 0,1 β’ In every state: β’ Sample random number π β 0,1 β’ If π > π => choose action with maximum q value β’ else => choose random action β’ Exploration vs. Exploitation
Q-Learning with Neural Networks Use network to traverse through the environment Neural Network Agent approximating Q*(s,a) Train Network with generated data => Data is non-stationary => Training with NN is instable
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin Playing atari with deep reinforcement learning Neural Network approximating Q*(s,a) Use network to traverse through the environment Train Network with randomly sampled data Agent Replay Memory Store new data in replay memory => Data is stationary => Training with NN is stable
On-policy vs. off-policy β’ On-policy: The data which is used to train our policy, has to be generated using the exact same policy. => Example: REINFORCE β’ Off-policy: The data which is used to train our policy, can also be generated using another policy. => Example: Q-Learning
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016) Asynchronous Methods for Deep RL β’ Alternative method to make RL work better together with neural networks Agent #1 Agent #2 Data generation Data generation Gradient computation Gradient computation One agent: Data generation Weight update Gradient computation Weight update Agent #3 Agent #4 Data generation Data generation Gradient computation Gradient computation Traditional way Asynchronous way
Asynchronous Q-Learning β’ Combine Idea with Q-Learning β’ Generated data is stationary => Training is stable => No replay memory necessary => Data can be used directly while training is still stable
Value-based algorithms Policy-based algorithms Try to approximate π β π‘ or - Directly learn policy - π β π‘, π - Implicitly learn policy
REINFORCE: 1 0.5 β π log π π π’ π‘ π’ , π π π’ Sample trajectories and 0.5 enforce actions which lead 0 0.2 to high rewards 0.8 2 0.9 0.1 -1
REINFORCE: 1 0.5 β π log π π π’ π‘ π’ , π π π’ 0.5 0 0.2 0.8 2 0.9 0.1 -1
REINFORCE: 1 0.5 β π log π π π’ π‘ π’ , π π π’ 0.5 0 0.2 0.8 2 0.9 0.1 -1
REINFORCE: Problem: High Variance β π log π π π’ π‘ π’ , π π π’ Substract baseline: β π π log π π π’ π‘ π’ , π π π’ Ξπ π β π log π π π’ π‘ π’ , π (π π’ β π π’ (π‘ π’ )) 0.9 500 450 0.2 501 100,2 -0.3 499 -149,7 501 0.33 0.33 500 0.33 499
REINFORCE: Problem: High Variance β π log π π π’ π‘ π’ , π π π’ Substract baseline: β π π log π π π’ π‘ π’ , π π΅ π’ Ξπ π β π log π π π’ π‘ π’ , π (π π’ β π π’ (π‘ π’ )) 0.9 0 0 0.2 1 0.2 Use value function as baseline: -0.3 -1 0.3 β π log π π π’ π‘ π’ , π (π π’ β π(π‘ π’ , π π€ )) 501 0.33 (π π’ βπ(π‘ π’ , π π€ )) Can be seen as estimate of advantage: 0.33 500 500 π΅ π π’ , π‘ π’ = π π π’ , π‘ π’ β π(π‘ π’ ) Actor: policy network 0.33 Critic: value network 499
Update interval REINFORCE β¦ 0 0.1 0 1 Actor-critic with advantage 0.6 0.1 0
Asynchronous advantage actor-critic (A3C) β’ Update local parameters from global shared parameters β’ Explore environment according to policy π(π π’ |π‘ π’ ; π) for N steps β’ Compute gradients for every visited state β’ Policy network: β π log π π π’ π‘ π’ , π (π π’ β π(π‘ π’ , π π€ )) 2 β’ Value network: β π π€ π β π π‘ π ; π π€ β’ Update global shared parameters with computed gradients
Disadvantage of A3C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Compute gradients
Synchronous version of A3C => A2C Global network Agent #1 Agent #2 Perform steps Perform steps Compute gradients Compute gradients Perform steps Perform steps Compute gradients Compute β¦ β¦ gradients
Advantages of β Asynchronous methods β β’ Simple extension β’ Can be applied to a big variety of algorithms β’ Makes robust NN training possible β’ Linear speedup
Advantages of β Asynchronous methods β β’ Simple extension β’ Can be applied to a big variety of algorithms β’ Makes robust NN training possible β’ Linear speedup Consumed data
Recommend
More recommend