Asynchronous Methods for Deep Reinforcement Learning
Dominik Winkelbauer
Deep Reinforcement Learning Dominik Winkelbauer State Value - - PowerPoint PPT Presentation
Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1
Dominik Winkelbauer
State π‘ Action π Reward π Policy π Value π€ Action value π
1 2
0.2 0.8 0.5 0.5 0.9 0.1
ππ π‘ = π½[ππ’|π‘π’ = π‘] π π π‘, π = π½[ππ’|π‘π’ = π‘, π] ππ π‘π’ = 0.8 β 0.1 β β1 + 0.8 β 0.9 β 2 + 0.2 β 0.5 β 0 + 0.2 β 0.5 β 1 = 1.46
1.46 1.7 0.5 2
1 1.7 0.5 2
1
Value function: Example: Action value function:
State π‘ Action π Reward π Policy π Value π€ Action value π
ππ π‘ = π½[ππ’|π‘π’ = π‘] π π π‘, π = π½[ππ’|π‘π’ = π‘, π]
Value function: Action value function:
π β π‘, π = πππ¦ππ π π‘, π
Optimal action value function:
1 2
? ? ? ? ? ? 2 1 2
1
=> π β π‘, π implicitly describes an
Value-based algorithms
π β π‘, π
Policy-based algorithms
π π = π½[π + πΏ max
π
π π‘β², πβ²; π β π π‘, π; π ] π π‘, π β΅ π + πΏ max
πβ² π (π‘β², πβ²) π‘ π‘β² π = 0.5 q = 0.5 q = 1.2 q = -1 π β΅1.7
π
Neural Network approximating Q*(s,a) Agent Use network to traverse through the environment Train Network with generated data
=> Data is non-stationary => Training with NN is instable
Neural Network approximating Q*(s,a) Agent Use network to traverse through the environment Train Network with randomly sampled data Replay Memory Store new data in replay memory
=> Data is stationary => Training with NN is stable
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin
Data generation Gradient computation Weight update One agent: Data generation Gradient computation Agent #1 Data generation Gradient computation Agent #2 Data generation Gradient computation Agent #3 Data generation Gradient computation Agent #4 Weight update Traditional way Asynchronous way Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016)
Value-based algorithms
π β π‘, π
Policy-based algorithms
1 2
0.2 0.8 0.5 0.5 0.9 0.1
βπ log π ππ’ π‘π’, π ππ’
REINFORCE:
Sample trajectories and enforce actions which lead to high rewards
1 2
0.2 0.8 0.5 0.5 0.9 0.1
βπ log π ππ’ π‘π’, π ππ’
REINFORCE:
1 2
0.2 0.8 0.5 0.5 0.9 0.1
βπ log π ππ’ π‘π’, π ππ’
REINFORCE:
βππ log π ππ’ π‘π’, π ππ’ Ξππ 0.9 500 450 0.2 501 100,2
499
499 500 501
βπ log π ππ’ π‘π’, π ππ’
REINFORCE:
βπ log π ππ’ π‘π’, π (ππ’ β ππ’(π‘π’))
Substract baseline:
0.33 0.33 0.33
βππ log π ππ’ π‘π’, π π΅π’ Ξππ 0.9 0.2 1 0.2
0.3 499 0.33 500 501 0.33 0.33
βπ log π ππ’ π‘π’, π ππ’ βπ log π ππ’ π‘π’, π (ππ’ β ππ’(π‘π’)) βπ log π ππ’ π‘π’, π (ππ’ β π(π‘π’, ππ€))
REINFORCE: Substract baseline: Use value function as baseline: Can be seen as estimate of advantage: π΅ ππ’, π‘π’ = π ππ’, π‘π’ β π(π‘π’) Actor: policy network Critic: value network
500
(ππ’βπ(π‘π’, ππ€))
REINFORCE
β¦
Actor-critic with advantage
1 0.1 0.1 0.6
2
Global network Agent #1 Agent #2 Perform steps Compute gradients Perform steps Compute gradients
Perform steps Compute gradients
Global network Agent #1 β¦ Agent #2 Perform steps Compute gradients Perform steps Compute gradients β¦ Perform steps Compute gradients
Perform steps Compute gradients
Consumed data