Deep Reinforcement Learning Dominik Winkelbauer State Value - - PowerPoint PPT Presentation

β–Ά
deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Dominik Winkelbauer State Value - - PowerPoint PPT Presentation

Asynchronous Methods for Deep Reinforcement Learning Dominik Winkelbauer State Value function: 1 1 0.5 = [ | = ] Action 1 0.5 Example: Reward = 0.8 0.1 1


slide-1
SLIDE 1

Asynchronous Methods for Deep Reinforcement Learning

Dominik Winkelbauer

slide-2
SLIDE 2

State 𝑑 Action 𝑏 Reward 𝑠 Policy 𝜌 Value 𝑀 Action value π‘Ÿ

1 2

  • 1

0.2 0.8 0.5 0.5 0.9 0.1

π‘ŠπœŒ 𝑑 = 𝔽[𝑆𝑒|𝑑𝑒 = 𝑑] π‘…πœŒ 𝑑, 𝑏 = 𝔽[𝑆𝑒|𝑑𝑒 = 𝑑, 𝑏] π‘ŠπœŒ 𝑑𝑒 = 0.8 βˆ— 0.1 βˆ— βˆ’1 + 0.8 βˆ— 0.9 βˆ— 2 + 0.2 βˆ— 0.5 βˆ— 0 + 0.2 βˆ— 0.5 βˆ— 1 = 1.46

1.46 1.7 0.5 2

  • 1

1 1.7 0.5 2

  • 1

1

Value function: Example: Action value function:

slide-3
SLIDE 3

State 𝑑 Action 𝑏 Reward 𝑠 Policy 𝜌 Value 𝑀 Action value π‘Ÿ

π‘ŠπœŒ 𝑑 = 𝔽[𝑆𝑒|𝑑𝑒 = 𝑑] π‘…πœŒ 𝑑, 𝑏 = 𝔽[𝑆𝑒|𝑑𝑒 = 𝑑, 𝑏]

Value function: Action value function:

π‘…βˆ— 𝑑, 𝑏 = π‘›π‘π‘¦πœŒπ‘…πœŒ 𝑑, 𝑏

Optimal action value function:

1 2

  • 1

? ? ? ? ? ? 2 1 2

  • 1

1

=> π‘…βˆ— 𝑑, 𝑏 implicitly describes an

  • ptimal policy
slide-4
SLIDE 4

Value-based algorithms

  • Try to approximate π‘Šβˆ— 𝑑 or

π‘…βˆ— 𝑑, 𝑏

  • Implicitly learn policy

Policy-based algorithms

  • Directly learn policy
slide-5
SLIDE 5

Q-Learning

  • Try to iteratively calculate π‘…βˆ—(𝑑, 𝑏)
  • Idea: Use neural network for approximating Q

𝑀 πœ„ = 𝔽[𝑠 + 𝛿 max

𝑏

𝑅 𝑑′, 𝑏′; πœ„ βˆ’ 𝑅 𝑑, 𝑏; πœ„ ] 𝑅 𝑑, 𝑏 ⟡ 𝑠 + 𝛿 max

𝑏′ 𝑅(𝑑′, 𝑏′) 𝑑 𝑑′ 𝑏 = 0.5 q = 0.5 q = 1.2 q = -1 π‘Ÿ ⟡1.7

𝑠

slide-6
SLIDE 6

How to traverse through the environment

  • We follow an πœ—β€“greedy policy with πœ— ∈ 0,1
  • In every state:
  • Sample random number 𝑙 ∈ 0,1
  • If 𝑙 > πœ— => choose action with maximum q value
  • else => choose random action
  • Exploration vs. Exploitation
slide-7
SLIDE 7

Q-Learning with Neural Networks

Neural Network approximating Q*(s,a) Agent Use network to traverse through the environment Train Network with generated data

=> Data is non-stationary => Training with NN is instable

slide-8
SLIDE 8

Playing atari with deep reinforcement learning

Neural Network approximating Q*(s,a) Agent Use network to traverse through the environment Train Network with randomly sampled data Replay Memory Store new data in replay memory

=> Data is stationary => Training with NN is stable

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan and Riedmiller, Martin

slide-9
SLIDE 9

On-policy vs. off-policy

  • On-policy: The data which is used to train our policy, has to be

generated using the exact same policy. => Example: REINFORCE

  • Off-policy: The data which is used to train our policy, can also be

generated using another policy. => Example: Q-Learning

slide-10
SLIDE 10

Asynchronous Methods for Deep RL

  • Alternative method to make RL work better together with neural

networks

Data generation Gradient computation Weight update One agent: Data generation Gradient computation Agent #1 Data generation Gradient computation Agent #2 Data generation Gradient computation Agent #3 Data generation Gradient computation Agent #4 Weight update Traditional way Asynchronous way Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D. & Kavukcuoglu, K. (ICML, 2016)

slide-11
SLIDE 11

Asynchronous Q-Learning

  • Combine Idea with Q-Learning
  • Generated data is stationary

=> Training is stable => No replay memory necessary => Data can be used directly while training is still stable

slide-12
SLIDE 12

Value-based algorithms

  • Try to approximate π‘Šβˆ— 𝑑 or

π‘…βˆ— 𝑑, 𝑏

  • Implicitly learn policy

Policy-based algorithms

  • Directly learn policy
slide-13
SLIDE 13

1 2

  • 1

0.2 0.8 0.5 0.5 0.9 0.1

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒

REINFORCE:

Sample trajectories and enforce actions which lead to high rewards

slide-14
SLIDE 14

1 2

  • 1

0.2 0.8 0.5 0.5 0.9 0.1

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒

REINFORCE:

slide-15
SLIDE 15

1 2

  • 1

0.2 0.8 0.5 0.5 0.9 0.1

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒

REINFORCE:

slide-16
SLIDE 16

Problem: High Variance

βˆ‡πœ„π‘— log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒 Ξ”πœ„π‘— 0.9 500 450 0.2 501 100,2

  • 0.3

499

  • 149,7

499 500 501

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒

REINFORCE:

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ (𝑆𝑒 βˆ’ 𝑐𝑒(𝑑𝑒))

Substract baseline:

0.33 0.33 0.33

slide-17
SLIDE 17

Problem: High Variance

βˆ‡πœ„π‘— log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝐡𝑒 Ξ”πœ„π‘— 0.9 0.2 1 0.2

  • 0.3
  • 1

0.3 499 0.33 500 501 0.33 0.33

βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ 𝑆𝑒 βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ (𝑆𝑒 βˆ’ 𝑐𝑒(𝑑𝑒)) βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ (𝑆𝑒 βˆ’ π‘Š(𝑑𝑒, πœ„π‘€))

REINFORCE: Substract baseline: Use value function as baseline: Can be seen as estimate of advantage: 𝐡 𝑏𝑒, 𝑑𝑒 = 𝑅 𝑏𝑒, 𝑑𝑒 βˆ’ π‘Š(𝑑𝑒) Actor: policy network Critic: value network

500

(π‘†π‘’βˆ’π‘Š(𝑑𝑒, πœ„π‘€))

slide-18
SLIDE 18

Update interval

REINFORCE

…

Actor-critic with advantage

1 0.1 0.1 0.6

slide-19
SLIDE 19

Asynchronous advantage actor-critic (A3C)

  • Update local parameters from global shared parameters
  • Explore environment according to policy 𝜌(𝑏𝑒|𝑑𝑒; πœ„) for N steps
  • Compute gradients for every visited state
  • Policy network: βˆ‡πœ„ log 𝜌 𝑏𝑒 𝑑𝑒, πœ„ (𝑆𝑒 βˆ’ π‘Š(𝑑𝑒, πœ„π‘€))
  • Value network: βˆ‡πœ„π‘€ 𝑆 βˆ’ π‘Š 𝑑𝑗; πœ„π‘€

2

  • Update global shared parameters with computed gradients
slide-20
SLIDE 20

Global network Agent #1 Agent #2 Perform steps Compute gradients Perform steps Compute gradients

Disadvantage of A3C

Perform steps Compute gradients

slide-21
SLIDE 21

Global network Agent #1 … Agent #2 Perform steps Compute gradients Perform steps Compute gradients … Perform steps Compute gradients

Synchronous version of A3C => A2C

Perform steps Compute gradients

slide-22
SLIDE 22

Advantages of β€žAsynchronous methodsβ€œ

  • Simple extension
  • Can be applied to a big variety of algorithms
  • Makes robust NN training possible
  • Linear speedup
slide-23
SLIDE 23

Advantages of β€žAsynchronous methodsβ€œ

  • Simple extension
  • Can be applied to a big variety of algorithms
  • Makes robust NN training possible
  • Linear speedup

Consumed data