Exploration Conscious Reinforcement Learning Revisited Lior Shani* - - PowerPoint PPT Presentation

β–Ά
exploration conscious reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Exploration Conscious Reinforcement Learning Revisited Lior Shani* - - PowerPoint PPT Presentation

Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology Why? To learn a good policy, an RL agent must explore! However, it can cause hazardous behavior during


slide-1
SLIDE 1

Exploration Conscious Reinforcement Learning Revisited

Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology

slide-2
SLIDE 2

Shani, Efroni & Mannor /19

I LOVE 𝝑-GREEDY

Why?

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 2

  • To learn a good policy, an RL agent must explore!
  • However, it can cause hazardous behavior during training.
slide-3
SLIDE 3

Shani, Efroni & Mannor /19

I LOVE 𝝑-GREEDY Damn you Exploration!

Why?

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 3

  • To learn a good policy, an RL agent must explore!
  • However, it can cause hazardous behavior during training.
slide-4
SLIDE 4

Shani, Efroni & Mannor /19

  • Objective: Find the optimal policy knowing that exploration might occur
  • For example : 𝝑-greedy exploration (𝜷 = 𝝑)

π†πœ·

βˆ— ∈ argmaxπ†βˆˆπœ¬π”½ πŸβˆ’πœ· 𝝆+πœ·π†πŸ ෍ 𝒖=𝟏 ∞

πœΉπ’–π’” 𝒕𝒖, 𝒃𝒖

Exploration Conscious Reinforcement Learning

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 4

slide-5
SLIDE 5

Shani, Efroni & Mannor /19

  • Objective: Find the optimal policy knowing that exploration might occur
  • For example : 𝝑-greedy exploration (𝜷 = 𝝑)

π†πœ·

βˆ— ∈ argmaxπ†βˆˆπœ¬π”½ πŸβˆ’πœ· 𝝆+πœ·π†πŸ ෍ 𝒖=𝟏 ∞

πœΉπ’–π’” 𝒕𝒖, 𝒃𝒖

  • Solving the Exploration-Conscious problem = Solving an MDP
  • We describe a bias-error sensitivity tradeoff in 𝜷

Exploration Conscious Reinforcement Learning

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 5

slide-6
SLIDE 6

Shani, Efroni & Mannor /19

  • Objective: Find the optimal policy knowing that exploration might occur

Exploration Conscious Reinforcement Learning

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 6

I’m Exploration Conscious

slide-7
SLIDE 7

Shani, Efroni & Mannor /19

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 7

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛

  • 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmaxπ’ƒπ‘Ήπ†πœ· 𝒕, 𝒃
slide-8
SLIDE 8

Shani, Efroni & Mannor /19

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 Draw Exploratory Action 𝒃𝒃𝒅𝒖

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 8

  • 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmaxπ’ƒπ‘Ήπ†πœ· 𝒕, 𝒃
  • For 𝜷-greedy: 𝒃𝒃𝒅𝒖 ∈ 𝒃𝒉𝒔𝒇𝒇𝒆𝒛

w.p. 𝟐 βˆ’ 𝜷 π†πŸ else

slide-9
SLIDE 9

Shani, Efroni & Mannor /19

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 Draw Exploratory Action 𝒃𝒃𝒅𝒖 Act 𝒃𝒃𝒅𝒖

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 9

  • 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmaxπ’ƒπ‘Ήπ†πœ· 𝒕, 𝒃
  • For 𝜷-greedy: 𝒃𝒃𝒅𝒖 ∈ 𝒃𝒉𝒔𝒇𝒇𝒆𝒛

w.p. 𝟐 βˆ’ 𝜷 π†πŸ else

slide-10
SLIDE 10

Shani, Efroni & Mannor /19

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 Draw Exploratory Action 𝒃𝒃𝒅𝒖 Act 𝒃𝒃𝒅𝒖 Recieve 𝒔, 𝒕′

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 10

  • 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmaxπ’ƒπ‘Ήπ†πœ· 𝒕, 𝒃
  • For 𝜷-greedy: 𝒃𝒃𝒅𝒖 ∈ 𝒃𝒉𝒔𝒇𝒇𝒆𝒛

w.p. 𝟐 βˆ’ 𝜷 π†πŸ else

slide-11
SLIDE 11

Shani, Efroni & Mannor /19

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 Draw Exploratory Action 𝒃𝒃𝒅𝒖 Act 𝒃𝒃𝒅𝒖 Recieve 𝒔, 𝒕′

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 11

  • Normally used information: 𝒕, 𝒃𝒃𝒅𝒖, 𝒔, 𝒕′
slide-12
SLIDE 12

Shani, Efroni & Mannor /19

Choose Greedy Action 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 Draw Exploratory Action 𝒃𝒃𝒅𝒖 Act 𝒃𝒃𝒅𝒖 Recieve 𝒔, 𝒕′

Fixed Exploration Schemes (e.g. πœ—-greedy)

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 12

  • Normally used information: 𝒕, 𝒃𝒃𝒅𝒖, 𝒔, 𝒕′
  • Using information about the exploration process: 𝒕, 𝒃𝒉𝒔𝒇𝒇𝒆𝒛 , 𝒃𝒃𝒅𝒖, 𝒔, 𝒕′
slide-13
SLIDE 13

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 13

1. Update π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖

2. Expect that the agent might explore in the next state π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖 += 𝜽 𝒔𝒖 + πœΉπ”½ πŸβˆ’πœ· 𝝆+πœ·π†πŸπ‘Ήπ†πœ· 𝒕𝒖+𝟐, 𝒃

βˆ’ π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖

slide-14
SLIDE 14

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 14

1. Update π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖

2. Expect that the agent might explore in the next state π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖 += 𝜽 𝒔𝒖 + πœΉπ”½ πŸβˆ’πœ· 𝝆+πœ·π†πŸπ‘Ήπ†πœ· 𝒕𝒖+𝟐, 𝒃

βˆ’ π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒃𝒅𝒖

  • Calculating expectations can be hard.
  • Requires sampling in the continuous case!
slide-15
SLIDE 15

Shani, Efroni & Mannor /19

  • Exploration is incorporated into the environment!

1. Update π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛

2. The rewards and next state 𝒔𝒖 ,𝒕𝒖+𝟐 are given by the acted action 𝒃𝒖

𝒃𝒅𝒖

π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔𝒖 + πœΉπ‘Ήπ†πœ· 𝒕𝒖+𝟐, 𝒃𝒖+𝟐 𝒉𝒔𝒇𝒇𝒆𝒛

βˆ’ π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛

Two Approaches – Surrogate approach

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 15

slide-16
SLIDE 16

Shani, Efroni & Mannor /19

  • Exploration is incorporated into the environment!

1. Update π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛

2. The rewards and next state 𝒔𝒖 ,𝒕𝒖+𝟐 are given by the acted action 𝒃𝒖

𝒃𝒅𝒖

π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔𝒖 + πœΉπ‘Ήπ†πœ· 𝒕𝒖+𝟐, 𝒃𝒖+𝟐 𝒉𝒔𝒇𝒇𝒆𝒛

βˆ’ π‘Ήπ†πœ· 𝒕𝒖, 𝒃𝒖

𝒉𝒔𝒇𝒇𝒆𝒛

  • NO NEED TO SAMPLE!

Two Approaches – Surrogate approach

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 16

slide-17
SLIDE 17

Shani, Efroni & Mannor /19

Deep RL Experimental Results

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 17

Training Evaluation

slide-18
SLIDE 18

Shani, Efroni & Mannor /19

  • We define Exploration Conscious RL and analyze its properties.
  • Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

  • Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

Summary

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 18

slide-19
SLIDE 19

Shani, Efroni & Mannor /19

  • We define Exploration Conscious RL and analyze its properties.
  • Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

  • Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90

Summary

12-Jun-19 Exploration Conscious Reinforcement Learning revisited 19