The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina - - PowerPoint PPT Presentation

▶

Dec 02, 2022 148 likes •343 views

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations

SLIDE 1

The Option-Critic Architecture

Pierre-Luc Bacon, Jean Harb, Doina Precup

Reasoning and Learning Lab McGill University, Montreal, Canada

AAAI 2017

SLIDE 2

Intelligence:

the ability to generalize and adapt efficiently to new and uncertain situations

Having good representations is key

“[...] solving a problem simply means representing it so as to make the solution transparent.”

— Simon, 1969

1 / 18

SLIDE 3

Reinforcement Learning: a general framework for AI

Equipped with a good state representation, RL has led to impressive results:

Tesauro’s TD Gammon (1995),
Watson’s Daily-Double Wagering in Jeopardy! (2013),
Human-level video game play in the Atari games (2013),
AlphaGo (2016)...

The ability to abstract knowledge temporally over many different time scales is still missing.

2 / 18

SLIDE 4

Temporal abstraction

Higher level steps

Choosing the type of coffee maker, type of coffee beans

Medium level steps

Grind the beans, measure the right quantity of water, boil the water

Lower level steps

Wrist and arm movements while adding coffee to the filter, ...

3 / 18

SLIDE 5

Temporal abstraction in AI

A cornerstone of AI planning since the 1970’s:

Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf

(1985), Laird (1986), Iba (1989), Drescher (1991) etc. It has been shown to :

Generate shorter plans
Reduce the complexity of choosing actions
Provide robustness against model misspecification
Improve exploration by taking shortcuts in the environment

4 / 18

SLIDE 6

Temporal abstraction in RL

Options (Sutton, Singh, Precup 2000) can represent courses of action at variable time scales:

High level Low level Trajectory, time

5 / 18

SLIDE 7

Options framework

An option ω is a triple:

1. initiation set: Iω
2. internal policy: πω
3. termination condition: βω

Example

Robot navigation: if there is no obstacle in front (Iω), go forward (πω) until you get too close to another object (βω) We can derive a policy over options πΩ that maximizes the expected discounted sum of rewards: E ∞

γtr(st, at)

s0, ω0
6 / 18

SLIDE 8

Contribution of this work

The problem of constructing/discovering good options has been a challenge for more than 15 years. Option-critic is a scalable solution to this problem:

Online, continual and model-free (but models can be used

if desired)

Requires no a priori domain knowledge, decomposition, or

human intervention

Learns in a single task, at least as fast as other methods

which do not use temporal abstraction

Applies to general continuous state and action spaces

7 / 18

SLIDE 9

Actor-Critic Architecture (Sutton 1984)

Value function Environment Policy at st Actor rt Gradient Critic TD error

The policy (actor) is decoupled from its value function.
The critic provides feedback to improve the actor
Learning is fully online

8 / 18

SLIDE 10

Option-Critic Architecture

πΩ QU, AΩ Environment at st πω, βω rt Gradients Critic TD error ωt Options Policy over options

Parameterize internal policies and termination conditions
Policy over options is computed by a separate process

9 / 18

SLIDE 11

Main result: Gradient updates

The gradient wrt. the internal policy parameters θ

is given by: E ∂ log πω,θ(a|s) ∂θ QU(s, ω, a)

This has the usual interpretation: take better primitives

more often inside the option

The gradient wrt. the termination parameters ν is

given by: E

−∂βω,ν(s′)

∂ν AπΩ(s′, ω)

where AπΩ = QπΩ − VπΩ is the advantage function This

means that we want to lengthen options that have a large advantage

10 / 18

SLIDE 12

Results: Options transfer

Hallways Walls Initial goal Random goal after 1000 episodes

11 / 18

SLIDE 13

Results: Options transfer

500 1000 1500 2000 Episodes 100 200 300 400 500 600 Steps

SARSA(0) AC-PG OC 4 options OC 8 options

Goal moves randomly Using temporal abstractions discovered by option-critic Primitive actions

Learning in the first task no slower than using primitives
Learning once the goal is moved faster with the options

12 / 18

SLIDE 14

Results: Learned options are intuitive

Probability of terminating in a particular state, for each option: Option 1 Option 2 Option 3 Option 4

Terminations are more likely near hallways (although there

are no pseudo-rewards provided)

13 / 18

SLIDE 15

Results: Nonlinear function approximation

Policy over options Termination functions Internal policies Shared representation Convolutional layers Last 4 frames

Same architecture as DQN (Mnih & al., 2013) for the 4 first layers but hybridized with options and the policy over them.

14 / 18

SLIDE 16

Performance matching or better than DQN

(a) Asterix (b) Ms. Pacman

Option-Critic DQN Option-Critic DQN

Avg. Score

Epoch Epoch 50 100 150 200 50 100 150 200 2000 4000 6000 8000 10000 500 1000 1500 2000 2500

(c) Seaquest (d) Zaxxon

Option-Critic DQN Option-Critic DQN Epoch Epoch 50 100 150 200 50 100 150 200 2000 4000 6000 8000 10000 2000 4000 6000 8000

15 / 18

SLIDE 17

Interpretable and specialized options in Seaquest

Option 1: downward shooting sequences Option 2: upward shooting sequences Transition from option 1 to 2 Action trajectory, time White: option 1 Black: option 2

16 / 18

SLIDE 18

Conclusion

Our results seem to be the first to be:

fully end-to-end
within a single task
at speed comparable or better than using just primitive

methods Using ideas from policy gradient methods, option-critic

provides continual option construction
can be used with nonlinear function approximators
can incorporate regularizers or pseudo-rewards easily

17 / 18

SLIDE 19

Future work

Learn initiation sets:

◮ Would require a new notion of stochastic initiation

functions

More empirical results !

Try our code : https://github.com/jeanharb/option_critic

18 / 18