Per-Decision Option Discounting
Anna Harutyunyan, Peter Vrancx, Philippe Hamel, Ann Nowe, Doina Precup
Per-Decision Option Discounting Anna Harutyunyan, Peter Vrancx, - - PowerPoint PPT Presentation
Per-Decision Option Discounting Anna Harutyunyan, Peter Vrancx, Philippe Hamel, Ann Nowe, Doina Precup Motivation: Agents that reason over long temporal horizons Motivation: Agents that reason over long temporal horizons Horizon depends on
Anna Harutyunyan, Peter Vrancx, Philippe Hamel, Ann Nowe, Doina Precup
Motivation: Agents that reason over long temporal horizons
Horizon depends on discount γ Motivation: Agents that reason over long temporal horizons
Motivation: Agents that reason over long temporal horizons Horizon depends on discount γ
Horizon depends on discount γ Larger grid requires a larger γ Motivation: Agents that reason over long temporal horizons
Horizon depends on discount γ Larger grid requires a larger γ Large γ-s are inefficient in practice :( Motivation: Agents that reason over long temporal horizons
Horizon depends on discount γ Larger grid requires a larger γ Temporal abstraction? Motivation: Agents that reason over long temporal horizons
Motivation: Agents that reason over long temporal horizons Horizon depends on discount γ Larger grid requires a larger γ Temporal abstraction? Options still tied to γ!
Motivation: Agents that reason over long temporal horizons Horizon depends on discount γ Larger grid requires a larger γ Temporal abstraction? Options still tied to γ! Contribution: Generalize the options framework to let it extend the agent’s horizon.
Reward model: Transition model:
Reward model: Transition model:
Reward model: Transition model:
(1) decouple (2) per-decision
Reward model: Transition model:
(1) decouple (2) per-decision
Reward model: Transition model:
γp controls how much we care about option duration (pseudo-primitive when γp=1) (1) decouple (2) per-decision
Reward model: Transition model: Key intuition: Insulate option time from global time
γp controls how much we care about option duration (pseudo-primitive when γp=1) (1) decouple (2) per-decision
Ours Classical
Analytical variance bound Empirical error (Four Rooms)
Analytical variance bound Empirical error (Four Rooms) Larger γp can induce less variance!
Analytical variance bound Empirical error (Four Rooms)
Larger γp can induce less variance!