AIXI: Universal Optimal Sequential Decision Making
Marcus Hutter (2005)
AIXI: Universal Optimal Sequential Decision Making Marcus Hutter - - PowerPoint PPT Presentation
AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005) Reinforcement Learning State space , Action space , Policy , Reward (, ) Goal: Find policy which maximizes expected cumulative reward.
Marcus Hutter (2005)
π
9 : = π½9 :[= >?@ A
π
> 9:]
:
π
9 :
{πL, π@, πM, β¦ } and is unknown to the agent
βπ@:Qβπ¦@:QR@: πQR@(π¦@:QR@ π@:QR@ = =
STβV
πQ (π¦@:Q|π@:Q)
βπ@:Qβπ¦@:QR@: πQR@(π¦@:QR@ π@:QR@ = =
STβV
πQ (π¦@:Q|π@:Q)
Marginalization of πQ over the current observation- reward Conditioned on all actions up to π β 1 Conditioned
up to π
each model is assigned a weight that represents the agentβs confidence in what it believes is the true environment
thus its belief of the underlying environment
^ > 0 is the weight assigned to each π β β³ such that β^ββ³ π₯L ^ =
1 π π¦@:Q|π@:Q β =
^ββ³
π₯L
^ π(π¦@:Q|π@:Q)
^ = 1
Type equation here. π π¦@:Q|π@:Q β =
^ββ³
π₯L
^ π(π¦@:Q|π@:Q)
βYo.β
specifies an object
πΏ π β min
p
πππππ’β π : π π = π
to compute mixture over all possible environments
Ξ₯ π = =
^ββ³x
2Rz(^) β π
^ :
Intelligence, Cognitive Technologies, pages 227β290. Springer, Berlin, 2007. ISBN 3-540-23733-X. URL http://www.hutter1.net/ai/aixigentle.htm.
agent with the same time and space constraints.
environments with smaller model class with surrogate for complexity
the classic selection-expansion-rollout-backprop MCTS algorithm
π(β)
number of possible actions
policy that balances exploration and exploitation
{ π(βπ); and the environment model, π(β |βπ), that returns a percept conditioned on the history
is received
Ξ~ π = # nodes in PST
piece of cheese
locations and the maze
food in its direct line of sight
Cheese Maze Partially Observable Pacman
Perception and Hidden State. PhD thesis, University of Rochester, 1996 βΆ "ππ’ππππ’π§ ππ£ππππ¦ πππππ π§β
reinforcement learning. Information Theory, IEEE Transactions on, 56(5):2441 β2454, may 2010. βΆ "π΅ππ’ππ€π β ππ"
Solomonoff Induction Ray Solomonoff 1960βs Context Tree Weightings Willems, Shtarkov, Tjalkens 1995 AIXI Marcus Hutter 2005 MCTS βBandit based MC Planningβ Kocsis & Szepesvari 2006 AIXIπ’π Marcus Hutter 2007 MC-AIXI-CTW Veness et al 2010 Kolmogorov Complexity Andrey Kolmogorov 1963