AIXI: Universal Optimal Sequential Decision Making Marcus Hutter - - PowerPoint PPT Presentation

β–Ά
aixi universal optimal sequential decision making
SMART_READER_LITE
LIVE PREVIEW

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter - - PowerPoint PPT Presentation

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005) Reinforcement Learning State space , Action space , Policy , Reward (, ) Goal: Find policy which maximizes expected cumulative reward.


slide-1
SLIDE 1

AIXI: Universal Optimal Sequential Decision Making

Marcus Hutter (2005)

slide-2
SLIDE 2

Reinforcement Learning

  • State space 𝑇, Action space 𝐡, Policy 𝜌, Reward 𝑆(𝑏, 𝑑)
  • Goal: Find policy which maximizes expected cumulative reward.
  • Challenge: Environment which RL interacts with is unknown
  • Explore and approximate the environment
  • Hard to balance exploration vs exploitation
  • AIXI: why approximate one environment? Consider them all!
slide-3
SLIDE 3

Optimal Agents in Known Environments

  • 𝒝, 𝒫, 𝑆 = (action, observation, reward) spaces
  • 𝑏- = action at time 𝑙, 𝑦- = 𝑝-𝑠- = perception at time 𝑙
  • Agent follows policy 𝜌: 𝒝×𝒫×ℛ βˆ— β†’ 𝒝
  • Environment reacts with 𝜈: 𝒝×𝒫×ℛ βˆ—Γ—π’ β†’ 𝒫×ℛ
slide-4
SLIDE 4

Agent-Environment Visualization

slide-5
SLIDE 5

Optimal Agents in Known Environments

  • Performance of 𝜌 is expected cumulative reward

π‘Š

9 : = 𝔽9 :[= >?@ A

𝑠

> 9:]

  • If 𝜈 is true environment, optimal policy is π‘ž9 ≔ arg max

:

π‘Š

9 :

?

slide-6
SLIDE 6

Definition of the Environment

  • An environment, 𝜍, is a sequence of conditional probability functions

{𝜍L, 𝜍@, 𝜍M, … } and is unknown to the agent

  • Each element in the sequence satisfies the β€œchronological condition”:

βˆ€π‘@:Qβˆ€π‘¦@:QR@: 𝜍QR@(𝑦@:QR@ 𝑏@:QR@ = =

ST∈V

𝜍Q (𝑦@:Q|𝑏@:Q)

slide-7
SLIDE 7

Definition of the Environment

βˆ€π‘@:Qβˆ€π‘¦@:QR@: 𝜍QR@(𝑦@:QR@ 𝑏@:QR@ = =

ST∈V

𝜍Q (𝑦@:Q|𝑏@:Q)

Marginalization of 𝜍Q over the current observation- reward Conditioned on all actions up to π‘œ βˆ’ 1 Conditioned

  • n all actions

up to π‘œ

slide-8
SLIDE 8

Dealing with the Unknown Environment

  • The idea is to maintain a mixture of environment models, in which

each model is assigned a weight that represents the agent’s confidence in what it believes is the true environment

  • As the agent obtains more experience, it updates the weights and

thus its belief of the underlying environment

  • Reminiscent of a Bayesian agent
slide-9
SLIDE 9

Mixture Model

  • β„³ β‰œ {𝜍@, 𝜍M, … , 𝜍Q} is the countable class of environments
  • π‘₯L

^ > 0 is the weight assigned to each 𝜍 ∈ β„³ such that βˆ‘^βˆˆβ„³ π‘₯L ^ =

1 𝜊 𝑦@:Q|𝑏@:Q β‰œ =

^βˆˆβ„³

π‘₯L

^ 𝜍(𝑦@:Q|𝑏@:Q)

slide-10
SLIDE 10

Selecting a Universal Prior

  • Occam’s Razor: The simplest solution is the most likely
  • Formalized as Kolmogorov Complexityβˆ‘^βˆˆβ„³ π‘₯L

^ = 1

Type equation here. 𝜊 𝑦@:Q|𝑏@:Q β‰œ =

^βˆˆβ„³

π‘₯L

^ 𝜍(𝑦@:Q|𝑏@:Q)

β€œYo.”

slide-11
SLIDE 11

Kolmogorov Complexity

  • Length of the shortest program on a Universal Turing Machine which

specifies an object

  • In our domain: shortest program which produces environment 𝜍

𝐿 𝜍 ≔ min

p

π‘šπ‘“π‘œπ‘•π‘’β„Ž π‘ž : 𝑉 π‘ž = 𝜍

  • Advantage: completely independent of prior assumptions
  • Problem: Incomputable due to halting problem.
  • NaΓ―ve search over all inputs will contain those with infinite loops
  • Paradoxical: β€œShortest object describable by N bits” is less than N bits.
slide-12
SLIDE 12

Solomonoff Prior

  • Key idea: Use inverse Kolmogorov Complexity as environmental prior

to compute mixture over all possible environments

Ξ₯ 𝜌 = =

^βˆˆβ„³x

2Rz(^) βˆ— π‘Š

^ :

  • Ξ₯ 𝜌 measures agent’s ability to perform in all possible environments
  • Hutter describes this Ξ₯ 𝜌 as Universal Intelligence
slide-13
SLIDE 13

AIXI

  • Expectimax over Solomonoff Prior
  • β„³ are chronologically conditional environments
  • Converges to agent acting with knowledge of true environment
  • Mathematically proven
slide-14
SLIDE 14

Evaluation: Pros and Cons

  • Theoretically optimal decision making.
  • Proven to converge to optimal agent acting in true environment
  • Universal
  • Prior completely independent of actual environment behavior
  • β€œReduces any conceptual AI problem to computation problem”
  • Incomputable and Intractable
  • Cannot compute Kolgomorov Complexity
  • Reward function?
  • Unclear how to define reward function which is also independent of problem
slide-15
SLIDE 15

Related Works: Approximations

  • Work in AIXI mainly in approximating the theoretical framework.
  • AIXIπ‘’π‘š
  • Marcus Hutter. Universal algorithmic intelligence: A mathematical topβ†’down
  • approach. In B. Goertzel and C. Pennachin, editors, Artificial General

Intelligence, Cognitive Technologies, pages 227–290. Springer, Berlin, 2007. ISBN 3-540-23733-X. URL http://www.hutter1.net/ai/aixigentle.htm.

  • Summary: provides approximate AIXI which is more optimal than any other RL

agent with the same time and space constraints.

  • MC-AIXI (Next!)
  • Summary: Monte Carlo approximation of AIXI.
slide-16
SLIDE 16

MC-AIXI CTW

  • β€œMonte Carlo – AIXI with Context Tree Weightings”
  • Veness et al 2011
  • Solves main barriers to applying AIXI:
  • 1. Expectimax is intractable β†’ Estimate using MCTS
  • 2. Kolmogorov Complexity is incomputable β†’ Replace universe of

environments with smaller model class with surrogate for complexity

slide-17
SLIDE 17

Part 1: MCTS

  • 𝜍UCT is used to estimate AIXI Expectimax by adapting

the classic selection-expansion-rollout-backprop MCTS algorithm

  • Decision node (circle):
  • Contains a history, h, and a value function estimate, {

π‘Š(β„Ž)

  • It has children (called β€œChance nodes”) corresponding to the

number of possible actions

  • An action, a, is selected based on the UCB action-selection

policy that balances exploration and exploitation

  • Chance node (star):
  • Follows a decision node
  • Contains the history, ha; an estimate of the future value,

{ π‘Š(β„Žπ‘); and the environment model, 𝜍(β‹… |β„Žπ‘), that returns a percept conditioned on the history

  • A new child of the chance node is added when a new percept

is received

slide-18
SLIDE 18

Part 2: Approximating the Solomonoff Prior

  • Solomonoff Prior: βˆ‘^ 2Rz(^) is incomputable
  • Solution: Replace with smaller class of environments
  • Variable Order Markov Process
  • Calculates probability of next observation depending on last k observations
  • Replace entire universe of environments with mixture of Markov Processes
slide-19
SLIDE 19

Prediction Suffix Tree

  • Representation of a sequence of binary events
  • Able to encode all variable order Markov Models up to depth D
  • Represents a space of 2^2^D
slide-20
SLIDE 20

Context Tree Weighting

  • Provides method to evaluate PST in linear time
  • Naively computable in 𝒫(2^2^D), CTW algorithm reduces to 𝒫(D)
  • Smaller trees represent simpler Markov Models
  • Evaluate prior probability under Occam’s razor as size of tree

Ξ“~ 𝑁 = # nodes in PST

  • Replace Kolmogorov prior with CTW prior
slide-21
SLIDE 21

Context Tree Weighting: Updated Formula

  • Original intractable prior
  • MC-AIXI with CTW
slide-22
SLIDE 22

Algorithm Performance

  • The agent must navigate to a

piece of cheese

  • -1 for entering an open cell
  • -10 for hitting a wall
  • +10 for finding cheese
  • Agent is unaware of the monsters’

locations and the maze

  • It can only β€œsmell” food and observe

food in its direct line of sight

Cheese Maze Partially Observable Pacman

slide-23
SLIDE 23

Performance on Cheese Maze

slide-24
SLIDE 24

Performance on PO-Pacman

slide-25
SLIDE 25

Related Work

  • Andrew Kachites McCallum. Reinforcement Learning with Selective

Perception and Hidden State. PhD thesis, University of Rochester, 1996 ⟢ "π‘‰π‘’π‘—π‘šπ‘—π‘’π‘§ 𝑇𝑣𝑔𝑔𝑗𝑦 π‘π‘“π‘›π‘π‘ π‘§β€œ

  • V.F. Farias, C.C.Moallemi, B. Van Roy, and T.Weissman. Universal

reinforcement learning. Information Theory, IEEE Transactions on, 56(5):2441 –2454, may 2010. ⟢ "𝐡𝑑𝑒𝑗𝑀𝑓 βˆ’ π‘€π‘Ž"

slide-26
SLIDE 26

Timeline

Solomonoff Induction Ray Solomonoff 1960’s Context Tree Weightings Willems, Shtarkov, Tjalkens 1995 AIXI Marcus Hutter 2005 MCTS β€œBandit based MC Planning” Kocsis & Szepesvari 2006 AIXIπ‘’π‘š Marcus Hutter 2007 MC-AIXI-CTW Veness et al 2010 Kolmogorov Complexity Andrey Kolmogorov 1963

slide-27
SLIDE 27

MC-AIXI-CTW Playing Pac-Man

  • jveness.info/publications/pacman_jair_2010.wmv