AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 1/41
AIXI Tutorial Part II
Intuitions, Approximations, and the Real World™ John Aslanides and Tom Everitt July 10, 2018
AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap - - PowerPoint PPT Presentation
AIXI Tutorial Part II AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations, and the Real World Approximations (Break) John Aslanides and Tom Everitt Variants of AIXI July 10, 2018 1/41
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 1/41
Intuitions, Approximations, and the Real World™ John Aslanides and Tom Everitt July 10, 2018
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 2/41
1 Short Recap 2 Approximations 3 (Break) 4 Variants of AIXI
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41
AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments?
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41
AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments? In this part we’ll give some scaled down examples and conceptual intuitions about what this means.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41
AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments? In this part we’ll give some scaled down examples and conceptual intuitions about what this means. These slides can be found at aslanides.io/docs/aixi_tutorial.pdf.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 4/41
Environment is an unknown, non-ergodic, partially
Symbol Description Example a ∈ A Action {↑, ↓, ←, →, . . . }, N, . . .
Observation RN, B⋆, , . . . r ∈ R Reward R, Z, . . . e ∈ E Percept O × R (definition) µ ∈ M Environment gridworld, robotics, . . . π ∈ ∆ (A) Policy ǫ-greedy, random, . . . æ<t ∈ (A × E)⋆ History a1o1r1 . . . at−1ot−1rt−1
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 5/41
Agent and environment interact using the standard RL setup: Agent Environment at et
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41
Optimal state-action value in environment µ at time t given history æ<t is given by Q∗
µ(at|æ<t) = sup π Eµ
∞
γkrk|π, æ<tat
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41
Optimal state-action value in environment µ at time t given history æ<t is given by Q∗
µ(at|æ<t) = sup π Eµ
∞
γkrk|π, æ<tat
V ∗
µ (æ<t) = max at∈A Q∗ µ (at|æ<t)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41
Optimal state-action value in environment µ at time t given history æ<t is given by Q∗
µ(at|æ<t) = sup π Eµ
∞
γkrk|π, æ<tat
V ∗
µ (æ<t) = max at∈A Q∗ µ (at|æ<t)
Optimal policy is greedy, breaking ties at random: π∗
µ (at|æ<t) = arg max a
Q∗
µ (a|æ<t)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 7/41
Optimal value in environment µ at time t given history æ<t is given by V ∗
µ(æ<t) = lim m→∞ max at
· · · max
am
t+m
γkrk
k
µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 8/41
Optimal value in environment µ at time t given history æ<t is given by V ∗
µ(æ<t) = lim m→∞ max at
· · · max
am
t+m
γkrk
k
µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k. Discounted return realized by the trajectory et:t+m.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 9/41
Optimal value in environment µ at time t given history æ<t is given by V ∗
µ(æ<t) = lim m→∞ max at
· · · max
am
t+m
γkrk
k
µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k. Discounted return realized by the trajectory et:t+m. Expectimax up to horizon m.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 10/41
Optimal value up to horizon m: V ∗
µ,m(æ<t) = max at
· · · max
am
t+m
γkrk
k
µ (ej|æ<jaj).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 11/41
Optimal value up to horizon m: V ∗
µ,m(æ<t) = max at
· · · max
am
t+m
γkrk
k
µ (ej|æ<jaj)
.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 12/41
We can approximate the expectimax computation of V ∗
µ,m
with a variant of Monte-Carlo Tree Search (MCTS). Example use: playing Chess, Go, Shogi (AlphaZero) [2].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 12/41
We can approximate the expectimax computation of V ∗
µ,m
with a variant of Monte-Carlo Tree Search (MCTS). Example use: playing Chess, Go, Shogi (AlphaZero) [2].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41
Algorithm: ρUCT [3], an extension of UCT [4] to histories.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41
Algorithm: ρUCT [3], an extension of UCT [4] to histories. Idea: Only expand subtrees that show promising rewards and/or high uncertainty.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41
Algorithm: ρUCT [3], an extension of UCT [4] to histories. Idea: Only expand subtrees that show promising rewards and/or high uncertainty. Trade off reward with uncertainty using a tree-based variant of the UCB algorithm [5]: aUCT ∈ arg max
a∈A
ˆ Q (a|æ<t)
+ C
T (æ<ta)
, where T (·) is the number of times a sequence has been visited.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =
2−ℓ(p) p (a<t) = e<t
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =
2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =
wνν (et|æ<tat)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =
2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =
wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =
2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =
wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M For very small M we can compute this exactly.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41
Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =
2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =
wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M For very small M we can compute this exactly. Let’s look at this with some toy examples.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Consider a class of gridworlds:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Consider a class of gridworlds: The world is a procedurally generated N × N maze:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}. The grey tiles are walls that yield −5 reward if hit.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41
Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}. The grey tiles are walls that yield −5 reward if hit. The white tiles are empty, but moving costs −1.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41
The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41
The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O
N2 steps to live.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41
The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O
N2 steps to live. e.g. 200 steps on 10 × 10 grid.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41
The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O
N2 steps to live. e.g. 200 steps on 10 × 10 grid.
The observations consist of just four bits, O = B4:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41
The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O
N2 steps to live. e.g. 200 steps on 10 × 10 grid.
The observations consist of just four bits, O = B4: This is a stochastic & partially observable environment with simple & easy-to-understand dynamics [3].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ Environment dynamics.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ Environment dynamics.
Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)
(x,y)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ Environment dynamics.
Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)
(x,y)
There are at most |M| ≤ N2 ‘legal’ dispenser positions.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ Environment dynamics.
Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)
(x,y)
There are at most |M| ≤ N2 ‘legal’ dispenser positions. Let the agent have a uniform prior wν = |M|−1 ∀ν ∈ M.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41
Let the agent know:
Maze layout Dispenser probability θ Environment dynamics.
Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)
(x,y)
There are at most |M| ≤ N2 ‘legal’ dispenser positions. Let the agent have a uniform prior wν = |M|−1 ∀ν ∈ M. Each ν is a complete gridworld simulator, and µ ∈ M.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41
Enough talk. Let’s see an
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41
Enough talk. Let’s see an
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41
What did we just see? Let’s visualize the agent’s uncertainty as it learns.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41
What did we just see? Let’s visualize the agent’s uncertainty as it learns. Initially, the agent has a uniform prior, shown in green.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 20/41
Let’s visualize the agent’s uncertainty as it learns. After exploring a little, the agent’s beliefs have changed. Lighter green corresponds to less probability mass.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 21/41
Let’s visualize the agent’s uncertainty as it learns. After discovering the dispenser, the agent’s posterior concentrates on µ. This concentration is immediate – global ‘collapse’.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 22/41
The previous model class was limited. Here’s a more interesting one. Model each tile independently with a categorical/Dirichlet distribution over
,
ρ (et|. . . ) =
Dirichlet (p|αs′) . Joint distribution factorizes over the grid. The agent learns about state dynamics only locally, rather than globally. Using this model, the agent is uncertain about:
Maze layout Location, number and payout probabilities θi of each dispenser(s).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 23/41
What did we just see? Let’s visualize the agent’s uncertainty as it learns. Initially the agent knows nothing about the layout. There are two dispensers, visualized for our benefit.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 24/41
Let’s visualize the agent’s uncertainty as it learns. Tiles that the agent knows are walls are blue . Purple tiles show the agent’s belief of θ.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 25/41
Let’s visualize the agent’s uncertainty as it learns. Note: the smaller has lower θ than the larger . The agent explores efficiently and learns quickly.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 26/41
Let’s visualize the agent’s uncertainty as it learns. Even so, the agent settles for a locally optimal policy. Due to its short horizon m, it can’t see the value in exploring further.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 27/41
Here we see the classic exploration/exploitation dilemma. Bayesian agents are not immune to this! Choices of:
model class priors discount function planning horizon
are all significant! Corollary: AIξ is not asymptotically optimal.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable?
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41
We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:
A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. Learns to play PacMan, Tic-Tac-Toe, Kuhn Poker, and Rock/Paper/Scissors tabula rasa [3].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 29/41
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41
We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41
We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41
We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling Knowledge-Seeking Agents
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41
We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling Knowledge-Seeking Agents BayesExp
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41
Minimum Description Length (MDL) principle: prefer simple models ρ = arg min
ν∈M
K (ν) − λ log
t
log ν (ek|æ<kak)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41
Minimum Description Length (MDL) principle: prefer simple models Another take on the ‘Occam principle’: ρ = arg min
ν∈M
K (ν) − λ log
t
log ν (ek|æ<kak)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41
Minimum Description Length (MDL) principle: prefer simple models Another take on the ‘Occam principle’: ρ = arg min
ν∈M
K (ν) − λ log
t
log ν (ek|æ<kak)
In deterministic environments: “use the simplest yet-unfalsified hypothesis”
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.
Good regret guarantees in finite MDPs [1]
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.
Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.
Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time,
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41
Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max
a
Q⋆
ξ (a|æ<t)
= arg max
a
max
π
Eπ
ξ
∞
γkrk
Idea: Instead of maximizing the ξ-expected return:
maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.
Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time,
this encourages ‘deep’ exploration.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41
‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41
‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41
‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap. Thompson sampling: draw a Q-function at random each episode and use a greedy policy.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41
‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap. Thompson sampling: draw a Q-function at random each episode and use a greedy policy. Exhibits much better exploration properties than many alternatives
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5].
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation
Two forms:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation
Two forms:
Shannon KSA (“surprise”): U (et|æ<tat) = − log ξ (et|æ<tat)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41
It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:
Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation
Two forms:
Shannon KSA (“surprise”): U (et|æ<tat) = − log ξ (et|æ<tat) Kullback-Leibler KSA (“information gain”): U (et|æ<tat) = Ent (w|æ<tat) − Ent (w|æ1:t)
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 35/41
Kullback Leibler (“information-seeking”) is superior to Shannon & Renyi (“entropy-seeking”):
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41
‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41
‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:
Learn a forward dynamics model in tandem with model-free RL
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41
‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:
Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41
‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:
Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41
‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:
Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward
Downside: only works well when learning from ‘states’, not pixels (wrong loss).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41
Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41
Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with Information-seeking (KL-KSA).
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41
Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with Information-seeking (KL-KSA). Idea: switch between RL and KSA policies depending on the relative sizes of VKSA and VRL.
AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 38/41
AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 39/41
Marcus Hutter (2005): Universal Artificial Intelligence. David Silver et al. (2017): Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. Joel Veness et al. (2011): A Monte-Carlo AIXI Approximation. Levente Kocsis and Csaba Szepesvari (2006): Bandit based Monte-Carlo Planning. Peter Auer (2002): Using Confidence Bounds for Exploitation-Exploration Trade-offs.
AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 40/41
Shipra Agrawal and Randy Jia (2017): Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Jan Leike et al. (2016): Thompson Sampling is Asymptotically Optimal in General Environments. John Aslanides, Jan Leike, and Marcus Hutter (2017): Universal Reinforcement Learning Algorithms: Survey and Experiments. Ian Osband, John Aslanides, and Albin Cassirer (2018): Randomized Prior Functions for Deep Reinforcement Learning. Juergen Schmidhuber (2008): Driven by Compression Progress.
AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 41/41
Rein Houthooft et al. (2016): VIME: Variational Information Maximization for Exploration.