AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap - - PowerPoint PPT Presentation

aixi tutorial
SMART_READER_LITE
LIVE PREVIEW

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap - - PowerPoint PPT Presentation

AIXI Tutorial Part II AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations, and the Real World Approximations (Break) John Aslanides and Tom Everitt Variants of AIXI July 10, 2018 1/41


slide-1
SLIDE 1

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 1/41

AIXI Tutorial Part II

Intuitions, Approximations, and the Real World™ John Aslanides and Tom Everitt July 10, 2018

slide-2
SLIDE 2

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 2/41

Contents

1 Short Recap 2 Approximations 3 (Break) 4 Variants of AIXI

slide-3
SLIDE 3

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41

Why are we here?

AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments?

slide-4
SLIDE 4

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41

Why are we here?

AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments? In this part we’ll give some scaled down examples and conceptual intuitions about what this means.

slide-5
SLIDE 5

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 3/41

Why are we here?

AIXI [1] proposes an answer to the following question: What is optimal behavior in general unknown environments? In this part we’ll give some scaled down examples and conceptual intuitions about what this means. These slides can be found at aslanides.io/docs/aixi_tutorial.pdf.

slide-6
SLIDE 6

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 4/41

RL Setting & Notation

Environment is an unknown, non-ergodic, partially

  • bservable MDP.

Symbol Description Example a ∈ A Action {↑, ↓, ←, →, . . . }, N, . . .

  • ∈ O

Observation RN, B⋆, , . . . r ∈ R Reward R, Z, . . . e ∈ E Percept O × R (definition) µ ∈ M Environment gridworld, robotics, . . . π ∈ ∆ (A) Policy ǫ-greedy, random, . . . æ<t ∈ (A × E)⋆ History a1o1r1 . . . at−1ot−1rt−1

slide-7
SLIDE 7

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 5/41

RL Setting & Notation

Agent and environment interact using the standard RL setup: Agent Environment at et

slide-8
SLIDE 8

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41

Optimal policy (“Just do the best thing”)

Optimal state-action value in environment µ at time t given history æ<t is given by Q∗

µ(at|æ<t) = sup π Eµ

  • k=t

γkrk|π, æ<tat

slide-9
SLIDE 9

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41

Optimal policy (“Just do the best thing”)

Optimal state-action value in environment µ at time t given history æ<t is given by Q∗

µ(at|æ<t) = sup π Eµ

  • k=t

γkrk|π, æ<tat

  • Optimal value:

V ∗

µ (æ<t) = max at∈A Q∗ µ (at|æ<t)

slide-10
SLIDE 10

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 6/41

Optimal policy (“Just do the best thing”)

Optimal state-action value in environment µ at time t given history æ<t is given by Q∗

µ(at|æ<t) = sup π Eµ

  • k=t

γkrk|π, æ<tat

  • Optimal value:

V ∗

µ (æ<t) = max at∈A Q∗ µ (at|æ<t)

Optimal policy is greedy, breaking ties at random: π∗

µ (at|æ<t) = arg max a

Q∗

µ (a|æ<t)

slide-11
SLIDE 11

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 7/41

Optimal value

Optimal value in environment µ at time t given history æ<t is given by V ∗

µ(æ<t) = lim m→∞ max at

  • et

· · · max

am

  • em

t+m

  • k=t

γkrk

k

  • j=t

µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k.

slide-12
SLIDE 12

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 8/41

Optimal value

Optimal value in environment µ at time t given history æ<t is given by V ∗

µ(æ<t) = lim m→∞ max at

  • et

· · · max

am

  • em

t+m

  • k=t

γkrk

k

  • j=t

µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k. Discounted return realized by the trajectory et:t+m.

slide-13
SLIDE 13

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 9/41

Optimal value

Optimal value in environment µ at time t given history æ<t is given by V ∗

µ(æ<t) = lim m→∞ max at

  • et

· · · max

am

  • em

t+m

  • k=t

γkrk

k

  • j=t

µ (ej|æ<jaj). Likelihood of percepts et:k given action sequence a1:k. Discounted return realized by the trajectory et:t+m. Expectimax up to horizon m.

slide-14
SLIDE 14

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 10/41

Optimal value

Optimal value up to horizon m: V ∗

µ,m(æ<t) = max at

  • et

· · · max

am

  • em

t+m

  • k=t

γkrk

k

  • j=t

µ (ej|æ<jaj).

slide-15
SLIDE 15

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 11/41

Optimal value

Optimal value up to horizon m: V ∗

µ,m(æ<t) = max at

  • et

· · · max

am

  • em

t+m

  • k=t

γkrk

  • "Planning"

k

  • j=t

µ (ej|æ<jaj)

  • "Learning"

.

slide-16
SLIDE 16

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 12/41

Planning

We can approximate the expectimax computation of V ∗

µ,m

with a variant of Monte-Carlo Tree Search (MCTS). Example use: playing Chess, Go, Shogi (AlphaZero) [2].

slide-17
SLIDE 17

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 12/41

Planning

We can approximate the expectimax computation of V ∗

µ,m

with a variant of Monte-Carlo Tree Search (MCTS). Example use: playing Chess, Go, Shogi (AlphaZero) [2].

slide-18
SLIDE 18

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41

Planning

Algorithm: ρUCT [3], an extension of UCT [4] to histories.

slide-19
SLIDE 19

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41

Planning

Algorithm: ρUCT [3], an extension of UCT [4] to histories. Idea: Only expand subtrees that show promising rewards and/or high uncertainty.

slide-20
SLIDE 20

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 13/41

Planning

Algorithm: ρUCT [3], an extension of UCT [4] to histories. Idea: Only expand subtrees that show promising rewards and/or high uncertainty. Trade off reward with uncertainty using a tree-based variant of the UCB algorithm [5]: aUCT ∈ arg max

a∈A

     

ˆ Q (a|æ<t)

  • Value estimate

+ C

  • log T (æ<t)

T (æ<ta)

  • Exploration bonus

     

, where T (·) is the number of times a sequence has been visited.

slide-21
SLIDE 21

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori.

slide-22
SLIDE 22

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =

  • p

2−ℓ(p) p (a<t) = e<t

slide-23
SLIDE 23

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =

  • p

2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =

  • ν∈M

wνν (et|æ<tat)

slide-24
SLIDE 24

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =

  • p

2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =

  • ν∈M

wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M

slide-25
SLIDE 25

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =

  • p

2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =

  • ν∈M

wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M For very small M we can compute this exactly.

slide-26
SLIDE 26

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 14/41

Learning

Agent doesn’t know µ a priori. Recall the incomputable Solomonoff model class M (e<t|a<t) =

  • p

2−ℓ(p) p (a<t) = e<t Introduce a finite model class M: ξ (et|æ<tat) =

  • ν∈M

wνν (et|æ<tat) Update posterior wν with Bayes rule: wν ← ν (et) ξ (et)wν ∀ν ∈ M For very small M we can compute this exactly. Let’s look at this with some toy examples.

slide-27
SLIDE 27

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example

Consider a class of gridworlds:

slide-28
SLIDE 28

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example

Consider a class of gridworlds: The world is a procedurally generated N × N maze:

slide-29
SLIDE 29

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example

Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}.

slide-30
SLIDE 30

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example

Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}. The grey tiles are walls that yield −5 reward if hit.

slide-31
SLIDE 31

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example

Consider a class of gridworlds: The world is a procedurally generated N × N maze: The agent is a robot with A = {←, →, ↑, ↓, ∅}. The grey tiles are walls that yield −5 reward if hit. The white tiles are empty, but moving costs −1.

slide-32
SLIDE 32

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example

The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ.

slide-33
SLIDE 33

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example

The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O

N2 steps to live.

slide-34
SLIDE 34

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example

The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O

N2 steps to live. e.g. 200 steps on 10 × 10 grid.

slide-35
SLIDE 35

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example

The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O

N2 steps to live. e.g. 200 steps on 10 × 10 grid.

The observations consist of just four bits, O = B4:

slide-36
SLIDE 36

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example

The orange circle looks like an empty tile, but randomly dispenses +100 each step with some fixed probability θ. The agent has O

N2 steps to live. e.g. 200 steps on 10 × 10 grid.

The observations consist of just four bits, O = B4: This is a stochastic & partially observable environment with simple & easy-to-understand dynamics [3].

slide-37
SLIDE 37

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

slide-38
SLIDE 38

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout

slide-39
SLIDE 39

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ

slide-40
SLIDE 40

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ Environment dynamics.

slide-41
SLIDE 41

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ Environment dynamics.

Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)

(x,y)

slide-42
SLIDE 42

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ Environment dynamics.

Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)

(x,y)

There are at most |M| ≤ N2 ‘legal’ dispenser positions.

slide-43
SLIDE 43

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ Environment dynamics.

Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)

(x,y)

There are at most |M| ≤ N2 ‘legal’ dispenser positions. Let the agent have a uniform prior wν = |M|−1 ∀ν ∈ M.

slide-44
SLIDE 44

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class

Let the agent know:

Maze layout Dispenser probability θ Environment dynamics.

Let it be uncertain about where the only dispenser is: M = {Gridworld with dispenser at (x, y)}(N,N)

(x,y)

There are at most |M| ≤ N2 ‘legal’ dispenser positions. Let the agent have a uniform prior wν = |M|−1 ∀ν ∈ M. Each ν is a complete gridworld simulator, and µ ∈ M.

slide-45
SLIDE 45

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41

AIXIjs

Enough talk. Let’s see an

slide-46
SLIDE 46

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41

AIXIjs

Enough talk. Let’s see an

Online web demo

aslanides.io/aixijs

slide-47
SLIDE 47

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41

Simple model class

What did we just see? Let’s visualize the agent’s uncertainty as it learns.

slide-48
SLIDE 48

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41

Simple model class

What did we just see? Let’s visualize the agent’s uncertainty as it learns. Initially, the agent has a uniform prior, shown in green.

slide-49
SLIDE 49

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 20/41

Simple model class

Let’s visualize the agent’s uncertainty as it learns. After exploring a little, the agent’s beliefs have changed. Lighter green corresponds to less probability mass.

slide-50
SLIDE 50

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 21/41

Simple model class

Let’s visualize the agent’s uncertainty as it learns. After discovering the dispenser, the agent’s posterior concentrates on µ. This concentration is immediate – global ‘collapse’.

slide-51
SLIDE 51

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 22/41

A more general model class

The previous model class was limited. Here’s a more interesting one. Model each tile independently with a categorical/Dirichlet distribution over

  • ,

,

  • :

ρ (et|. . . ) =

  • s′∈ne(st)

Dirichlet (p|αs′) . Joint distribution factorizes over the grid. The agent learns about state dynamics only locally, rather than globally. Using this model, the agent is uncertain about:

Maze layout Location, number and payout probabilities θi of each dispenser(s).

slide-52
SLIDE 52

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 23/41

A more general model class

What did we just see? Let’s visualize the agent’s uncertainty as it learns. Initially the agent knows nothing about the layout. There are two dispensers, visualized for our benefit.

slide-53
SLIDE 53

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 24/41

A more general model class

Let’s visualize the agent’s uncertainty as it learns. Tiles that the agent knows are walls are blue . Purple tiles show the agent’s belief of θ.

slide-54
SLIDE 54

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 25/41

A more general model class

Let’s visualize the agent’s uncertainty as it learns. Note: the smaller has lower θ than the larger . The agent explores efficiently and learns quickly.

slide-55
SLIDE 55

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 26/41

A more general model class

Let’s visualize the agent’s uncertainty as it learns. Even so, the agent settles for a locally optimal policy. Due to its short horizon m, it can’t see the value in exploring further.

slide-56
SLIDE 56

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 27/41

Exploration/exploitation trade-off

Here we see the classic exploration/exploitation dilemma. Bayesian agents are not immune to this! Choices of:

model class priors discount function planning horizon

are all significant! Corollary: AIξ is not asymptotically optimal.

slide-57
SLIDE 57

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes.

slide-58
SLIDE 58

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable?

slide-59
SLIDE 59

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

slide-60
SLIDE 60

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees.

slide-61
SLIDE 61

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models.

slide-62
SLIDE 62

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth).

slide-63
SLIDE 63

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k.

slide-64
SLIDE 64

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution).

slide-65
SLIDE 65

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length.

slide-66
SLIDE 66

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class

We’ve demonstrated Bayesian RL on gridworlds using very domain-oriented model classes. Is there something more general that is still tractable? Yes! The Context-Tree Weighting (CTW) algorithm:

A data compressor with good theoretical guarantees. Mixes over all < kth-order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k. Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. Learns to play PacMan, Tic-Tac-Toe, Kuhn Poker, and Rock/Paper/Scissors tabula rasa [3].

slide-67
SLIDE 67

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 29/41

Break Time

Let’s take a tea/coffee break! (See you again in 30 mins)

slide-68
SLIDE 68

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41

Variants of AIξ

We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent

slide-69
SLIDE 69

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41

Variants of AIξ

We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling

slide-70
SLIDE 70

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41

Variants of AIξ

We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling Knowledge-Seeking Agents

slide-71
SLIDE 71

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 30/41

Variants of AIξ

We’ll discuss various variants of AIXI and their links with ‘model-free’/‘deep RL’ algorithms: MDL Agent Thompson Sampling Knowledge-Seeking Agents BayesExp

slide-72
SLIDE 72

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41

MDL Agent

Minimum Description Length (MDL) principle: prefer simple models ρ = arg min

ν∈M

      

K (ν) − λ log

t

  • k=1

log ν (ek|æ<kak)

  • Log-likelihood

      

slide-73
SLIDE 73

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41

MDL Agent

Minimum Description Length (MDL) principle: prefer simple models Another take on the ‘Occam principle’: ρ = arg min

ν∈M

      

K (ν) − λ log

t

  • k=1

log ν (ek|æ<kak)

  • Log-likelihood

      

slide-74
SLIDE 74

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 31/41

MDL Agent

Minimum Description Length (MDL) principle: prefer simple models Another take on the ‘Occam principle’: ρ = arg min

ν∈M

      

K (ν) − λ log

t

  • k=1

log ν (ek|æ<kak)

  • Log-likelihood

      

In deterministic environments: “use the simplest yet-unfalsified hypothesis”

slide-75
SLIDE 75

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
slide-76
SLIDE 76

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).
slide-77
SLIDE 77

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

slide-78
SLIDE 78

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t).

slide-79
SLIDE 79

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.

slide-80
SLIDE 80

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.

Good regret guarantees in finite MDPs [1]

slide-81
SLIDE 81

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.

Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2].

slide-82
SLIDE 82

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.

Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time,

slide-83
SLIDE 83

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 32/41

Thompson Sampling

Recall the Bayes-optimal agent (AIξ) maximizes ξ-expected return: aAIξ = arg max

a

Q⋆

ξ (a|æ<t)

= arg max

a

max

π

ξ

  • k=t

γkrk

  • æ<ta
  • A related algorithm is Thompson sampling).

Idea: Instead of maximizing the ξ-expected return:

maximize the ρ-expected return, ρ drawn from w (·|æ<t). resample ρ every ‘effective horizon’ given by discount γ.

Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time,

this encourages ‘deep’ exploration.

slide-84
SLIDE 84

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41

Thompson Sampling

‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}.

slide-85
SLIDE 85

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41

Thompson Sampling

‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap.

slide-86
SLIDE 86

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41

Thompson Sampling

‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap. Thompson sampling: draw a Q-function at random each episode and use a greedy policy.

slide-87
SLIDE 87

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 33/41

Thompson Sampling

‘Deep RL’ version: Deep Exploration via Bootstrapped DQN [2]. Idea: Maintain an ensemble of value functions {Qk (s, a)}. Train these using e.g. DQN using the statistical bootstrap. Thompson sampling: draw a Q-function at random each episode and use a greedy policy. Exhibits much better exploration properties than many alternatives

slide-88
SLIDE 88

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5].

slide-89
SLIDE 89

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

slide-90
SLIDE 90

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards)

slide-91
SLIDE 91

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world

slide-92
SLIDE 92

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation

slide-93
SLIDE 93

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation

Two forms:

slide-94
SLIDE 94

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation

Two forms:

Shannon KSA (“surprise”): U (et|æ<tat) = − log ξ (et|æ<tat)

slide-95
SLIDE 95

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents

It has long been thought that some form of intrinsic motivation, surprise, or curiosity is necessary for effective exploration and learning [5]. Knowledge-seeking agents (KSA) take to this to the extreme:

Fully unsupervised (no extrinsic rewards) Utility function depends on agent beliefs about the world Exploration ≡ Exploitation

Two forms:

Shannon KSA (“surprise”): U (et|æ<tat) = − log ξ (et|æ<tat) Kullback-Leibler KSA (“information gain”): U (et|æ<tat) = Ent (w|æ<tat) − Ent (w|æ1:t)

slide-96
SLIDE 96

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 35/41

Knowledge-Seeking Agents

Kullback Leibler (“information-seeking”) is superior to Shannon & Renyi (“entropy-seeking”):

slide-97
SLIDE 97

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents

‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:

slide-98
SLIDE 98

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents

‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:

Learn a forward dynamics model in tandem with model-free RL

slide-99
SLIDE 99

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents

‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:

Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form

slide-100
SLIDE 100

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents

‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:

Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward

slide-101
SLIDE 101

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents

‘Deep RL’ version: Variational Information Maximization for Exploration (VIME) [1]. Idea:

Learn a forward dynamics model in tandem with model-free RL Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward

Downside: only works well when learning from ‘states’, not pixels (wrong loss).

slide-102
SLIDE 102

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41

BayesExp

Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with

slide-103
SLIDE 103

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41

BayesExp

Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with Information-seeking (KL-KSA).

slide-104
SLIDE 104

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 37/41

BayesExp

Combine best of both worlds: Bayes-optimal reinforcement learner (AIξ) with Information-seeking (KL-KSA). Idea: switch between RL and KSA policies depending on the relative sizes of VKSA and VRL.

slide-105
SLIDE 105

AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 38/41

Thanks!

Thanks!

slide-106
SLIDE 106

AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 39/41

References

Marcus Hutter (2005): Universal Artificial Intelligence. David Silver et al. (2017): Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. Joel Veness et al. (2011): A Monte-Carlo AIXI Approximation. Levente Kocsis and Csaba Szepesvari (2006): Bandit based Monte-Carlo Planning. Peter Auer (2002): Using Confidence Bounds for Exploitation-Exploration Trade-offs.

slide-107
SLIDE 107

AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 40/41

References

Shipra Agrawal and Randy Jia (2017): Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Jan Leike et al. (2016): Thompson Sampling is Asymptotically Optimal in General Environments. John Aslanides, Jan Leike, and Marcus Hutter (2017): Universal Reinforcement Learning Algorithms: Survey and Experiments. Ian Osband, John Aslanides, and Albin Cassirer (2018): Randomized Prior Functions for Deep Reinforcement Learning. Juergen Schmidhuber (2008): Driven by Compression Progress.

slide-108
SLIDE 108

AIXI Tutorial Part II John Aslanides and Tom Everitt Appendix 41/41

References

Rein Houthooft et al. (2016): VIME: Variational Information Maximization for Exploration.