AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap - PowerPoint PPT Presentation

Learning AIXI Tutorial Agent doesn’t know µ a priori. Part II John Aslanides Recall the incomputable Solomonoff model class and Tom Everitt � 2 − ℓ ( p ) � p ( a < t ) = e < t � M ( e < t | a < t ) = Short Recap p Approximations Introduce a finite model class M : (Break) Variants of AIXI � ξ ( e t | æ < t a t ) = w ν ν ( e t | æ < t a t ) ν ∈M Update posterior w ν with Bayes rule: w ν ← ν ( e t ) ξ ( e t ) w ν ∀ ν ∈ M For very small M we can compute this exactly. Let’s look at this with some toy examples. 14/41

Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI 15/41

Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot 15/41

Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot The grey tiles are walls that yield − 5 reward if hit. 15/41

Gridworld example AIXI Tutorial Part II Consider a class of gridworlds : John Aslanides and Tom The world is a procedurally generated N × N maze: Everitt Short Recap Approximations (Break) Variants of AIXI with A = {← , → , ↑ , ↓ , ∅} . The agent is a robot The grey tiles are walls that yield − 5 reward if hit. are empty, but moving costs − 1. The white tiles 15/41

Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap Approximations (Break) Variants of AIXI 16/41

Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations (Break) Variants of AIXI 16/41

Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) Variants of AIXI 16/41

Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) The observations consist of just four bits , O = B 4 : Variants of AIXI 16/41

Gridworld example AIXI Tutorial Part II The orange circle looks like an empty tile, but John Aslanides and Tom randomly dispenses +100 each step with some fixed Everitt probability θ . Short Recap � N 2 � steps to live. The agent has O Approximations e.g. 200 steps on 10 × 10 grid. (Break) The observations consist of just four bits , O = B 4 : Variants of AIXI This is a stochastic & partially observable environment with simple & easy-to-understand dynamics [3]. 16/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Approximations (Break) Variants of AIXI 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations (Break) Variants of AIXI 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Variants of AIXI 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. Let the agent have a uniform prior w ν = |M| − 1 ∀ ν ∈ M . 17/41

Simple model class AIXI Tutorial Part II John Aslanides and Tom Let the agent know : Everitt Maze layout Short Recap Dispenser probability θ Approximations Environment dynamics. (Break) Let it be uncertain about where the only dispenser is: Variants of AIXI M = { Gridworld with dispenser at ( x , y ) } ( N , N ) ( x , y ) There are at most |M| ≤ N 2 ‘legal’ dispenser positions. Let the agent have a uniform prior w ν = |M| − 1 ∀ ν ∈ M . Each ν is a complete gridworld simulator, and µ ∈ M . 17/41

AIXIjs AIXI Tutorial Part II Enough talk. Let’s see an John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 18/41

AIXIjs AIXI Tutorial Part II Enough talk. Let’s see an John Aslanides and Tom Everitt Short Recap Online web demo Approximations (Break) Variants of AIXI aslanides.io/aixijs 18/41

Simple model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 19/41

Simple model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Initially, the agent has a uniform prior, shown in green. 19/41

Simple model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI After exploring a little, the agent’s beliefs have changed. Lighter green corresponds to less probability mass. 20/41

Simple model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI After discovering the dispenser, the agent’s posterior concentrates on µ . This concentration is immediate – global ‘collapse’. 21/41

A more general model class The previous model class was limited. Here’s a more AIXI Tutorial Part II interesting one. John Aslanides and Tom Model each tile independently with a categorical/Dirichlet Everitt � � distribution over , , : Short Recap Approximations (Break) � ρ ( e t | . . . ) = Dirichlet ( p | α s ′ ) . Variants of AIXI s ′ ∈ ne( s t ) Joint distribution factorizes over the grid. The agent learns about state dynamics only locally , rather than globally . Using this model, the agent is uncertain about: Maze layout Location, number and payout probabilities θ i of each dispenser(s). 22/41

A more general model class What did we just see? AIXI Tutorial Part II Let’s visualize the agent’s uncertainty as it learns. John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Initially the agent knows nothing about the layout. There are two dispensers, visualized for our benefit. 23/41

A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Tiles that the agent knows are walls are blue . Purple tiles show the agent’s belief of θ . 24/41

A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Note: the smaller has lower θ than the larger . The agent explores efficiently and learns quickly. 25/41

A more general model class Let’s visualize the agent’s uncertainty as it learns. AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Approximations (Break) Variants of AIXI Even so, the agent settles for a locally optimal policy. Due to its short horizon m , it can’t see the value in exploring further. 26/41

Exploration/exploitation trade-off AIXI Tutorial Part II John Aslanides and Tom Everitt Here we see the classic exploration/exploitation dilemma. Short Recap Bayesian agents are not immune to this! Approximations Choices of: (Break) model class Variants of AIXI priors discount function planning horizon are all significant! Corollary: AI ξ is not asymptotically optimal . 27/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations (Break) Variants of AIXI 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) Variants of AIXI 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. 28/41

(Aside) An even more general model class AIXI Tutorial Part II John Aslanides We’ve demonstrated Bayesian RL on gridworlds using and Tom Everitt very domain-oriented model classes. Short Recap Is there something more general that is still tractable? Approximations Yes! The Context-Tree Weighting (CTW) algorithm: (Break) A data compressor with good theoretical guarantees. Variants of AIXI Mixes over all < k th -order (in bits) Markov models. Automatically weights models by complexity (tree depth). Model updates in time linear in k . Based on the KT estimator (similar to Beta distribution). Can model any sequential density up to a finite given context/history length. Learns to play PacMan, Tic-Tac-Toe, Kuhn Poker, and Rock/Paper/Scissors tabula rasa [3]. 28/41

Break Time AIXI Tutorial Part II John Aslanides and Tom Everitt Short Recap Let’s take a tea/coffee break! Approximations (Break) Variants of AIXI (See you again in 30 mins) 29/41

Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI 30/41

Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling 30/41

Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling Knowledge-Seeking Agents 30/41

Variants of AI ξ AIXI Tutorial Part II John Aslanides and Tom Everitt We’ll discuss various variants of AIXI and their links with Short Recap ‘model-free’/‘deep RL’ algorithms: Approximations (Break) MDL Agent Variants of AIXI Thompson Sampling Knowledge-Seeking Agents BayesExp 30/41

MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Everitt   Short Recap   t   Approximations �   ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k )   (Break) ν ∈M   k =1   Variants of AIXI � �� Log-likelihood 31/41

MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Another take on the ‘Occam principle’: Everitt   Short Recap   t   Approximations �   ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k )   (Break) ν ∈M   k =1   Variants of AIXI � �� Log-likelihood 31/41

MDL Agent AIXI Tutorial Minimum Description Length (MDL) principle: prefer Part II simple models John Aslanides and Tom Another take on the ‘Occam principle’: Everitt   Short Recap   t   Approximations �   ρ = arg min K ( ν ) − λ log log ν ( e k | æ < k a k )   (Break) ν ∈M   k =1   Variants of AIXI � �� Log-likelihood In deterministic environments: “use the simplest yet-unfalsified hypothesis” 31/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time, 32/41

Thompson Sampling AIXI Tutorial Recall the Bayes-optimal agent (AI ξ ) maximizes Part II ξ -expected return: John Aslanides and Tom Everitt Q ⋆ ξ ( a | æ < t ) a AI ξ = arg max a Short Recap � ∞ � � � � Approximations E π � = arg max max γ k r k � æ < t a ξ � a π (Break) k = t Variants of AIXI A related algorithm is Thompson sampling). Idea: Instead of maximizing the ξ -expected return: maximize the ρ -expected return, ρ drawn from w ( ·| æ < t ). resample ρ every ‘effective horizon’ given by discount γ . Good regret guarantees in finite MDPs [1] Asymptotically optimal in general environments [2]. Intuition: ‘commits’ the agent to a given belief/policy for a significant amount of time, this encourages ‘deep’ exploration. 32/41

Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Approximations (Break) Variants of AIXI 33/41

Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Variants of AIXI 33/41

Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Thompson sampling: draw a Q -function at random each Variants of AIXI episode and use a greedy policy. 33/41

Thompson Sampling ‘Deep RL’ version: Deep Exploration via Bootstrapped AIXI Tutorial Part II DQN [2]. John Aslanides and Tom Idea: Maintain an ensemble of value functions Everitt { Q k ( s , a ) } . Short Recap Train these using e.g. DQN using the statistical Approximations bootstrap. (Break) Thompson sampling: draw a Q -function at random each Variants of AIXI episode and use a greedy policy. Exhibits much better exploration properties than many alternatives 33/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Short Recap Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Variants of AIXI 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: Shannon KSA (“surprise”): U ( e t | æ < t a t ) = − log ξ ( e t | æ < t a t ) 34/41

Knowledge-Seeking Agents AIXI Tutorial It has long been thought that some form of intrinsic Part II motivation , surprise , or curiosity is necessary for John Aslanides and Tom effective exploration and learning [5]. Everitt Knowledge-seeking agents (KSA) take to this to the Short Recap extreme: Approximations Fully unsupervised (no extrinsic rewards) (Break) Utility function depends on agent beliefs about the world Variants of AIXI Exploration ≡ Exploitation Two forms: Shannon KSA (“surprise”): U ( e t | æ < t a t ) = − log ξ ( e t | æ < t a t ) Kullback-Leibler KSA (“information gain”): U ( e t | æ < t a t ) = Ent ( w | æ < t a t ) − Ent ( w | æ 1: t ) 34/41

Knowledge-Seeking Agents AIXI Tutorial Kullback Leibler (“information-seeking”) is superior to Part II John Aslanides Shannon & Renyi (“entropy-seeking”): and Tom Everitt Short Recap Approximations (Break) Variants of AIXI 35/41

Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations (Break) Variants of AIXI 36/41

Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI 36/41

Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI Use a variational approximation to compute the information gain in closed form 36/41

Knowledge-Seeking Agents AIXI Tutorial Part II John Aslanides and Tom ‘Deep RL’ version: Variational Information Maximization Everitt for Exploration (VIME) [1]. Short Recap Idea: Approximations Learn a forward dynamics model in tandem with (Break) model-free RL Variants of AIXI Use a variational approximation to compute the information gain in closed form Use this as an ‘exploration bonus’, or intrinsic reward 36/41

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap - PowerPoint PPT Presentation

AIXI Tutorial Part II AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations, and the Real World Approximations (Break) John Aslanides and Tom Everitt Variants of AIXI July 10, 2018 1/41

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005) Reinforcement Learning

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

XDP hands-on tutorial Jesper Dangaard Brouer Toke Hiland-Jrgensen Bornhack Gelsted, August

Prose tutorial Edit New Page Sumit Gulwani edited this page 9 minutes ago 60 revisions

Tutorial on using the Google Cloud Platform (GCP) Tutorial on using the Google Cloud Platform

CS 525M Mobile and Ubiquitous Computing Tutorial 1: Introduction by Bucky Roberts (thenewboston)

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

Machine Learning: Course Overview Yingyu Liang Computer Sciences 760 Fall 2017

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Instance Based Learning [Read Ch. 8] k -Nearest Neigh b or Lo cally w eigh

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Lecture 15: Inheritance 2/27/2015 Guest Lecturer: Marvin Zhang Some

Omer Boyaci, Victoria Beltran and Henning Schulzrinne IRT Pizza Talk Nov 2010 SECE allows

Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Stenography

ProtoDUNE SP TPC ADC Calibration Linearity and NL Measurements Note: Slides updated since