Provably Efficient RL via Latent State Decoding Akshay Alekh John - - PowerPoint PPT Presentation

provably efficient rl via latent state decoding
SMART_READER_LITE
LIVE PREVIEW

Provably Efficient RL via Latent State Decoding Akshay Alekh John - - PowerPoint PPT Presentation

Provably Efficient RL via Latent State Decoding Akshay Alekh John Simon S. Du Krishnamurthy Nan Jiang Miro Dudk Agarwal Langford RL theory vs practice RL theory vs practice Theory Simple tabular environments No generalization RL


slide-1
SLIDE 1

Provably Efficient RL via Latent State Decoding

Simon S. Du Akshay Krishnamurthy Nan Jiang Alekh Agarwal Miro Dudík John Langford

slide-2
SLIDE 2

RL theory vs practice

slide-3
SLIDE 3

RL theory vs practice

Theory

Simple tabular environments No generalization

slide-4
SLIDE 4

RL theory vs practice

Theory

Simple tabular environments No generalization

Practice

Complex rich-observation environments
 Generalization via function approximation

slide-5
SLIDE 5

RL theory vs practice

Theory

Simple tabular environments No generalization

Practice

Complex rich-observation environments
 Generalization via function approximation

Can we design provably sample-efficient RL algorithms for rich observation environments?

slide-6
SLIDE 6

Block MDPs

A structured model for rich observation RL

slide-7
SLIDE 7

Block MDPs

A structured model for rich observation RL

  • Agent only observes rich context (visual signal)

Context x

slide-8
SLIDE 8

Block MDPs

A structured model for rich observation RL

  • Agent only observes rich context (visual signal)
  • Environment summarized by small hidden state space (agent location)

Context x State s

slide-9
SLIDE 9

Block MDPs

A structured model for rich observation RL

  • Agent only observes rich context (visual signal)
  • Environment summarized by small hidden state space (agent location)

Context x State s Action a

(Left)

slide-10
SLIDE 10

Block MDPs

A structured model for rich observation RL

  • Agent only observes rich context (visual signal)
  • Environment summarized by small hidden state space (agent location)

Context x State s Action a State s Context x Action a

(Left)

For H steps

slide-11
SLIDE 11

Block MDPs

A structured model for rich observation RL

  • Agent only observes rich context (visual signal)
  • Environment summarized by small hidden state space (agent location)
  • State can be decoded from observation

Context x State s Action a State s Context x Action a

(Left)

For H steps

slide-12
SLIDE 12

Objective: Find a Decoder

slide-13
SLIDE 13

Objective: Find a Decoder

Idea: Find a function that decodes hidden states from contexts.

f( ) =

context state

slide-14
SLIDE 14

Objective: Find a Decoder

Idea: Find a function that decodes hidden states from contexts.

f( ) =

context state Reduce to a tabular problem

slide-15
SLIDE 15

Objective: Find a Decoder

Idea: Find a function that decodes hidden states from contexts.

f( ) =

context state Main Challenge: There is no label (we cannot observe hidden states). Reduce to a tabular problem

slide-16
SLIDE 16

Approach

slide-17
SLIDE 17

Approach

Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from contexts. (assume access a regression oracle to learn this function)

f( ) =

context

s1,a1 s1,a2 s2,a1 s2,a2

State at level h: s1, s2 Actions: a1, a2

slide-18
SLIDE 18

Approach

Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from contexts. (assume access a regression oracle to learn this function)

f( ) =

context Different conditional probabilities correspond to different states

s1,a1 s1,a2 s2,a1 s2,a2 s1,a1 s1,a2 s2,a1 s2,a2 s1,a1 s1,a2 s2,a1 s2,a2

State at level h: s1, s2 Actions: a1, a2 State at level h+1: s3 s4

slide-19
SLIDE 19

Approach

Our Approach: Learn a function that predicts the conditional probability of (previous state, action) pairs from contexts. (assume access a regression oracle to learn this function)

f( ) =

context Different conditional probabilities correspond to different states

s1,a1 s1,a2 s2,a1 s2,a2 s1,a1 s1,a2 s2,a1 s2,a2

State classification

s1,a1 s1,a2 s2,a1 s2,a2

State at level h: s1, s2 Actions: a1, a2 State at level h+1: s3 s4

slide-20
SLIDE 20

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

slide-21
SLIDE 21

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

M = Number of hidden states, K = Number of actions, H = Time horizon

slide-22
SLIDE 22

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

M = Number of hidden states, K = Number of actions, H = Time horizon

Statistical efficiency

slide-23
SLIDE 23

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

M = Number of hidden states, K = Number of actions, H = Time horizon

Statistical efficiency Computational efficiency

slide-24
SLIDE 24

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

M = Number of hidden states, K = Number of actions, H = Time horizon

Statistical efficiency Computational efficiency Rich observations

slide-25
SLIDE 25

Guarantees

Theorem: Our algorithm can find a near-optimal decoder with poly(M,K,H) samples in polynomial time, with H calls to supervised learning black box.

M = Number of hidden states, K = Number of actions, H = Time horizon

Statistical efficiency Computational efficiency Rich observations Assumptions

  • Supervised learner expressive enough
  • Latent states reachable and identifiable
slide-26
SLIDE 26

Algorithm details and experiments @ Poster #208