Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan - - PowerPoint PPT Presentation

information theoretic considerations in batch rl
SMART_READER_LITE
LIVE PREVIEW

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan - - PowerPoint PPT Presentation

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois at Urbana Champaign What we study: theory of batch RL (ADP)backbone for deep RL What we study: theory of batch RL (ADP)backbone for


slide-1
SLIDE 1

Information-Theoretic Considerations in Batch RL

Jinglin Chen, Nan Jiang University of Illinois at Urbana Champaign

slide-2
SLIDE 2

What we study: theory of batch RL (ADP)—backbone for “deep RL”

slide-3
SLIDE 3

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*)

slide-4
SLIDE 4

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

slide-5
SLIDE 5

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F

[Munos’03]

slide-6
SLIDE 6

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F

[Munos’03] [Munos & Szepesvari ’05]

slide-7
SLIDE 7

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

[Munos’03] [Munos & Szepesvari ’05]

slide-8
SLIDE 8
  • Intuition: data should be exploratory

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

[Munos’03] [Munos & Szepesvari ’05]

slide-9
SLIDE 9
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

[Munos’03] [Munos & Szepesvari ’05]

slide-10
SLIDE 10
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

Similar to Jiang et al [2017]

[Munos’03] [Munos & Szepesvari ’05]

slide-11
SLIDE 11
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

Similar to Jiang et al [2017]

  • Conjecture: realizability alone is insufficient

[Munos’03] [Munos & Szepesvari ’05]

slide-12
SLIDE 12
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

Similar to Jiang et al [2017]

  • Conjecture: realizability alone is insufficient
  • Alg-specific lower bound exists for decades
  • Info-theoretic?

?

[Munos’03] [Munos & Szepesvari ’05]

slide-13
SLIDE 13
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

Similar to Jiang et al [2017]

  • Conjecture: realizability alone is insufficient
  • Alg-specific lower bound exists for decades
  • Info-theoretic?
  • Negative results: two general proof styles

excluded

  • e.g., construct an exponentially large MDP

family => fail!

?

[Munos’03] [Munos & Szepesvari ’05]

slide-14
SLIDE 14
  • Intuition: data should be exploratory
  • We show: also about MDP dynamics!
  • Unrestricted dynamics cause

exponential lower bound even with the most exploratory distribution

ℱ " #" Πℱ#" small

n ! × #) Data distribution: %(!, #) Distribution ind arbitrary policy ( class: )* !, #

data distribution Distribution induced by any policy π

n %(!, #)

S×A

What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {(s, a, r, s’)} + value-function approximator F (model Q*) Central question: When is sample-efficient (poly(log|F|, H)) learning guaranteed?

Assumption on data Assumption on F Are they necessary? (hardness results) Do they hold in interesting scenarios?

F{f

Similar to Jiang et al [2017]

  • Conjecture: realizability alone is insufficient
  • Alg-specific lower bound exists for decades
  • Info-theoretic?
  • Negative results: two general proof styles

excluded

  • e.g., construct an exponentially large MDP

family => fail!

?

F piece-wise constant + F closed under
 Bellman update ⇔ bisimulation

[Givan et al’03]

[Munos’03] [Munos & Szepesvari ’05]

slide-15
SLIDE 15

F{f

RL intractable Tabular RL RL with function approximation tractable Batch Online (exploration)

value-based model-based

Nice dynamics & exploratory data + realizability + ??? Nice dynamics & exploratory data + realizability Nice dynamics 


(low Bellman rank; Jiang et al’17)

+ realizability Nice dynamics

(low witness rank; Sun et al’18)

+ realizability

G a p ? G a p ? G a p c

  • n

fi r m e d

Implications and the Bigger Picture

Poster: Tue Evening Pacific Ballroom #209