information theoretic considerations in batch rl
play

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan - PowerPoint PPT Presentation

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois at Urbana Champaign What we study: theory of batch RL (ADP)backbone for deep RL What we study: theory of batch RL (ADP)backbone for


  1. Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois at Urbana Champaign

  2. What we study: theory of batch RL (ADP)—backbone for “deep RL”

  3. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* )

  4. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed?

  5. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? n Assumption on data n Data distribution: %(!, #) data distribution Distribution Distribution ind %(!, #) induced by any arbitrary policy ( policy π class: ) * !, # S × A [Munos’03] ! × #) Assumption on F

  6. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? n Assumption on data n Data distribution: %(!, #) data distribution Distribution Distribution ind %(!, #) induced by any arbitrary policy ( policy π class: ) * !, # S × A [Munos’03] ! × #) Assumption on F #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  7. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data n Data distribution: %(!, #) data distribution Distribution Distribution ind %(!, #) induced by any arbitrary policy ( policy π class: ) * !, # S × A [Munos’03] ! × #) Assumption on F #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  8. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution Distribution Distribution ind %(!, #) induced by any arbitrary policy ( policy π class: ) * !, # S × A [Munos’03] ! × #) Assumption on F #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  9. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f S × A [Munos’03] ! × #) Assumption on F #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  10. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f Similar to Jiang et al [2017] S × A [Munos’03] ! × #) Assumption on F #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  11. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f Similar to Jiang et al [2017] S × A [Munos’03] ! × #) Assumption on F • Conjecture: realizability alone is insufficient #" small " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  12. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f Similar to Jiang et al [2017] S × A [Munos’03] ! × #) Assumption on F • Conjecture: realizability alone is insufficient #" • Alg-specific lower bound exists for decades ? small • Info-theoretic ? " Π ℱ #" ℱ [Munos & Szepesvari ’05]

  13. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f Similar to Jiang et al [2017] S × A [Munos’03] ! × #) Assumption on F • Conjecture: realizability alone is insufficient #" • Alg-specific lower bound exists for decades ? small • Info-theoretic ? • Negative results: two general proof styles " Π ℱ #" excluded • e.g., construct an exponentially large MDP ℱ family => fail! [Munos & Szepesvari ’05]

  14. What we study: theory of batch RL (ADP)—backbone for “deep RL” Setting: learn good policy from batch data {( s, a, r, s’ )} + value-function approximator F (model Q* ) Central question: When is sample-efficient ( poly (log | F | , H ) ) learning guaranteed? Do they hold in Are they necessary? (hardness results) interesting scenarios? n Assumption on data • Intuition: data should be exploratory n Data distribution: %(!, #) data distribution • We show: also about MDP dynamics! Distribution Distribution ind %(!, #) induced by any arbitrary policy ( • Unrestricted dynamics cause policy π class: ) * !, # exponential lower bound even with the most exploratory distribution F { f Similar to Jiang et al [2017] S × A [Munos’03] ! × #) Assumption on F • Conjecture: realizability alone is insufficient #" • Alg-specific lower bound exists for decades F piece-wise constant ? small • Info-theoretic ? + • Negative results: two general proof styles ⇔ bisimulation " Π ℱ #" excluded F closed under 
 [Givan et al’03] • e.g., construct an exponentially large MDP Bellman update ℱ family => fail! [Munos & Szepesvari ’05]

  15. Implications and the Bigger Picture Tabular RL Batch RL with function approximation tractable Nice dynamics & exploratory data + realizability + ??? Nice dynamics & exploratory data Online (exploration) + realizability G ? a p p ? a G RL intractable Nice dynamics 
 G a p c o (low Bellman rank; Jiang et al’17) n fi r m e d + realizability Nice dynamics (low witness rank; Sun et al’18) F { f + realizability Poster: Tue Evening Pacific Ballroom #209 value-based model-based

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend