Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le - - PowerPoint PPT Presentation

▶

Nov 18, 2023 284 likes •381 views

Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng Li Line nearly arly Ad Additive ditive Fea eatur tures es Lin in F. Yan ang, , Meng ngdi di Wan ang A Basic RL Model: Markov Decision

SLIDE 1

Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng Li Line nearly arly Ad Additive ditive Fea eatur tures es

Lin in F. Yan ang, , Meng ngdi di Wan ang

SLIDE 2

A Basic RL Model: Markov Decision Process

States: ; Actions:
Reward:
State transition:
Policy:
Optimal policy & value:
optimal policy :

Effective Horizon: random

SLIDE 3

Optimal sample complexity:

Too many states for most cases …

|S| = 3361 |S| ≥ 256256×240

How to optimally reduce dimensions? Exploiting structures!

Curse of Dimensionality

SLIDE 4

Parametric Q-Learning On Feature-Based MDP

Transition is decomposable

𝑄 ∈ ℝ 𝑇×𝐵 ×𝑇

Φ Ψ

Known Unknown

SLIDE 5

Parametric Q-Learning On Feature-Based MDP

Transition is decomposable

SLIDE 6

Parametric Q-Learning On Feature-Based MDP

0.2 0.11 0.3 0.5 0.01

SLIDE 7

A Simple Regression Based Algorithm

Generative Model: we are able to samples from any (s,a)
Learn 𝑥 with modified Q-learning

Represent Q-function with parameter 𝑥 ∈ ℝ𝐿: 𝑅𝑥 ≔ 𝑠 𝑡, 𝑏 + 𝛿𝜚 𝑡, 𝑏 ⊤𝑥 𝑊

𝑥 𝑡 ≔ max 𝑏∈𝐵 𝑅𝑥(𝑡, 𝑏)

𝜌𝑥 𝑡 ≔ argmax𝑏∈𝐵𝑅𝑥(𝑡, 𝑏) Sample complexity (𝐿: feature dimension): ෨ 𝑃 𝐿 𝜗2 1 − 𝛿 7

SLIDE 8

Sample Optimality?

Anchor condition:

𝑄 ⋅ |𝑡1, 𝑏1 𝑄 ⋅ |𝑡2, 𝑏2 𝑄 ⋅ |𝑡3, 𝑏3 𝑄 ⋅ |𝑡4, 𝑏4 𝑄 ⋅ |𝑡5, 𝑏5 𝑄 ⋅ |𝑡6, 𝑏6 𝑄 ⋅ |𝑡, 𝑏

Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng Li Line nearly arly Ad Additive ditive Fea eatur tures es

Lin in F. Yan ang, , Meng ngdi di Wan ang

A Basic RL Model: Markov Decision Process

Curse of Dimensionality

Parametric Q-Learning On Feature-Based MDP

𝑄 ∈ ℝ 𝑇×𝐵 ×𝑇

Φ Ψ

Parametric Q-Learning On Feature-Based MDP

Parametric Q-Learning On Feature-Based MDP

A Simple Regression Based Algorithm

Represent Q-function with parameter 𝑥 ∈ ℝ𝐿: 𝑅𝑥 ≔ 𝑠 𝑡, 𝑏 + 𝛿𝜚 𝑡, 𝑏 ⊤𝑥 𝑊

𝜌𝑥 𝑡 ≔ argmax𝑏∈𝐵𝑅𝑥(𝑡, 𝑏) Sample complexity (𝐿: feature dimension): ෨ 𝑃 𝐿 𝜗2 1 − 𝛿 7

Sample Optimality?

Sample complexity: ෩ Θ 𝐿 𝜗2 1 − 𝛿 3 ArXiv: 1902.04779. Poster: 117