On Conservative Policy Iteration
Bruno Scherrer
INRIA Lorraine, LORIA
ICML 2014
1 / 13
On Conservative Policy Iteration Bruno Scherrer INRIA Lorraine, - - PowerPoint PPT Presentation
On Conservative Policy Iteration Bruno Scherrer INRIA Lorraine, LORIA ICML 2014 1 / 13 Motivation / Context Large Markov Decision Process A policy space A reference policy On-Policy data from Can we compute a
1 / 13
2 / 13
2 / 13
2 / 13
3 / 13
4 / 13
5 / 13
5 / 13
a∈A
π
a∈A
6 / 13
7 / 13
8 / 13
8 / 13
8 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
ν,π(Tπ′vπ − Tπvπ) + o(α2)
ν,π = (1 − γ)νT(I − γPπ)−1.
9 / 13
k+1vπk − Tπkvπk)
10 / 13
k+1vπk − Tπkvπk)
10 / 13
k+1vπk − Tπkvπk)
10 / 13
k+1vπk − Tπkvπk)
10 / 13
11 / 13
12 / 13
12 / 13
12 / 13
12 / 13
12 / 13
12 / 13
12 / 13
13 / 13
13 / 13
13 / 13
13 / 13
13 / 13
14 / 13
15 / 13
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
API
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
API(0.1)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
CPI+ (line search)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
CPI(0.1)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
NSPI(5)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
NSPI(10)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
NSPI(30)
20 40 60 80 100 Iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ¹(v¼ ¤¡v¼k)
PSDP1
13 / 13