SLIDE 10 10
- Initialise π
(π‘, π) arbitrarily
- Repeat (for each episode):
- Initialise π‘
- Repeat (for each step of episode):
1. Choose π from π‘ (π-greedy policy from π
)
arg max
π
π
(π‘, π) π₯. π. 1 β π π πππππ π₯. π. π
2. Take action π, observe π , π‘β² 3. Update estimate of π
- π
π‘, π β π
π‘, π + π½ π + πΏ max
πβ² π
π‘β², πβ² β π
(π‘, π)
- π‘ β π‘β²
- Until π‘ is terminal
exploit explore learn
immediate reward estimated future reward
An algorithm: Q-learning