MVA-RL Course
A Quick Look at the “Reinforcement Learning” course
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
A Quick Look at the Reinforcement Learning course A. LAZARIC ( - - PowerPoint PPT Presentation
A Quick Look at the Reinforcement Learning course A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Why A. LAZARIC Introduction to Reinforcement Learning Sept 27, 2013 - 2/24
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Sept 27, 2013 - 2/24
Sept 27, 2013 - 3/24
◮ Autonomous robotics ◮ Elder care ◮ Exploration of
unknown/dangerous environments
◮ Robotics for entertainment
Sept 27, 2013 - 4/24
◮ Autonomous robotics ◮ Financial applications ◮ Trading execution algorithms ◮ Portfolio management ◮ Option pricing
Sept 27, 2013 - 5/24
◮ Autonomous robotics ◮ Financial applications ◮ Energy management ◮ Energy grid integration ◮ Maintenance scheduling ◮ Energy market regulation ◮ Energy production
management
Sept 27, 2013 - 6/24
◮ Autonomous robotics ◮ Financial applications ◮ Energy management ◮ Recommender systems ◮ Web advertising ◮ Product recommendation ◮ Date matching
Sept 27, 2013 - 7/24
◮ Autonomous robotics ◮ Financial applications ◮ Energy management ◮ Recommender systems ◮ Social applications ◮ Bike sharing optimization ◮ Election campaign ◮ ER service optimization ◮ Resource distribution
Sept 27, 2013 - 8/24
◮ Autonomous robotics ◮ Financial applications ◮ Energy management ◮ Recommender systems ◮ Social applications ◮ And many more...
Sept 27, 2013 - 9/24
Sept 27, 2013 - 10/24
Sept 27, 2013 - 11/24
“An introduction to reinforcement learning”, Sutton and Barto (1998).
Sept 27, 2013 - 12/24
Agent Environment
state / actuation action / perception
Sept 27, 2013 - 13/24
A Markov decision process (MDP) is represented by the tuple M = X, A, r, p where X is the state space, A is the action space, r : X × A → [0, B] is the reward function, p is the dynamics. At time t ∈ N a decision rule πt : X → A is a mapping from states to actions and a policy (strategy, plan) is a sequence of decision rules π = (π0, π1, π2, . . . ).
V π(x) = r(x, π(x)) + γ
p(y|x, π(x))V π(y), V ∗(x) = max
a∈A
p(y|x, a)V ∗(y)
Sept 27, 2013 - 14/24
Value Iteration Vk+1 = T Vk Policy Iteration
◮ Evaluate: given πk compute V πk. ◮ Improve: given V πk compute πk+1 = greedy(V πk)
Sept 27, 2013 - 15/24
Given a observed transition x, a, x′, r update Qk+1(x, a) = (1 − α)Qk(x, a) + α
a′ Qk(x′, a′)
Sept 27, 2013 - 16/24
Given K arms we define the regret over n rounds of a bandit strategy as Rn =
n
Xi∗,t −
n
XIt,t. For the UCB strategy we can prove Rn ≤
b2 ∆i log(n).
Sept 27, 2013 - 17/24
Approximate Value Iteration ˆ Vk+1 = T ˆ Vk Approximate Policy Iteration
◮ Evaluate: given πk compute ˆ
V πk.
◮ Improve: given ˆ
V πk compute ˆ πk+1 ≈ greedy( ˆ V πk)
Sept 27, 2013 - 18/24
||V πK − V ∗||2,ρ ≤ inf
f ∈F ||V ∗ − f ||2,ρ +
Cρ 1 − γ
n .
Sept 27, 2013 - 19/24
FCD
Fonteneau 131 bis 131 132 1 3 3 135Sept 27, 2013 - 20/24
Lectures Alessandro LAZARIC SequeL Team INRIA-Lille Nord Europe
alessandro.lazaric@inria.fr researchers.lille.inria.fr/˜lazaric/
Practical Sessions Emilie KAUFMANN Telecom ParisTech
emilie.kaufmann@telecom-paristech.fr perso.telecom-paristech.fr/˜kaufmann/
Sept 27, 2013 - 21/24
Date Topic Classroom 01/10 Intro/MDP C103 08/10 Dynamic Programming C103 15/10 RL Algorithms C103 22/10 TP on DP and RL C109 29/10 Multi-arm Bandit (1) C103 05/11 TP on Bandit C109 12/11 Multi-arm Bandit (2) C103 19/11 TP on Bandit C109 26/11 Approximate DP C103 03/12 Sample Complexity of ADP C103 10/12 TP on ADP C109 17/12 Guest lectures + Internships C103 (TBC) 14/01 Evaluation C103 (TBC) Lectures are from 11am to 1pm, TP should be from 11am to 1:15pm.
Sept 27, 2013 - 22/24
◮ Papers review + oral presentation ◮ Projects ◮ Stages ◮ PhD
Sept 27, 2013 - 23/24
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr