Batch Policy Learning under Constraints
Hoang M. Le California Institute of Technology Cameron Voloshin Yisong Yue
Batch Policy Learning under Constraints Hoang M. Le Cameron - - PowerPoint PPT Presentation
Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology <latexit
Hoang M. Le California Institute of Technology Cameron Voloshin Yisong Yue
Learn better policy from data under multiple constraints? Learn policy under new constraints? (Setting: MDP, no exploration) generates historical (sub-optimal) data πD
<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit>Given: n tuples data set Goal: find m constraints (vector-valued in ) Rm
<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit>D = {(state, action, next state, cost)} ∼ πD C(π) = 𝔽 [∑ c(state, action)] G(π) = 𝔽 [∑ g(state, action)] g = [g1 g2 … gm]
⊤
min
π
C(π) s.t. G(π) ≤ 0 π
Given: n tuples data set Goal: find m constraints (vector-valued in ) Rm
<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit>D = {(state, action, next state, c, g)} ∼ πD C(π) = 𝔽 [∑ c(state, action)] G(π) = 𝔽 [∑ g(state, action)] g = [g1 g2 … gm]
⊤
min
π
C(π) s.t. G(π) ≤ 0 π
Counterfactual & Safe policy learning g(x) = 1 [x = xavoid] Multi-criteria value-based constraints min
π
travel time s.t. lane centering smooth driving Examples: Given: n tuples data set Goal: find D = {(state, action, next state, c, g)} ∼ πD min
π
C(π) s.t. G(π) ≤ 0 π
Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Proposed Approach:
Multiple reductions to supervised learning and online learning
Algorithm (rough sketch) 1: π ← Best-response(λ) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop 6: new λ ← Online-algorithm(all previous π) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Regret = O( T) ⟹ convergence in O( 1 ω2 ) iterations
Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop 6: new λ ← Online-algorithm(all previous π) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min
π max λ≥0 L(π, λ)
(D) max
λ≥0 min π L(π, λ)
Regret = O( T) ⟹ convergence in O( 1 ω2 ) iterations update based on amount
λ λ ← λ − η ̂ G (π)
Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π)
New approach: model-free function approximation Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π) Fitted Q Evaluation (simplified) 1: Solve for Q : (state, action) ↦ y = g + Qprev(next state, π(next state)) 2: Qprev ← Q For K iterations: Return value of QK
New approach: model-free function approximation Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π) Fitted Q Evaluation (simplified) 1: Solve for Q : (state, action) ↦ y = g + Qprev(next state, π(next state)) 2: Qprev ← Q For K iterations: Return value of QK Guarantee for FQE distribution shift coefficient of MDP
End-to-end Performance Guarantee stopping condition
πD
<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit>returned policy ≤ 1 2 minimize travel time s.t. smooth driving cost
distance to lane center ≤ 1 2 online RL optimal (w/o constraint) Results:
Data efficiency from off-line policy learning and counterfactual cost function modification Value-based constraint specification: Flexible to encode domain knowledge