Batch Policy Learning under Constraints Hoang M. Le Cameron - - PowerPoint PPT Presentation

batch policy learning under constraints
SMART_READER_LITE
LIVE PREVIEW

Batch Policy Learning under Constraints Hoang M. Le Cameron - - PowerPoint PPT Presentation

Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology <latexit


slide-1
SLIDE 1

Batch Policy Learning under Constraints

Hoang M. Le California Institute of Technology Cameron Voloshin Yisong Yue

slide-2
SLIDE 2

Learning from off-line, off-policy data

Learn better policy from data under multiple constraints? Learn policy under new constraints? (Setting: MDP, no exploration) generates historical (sub-optimal) data πD

<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit>
slide-3
SLIDE 3

Given: n tuples data set Goal: find m constraints (vector-valued in ) Rm

<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit>

D = {(state, action, next state, cost)} ∼ πD C(π) = 𝔽 [∑ c(state, action)] G(π) = 𝔽 [∑ g(state, action)] g = [g1 g2 … gm]

min

π

C(π) s.t. G(π) ≤ 0 π

slide-4
SLIDE 4

Given: n tuples data set Goal: find m constraints (vector-valued in ) Rm

<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit>

D = {(state, action, next state, c, g)} ∼ πD C(π) = 𝔽 [∑ c(state, action)] G(π) = 𝔽 [∑ g(state, action)] g = [g1 g2 … gm]

min

π

C(π) s.t. G(π) ≤ 0 π

slide-5
SLIDE 5

Counterfactual & Safe policy learning g(x) = 1 [x = xavoid] Multi-criteria value-based constraints min

π

travel time s.t. lane centering smooth driving Examples: Given: n tuples data set Goal: find D = {(state, action, next state, c, g)} ∼ πD min

π

C(π) s.t. G(π) ≤ 0 π

slide-6
SLIDE 6

Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

Proposed Approach:

Multiple reductions to supervised learning and online learning

slide-7
SLIDE 7

Algorithm (rough sketch) 1: π ← Best-response(λ) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

  • ff-line RL w.r.t. c + λ⊤g
slide-8
SLIDE 8

Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

slide-9
SLIDE 9

Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

slide-10
SLIDE 10

Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop 6: new λ ← Online-algorithm(all previous π) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

Regret = O( T) ⟹ convergence in O( 1 ω2 ) iterations

slide-11
SLIDE 11

Algorithm (rough sketch) 1: π ← Best-response(λ) 2: Lmax = evaluate (D) fixing π 3: Lmin = evaluate (P) fixing λ 4: if Lmax − Lmin ≤ ω : 5: stop 6: new λ ← Online-algorithm(all previous π) Iteratively: Lagrangian L(π, λ) = C(π) + λ⊤G(π) (P) min

π max λ≥0 L(π, λ)

(D) max

λ≥0 min π L(π, λ)

Regret = O( T) ⟹ convergence in O( 1 ω2 ) iterations update based on amount

  • f constraint violation

λ λ ← λ − η ̂ G (π)

slide-12
SLIDE 12

Off-policy evaluation

Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π)

slide-13
SLIDE 13

Off-policy evaluation

New approach: model-free function approximation Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π) Fitted Q Evaluation (simplified) 1: Solve for Q : (state, action) ↦ y = g + Qprev(next state, π(next state)) 2: Qprev ← Q For K iterations: Return value of QK

slide-14
SLIDE 14

Off-policy evaluation

New approach: model-free function approximation Given estimate D = {(state, action, next state, g)} ∼ πD ̂ G (π) ≈ G(π) Fitted Q Evaluation (simplified) 1: Solve for Q : (state, action) ↦ y = g + Qprev(next state, π(next state)) 2: Qprev ← Q For K iterations: Return value of QK Guarantee for FQE distribution shift coefficient of MDP

slide-15
SLIDE 15

End-to-end Performance Guarantee stopping condition

slide-16
SLIDE 16

πD

<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit>

returned policy ≤ 1 2 minimize travel time s.t. smooth driving cost

  • nline RL optimal (w/o constraint)

distance to lane center ≤ 1 2 online RL optimal (w/o constraint) Results:

  • both constraints satisfied
  • travel time still matches online RL optimal
slide-17
SLIDE 17

More details in the paper…

Data efficiency from off-line policy learning and counterfactual cost function modification Value-based constraint specification: Flexible to encode domain knowledge