Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology
<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> Learning from o ff -line, o ff -policy data generates historical (sub-optimal) data π D Learn better policy from data under multiple constraints? Learn policy under new constraints? (Setting: MDP, no exploration)
<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , cost )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …
<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …
D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 Examples: Counterfactual & Safe policy learning g ( x ) = 1 [ x = x avoid ] Multi-criteria value-based constraints travel time min π s.t. lane centering smooth driving
L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Proposed Approach: Multiple reductions to supervised learning and online learning
L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: o ff -line RL w.r.t. c + λ ⊤ g 1: π ← Best-response ( λ )
L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ
L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop
L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹
̂ L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π λ ← λ − η G ( π ) 3: L min = evaluate (P) fixing λ update based on amount λ 4: if L max − L min ≤ ω : of constraint violation 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹
̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π )
̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K
̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K Guarantee for FQE distribution shift coe ffi cient of MDP
End-to-end Performance Guarantee stopping condition
<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> minimize travel time s.t. ≤ 1 smooth driving cost online RL optimal (w/o constraint) 2 distance to lane center ≤ 1 2 online RL optimal (w/o constraint) returned policy π D Results: - both constraints satisfied - travel time still matches online RL optimal
More details in the paper… Value-based constraint specification: Flexible to encode domain knowledge Data e ffi ciency from o ff -line policy learning and counterfactual cost function modification
Recommend
More recommend