batch policy learning under constraints
play

Batch Policy Learning under Constraints Hoang M. Le Cameron - PowerPoint PPT Presentation

Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology <latexit


  1. Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology

  2. <latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> Learning from o ff -line, o ff -policy data generates historical (sub-optimal) data π D Learn better policy from data under multiple constraints? Learn policy under new constraints? (Setting: MDP, no exploration)

  3. <latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , cost )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …

  4. <latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …

  5. D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 Examples: Counterfactual & Safe policy learning g ( x ) = 1 [ x = x avoid ] Multi-criteria value-based constraints travel time min π s.t. lane centering smooth driving

  6. L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Proposed Approach: Multiple reductions to supervised learning and online learning

  7. L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: o ff -line RL w.r.t. c + λ ⊤ g 1: π ← Best-response ( λ )

  8. L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ

  9. L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop

  10. L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹

  11. ̂ L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π λ ← λ − η G ( π ) 3: L min = evaluate (P) fixing λ update based on amount λ 4: if L max − L min ≤ ω : of constraint violation 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹

  12. ̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π )

  13. ̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K

  14. ̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K Guarantee for FQE distribution shift coe ffi cient of MDP

  15. End-to-end Performance Guarantee stopping condition

  16. <latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> minimize travel time s.t. ≤ 1 smooth driving cost online RL optimal (w/o constraint) 2 distance to lane center ≤ 1 2 online RL optimal (w/o constraint) returned policy π D Results: - both constraints satisfied - travel time still matches online RL optimal

  17. More details in the paper… Value-based constraint specification: Flexible to encode domain knowledge Data e ffi ciency from o ff -line policy learning and counterfactual cost function modification

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend