Batch Policy Learning under Constraints Hoang M. Le Cameron - PowerPoint PPT Presentation

Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology

<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> Learning from o ff -line, o ff -policy data generates historical (sub-optimal) data π D Learn better policy from data under multiple constraints? Learn policy under new constraints? (Setting: MDP, no exploration)

<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , cost )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …

<latexit sha1_base64="+arSI5hmKYB6CsKAQ7L2L6WEV/c=">ACHnicdVBLSwMxGMz6rPV9eglWAQPsmys1XqRghePVewDdteSTbNtaPZBkhXKsr/Ei3/FiwdFBE/6b0y3LajoQGCYmS/5Ml7MmVSW9WnMzS8sLi0XVoqra+sbm6Wt7ZaMEkFok0Q8Eh0PS8pZSJuKU47saA48Dhte8OLsd+o0KyKLxRo5i6Ae6HzGcEKy1S9XUyS+xRd9zU8usWujsxDq0TCtHTmqogjInwGrgel1dhtk3VJ5FoWzKJxFIZoqZTBFo1t6d3oRSQIaKsKxlDayYuWmWChGOM2KTiJpjMkQ96mtaYgDKt03yD+1rpQT8S+oQK5ur3iRQHUo4CTyfHS8rf3lj8y7MT5dfclIVxomhIJg/5CYcqguOuYI8JShQfaYKJYHpXSAZYKJ0o0Vdwuyn8H/SOjJRxURXx+X6+bSOAtgFe+AIHAK6uASNEATEHAPHsEzeDEejCfj1XibROeM6cwO+AHj4wuUAp8P</latexit> D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 m constraints (vector-valued in ) R m C ( π ) = 𝔽 [ ∑ c ( state , action ) ] G ( π ) = 𝔽 [ ∑ g ( state , action ) ] ⊤ g = [ g 1 g m ] g 2 …

D = {( state , action , next state , c , g )} ∼ π D Given : n tuples data set Goal : find π min C ( π ) π s.t. G ( π ) ≤ 0 Examples: Counterfactual & Safe policy learning g ( x ) = 1 [ x = x avoid ] Multi-criteria value-based constraints travel time min π s.t. lane centering smooth driving

L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Proposed Approach: Multiple reductions to supervised learning and online learning

L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: o ff -line RL w.r.t. c + λ ⊤ g 1: π ← Best-response ( λ )

L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ

L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop

L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π 3: L min = evaluate (P) fixing λ 4: if L max − L min ≤ ω : 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹

̂ L ( π , λ ) = C ( π ) + λ ⊤ G ( π ) Lagrangian min π max λ ≥ 0 L ( π , λ ) (P) (D) max λ ≥ 0 min π L ( π , λ ) Algorithm (rough sketch) Iteratively: 1: π ← Best-response ( λ ) 2: L max = evaluate (D) fixing π λ ← λ − η G ( π ) 3: L min = evaluate (P) fixing λ update based on amount λ 4: if L max − L min ≤ ω : of constraint violation 5: stop 6: new λ ← Online-algorithm ( all previous π ) convergence in O ( 1 Regret = O ( ω 2 ) iterations T ) ⟹

̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π )

̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K

̂ O ff -policy evaluation D = {( state , action , next state , g )} ∼ π D Given estimate G ( π ) ≈ G ( π ) New approach: model-free function approximation Fitted Q Evaluation (simplified) For K iterations: 1: Solve for Q : ( state , action ) ↦ y = g + Q prev ( next state , π ( next state ) ) 2: Q prev ← Q Return value of Q K Guarantee for FQE distribution shift coe ffi cient of MDP

End-to-end Performance Guarantee stopping condition

<latexit sha1_base64="9XsIEUOuoSr0nF/REMeIeWgdpgU=">AB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6EoKunBZwT6gCWEynbZDZ5IwMymUkD9x40IRt/6JO/GSZuFth4YOJxzL/fMCRPOlHacb6uytr6xuVXdru3s7u0f2IdHRWnktA2iXkseyFWlLOItjXTnPYSbEIOe2Gk7vC706pVCyOnvQsob7Ao4gNGcHaSIFtewkLMk9gPZYiu8/zwK47DWcOtErcktShRCuwv7xBTFJBI04VqrvOon2Myw1I5zmNS9VNMFkgke0b2iEBV+Nk+eozOjDNAwluZFGs3V3xsZFkrNRGgmi4hq2SvE/7x+qoc3fsaiJNU0IotDw5QjHaOiBjRgkhLNZ4ZgIpnJisgYS0y0KatmSnCXv7xKOhcN97LhPF7Vm7dlHVU4gVM4BxeuoQkP0I2EJjCM7zCm5VZL9a79bEYrVjlzjH8gfX5AyHTk/Y=</latexit> minimize travel time s.t. ≤ 1 smooth driving cost online RL optimal (w/o constraint) 2 distance to lane center ≤ 1 2 online RL optimal (w/o constraint) returned policy π D Results: - both constraints satisfied - travel time still matches online RL optimal

More details in the paper… Value-based constraint specification: Flexible to encode domain knowledge Data e ffi ciency from o ff -line policy learning and counterfactual cost function modification

Batch Policy Learning under Constraints Hoang M. Le Cameron - PowerPoint PPT Presentation

Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology <latexit

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Batch Mode Active Learning and Its Application to Medical Image Classification ICML 2006 S. Hoi,

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

Ac#ve Learning Machine Learning 10-601B Batch/Passive Learning

Watched Literals in SAT and CP T opics in this Series Why SAT & Constraints? SAT

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

On (Hoffman) graphs with smallest eigenvalue at least 3 J. Koolen 1 1 Department of Mathematics

N UMERICAL RESULTS OBTAINED FOR PLANE CHANNELS , CIRCULAR AND ANNULAR PIPES Circular pipes Plane

Heathwall Pumping Station & Kirtling Street Community Liaison Working Group Thursday, 5

AGRM UPDATE! H E R E I S S O M E I N F O T O K E E P Y O U I N T H E L O O P. WE JUST

On Cameron-Liebler line classes with large parameter J. De Beule ( joint work with Jeroen

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

CS226 Big-Data Management Instructor: Ahmed Eldawy 09/28/2018 1 Welcome (back) to UCR!

What is the best opportunity for new invasive species funding? Land and Water Conservation

Batch Policy Learning under Constraints Hoang M. Le Cameron - PowerPoint PPT Presentation

Batch Policy Learning under Constraints Hoang M. Le Cameron Voloshin Yisong Yue California Institute of Technology <latexit

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Batch Mode Active Learning and Its Application to Medical Image Classification ICML 2006 S. Hoi,

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

Ac#ve Learning Machine Learning 10-601B Batch/Passive Learning

Watched Literals in SAT and CP T opics in this Series Why SAT &amp; Constraints? SAT

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

On (Hoffman) graphs with smallest eigenvalue at least 3 J. Koolen 1 1 Department of Mathematics

N UMERICAL RESULTS OBTAINED FOR PLANE CHANNELS , CIRCULAR AND ANNULAR PIPES Circular pipes Plane

Heathwall Pumping Station &amp; Kirtling Street Community Liaison Working Group Thursday, 5

AGRM UPDATE! H E R E I S S O M E I N F O T O K E E P Y O U I N T H E L O O P. WE JUST

On Cameron-Liebler line classes with large parameter J. De Beule ( joint work with Jeroen

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

CS226 Big-Data Management Instructor: Ahmed Eldawy 09/28/2018 1 Welcome (back) to UCR!

What is the best opportunity for new invasive species funding? Land and Water Conservation

Watched Literals in SAT and CP T opics in this Series Why SAT & Constraints? SAT

Heathwall Pumping Station & Kirtling Street Community Liaison Working Group Thursday, 5