Regret-Based Reward Elicitation for Markov Decision Processes Kevin - PowerPoint PPT Presentation

Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science

Introduction 2 Motivation Setting: Computational approaches to sequential decision making under uncertainty, specifically MDPs These approaches require A model of dynamics A model of rewards

Introduction 3 Motivation Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user

Introduction 4 Motivation “ My research goal is develop minimax regret-based framework for the incremental elicitation of reward functions for MDPs that ’’ is cognitively and computationally effective.

Introduction 5 Reward Elicitation Compute MDP Robust Policy Satisfied? Reward policy yes Done measure no Select Query User query response

Introduction 6 Considerations Compute MDP Robust Policy Reward policy measure The efficient computation of robust 1. policies using minimax regret

Introduction 7 Considerations The effective selection of queries in 2. terms of query type and parameters MDP Select Query User Reward query response

Outline 1. MMR Computation 2. Reward Elicitation 3. IRMDP Applications 4. Research Plan

MMR Comptuation 9 MMR Computation - Exact (MIP) We use constraint generation, solving max regret using a mixed integer program (MIP) [UAI09] Explicitly encodes the adversary’s policy choices using binary indicator variables Requires O(|S||A|) constraints and variables Effective for small MDPs (less than 10 states)

MMR Comptuation 10 MMR Computation - Approximations Alternating optimization Compute regret maximizing adversary policy for fixed reward, and regret maximizing reward for fixed adversary policy Linear Relaxation [UAI09] We remove integrality constraints on binary variables encoding policy in the MIP . The resulting adversarial reward is used to construct approximation to max regret. These approximations exhibit low empirical error, however, we have no bound on this error.

MMR Comptuation 11 Nondominated Policies Γ Given the set of policies that are nondominated w.r.t. reward we gain significant computational leverage for computing MMR In constraint generation framework, max regret is computed by finding regret maximizing reward for each nondominated policy [AAAI10] Γ Generating the set of nondominated policies We leverage similarities to POMDP solution techniques to develop two generation algorithms: π Witness [AAAI10] and Linear Support [TechRep]

MMR Comptuation 12 MMR Computation - Nondominated Policies 1. We show that given polynomial number of nondominated policies, minimax regret is polynomial [AAAI10] Exact 2. Empirically outperforms related approach [XM09] 3. Bottleneck: the size of the set of nondominated policies A partial set of nondominated policies produces an lower bound on minimax Approximation regret

MMR Comptuation 13 Γ Generating - π Witness [AAAI10]  Γ Given a parital set of nondominated policies f ∈  For each , construct its “local adjustments” Γ Look for reward “witnesses” that testify to a Idea new nondominated policy Find optimal policy at the witness and add it to the set of nondominated policies Runtime is polynomial in the number of Properties nondominated policies generated No anytime error guarantee

MMR Computation 14 Γ Generating - Linear Support [TechRep10] Each generated nondominated policy induces a  convex “nondominated region” (w.r.t. ) Γ The maximal error will occur at vertices of these Idea regions Find the optimal policy at each vertex  Γ Add the policy with maximal error to Relies on vertex enumeration (exponential in worst case) Properties Provides anytime error bound Empirically, error drops quickly

MMR Computation 15 Online Adjustment of Γ [TechRep10] Idea Γ Small leads to efficient minimax regret computation. During elicitation the volume of the reward polytope is reduced Γ Policies in can become dominated. We can eliminate these dominated policies and add new nondominated policies, reducing approximation error. Empirically, elicitation using that begins with high error Γ quickly sees error reduced to zero

MMR Computation 16 MMR Computation Summary Generating π Witness Γ Linear Support Constraint Generation with Max Regret using: Computing MMR MIP Linear Relaxation Alternating Optimization Γ The set of nondominated policies  Γ The approximate set  Γ Adjusting online

MMR Computation 17 Characterizing the Set of Nondom. Policies ? Is there a simple characterization of an MDP that allows us to quantify: Question 1. The number of nondominated policies 2. Whether the set of nondominated policies is easy to approximate

MMR Computation 18 Fully Factored MDPs ? It is desirable to adapt our computational approaches to incorporate structure in the transition model. Important because: 1. Many existing MDP models encode this structure 2. Leverage transition structure could speed up our algorithms However, to realize these benefits requires reimplementing most of our algorithms.

MMR Computation 19 Prior on Reward - Constructing Γ ? Given a prior over feasible reward functions: Using adaptations of point based POMDP algorithms, we may be able to generate very  Γ small approximate sets with low expected error. However, such sets will offer no guarantee on worst case error.

Outline 1. MMR Computation 2. Reward Elicitation 3. IRMDP Applications 4. Research Plan

Reward Elicitation 21 Simple Elicitation Strategies [UAI09] We used bound queries of the form “Is r(s,a) ≥ b?” with the following parameter selection strategies: 1. Halve the Largest Gap : Select the state-action pair with the largest “gap” between upper and lower bound: Δ ( s , a ) = max r ' ∈ R r '( s , a ) − min a * ∈ A , s * ∈ S Δ ( s *, a *) r ∈ R r ( s , a ) choose argmax 2. Current Solution : Weight each gap with the occupancy frequencies found in solution to minimax regret. Choose state-action pair with largest weighted gap: { } f ( s *, a *) Δ ( s *, a *), g ( s *, a *) Δ ( s *, a *) argmax a * ∈ A , s * ∈ S max

Reward Elicitation 22 Simple Elicitation Strategies - Assessment We analyzed the effectiveness of elicitation on randomly generated IRMDPs and the autonomic computing domain [UAI09] Results 1. Both strategies performed well, but current solution reduced regret to zero in half the queries 2. Less than 2 queries per parameter were required to find a provable optimal policy 3. Current solution effectively reduces regret while leaving a large amount of uncertainty

Reward Elicitation 23 Sequential Queries ? The sequential nature of the MDP motivates novel modes of interacting with the user, including queries about: 1. Policies - π : S → A 2. Trajectories - s 1 ,a 1 ,s 2 ,a 2 ,...,a n-1 ,s n 3. Occupancy Frequencies - f(s,a) It is unreasonable to expect a user to specify a numerical value for a policy, trajectory or occupancy frequency. Comparisons may be more manageable.

Reward Elicitation 24 Sequential Queries ? Policy Is policy π preferred to policy π ’ ? Comparison s 1 , a 1 , … , a n − 1 , s n Trajectory Is the potential trajectory s 1 ′ , a 1 ′ , … , a n − 1 ′ , s n ′ Comparison preferred to trajectory ?” Responses to both types of query imply linear constraints on reward However, trajectory comparisons may be easier to reason about than policy comparisons

Reward Elicitation 25 Occupancy Frequency Comparisons ? “Are occupancy frequencies f preferred to g ?” A response imposes a linear constraint on reward. “Yes” implies: n n ∑ ∑ ≥ f ( s i , a i ) r ( s i , a i ) g ( s i , a i ) r ( s i , a i ) i = 0 i = 0 Given a factored model, it allows a user to directly specify trade-offs pertaining to portions of the policy. ⎧ x = x 1 , x 2 , … , x n ⎪ domain( x 1 ) = ⎨ r ( x ) = r 1 ( x 1 ) + r 2 ( x 2 ) … ⎪ ⎩

Reward Elicitation 26 Occupancy Frequency Comparisons ? payoff r ( x 1 = ) with frequency f ( x 1 = ), along with Do you prefer payoff r ( x 1 = ) with frequency f ( x 1 = ) and payoff r ( x 1 = ) with frequency f ( x 1 = ) or payoff r ( x 1 = ) with frequency g ( x 1 = ), along with payoff r ( x 1 = ) with frequency g ( x 1 = ) and payoff r ( x 1 = ) with frequency g ( x 1 = )? ∑ ∑ g ( x 1 ) = … g ( x 1 , … , x n ) = Yes response implies : x 2 x n ∑ ∑ f ( x 1 ) r ( x 1 ) ≥ g ( x 1 ) r ( x 1 ) ∑ ∑ f ( x 1 ) = … f ( x 1 , … , x n ) = x 1 x 1 x 2 x n

Reward Elicitation 27 Optimal Query Selection ? The “value” of a query can be measured by the Expected Value of Information (EVOI): For query parameters ϕ and response ρ : ∑ ρ ) EVOI ( φ , R ) = MMR ( R ) − Pr( ρ | φ ) MMR ( R (with prior) φ ρ ⎡ ρ ) ⎤ = min MMR ( R ) − MMR ( R (w/o prior) ⎣ ⎦ φ ρ We can attempt to select queries that 1. Myopically maximize EVOI for a single query 2. Maximize EVOI for an entire sequence of queries

Regret-Based Reward Elicitation for Markov Decision Processes Kevin - PowerPoint PPT Presentation

Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science Introduction 2 Motivation

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

2 3 Markov Decision Process r k+1 s k+1 Environment Environment Action a k State s k Reward r k

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

SOFTWARE REQUIREMENTS 1. Introduction Elicitation Analysis Specification

Preference elicitation Dorota Kurowicka Preference elicitation Discrete choice over

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations 1

Outpatient Management of Patients Under Investigation (PUIs) for Measles Virus *Suspect cases

Wholesale Market Insights Through June 2019 J o n a t h a n S m o k e & Z o R a h i m 1

DISTRIBUTION AND FATE OF ENERGETICS AT THE MMR IMPACT AREA AND TRAINING RANGES Jay Clausen

Emmaus High School no longer publishes Class Rank Transcripts now show the final grades only

Measure I 2018 Fall Update and Summer Project Report Tony Gitt, Chair Citizens Bond Oversite

Barkat-e-Khuda, PhD Professor, Department of Economics, University of Dhaka, Bangladesh. Email:

PRESENTATION May 2015 Forward Looking Statements This presentation contains historical

Regret-Based Reward Elicitation for Markov Decision Processes Kevin - PowerPoint PPT Presentation

Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science Introduction 2 Motivation

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

2 3 Markov Decision Process r k+1 s k+1 Environment Environment Action a k State s k Reward r k

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

SOFTWARE REQUIREMENTS 1. Introduction Elicitation Analysis Specification

Preference elicitation Dorota Kurowicka Preference elicitation Discrete choice over

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations 1

Outpatient Management of Patients Under Investigation (PUIs) for Measles Virus *Suspect cases

Wholesale Market Insights Through June 2019 J o n a t h a n S m o k e &amp; Z o R a h i m 1

DISTRIBUTION AND FATE OF ENERGETICS AT THE MMR IMPACT AREA AND TRAINING RANGES Jay Clausen

Emmaus High School no longer publishes Class Rank Transcripts now show the final grades only

Measure I 2018 Fall Update and Summer Project Report Tony Gitt, Chair Citizens Bond Oversite

Barkat-e-Khuda, PhD Professor, Department of Economics, University of Dhaka, Bangladesh. Email:

PRESENTATION May 2015 Forward Looking Statements This presentation contains historical

Wholesale Market Insights Through June 2019 J o n a t h a n S m o k e & Z o R a h i m 1