regret based reward elicitation for markov decision
play

Regret-Based Reward Elicitation for Markov Decision Processes Kevin - PowerPoint PPT Presentation

Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science Introduction 2 Motivation


  1. Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science

  2. Introduction 2 Motivation Setting: Computational approaches to sequential decision making under uncertainty, specifically MDPs These approaches require A model of dynamics A model of rewards

  3. Introduction 3 Motivation Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user

  4. Introduction 4 Motivation “ My research goal is develop minimax regret-based framework for the incremental elicitation of reward functions for MDPs that ’’ is cognitively and computationally effective.

  5. Introduction 5 Reward Elicitation Compute MDP Robust Policy Satisfied? Reward policy yes Done measure no Select Query User query response

  6. Introduction 6 Considerations Compute MDP Robust Policy Reward policy measure The efficient computation of robust 1. policies using minimax regret

  7. Introduction 7 Considerations The effective selection of queries in 2. terms of query type and parameters MDP Select Query User Reward query response

  8. Outline 1. MMR Computation 2. Reward Elicitation 3. IRMDP Applications 4. Research Plan

  9. MMR Comptuation 9 MMR Computation - Exact (MIP) We use constraint generation, solving max regret using a mixed integer program (MIP) [UAI09] Explicitly encodes the adversary’s policy choices using binary indicator variables Requires O(|S||A|) constraints and variables Effective for small MDPs (less than 10 states)

  10. MMR Comptuation 10 MMR Computation - Approximations Alternating optimization Compute regret maximizing adversary policy for fixed reward, and regret maximizing reward for fixed adversary policy Linear Relaxation [UAI09] We remove integrality constraints on binary variables encoding policy in the MIP . The resulting adversarial reward is used to construct approximation to max regret. These approximations exhibit low empirical error, however, we have no bound on this error.

  11. MMR Comptuation 11 Nondominated Policies Γ Given the set of policies that are nondominated w.r.t. reward we gain significant computational leverage for computing MMR In constraint generation framework, max regret is computed by finding regret maximizing reward for each nondominated policy [AAAI10] Γ Generating the set of nondominated policies We leverage similarities to POMDP solution techniques to develop two generation algorithms: π Witness [AAAI10] and Linear Support [TechRep]

  12. MMR Comptuation 12 MMR Computation - Nondominated Policies 1. We show that given polynomial number of nondominated policies, minimax regret is polynomial [AAAI10] Exact 2. Empirically outperforms related approach [XM09] 3. Bottleneck: the size of the set of nondominated policies A partial set of nondominated policies produces an lower bound on minimax Approximation regret

  13. MMR Comptuation 13 Γ Generating - π Witness [AAAI10]  Γ Given a parital set of nondominated policies f ∈  For each , construct its “local adjustments” Γ Look for reward “witnesses” that testify to a Idea new nondominated policy Find optimal policy at the witness and add it to the set of nondominated policies Runtime is polynomial in the number of Properties nondominated policies generated No anytime error guarantee

  14. MMR Computation 14 Γ Generating - Linear Support [TechRep10] Each generated nondominated policy induces a  convex “nondominated region” (w.r.t. ) Γ The maximal error will occur at vertices of these Idea regions Find the optimal policy at each vertex  Γ Add the policy with maximal error to Relies on vertex enumeration (exponential in worst case) Properties Provides anytime error bound Empirically, error drops quickly

  15. MMR Computation 15 Online Adjustment of Γ [TechRep10] Idea Γ Small leads to efficient minimax regret computation. During elicitation the volume of the reward polytope is reduced Γ Policies in can become dominated. We can eliminate these dominated policies and add new nondominated policies, reducing approximation error. Empirically, elicitation using that begins with high error Γ quickly sees error reduced to zero

  16. MMR Computation 16 MMR Computation Summary Generating π Witness Γ Linear Support Constraint Generation with Max Regret using: Computing MMR MIP Linear Relaxation Alternating Optimization Γ The set of nondominated policies  Γ The approximate set  Γ Adjusting online

  17. MMR Computation 17 Characterizing the Set of Nondom. Policies ? Is there a simple characterization of an MDP that allows us to quantify: Question 1. The number of nondominated policies 2. Whether the set of nondominated policies is easy to approximate

  18. MMR Computation 18 Fully Factored MDPs ? It is desirable to adapt our computational approaches to incorporate structure in the transition model. Important because: 1. Many existing MDP models encode this structure 2. Leverage transition structure could speed up our algorithms However, to realize these benefits requires reimplementing most of our algorithms.

  19. MMR Computation 19 Prior on Reward - Constructing Γ ? Given a prior over feasible reward functions: Using adaptations of point based POMDP algorithms, we may be able to generate very  Γ small approximate sets with low expected error. However, such sets will offer no guarantee on worst case error.

  20. Outline 1. MMR Computation 2. Reward Elicitation 3. IRMDP Applications 4. Research Plan

  21. Reward Elicitation 21 Simple Elicitation Strategies [UAI09] We used bound queries of the form “Is r(s,a) ≥ b?” with the following parameter selection strategies: 1. Halve the Largest Gap : Select the state-action pair with the largest “gap” between upper and lower bound: Δ ( s , a ) = max r ' ∈ R r '( s , a ) − min a * ∈ A , s * ∈ S Δ ( s *, a *) r ∈ R r ( s , a ) choose argmax 2. Current Solution : Weight each gap with the occupancy frequencies found in solution to minimax regret. Choose state-action pair with largest weighted gap: { } f ( s *, a *) Δ ( s *, a *), g ( s *, a *) Δ ( s *, a *) argmax a * ∈ A , s * ∈ S max

  22. Reward Elicitation 22 Simple Elicitation Strategies - Assessment We analyzed the effectiveness of elicitation on randomly generated IRMDPs and the autonomic computing domain [UAI09] Results 1. Both strategies performed well, but current solution reduced regret to zero in half the queries 2. Less than 2 queries per parameter were required to find a provable optimal policy 3. Current solution effectively reduces regret while leaving a large amount of uncertainty

  23. Reward Elicitation 23 Sequential Queries ? The sequential nature of the MDP motivates novel modes of interacting with the user, including queries about: 1. Policies - π : S → A 2. Trajectories - s 1 ,a 1 ,s 2 ,a 2 ,...,a n-1 ,s n 3. Occupancy Frequencies - f(s,a) It is unreasonable to expect a user to specify a numerical value for a policy, trajectory or occupancy frequency. Comparisons may be more manageable.

  24. Reward Elicitation 24 Sequential Queries ? Policy Is policy π preferred to policy π ’ ? Comparison s 1 , a 1 , … , a n − 1 , s n Trajectory Is the potential trajectory s 1 ′ , a 1 ′ , … , a n − 1 ′ , s n ′ Comparison preferred to trajectory ?” Responses to both types of query imply linear constraints on reward However, trajectory comparisons may be easier to reason about than policy comparisons

  25. Reward Elicitation 25 Occupancy Frequency Comparisons ? “Are occupancy frequencies f preferred to g ?” A response imposes a linear constraint on reward. “Yes” implies: n n ∑ ∑ ≥ f ( s i , a i ) r ( s i , a i ) g ( s i , a i ) r ( s i , a i ) i = 0 i = 0 Given a factored model, it allows a user to directly specify trade-offs pertaining to portions of the policy. ⎧ x = x 1 , x 2 , … , x n ⎪ domain( x 1 ) = ⎨ r ( x ) = r 1 ( x 1 ) + r 2 ( x 2 ) … ⎪ ⎩

  26. Reward Elicitation 26 Occupancy Frequency Comparisons ? payoff r ( x 1 = ) with frequency f ( x 1 = ), along with Do you prefer payoff r ( x 1 = ) with frequency f ( x 1 = ) and payoff r ( x 1 = ) with frequency f ( x 1 = ) or payoff r ( x 1 = ) with frequency g ( x 1 = ), along with payoff r ( x 1 = ) with frequency g ( x 1 = ) and payoff r ( x 1 = ) with frequency g ( x 1 = )? ∑ ∑ g ( x 1 ) = … g ( x 1 , … , x n ) = Yes response implies : x 2 x n ∑ ∑ f ( x 1 ) r ( x 1 ) ≥ g ( x 1 ) r ( x 1 ) ∑ ∑ f ( x 1 ) = … f ( x 1 , … , x n ) = x 1 x 1 x 2 x n

  27. Reward Elicitation 27 Optimal Query Selection ? The “value” of a query can be measured by the Expected Value of Information (EVOI): For query parameters ϕ and response ρ : ∑ ρ ) EVOI ( φ , R ) = MMR ( R ) − Pr( ρ | φ ) MMR ( R (with prior) φ ρ ⎡ ρ ) ⎤ = min MMR ( R ) − MMR ( R (w/o prior) ⎣ ⎦ φ ρ We can attempt to select queries that 1. Myopically maximize EVOI for a single query 2. Maximize EVOI for an entire sequence of queries

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend