Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - - PowerPoint PPT Presentation

β–Ά
off policy evaluation via off
SMART_READER_LITE
LIVE PREVIEW

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - - PowerPoint PPT Presentation

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye Overview Motivation Contributions


slide-1
SLIDE 1

Off-Policy Evaluation via Off- Policy Classification

Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye

slide-2
SLIDE 2

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-3
SLIDE 3

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-4
SLIDE 4

Motivation

  • Typically, performance of deep RL algorithms is evaluated via on-

policy interactions

  • But comparing models in a real-world environment is costly
  • Examines off-policy policy evaluation (OPE) for value-based methods
slide-5
SLIDE 5

Motivation (cont.)

  • Existing OPE metrics either rely on a model of the

environment or importance sampling (IS)

  • OPE is most useful in off-policy RL setting, where we

expect to use real-world data as β€œvalidation set”

  • Hard to use with IS
  • For high-dimensional observations, models of the

environment can be difficult to fit

slide-6
SLIDE 6

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-7
SLIDE 7

Contributions

  • Framed OPE as a positive-unlabeled (PU) classification problem and

developed two scores: OPC and SoftOPC

  • Relies on neither IS nor model learning
  • Correlate well with performance (on both simulated and real-world tasks)
  • Can be used with complex data to evaluate expected performance of
  • ff-policy RL methods
  • Proposed metrics outperform a variety of baseline methods including

simulation-to-reality transfer scenario

slide-8
SLIDE 8

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-9
SLIDE 9

General Background (MDP)

  • Focus on finite-horizon Markov decision processes (MDP):
  • Assume a binary reward MDP, which satisfies:
  • 𝛿 = 1
  • Reward is 𝑠

𝑒 = 0 at all intermediate steps

  • Final reward π‘ π‘ˆ = 0,1
  • Learn Q-functions 𝑅(𝐭,𝐛) to evaluate policies

𝜌 𝐭 = 𝑏𝑠𝑕𝑛𝑏𝑦𝐛𝑅(𝐭,𝐛)

slide-10
SLIDE 10

General Background (Positive-Unlabeled Learning)

  • Positive-unlabeled (PU) learning learns binary classification from

partially labeled data

  • Sufficient to learn a binary classifier if the positive class prior π‘ž(𝑧 = 1) is

known

  • Loss over negatives can be indirectly estimated from π‘ž(𝑧 = 1)
slide-11
SLIDE 11

General Background (Positive-Unlabeled Learning)

  • Want to evaluate π‘š 𝑕 𝑦 , 𝑧 over negative examples (𝑦, 𝑧 = 0)

π‘ž 𝑦 = π‘ž 𝑦 𝑧 = 1 π‘ž 𝑧 = 1 + π‘ž 𝑦 𝑧 = 0 π‘ž(𝑧 = 0)

  • Using π”½π‘Œ 𝑔(𝑦) = Χ¬

𝑦 π‘ž 𝑦 𝑔 𝑦 𝑒𝑦:

π”½π‘Œ 𝑔(𝑦) = π‘ž 𝑧 = 1 π”½π‘Œ|𝑍=1 𝑔(𝑦) + π‘ž 𝑧 = 0 π”½π‘Œ|𝑍=0 𝑔(𝑦)

  • Letting 𝑔 𝑦 = π‘š(𝑕 𝑦 , 0):
slide-12
SLIDE 12

General Background (Definitions)

  • In a binary reward MDP, (𝐭𝑒,𝐛𝑒) is feasible if an optimal πœŒβˆ— has non-

zero probability of achieving success after taking 𝐛𝑒 in 𝐭𝑒

  • (𝐭𝑒,𝐛𝑒) is catastrophic if even an optimal πœŒβˆ— has zero probability of

succeeding after 𝐛𝑒 is taken

  • Therefore, return of a trajectory 𝜐 is 1 only if all (𝐭𝑒,𝐛𝑒) in 𝜐 are

feasible

slide-13
SLIDE 13

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-14
SLIDE 14

OPE Method (Theorem)

  • Theorem: 𝑆 𝜌 β‰₯ 1 βˆ’ π‘ˆ(πœ— + 𝑑)
  • πœ— =

1 π‘ˆ σ𝑗=1 π‘ˆ

πœ—π‘’ being average error over all 𝐭𝑒,𝐛𝑒 , with πœ—π‘’ = π”½πœπ‘’,𝜌

+

෍

π›βˆˆπ’_(𝐭𝑒)

𝜌 𝐛 𝐭𝑒

  • 𝒝_(𝐭): set of catastrophic actions at state 𝐭
  • πœπ‘’,𝜌

+ : state distribution at time 𝑒, given that 𝜌 was followed, and all its

previous actions were feasible, and 𝐭𝑒 is feasible

  • 𝑑 𝐭𝑒, 𝐛𝑒 : probability that stochastic dynamics bring a feasible (𝐭𝑒,𝐛𝑒) to a

catastrophic 𝐭𝑒+1, with 𝑑 = max

𝐭,𝐛 𝑑(𝐭, 𝐛)

slide-15
SLIDE 15

OPE Method (Missing negative labels)

  • Estimate πœ—, probability that 𝜌 takes a catastrophic action – i.e.,

(𝐭,𝜌 𝐭 ) is a false positive πœ— = π‘ž 𝑧 = 0 π”½π‘Œ|𝑍=0 π‘š 𝑕 𝑦 , 0

  • Recall

π‘ž 𝑧 = 0 π”½π‘Œ|𝑍=0 π‘š 𝑕 𝑦 ,0 = π”½π‘Œ,𝑍 π‘š 𝑕 𝑦 , 0 βˆ’ π‘ž(𝑧 = 1)π”½π‘Œ|𝑍=1 π‘š 𝑕 𝑦 , 0

  • We obtain

πœ— = 𝔽 𝐭,𝐛 π‘š 𝑅 𝐭,𝐛 , 0 βˆ’ π‘ž(𝑧 = 1)𝔽 𝐭,𝐛 ,𝑧=1 π‘š(𝑅 𝐭,𝐛 , 0)

slide-16
SLIDE 16

OPE Method (Off-policy classification)

  • Off-policy classification (OPC) score: negative loss when π‘š is 0-1 loss
  • SoftOPC: negative loss when π‘š is a soft loss function

π‘š 𝑅 𝐭, 𝐛 , 𝑍 = 1 βˆ’ 2𝑍 𝑅 𝐭, 𝐛

slide-17
SLIDE 17

OPE Method (Evaluating OPE metrics)

  • Standard method: report MSE to the true episode return
  • Our metrics do not estimate episode return directly
  • Instead, train many Q-functions with different learning algorithms
  • Evaluate true return of the equivalent argmax policy for each Q-function
  • Compare correlation of the metric to true return
  • Coefficient of determination of line of best fit 𝑆2, and Spearman rank

correlation 𝜊

slide-18
SLIDE 18

Baseline Metrics

  • Temporal-difference (TD) error
  • Standard Q-learning training loss
  • Discounted sum of advantages

σ𝑒 π›Ώπ‘’π΅πœŒ

  • Relates π‘ŠπœŒπ‘ 𝐭 βˆ’ π‘ŠπœŒ(𝐭) to the sum of

advantages over data from πœŒπ‘

  • Monte Carlo corrected (MCC) error
  • Arrange discounted sum of advantages

into a squared error

slide-19
SLIDE 19

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-20
SLIDE 20

Experimental Results (Simple Environments)

  • Performance against stochastic dynamics
slide-21
SLIDE 21

Experimental Results (Vision-Based Robotic Grasping)

  • Performance on

simulated and real versions of a vision- based grasping task

slide-22
SLIDE 22

Discussion of results

  • OPC and SoftOPC consistently
  • utperformed baselines
  • SoftOPC more reliably ranks

policies than baselines for real- world performance

  • SoftOPC performs slightly

better than OPC

slide-23
SLIDE 23

Overview

  • Motivation
  • Contributions
  • Background
  • Method
  • Results
  • Limitations
slide-24
SLIDE 24

Limitations

  • Key limitation: restricted task domain
  • Assumes an agent either succeeds or fails
  • Difficult to model with complicated tasks with a long time-horizon
  • Could not compare to many OPE baselines that use IS and model

learning techniques

  • High correlation with real-world robotic grasping task, but

comparable with sum of discounted advantages in simulation

slide-25
SLIDE 25

Contributions (Recap)

  • Difficult and expensive to evaluate performance based on real-world

environments

  • Many off-policy RL methods are based on value-based methods and do not require

any knowledge of the policy that generated the real-world training data

  • These methods are hard to use with IS and model selection
  • Treated evaluation as a classification problem and proposed OPC and

SoftOPC from negative losses to be used with off-policy Q-learning algorithms

  • Can predict relative performance of different policies in generalization scenarios
  • Proposed OPE metrics outperform a variety of baseline methods including

simulation-to-reality transfer scenario

slide-26
SLIDE 26

Take Home Questions

  • What conditions must be met for the MDP to perform OPE via OPC?
  • What is a natural choice for the decision function?
  • How are the classification scores determined? Which losses are used?
  • Which two correlations are used to evaluate the metrics?