Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - - PowerPoint PPT Presentation

▶

Jan 25, 2023 482 likes •762 views

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye Overview Motivation Contributions

SLIDE 1

Off-Policy Evaluation via Off- Policy Classification

Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye

SLIDE 2

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 3

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 4

Motivation

Typically, performance of deep RL algorithms is evaluated via on-

policy interactions

But comparing models in a real-world environment is costly
Examines off-policy policy evaluation (OPE) for value-based methods

SLIDE 5

Motivation (cont.)

Existing OPE metrics either rely on a model of the

environment or importance sampling (IS)

OPE is most useful in off-policy RL setting, where we

expect to use real-world data as “validation set”

Hard to use with IS
For high-dimensional observations, models of the

environment can be difficult to fit

SLIDE 6

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 7

Contributions

Framed OPE as a positive-unlabeled (PU) classification problem and

developed two scores: OPC and SoftOPC

Relies on neither IS nor model learning
Correlate well with performance (on both simulated and real-world tasks)
Can be used with complex data to evaluate expected performance of
ff-policy RL methods
Proposed metrics outperform a variety of baseline methods including

simulation-to-reality transfer scenario

SLIDE 8

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 9

General Background (MDP)

Focus on finite-horizon Markov decision processes (MDP):
Assume a binary reward MDP, which satisfies:
𝛿 = 1
Reward is 𝑠

𝑢 = 0 at all intermediate steps

Final reward 𝑠𝑈 = 0,1
Learn Q-functions 𝑅(𝐭,𝐛) to evaluate policies

𝜌 𝐭 = 𝑏𝑠𝑕𝑛𝑏𝑦𝐛𝑅(𝐭,𝐛)

SLIDE 10

General Background (Positive-Unlabeled Learning)

Positive-unlabeled (PU) learning learns binary classification from

partially labeled data

Sufficient to learn a binary classifier if the positive class prior 𝑞(𝑧 = 1) is

known

Loss over negatives can be indirectly estimated from 𝑞(𝑧 = 1)

SLIDE 11

General Background (Positive-Unlabeled Learning)

Want to evaluate 𝑚 𝑕 𝑦 , 𝑧 over negative examples (𝑦, 𝑧 = 0)

𝑞 𝑦 = 𝑞 𝑦 𝑧 = 1 𝑞 𝑧 = 1 + 𝑞 𝑦 𝑧 = 0 𝑞(𝑧 = 0)

Using 𝔽𝑌 𝑔(𝑦) = ׬

𝑦 𝑞 𝑦 𝑔 𝑦 𝑒𝑦:

𝔽𝑌 𝑔(𝑦) = 𝑞 𝑧 = 1 𝔽𝑌|𝑍=1 𝑔(𝑦) + 𝑞 𝑧 = 0 𝔽𝑌|𝑍=0 𝑔(𝑦)

Letting 𝑔 𝑦 = 𝑚(𝑕 𝑦 , 0):

SLIDE 12

General Background (Definitions)

In a binary reward MDP, (𝐭𝑢,𝐛𝑢) is feasible if an optimal 𝜌∗ has non-

zero probability of achieving success after taking 𝐛𝑢 in 𝐭𝑢

(𝐭𝑢,𝐛𝑢) is catastrophic if even an optimal 𝜌∗ has zero probability of

succeeding after 𝐛𝑢 is taken

Therefore, return of a trajectory 𝜐 is 1 only if all (𝐭𝑢,𝐛𝑢) in 𝜐 are

feasible

SLIDE 13

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 14

OPE Method (Theorem)

Theorem: 𝑆 𝜌 ≥ 1 − 𝑈(𝜗 + 𝑑)
𝜗 =

1 𝑈 σ𝑗=1 𝑈

𝜗𝑢 being average error over all 𝐭𝑢,𝐛𝑢 , with 𝜗𝑢 = 𝔽𝜍𝑢,𝜌

෍

𝐛∈𝒝_(𝐭𝑢)

𝜌 𝐛 𝐭𝑢

𝒝_(𝐭): set of catastrophic actions at state 𝐭
𝜍𝑢,𝜌

+ : state distribution at time 𝑢, given that 𝜌 was followed, and all its

previous actions were feasible, and 𝐭𝑢 is feasible

𝑑 𝐭𝑢, 𝐛𝑢 : probability that stochastic dynamics bring a feasible (𝐭𝑢,𝐛𝑢) to a

catastrophic 𝐭𝑢+1, with 𝑑 = max

𝐭,𝐛 𝑑(𝐭, 𝐛)

SLIDE 15

OPE Method (Missing negative labels)

Estimate 𝜗, probability that 𝜌 takes a catastrophic action – i.e.,

(𝐭,𝜌 𝐭 ) is a false positive 𝜗 = 𝑞 𝑧 = 0 𝔽𝑌|𝑍=0 𝑚 𝑕 𝑦 , 0

Recall

𝑞 𝑧 = 0 𝔽𝑌|𝑍=0 𝑚 𝑕 𝑦 ,0 = 𝔽𝑌,𝑍 𝑚 𝑕 𝑦 , 0 − 𝑞(𝑧 = 1)𝔽𝑌|𝑍=1 𝑚 𝑕 𝑦 , 0

We obtain

𝜗 = 𝔽 𝐭,𝐛 𝑚 𝑅 𝐭,𝐛 , 0 − 𝑞(𝑧 = 1)𝔽 𝐭,𝐛 ,𝑧=1 𝑚(𝑅 𝐭,𝐛 , 0)

SLIDE 16

OPE Method (Off-policy classification)

Off-policy classification (OPC) score: negative loss when 𝑚 is 0-1 loss
SoftOPC: negative loss when 𝑚 is a soft loss function

𝑚 𝑅 𝐭, 𝐛 , 𝑍 = 1 − 2𝑍 𝑅 𝐭, 𝐛

SLIDE 17

OPE Method (Evaluating OPE metrics)

Standard method: report MSE to the true episode return
Our metrics do not estimate episode return directly
Instead, train many Q-functions with different learning algorithms
Evaluate true return of the equivalent argmax policy for each Q-function
Compare correlation of the metric to true return
Coefficient of determination of line of best fit 𝑆2, and Spearman rank

correlation 𝜊

SLIDE 18

Baseline Metrics

Temporal-difference (TD) error
Standard Q-learning training loss
Discounted sum of advantages

σ𝑢 𝛿𝑢𝐵𝜌

Relates 𝑊𝜌𝑐 𝐭 − 𝑊𝜌(𝐭) to the sum of

advantages over data from 𝜌𝑐

Monte Carlo corrected (MCC) error
Arrange discounted sum of advantages

into a squared error

SLIDE 19

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 20

Experimental Results (Simple Environments)

Performance against stochastic dynamics

SLIDE 21

Experimental Results (Vision-Based Robotic Grasping)

Performance on

simulated and real versions of a vision- based grasping task

SLIDE 22

Discussion of results

OPC and SoftOPC consistently
utperformed baselines
SoftOPC more reliably ranks

policies than baselines for real- world performance

SoftOPC performs slightly

better than OPC

SLIDE 23

Overview

Motivation
Contributions
Background
Method
Results
Limitations

SLIDE 24

Limitations

Key limitation: restricted task domain
Assumes an agent either succeeds or fails
Difficult to model with complicated tasks with a long time-horizon
Could not compare to many OPE baselines that use IS and model

learning techniques

High correlation with real-world robotic grasping task, but

comparable with sum of discounted advantages in simulation

SLIDE 25

Contributions (Recap)

Difficult and expensive to evaluate performance based on real-world

environments

Many off-policy RL methods are based on value-based methods and do not require

any knowledge of the policy that generated the real-world training data

These methods are hard to use with IS and model selection
Treated evaluation as a classification problem and proposed OPC and

SoftOPC from negative losses to be used with off-policy Q-learning algorithms

Can predict relative performance of different policies in generalization scenarios
Proposed OPE metrics outperform a variety of baseline methods including

simulation-to-reality transfer scenario

SLIDE 26

Take Home Questions

What conditions must be met for the MDP to perform OPE via OPC?
What is a natural choice for the decision function?
How are the classification scores determined? Which losses are used?
Which two correlations are used to evaluate the metrics?