Off-Policy Evaluation via Off- Policy Classification
Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye
Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - - PowerPoint PPT Presentation
Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye Overview Motivation Contributions
Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye
policy interactions
environment or importance sampling (IS)
expect to use real-world data as βvalidation setβ
environment can be difficult to fit
developed two scores: OPC and SoftOPC
simulation-to-reality transfer scenario
π’ = 0 at all intermediate steps
π π = ππ ππππ¦ππ (π,π)
partially labeled data
known
π π¦ = π π¦ π§ = 1 π π§ = 1 + π π¦ π§ = 0 π(π§ = 0)
π¦ π π¦ π π¦ ππ¦:
π½π π(π¦) = π π§ = 1 π½π|π=1 π(π¦) + π π§ = 0 π½π|π=0 π(π¦)
zero probability of achieving success after taking ππ’ in ππ’
succeeding after ππ’ is taken
feasible
1 π Οπ=1 π
ππ’ being average error over all ππ’,ππ’ , with ππ’ = π½ππ’,π
+
ΰ·
πβπ_(ππ’)
π π ππ’
+ : state distribution at time π’, given that π was followed, and all its
previous actions were feasible, and ππ’ is feasible
catastrophic ππ’+1, with π = max
π,π π(π, π)
(π,π π ) is a false positive π = π π§ = 0 π½π|π=0 π π π¦ , 0
π π§ = 0 π½π|π=0 π π π¦ ,0 = π½π,π π π π¦ , 0 β π(π§ = 1)π½π|π=1 π π π¦ , 0
π = π½ π,π π π π,π , 0 β π(π§ = 1)π½ π,π ,π§=1 π(π π,π , 0)
π π π, π , π = 1 β 2π π π, π
correlation π
Οπ’ πΏπ’π΅π
advantages over data from ππ
into a squared error
simulated and real versions of a vision- based grasping task
policies than baselines for real- world performance
better than OPC
learning techniques
comparable with sum of discounted advantages in simulation
environments
any knowledge of the policy that generated the real-world training data
SoftOPC from negative losses to be used with off-policy Q-learning algorithms
simulation-to-reality transfer scenario