Maximum Entropy Framework: Inverse RL, Soft Optimality, and More
Chelsea Finn and Sergey Levine UC Berkeley
5/20/2017
Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - - PowerPoint PPT Presentation
Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017 Introductions Sergey Levine Chelsea Finn assistant professor PhD student Outline 1. A World without Rewards 2. A
Chelsea Finn and Sergey Levine UC Berkeley
5/20/2017
Chelsea Finn PhD student
Sergey Levine assistant professor
Mnih et al. ’15 video from Montessori New Zealand
what is the reward?
reinforcement learning agent
reward
In the real world, humans don’t get a score.
reward function is essential for RL real-world domains: reward/cost often difficult to specify
Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16 Tesauro ’95
One approach: Mimic actions of human expert
Can we reason about human decision-making? behavioral cloning + simple, sometimes works well
Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06
some mistakes maTer more than others! behavior is stochas'c but good behavior is sQll the most likely
no assumpQon of opQmal behavior!
how to do inference?
“opQmisQc” transiQon (not a good idea!) Ziebart et al. ‘10 “Modeling InteracQon via the Principle of Maximum Causal Entropy”
variants: summary:
model for opQmal control
(similar to HMM, EKF, etc.)
programming, value iteraQon,
Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations
Challenges underdefjned problem difficult to evaluate a learned reward demonstrations may not be precisely optimal
given:
goal:
(Kalman ’64, Ng & Russell ’00)
(IOC/IRL)
Ng & Russell ‘00: expert actions should have higher value than
Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories Ratliff et al. ’06: max margin formulation between value of expert actions and other actions How to handle ambiguity and suboptimality?
Whiteboard
(Ziebart et al. ’08)
Notation:
: reward with parameters [linear case ] : dataset of demonstrations
handle ambiguity using probabilistic model of behavior
(Ziebart et al. ’08)
Whiteboard
Goals:
ICML 2016
Update reward using samples & demos generate policy samples from π update π w.r.t. reward
x x x
1 2 n
h h h
1 2 k
h h h
1 2 p
(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)
h h h
1 2 m
( 3 ) ( 3 ) ( 3 )
c (x)
θ
2policy π reward r guided cost learning algorithm
policy π0
Update reward using samples & demos generate policy samples from π update π w.r.t. reward (partially optimize) generator
x x x
1 2 n
h h h
1 2 k
h h h
1 2 p
(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)
h h h
1 2 m
( 3 ) ( 3 ) ( 3 )
c (x)
θ
2policy π reward r discriminator guided cost learning algorithm update reward in inner loop of policy optimization
policy π0
Update reward using samples & demos generate policy samples from π generator Ho et al., ICML ’16, NIPS ‘16
x x x
1 2 n
h h h
1 2 k
h h h
1 2 p
(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)
h h h
1 2 m
( 3 ) ( 3 ) ( 3 )
c (x)
θ
2policy π reward r discriminator guided cost learning algorithm update π w.r.t. reward (partially optimize)
policy π0
dish placement pouring almonds Real-world Tasks
state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16] action: joint torques
Update reward using samples & demos generate policy samples from q
x x x
1 2 n
h h h
1 2 k
h h h
1 2 p
(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)
h h h
1 2 m
( 3 ) ( 3 ) ( 3 )
c (x)
θ
2reward r Comparisons Relative Entropy IRL (Boularias et al. ‘11) Path Integral IRL (Kalakrishnan et al. ‘13)
Conclusion: We can recover successful policies for new positions. Is the reward function also useful for new scenarios?
Note: normally the GAN discriminator is discarded
Strengths
Limitations
teaching or teleoperation (fjrst person)
Similarly, GANs learn an objective for generative modeling. real generated
(Goodfellow et al. ’14)
Zhu et al. ‘17 Isola et al. ‘17 Arjovsky et al. ‘17
noise
Connection between Inverse RL and GANs
policy π~q(τ) generator G reward r discriminator D Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 trajectory τ sample x discriminator only needs to learn data distribution, θ independent of generator density discriminator
Connection between Inverse RL and GANs
policy π~q(τ) generator G cost c discriminator D Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 trajectory τ sample x generator objective is entropy-regularized RL generator
GANs for training EBMs
Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 energy E discriminator D sampler q(x) generator G MaxEnt IRL is an energy-based model Use the generator’s density q(x) to form a consistent estimator of the energy function
1 Z exp(−Eθ(x)) 1 Z exp(−Eθ(x)) + q(x)
Dai et al., ICLR submission ‘17
Kim & Bengio ICLR Workshop ’16; Zhao et al. arXiv ’16; Zhai et al. ICLR sub ‘17
hypotheses?
StochasQc energy-based policies
Tuomas Haarnoja Haoran Tang
Wang & Liu, ‘17
StochasQc energy-based policies aid exploraQon
StochasQc energy-based policies provide pretraining
Sallans & Hinton. Using Free Energies to Represent Q-values in a MulQagent Reinforcement Learning Task. 2000. Nachum et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning. 2017. Peters et al. RelaQve Entropy Policy Search. 2010. O’Donoghue et al. Combining Policy Gradient and Q-Learning. 2017
Kitani et al. ‘14: Model human pedestrian interacQons Ziebart et al. ‘08: Predict taxi driver route preferences Dragan et al. ‘13: GeneraQng human-legible moQon (beyond roboQc manipulaQon and control) Li et al. ‘17: Learn objecQve for dialog generaQon