Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - - PowerPoint PPT Presentation

maximum entropy framework inverse rl soft optimality and
SMART_READER_LITE
LIVE PREVIEW

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More - - PowerPoint PPT Presentation

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017 Introductions Sergey Levine Chelsea Finn assistant professor PhD student Outline 1. A World without Rewards 2. A


slide-1
SLIDE 1

Maximum Entropy Framework: Inverse RL, Soft Optimality, and More

Chelsea Finn and Sergey Levine UC Berkeley

5/20/2017

slide-2
SLIDE 2

Introductions

Chelsea Finn PhD student

Sergey Levine assistant professor

slide-3
SLIDE 3
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-4
SLIDE 4
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-5
SLIDE 5

Mnih et al. ’15 video from Montessori New Zealand

what is the reward?

reinforcement learning agent

reward

In the real world, humans don’t get a score.

slide-6
SLIDE 6

reward function is essential for RL real-world domains: reward/cost often difficult to specify

  • robotic manipulation
  • autonomous driving
  • dialog systems
  • virtual assistants
  • and more…

Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16 Tesauro ’95

slide-7
SLIDE 7

One approach: Mimic actions of human expert

  • but no reasoning about outcomes or dynamics
  • the expert might have different degrees of freedom
  • the expert might not be always optimal

Can we reason about human decision-making? behavioral cloning + simple, sometimes works well

slide-8
SLIDE 8
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-9
SLIDE 9

Op&mal Control as a Model of Human Behavior

Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06

  • pQmize this to explain the data
slide-10
SLIDE 10

What if the data is no not op&mal?

some mistakes maTer more than others! behavior is stochas'c but good behavior is sQll the most likely

slide-11
SLIDE 11

A probabilis&c graphical model of decision making

no assumpQon of opQmal behavior!

slide-12
SLIDE 12

Inference = planning

how to do inference?

slide-13
SLIDE 13

A closer look at the backward pass

“opQmisQc” transiQon (not a good idea!) Ziebart et al. ‘10 “Modeling InteracQon via the Principle of Maximum Causal Entropy”

slide-14
SLIDE 14

Stochas&c op&mal control (MaxCausalEnt) summary

variants: summary:

  • 1. ProbabilisQc graphical

model for opQmal control

  • 2. Control = inference

(similar to HMM, EKF, etc.)

  • 3. Very similar to dynamic

programming, value iteraQon,

  • etc. (but “soc”)
slide-15
SLIDE 15
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-16
SLIDE 16

Under reward, we can model how human can sub-optimally maximize reward. How can this help us with learning?

slide-17
SLIDE 17

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations

Challenges underdefjned problem difficult to evaluate a learned reward demonstrations may not be precisely optimal

given:

  • state & action space
  • roll-outs from π*
  • dynamics model [sometimes]

goal:

  • recover reward function
  • then use reward to get policy

(Kalman ’64, Ng & Russell ’00)

(IOC/IRL)

slide-18
SLIDE 18

Early IRL Approaches

  • deterministic MDP
  • alternative between solving MDP & updating reward
  • heuristics for handling sub-optimality

Ng & Russell ‘00: expert actions should have higher value than

  • ther actions, larger gap is better

Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories Ratliff et al. ’06: max margin formulation between value of expert actions and other actions How to handle ambiguity and suboptimality?

slide-19
SLIDE 19

Whiteboard

Maximum Entropy Inverse RL

(Ziebart et al. ’08)

Notation:

: reward with parameters [linear case ] : dataset of demonstrations

handle ambiguity using probabilistic model of behavior

slide-20
SLIDE 20

Maximum Entropy Inverse RL

(Ziebart et al. ’08)

slide-21
SLIDE 21

Whiteboard

What about unknown dynamics?

slide-22
SLIDE 22

Goals:

  • remove need to solve MDP in the inner loop
  • be able to handle unknown dynamics
  • handle continuous state & actions spaces

Case Study: Guided Cost Learning

ICML 2016

slide-23
SLIDE 23

Update reward using samples & demos generate policy samples from π update π w.r.t. reward

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

policy π reward r guided cost learning algorithm

policy π0

slide-24
SLIDE 24

Update reward using samples & demos generate policy samples from π update π w.r.t. reward (partially optimize) generator

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

policy π reward r discriminator guided cost learning algorithm update reward in inner loop of policy optimization

policy π0

slide-25
SLIDE 25

Update reward using samples & demos generate policy samples from π generator Ho et al., ICML ’16, NIPS ‘16

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

policy π reward r discriminator guided cost learning algorithm update π w.r.t. reward (partially optimize)

policy π0

slide-26
SLIDE 26

GCL Experiments

dish placement pouring almonds Real-world Tasks

state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16] action: joint torques

slide-27
SLIDE 27

Update reward using samples & demos generate policy samples from q

x x x

1 2 n

h h h

1 2 k

h h h

1 2 p

(1) ( 1 ) (1) ( 2 ) ( 2 ) (2)

h h h

1 2 m

( 3 ) ( 3 ) ( 3 )

c (x)

θ

2

reward r Comparisons Relative Entropy IRL (Boularias et al. ‘11) Path Integral IRL (Kalakrishnan et al. ‘13)

slide-28
SLIDE 28

Dish placement, demos

slide-29
SLIDE 29

Dish placement, standard cost

slide-30
SLIDE 30

Dish placement, RelEnt IRL

  • video of dish baseline method
slide-31
SLIDE 31

Dish placement, GCL policy

  • video of dish our method - samples & reoptimizing
slide-32
SLIDE 32

Pouring, demos

  • video of pouring demos
slide-33
SLIDE 33

Pouring, RelEnt IRL

  • video of pouring baseline method
slide-34
SLIDE 34

Pouring, GCL policy

  • video of pouring our method - samples
slide-35
SLIDE 35

Conclusion: We can recover successful policies for new positions. Is the reward function also useful for new scenarios?

slide-36
SLIDE 36

Dish placement - GCL reopt.

  • video of dish our method - samples & reoptimizing
slide-37
SLIDE 37

Pouring - GCL reopt.

  • video of pouring our method - reoptimization

Note: normally the GAN discriminator is discarded

slide-38
SLIDE 38

Guided Cost Learning & Generative Adversarial Imitation Learning

Strengths

  • can handle unknown dynamics
  • scales to neural net rewards
  • efficient enough for real robots

Limitations

  • adversarial optimization is hard
  • can’t scale to raw pixel observations of demos
  • demonstrations typically collected with kinesthetic

teaching or teleoperation (fjrst person)

slide-39
SLIDE 39
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-40
SLIDE 40

Generative Adversarial Networks

Similarly, GANs learn an objective for generative modeling. real generated

(Goodfellow et al. ’14)

Zhu et al. ‘17 Isola et al. ‘17 Arjovsky et al. ‘17

noise

G D

slide-41
SLIDE 41

Connection between Inverse RL and GANs

policy π~q(τ) generator G reward r discriminator D Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 trajectory τ sample x discriminator only needs to learn data distribution, θ independent of generator density discriminator

slide-42
SLIDE 42

Connection between Inverse RL and GANs

policy π~q(τ) generator G cost c discriminator D Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 trajectory τ sample x generator objective is entropy-regularized RL generator

slide-43
SLIDE 43

GANs for training EBMs

Finn*, Christiano*, Abbeel, Levine, arXiv ‘16 energy E discriminator D sampler q(x) generator G MaxEnt IRL is an energy-based model Use the generator’s density q(x) to form a consistent estimator of the energy function

Dθ(x) =

1 Z exp(−Eθ(x)) 1 Z exp(−Eθ(x)) + q(x)

Dai et al., ICLR submission ‘17

Kim & Bengio ICLR Workshop ’16; Zhao et al. arXiv ’16; Zhai et al. ICLR sub ‘17

slide-44
SLIDE 44
  • 1. A World without Rewards
  • 2. A Probabilistic Model of Behavior
  • 3. Application: Inverse RL
  • 4. GANs and Energy-Based Models
  • 5. Application: Soft-Q Learning

Outline

slide-45
SLIDE 45

Stochas&c models for learning control

  • How can we track both

hypotheses?

slide-46
SLIDE 46

StochasQc energy-based policies

Tuomas Haarnoja Haoran Tang

slide-47
SLIDE 47

SoK Q-learning

slide-48
SLIDE 48

Tractable amor&zed inference for con&nuous ac&ons

Wang & Liu, ‘17

slide-49
SLIDE 49

StochasQc energy-based policies aid exploraQon

slide-50
SLIDE 50

StochasQc energy-based policies provide pretraining

slide-51
SLIDE 51

Stochas&c Op&mal Control & MaxEnt in RL

Sallans & Hinton. Using Free Energies to Represent Q-values in a MulQagent Reinforcement Learning Task. 2000. Nachum et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning. 2017. Peters et al. RelaQve Entropy Policy Search. 2010. O’Donoghue et al. Combining Policy Gradient and Q-Learning. 2017

slide-52
SLIDE 52

Applica&ons of inverse reinforcement learning

Kitani et al. ‘14: Model human pedestrian interacQons Ziebart et al. ‘08: Predict taxi driver route preferences Dragan et al. ‘13: GeneraQng human-legible moQon (beyond roboQc manipulaQon and control) Li et al. ‘17: Learn objecQve for dialog generaQon

slide-53
SLIDE 53

Concluding remarks