CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. So - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. So - - PowerPoint PPT Presentation

Inverse Reinforcement Learning CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. So far: manually design reward function to define a task 2. What if we want to learn the reward function from observing an expert, and then use


slide-1
SLIDE 1

Inverse Reinforcement Learning

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2

Today’s Lecture

  • 1. So far: manually design reward function to define a task
  • 2. What if we want to learn the reward function from observing an

expert, and then use reinforcement learning?

  • 3. Apply approximate optimality model from last time, but now learn

the reward!

  • Goals:
  • Understand the inverse reinforcement learning problem definition
  • Understand how probabilistic models of behavior can be used to derive

inverse reinforcement learning algorithms

  • Understand a few practical inverse reinforcement learning algorithms we

can use

slide-3
SLIDE 3

Optimal Control as a Model of Human Behavior

Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06

  • ptimize this to explain the data
slide-4
SLIDE 4

Why should we worry about learning rewards?

The imitation learning perspective

Standard imitation learning:

  • copy the actions performed by the expert
  • no reasoning about outcomes of actions

Human imitation learning:

  • copy the intent of the expert
  • might take very different actions!
slide-5
SLIDE 5

Why should we worry about learning rewards?

The reinforcement learning perspective

what is the reward?

slide-6
SLIDE 6

Inverse reinforcement learning

Infer reward fu functions from demonstrations

by itself, this is an underspecified problem many reward functions can explain the same behavior

slide-7
SLIDE 7

A bit more formally

“forward” reinforcement learning inverse reinforcement learning

reward parameters

slide-8
SLIDE 8

Feature matching IRL

still ambiguous!

slide-9
SLIDE 9

Feature matching IRL & maximum margin

Issues:

  • Maximizing the margin is a bit arbitrary
  • No clear model of expert suboptimality (can add slack variables…)
  • Messy constrained optimization problem – not great for deep learning!

Further reading:

  • Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning
  • Ratliff et al: Maximum margin planning
slide-10
SLIDE 10

Optimal Control as a Model of Human Behavior

Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06

slide-11
SLIDE 11

A probabilistic graphical model of decision making

no assumption of optimal behavior!

slide-12
SLIDE 12

Learning the Reward Function

slide-13
SLIDE 13

Learning the optimality variable

reward parameters

slide-14
SLIDE 14

The IRL partition function

slide-15
SLIDE 15

Estimating the expectation

slide-16
SLIDE 16

Estimating the expectation

slide-17
SLIDE 17

The MaxEnt IRL algorithm Why MaxEnt?

Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning

slide-18
SLIDE 18
slide-19
SLIDE 19

Approximations in High Dimensions

slide-20
SLIDE 20
  • MaxEnt IRL so far requires…
  • Solving for (soft) optimal policy in the inner loop
  • Enumerating all state-action tuples for visitation frequency and gradient
  • To apply this in practical problem settings, we need to handle…
  • Large and continuous state and action spaces
  • States obtained via sampling only
  • Unknown dynamics

What’s missing so far?

slide-21
SLIDE 21

Unknown dynamics & large state/action spaces

Assume we don’t know the dynamics, but we can sample, like in standard RL

slide-22
SLIDE 22

More efficient sample-based updates

slide-23
SLIDE 23

Importance sampling

slide-24
SLIDE 24

Update reward using samples & demos generate policy samples from π update π w.r.t. reward policy π reward r

guided cost learning algorithm

policy π

(Finn et al. ICML ’16)

slides adapted from C. Finn

slide-25
SLIDE 25

IRL and GANs

slide-26
SLIDE 26

It looks a bit like a game…

policy π

slide-27
SLIDE 27

Generative Adversarial Networks

Goodfellow et al. ‘14

Isola et al. ‘17 Arjovsky et al. ‘17 Zhu et al. ‘17

slide-28
SLIDE 28

Inverse RL as a GAN

Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”

slide-29
SLIDE 29

Inverse RL as a GAN

Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”

slide-30
SLIDE 30

Generalization via inverse RL

demonstration reproduce behavior under different conditions what can we learn from the demonstration to enable better transfer? need to decouple the goal from the dynamics! policy = reward + dynamics

Fu et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning

slide-31
SLIDE 31

Can we just use a regular discriminator?

Ho & Ermon. Generative adversarial imitation learning.

Pros & cons:

+ often simpler to set up optimization, fewer moving parts

  • discriminator knows nothing at convergence
  • generally cannot reoptimize the “reward”
slide-32
SLIDE 32

IRL as adversarial optimization

Generative Adversarial Imitation Learning Guided Cost Learning

robot attempt

classifier

Ho & Ermon, NIPS 2016 Hausman, Chebotar, Schaal, Sukhatme, Lim

Peng, Kanazawa, Toyer, Abbeel, Levine

Finn et al., ICML 2016

robot attempt

reward function actually the same thing!

slide-33
SLIDE 33

Suggested Reading on Inverse RL

Classic Papers: Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement learning Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning. Introduction to probabilistic method for inverse reinforcement learning Modern Papers: Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for MaxEnt IRL that handles unknown dynamics and deep reward functions Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt inverse RL using deep reward functions Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL method using generative adversarial networks Fu, Luo, Levine ICLR ‘18. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning