Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk - - PowerPoint PPT Presentation

learning novel policies for tasks
SMART_READER_LITE
LIVE PREVIEW

Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk - - PowerPoint PPT Presentation

Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk Motivation Want more than one solution (i.e. novel solutions) to a problem. E.g. Different Locomotion styles for legged robots. Style 1 Style 2 Style 3 Key Aspects


slide-1
SLIDE 1

Learning Novel Policies For Tasks

Yunbo Zhang, Wenhao Yu, Greg Turk

slide-2
SLIDE 2

Motivation

  • Want more than one solution (i.e. novel solutions) to a problem.
  • E.g. Different Locomotion styles for legged robots.

Style 1 Style 2 Style 3

slide-3
SLIDE 3

Key Aspects

  • Novelty measurement function
  • Measures the novelty of a trajectory compared with

trajectories from other policies

  • Policy Gradient Update
  • Make sure final gradient compromises between task and

novelty

  • Task-Novelty Bisector (TNB)
slide-4
SLIDE 4

Method Overview

  • Define a separate novelty reward function apart from task reward.
  • Train a policy using Task-Novelty Bisector (TNB) to balance the
  • ptimization of task and novelty.
  • Update novelty measurement function.
  • Repeat
slide-5
SLIDE 5

Novelty Measurement

  • Use autoencoder reconstruction error of state sequences to compute

novelty.

  • One autoencoder for each policy.
  • For the set of autoencoders 𝑬 = {𝐸%, … , 𝐸(}, the novelty reward

function is:

𝑠

+,-./ = −exp (−𝑥+,-./ min 9∈𝑬‖

‖ 𝐸< 𝒕 − 𝒕 >)

slide-6
SLIDE 6

Task-Novelty Bisector (TNB)

  • Compute policy gradients for task reward and novelty reward
  • Compute the final policy gradient using the following rules:

𝑕ABCD = 𝜖𝐾ABCD 𝜖𝜄 𝑕+,-./ = 𝜖𝐾+,-./ 𝜖𝜄

  • r
slide-7
SLIDE 7

Multiple Solutions

PPO Policy

Target End-Effector

slide-8
SLIDE 8

Multiple Solutions

TNB Policies

slide-9
SLIDE 9

Deceptive Reward Problems

  • Our methods could be further extended to solve tasks with

deceptive reward signals.

  • E.g. Deceptive Reacher

Target End-Effector

slide-10
SLIDE 10

Deceptive Reward Problems

TNB Policies

slide-11
SLIDE 11

Thank You!

Poster: Pacific Ballroom #37