Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk - - PowerPoint PPT Presentation

▶

Sep 07, 2022 375 likes •502 views

Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk Motivation Want more than one solution (i.e. novel solutions) to a problem. E.g. Different Locomotion styles for legged robots. Style 1 Style 2 Style 3 Key Aspects

SLIDE 1

Learning Novel Policies For Tasks

Yunbo Zhang, Wenhao Yu, Greg Turk

SLIDE 2

Motivation

Want more than one solution (i.e. novel solutions) to a problem.
E.g. Different Locomotion styles for legged robots.

Style 1 Style 2 Style 3

SLIDE 3

Key Aspects

Novelty measurement function
Measures the novelty of a trajectory compared with

trajectories from other policies

Policy Gradient Update
Make sure final gradient compromises between task and

novelty

Task-Novelty Bisector (TNB)

SLIDE 4

Method Overview

Define a separate novelty reward function apart from task reward.
Train a policy using Task-Novelty Bisector (TNB) to balance the
ptimization of task and novelty.
Update novelty measurement function.
Repeat

SLIDE 5

Novelty Measurement

Use autoencoder reconstruction error of state sequences to compute

novelty.

One autoencoder for each policy.
For the set of autoencoders 𝑬 = {𝐸%, … , 𝐸(}, the novelty reward

function is:

𝑠

+,-./ = −exp (−𝑥+,-./ min 9∈𝑬‖

‖ 𝐸< 𝒕 − 𝒕 >)

SLIDE 6

Task-Novelty Bisector (TNB)

Compute policy gradients for task reward and novelty reward
Compute the final policy gradient using the following rules:

𝑕ABCD = 𝜖𝐾ABCD 𝜖𝜄 𝑕+,-./ = 𝜖𝐾+,-./ 𝜖𝜄

SLIDE 7

Multiple Solutions

PPO Policy

Target End-Effector

SLIDE 8

Multiple Solutions

TNB Policies

SLIDE 9

Deceptive Reward Problems

Our methods could be further extended to solve tasks with

deceptive reward signals.

E.g. Deceptive Reacher

Target End-Effector

SLIDE 10

Deceptive Reward Problems

TNB Policies

SLIDE 11

Learning Novel Policies For Tasks Yunbo Zhang, Wenhao Yu, Greg Turk - - PowerPoint PPT Presentation

Learning Novel Policies For Tasks

Yunbo Zhang, Wenhao Yu, Greg Turk

Motivation

Style 1 Style 2 Style 3

Key Aspects

trajectories from other policies

novelty

Method Overview

Novelty Measurement

novelty.

function is:

𝑠

‖ 𝐸< 𝒕 − 𝒕 >)

Task-Novelty Bisector (TNB)

Multiple Solutions

PPO Policy

Multiple Solutions

TNB Policies

Deceptive Reward Problems

deceptive reward signals.

Deceptive Reward Problems

TNB Policies

Thank You!

Poster: Pacific Ballroom #37