Learning to Control an Octopus Arm with Gaussian Process Temporal - - PowerPoint PPT Presentation

learning to control an octopus arm with gaussian process
SMART_READER_LITE
LIVE PREVIEW

Learning to Control an Octopus Arm with Gaussian Process Temporal - - PowerPoint PPT Presentation

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Gaussian Processes in Practice Workshop Why use GPs in RL? A Bayesian


slide-1
SLIDE 1

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Yaakov Engel

Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion)

slide-2
SLIDE 2

Gaussian Processes in Practice Workshop

Why use GPs in RL?

  • A Bayesian approach to value estimation
  • Forces us to to make our assumptions explicit
  • Non-parametric – priors are placed and inference is performed

directly in function space (kernels).

  • Domain knowledge intuitively coded into priors
  • Provides full posterior, not just point estimates
  • Efficient, on-line implementation, suitable for large problems

2/23

slide-3
SLIDE 3

Gaussian Processes in Practice Workshop

Markov Decision Processes

X: state space U: action space p: X × X × U → [0, 1], xt+1 ∼ p(·|xt, ut) q: I R × X × U → [0, 1], R(xt, ut) ∼ q(·|xt, ut) A Stationary policy: µ: U × X → [0, 1], ut ∼ µ(·|xt) Discounted Return: Dµ(x) = ∞

i=0 γiR(xi, ui)|(x0 = x)

Value function: V µ(x) = Eµ[Dµ(x)] Goal: Find a policy µ∗ maximizing V µ(x) ∀x ∈ X

3/23

slide-4
SLIDE 4

Gaussian Processes in Practice Workshop

Bellman’s Equation

For a fixed policy µ: V µ(x) = Ex′,u|x

  • R(x, u) + γV µ(x′)
  • Optimal value and policy:

V ∗(x) = max

µ

V µ(x) , µ∗ = argmax

µ

V µ(x) How to solve it?

  • Methods based on Value Iteration (e.g. Q-learning)
  • Methods based on Policy Iteration (e.g. SARSA, OPI,

Actor-Critic)

4/23

slide-5
SLIDE 5

Gaussian Processes in Practice Workshop

Solution Method Taxonomy

Value−Function based Purely Policy based (Policy Gradient) Policy Iteration type Value Iteration type (Q−Learning) (Actor−Critic, OPI, SARSA)

RL Algorithms

PI methods need a “subroutine” for policy evaluation

5/23

slide-6
SLIDE 6

Gaussian Processes in Practice Workshop

Gaussian Process Temporal Difference Learning

Model Equations: R(xi) = V (xi) − γV (xi+1) + N(xi) Or, in compact form: Rt = Ht+1Vt+1 + Nt Ht =       

1 −γ . . . 1 −γ . . . . . . . . . . . . 1 −γ

       . Our (Bayesian) goal: Find the posterior distribution of V (·), given a sequence

  • f observed states and rewards.

6/23

slide-7
SLIDE 7

Gaussian Processes in Practice Workshop

The Posterior

General noise covariance: Cov[Nt] = Σt Joint distribution:   Rt−1 V (x)   ∼ N      0   ,   HtKtH⊤

t + Σt

Htkt(x) kt(x)⊤H⊤

t

k(x, x)      Invoke Bayes’ Rule: E[V (x)|Rt−1] = kt(x)⊤αt Cov[V (x), V (x′)|Rt−1] = k(x, x′) − kt(x)⊤Ctkt(x′) kt(x) = (k(x1, x), . . . , k(xt, x))⊤

7/23

slide-8
SLIDE 8

Gaussian Processes in Practice Workshop 8/23

slide-9
SLIDE 9

Gaussian Processes in Practice Workshop

The Octopus Arm

Can bend and twist at any point Can do this in any direction Can be elongated and shortened Can change cross section Can grab using any part of the arm Virtually infinitely many DOF

9/23

slide-10
SLIDE 10

Gaussian Processes in Practice Workshop

The Muscular Hydrostat Mechanism

A constraint: Muscle fibers can only contract (actively) In vertebrate limbs, two separate muscle groups - agonists and antagonists - are used to control each DOF of every joint, by exerting

  • pposite torques.

But the Octopus has no skeleton! Balloon example Muscle tissue is incompressible, therefore, if muscles are arranged such that different muscle groups are interleaved in perpendicular directions in the same region, contraction in one direction will result in extension in at least one of the other directions. This is the Muscular Hydrostat mechanism

10/23

slide-11
SLIDE 11

Gaussian Processes in Practice Workshop

Octopus Arm Anatomy 101

11/23

slide-12
SLIDE 12

Gaussian Processes in Practice Workshop

Our Arm Model

C

N

✁ ✁ ✁ ✂ ✂ ✂ ✂ ✂ ✂ ✄ ✄ ✄ ✄ ✄ ✄ ☎ ☎ ☎ ☎ ✆ ✆ ✆ ✆ ✝ ✝ ✝ ✝ ✞ ✞ ✞ ✞

C

1

✟✠ ✡☛ ☞✌ ✍✎ ✏✑ ✒✓ ✔✕ ✖✗ ✘✙ ✚✛ ✜✢ ✣✤ ✥✦ ✧★ ✩✪ ✫✬ ✭✮ ✯✰ ✱ ✱ ✱ ✱ ✱ ✱ ✲ ✲ ✲ ✲ ✲ ✲ ✳ ✳ ✳ ✳ ✳ ✳ ✳ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✵ ✵ ✵ ✵ ✵ ✵ ✶ ✶ ✶ ✶ ✶ ✶ ✷ ✷ ✷ ✷ ✷ ✷ ✸ ✸ ✸ ✸ ✸ ✸

ventral side dorsal side pair #1 pair #N+1 longitudinal muscle longitudinal muscle transverse muscle transverse muscle arm tip arm base

12/23

slide-13
SLIDE 13

Gaussian Processes in Practice Workshop

The Muscle Model

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁

f(a) = (k0 + a(kmax − k0)) (ℓ − ℓ0) + β dℓ dt a ∈ [0, 1]

13/23

slide-14
SLIDE 14

Gaussian Processes in Practice Workshop

Other Forces

  • Gravity
  • Buoyancy
  • Water drag
  • Internal pressures (maintain constant compartmental volume)

Dimensionality

10 compartments ⇒ 22 point masses × (x, y, ˙ x, ˙ y) = 88 state variables

14/23

slide-15
SLIDE 15

Gaussian Processes in Practice Workshop

The Control Problem

Starting from a random position, bring {any part, tip} of arm into contact with a goal region, optimally. Optimality criteria: Time, energy, obstacle avoidance Constraint: We only have access to sampled trajectories Our approach: Define problem as a MDP Apply Reinforcement Learning algorithms

15/23

slide-16
SLIDE 16

Gaussian Processes in Practice Workshop

The Task

−0.1 −0.05 0.05 0.1 0.15 −0.1 −0.05 0.05 0.1 0.15

t = 1.38

16/23

slide-17
SLIDE 17

Gaussian Processes in Practice Workshop

Actions

Each action specifies a set of fixed activations –

  • ne for each muscle in the arm.

Action # 1 Action # 2 Action # 3 Action # 4 Action # 5 Action # 6

Base rotation adds duplicates of actions 1,2,4 and 5 with positive and negative torques applied to the base.

17/23

slide-18
SLIDE 18

Gaussian Processes in Practice Workshop

Rewards

Deterministic rewards: +10 for a goal state, Large negative value for obstacle hitting,

  • 1 otherwise.

Energy economy: A constant multiple of the energy expended by the muscles in each action interval was deducted from the reward.

18/23

slide-19
SLIDE 19

Gaussian Processes in Practice Workshop

Fixed Base Task I

19/23

slide-20
SLIDE 20

Gaussian Processes in Practice Workshop

Fixed Base Task II

20/23

slide-21
SLIDE 21

Gaussian Processes in Practice Workshop

Rotating Base Task I

21/23

slide-22
SLIDE 22

Gaussian Processes in Practice Workshop

Rotating Base Task II

22/23

slide-23
SLIDE 23

Gaussian Processes in Practice Workshop

To Wrap Up

  • There’s more to GPs than regression and classification
  • Online sparsification works

Challenges

  • How to use value uncertainty?
  • What’s a disciplined way to select actions?
  • What’s the best noise covariance?
  • More complicated tasks

23/23