Stochastic Optimal Control part 4 research issues, robotics - - PowerPoint PPT Presentation

stochastic optimal control part 4 research issues
SMART_READER_LITE
LIVE PREVIEW

Stochastic Optimal Control part 4 research issues, robotics - - PowerPoint PPT Presentation

Stochastic Optimal Control part 4 research issues, robotics applications Marc Toussaint Machine Learning & Robotics Group TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki challenges in stochastic optimal control


slide-1
SLIDE 1

Stochastic Optimal Control – part 4 research issues, robotics applications

Marc Toussaint

Machine Learning & Robotics Group – TU Berlin mtoussai@cs.tu-berlin.de ICML 2008, Helsinki

  • challenges in stochastic optimal control
  • probabilistic inference approaches to control
  • robotics
  • model learning

1/14

slide-2
SLIDE 2

challenges in stochastic optimal control

  • often said: “scale up”
  • Efficient Application in Real Systems!

→ try to extract the fundamental problems

2/14

slide-3
SLIDE 3

research issues 1/3: structured state

  • notion of state (i.e., having one big state space)

– curse of dimensionality – real systems are typically decomposed/modular/hierarchical/structured → exploit this!

  • interesting lines of work

– Carlos Guestrin (PhD thesis) – probabilistic inference methods! (in graphical models, belief propagation, etc) – probabilistic inference for computing optimal policies

3/14

slide-4
SLIDE 4

research issues 2/3: learning

  • learning

– want to learn models from experience

  • interesting lines of work

– ML for model learning in robotics

4/14

slide-5
SLIDE 5

research issues 3/3: integration

  • integration

– complex systems (e.g., robots) collect state information from many different modalities (sensors) – many subsystems (e.g, vision, position, haptics) – delayed/partial information – integration is hard

5/14

slide-6
SLIDE 6

probabilistic inference approach

  • general idea:

decision making, motion control and planning can be viewed as a problem of inferring a posterior over unknown variables (actions, control signals, whole trajectories) conditioned on available information (targets, goals, constraints)

6/14

slide-7
SLIDE 7

probabilistic inference approach

  • given some model of the future:

x2 x1 a2 a1 a0 r1 r2 r0

π

x0

(here a Markov-Decision Process with P(x0), P(x′ | a, x), P(r | a, x) given, and the policy πax = P(a | x) unknown)

  • condition it on something you want to see in the future
  • compute the posterior over actions/decisions to get there
  • Toussaint & Storkey (ICML 2006): proof that maximization of expected

future rewards → likelihood maximization problem (EM-algorithm) [fwd-bwd video]

7/14

slide-8
SLIDE 8

probabilistic inference approach

[details: Toussaint, Storkey, ICML 2006]

  • problem:

Find a policy π that maximizes V π = E {∞

t=0 γt rt; π}

with discount factor γ ∈ [0, 1]

  • Theorem:

Maximizing the likelihood Lπ = P(ˆ r=1; π) in the mixture of finite-time MDPs (P(T) = γT (1 − γ)) is equivalent to maximizing V π = E {∞

t=0 γt rt; π}

in the original MDP .

  • problem of optimal policy → problem of likelihood maximization

(EM-algorithm) [demo]

8/14

slide-9
SLIDE 9

POMDP application

  • in POMDPs the agent needs some kind of memory

a2 y1 b1 a0 x0 y0 b0 a1 y2 b2 x2 r1 r0 r2 x1

  • mazes: T-junctions, halls & corridors (379 locations, 1516 states)

(Toussaint, Harmeling, Storkey & 2006)

9/14

slide-10
SLIDE 10

POMDP application

  • UAI paper persented on Friday:

Marc Toussaint, Laurent Charlin, Pascal Poupart: Hierarchical POMDP Controller Optimization by Likelihood Maximization

N′1 N2 N1 N0 O S S′ O′ A N′0 E0 E1 N′2 N′2 N2 N1 N0 O S A S′ O′ N′0 N′1 N2 N1 N0 S S′ N′0 N′1 N′2

N2N1N0SS′ N2N1N0S′ N2N′2N1N0S′ N′2N1N0S′ N′2N1N′1N0S′ N′2N′1N0S′ N′2N′1N0N′0S′

Problem |S|, |A|, |O| V ∗ HSVI2 Best results from [1] ML approach (avg. over 10 runs) V nodes t(s) V nodes t(s) V paint 4, 4, 2 3.28 3.29±0.04 (1,3) <1 3.29 (5,3) 0.96±0.3 3.26±0.004 shuttle 8, 3, 5 32.7 32.9±0.8 (1,3) 2 31.87 (5,3) 2.81±0.2 31.6±0.5 4x4 maze 16, 4, 2 3.7 3.75±0.1 (1,2) 30 3.73 (3,3) 2.8±0.8 3.72±8e−5 chain-of-chains 10, 4, 1 157.1 157.1±0 (3,3) 10 0.0 (10,3) 6.4±0.2 151.6±2.6 handwashing 84, 7, 12 1052 N/A N/A (10,5) 655±2 984±1 cheese-taxi 33, 7, 10 5.3 2.53±0.3 N/A (10,3) 311±14 −9±11(2.25∗)

slide-11
SLIDE 11

robotic motion inference application

four task variables – position of right finger – collision with objects – balance – comfortableness

0.1 1 10 100 1 2 3 4 5 6 (MAP) cost time (sec) bayes (repeats) bayes (fwd-bwd) gradient (direct) gradient (spline)

(Toussaint & Goerick, IROS 2007)

11/14

slide-12
SLIDE 12
  • n Asimo
  • Toussaint, Gienger & Goerick (Humanoids 2007): Optimization of sequential attractor-based

movement for compact behavior generation (other technique than inference) Time: 3s Control points: 8 Controlled: Both hands position and attitude Time: 3s Control points: 4 Controlled: Left hand position and attitude Time: 4s Control points: 10 Controlled: Both hands position and attitude 12/14

slide-13
SLIDE 13

model learning

  • Control of a dynamic robot

system dynamics: f : x, ˙ x, u → ¨ x learning inverse model φ : x, ˙ x, ¨ x∗ → u [learn] [pole] (methods: A. Moore, C. Atkeson, S. Schaal, S. Vijayakumar, et al)

13/14

slide-14
SLIDE 14

conclusions

E^3

TD Bayesian RL Q-learning likelihood max Path integral posterior trajectories/control

  • sensor processing
  • state estimation

graphical models RL core of optimal control inference Bellman DP HJB LQG Value Iteration MDPs

  • exciting potential for Machine Learning methods

– structured state, abstraction, learning, integration

  • integrative view from ML perspective possible

14/14