ControlBasis-III double - - PowerPoint PPT Presentation

controlbasis iii
SMART_READER_LITE
LIVE PREVIEW

ControlBasis-III double - - PowerPoint PPT Presentation

Recommended Supervisor Structure action set A ControlBasis-III double recommended-setpoint(actions[NACTIONS][NDOF]; (scope is a project i.e. a task supervisor) automating sequential control composition Inside all supervisors: state = 0;


slide-1
SLIDE 1

1

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

ControlBasis-III

45

automating sequential control composition

  • reinforcement learning

example: hierarchical walking

  • Dynamic Motion Primitives (DMPs)
  • policy search

46 46 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

action set A double recommended-setpoint(actions[NACTIONS][NDOF]; (scope is a “project” i.e. a task supervisor) Inside all supervisors: state = 0; \\ collect return status & new recommended setpoints byproduct for (i=0,k) { di = action[i]; state += di * 3i; } switch(state) { case(0): submit recommended-setpoint[ACTION] to motor units; break case(1): … case(K): }

Recommended Supervisor Structure

47 47 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Convention: every action should return its status and pass a recommended setpoint back through an argument---they don’t write directly to motorunits anymore. Standardize Actions: Search(), Track(), SearchTrack(), Chase(), Touch(), ChaseTouch(), FClosure() Goals: to evaluate* the impact of implicit knowledge *in terms of learning performance (all actions, just primitives, just macros) define training episodes, pause every N episodes and write out greedy policy, in post processing: run M greedy trials, compute mean/variance of performance for plotting Reward: squared wrench residual (sparse), other things?

Recommended Supervisor Structure

48 48

Conditioned Response

Pavlov, I. P. (1927), Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex, Translated and Edited by G.

  • V. Anrep. London: Oxford University Press.

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

slide-2
SLIDE 2

2

49 49

A Computational Model for Conditioned Response

states actions

Reinforcement Learning - value iteration

  • diffusion processes
  • curse of dimensionality diminished

by exploiting neurological structure value functions - an generalization

  • f the potential field

value functions

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 50 50

Markov Decision Processes

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

describe a memory-less stochastic process the conditional probability distribution of future states of the process depends only on the current/present state---not how the process got to this state. M = < S, A, , P, R > S: set of system states A: set of available actions subset of actions allowed from each state probability that (sk, ak) transitions to state sk+1 real-valued reward for (state, action) pair

51 51

Markov Decision Processes

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 52 52

The Bellman Equation

Define a policy, π(s, a), to be a function that returns the probability of selecting action a ∈ A from state s ∈ S the value of state s under policy π, denoted Vπ(s), is the expected sum

  • f discounted future rewards when policy π is executed from state s,

0.0 < γ ≤ 1.0 represents a discounting factor per decision, and scalar rt is the reward received at time t.

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

slide-3
SLIDE 3

3

53 53

The Bellman Equation

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 54 54

The Bellman Equation

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 55 55

Value Iteration

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Dynamic Programming (DP) algorithms compute optimal policies from complete knowledge of the underlying MDP Reinforcement Learning (RL) algorithms are an important subset of DP algorithms that do not require prior knowledge of transition probabilities in the MDP. provides the basis for a numerical iteration that incorporates the Bellman consistency constraints to estimate Vπ (s). a recursive numerical technique that converges to Vπ as k → ∞

56 56

Q-learning

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Typically, DP employs a full backup--- a comprehensive sweep through the entire state-action space using numerical relaxation techniques (Appendix C). RL techniques generally estimate Vπ (s) using sampled backups at the expense

  • f optimality guarantees.

Attractive in robotics because it focuses exploration on portions of the state/action space most relevant to the reward/task greedy ascent of the converged value function is an optimal policy for accumulating reward.

slide-4
SLIDE 4

4

57 57

Q-learning

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Quality function – the value function written the state/action form Policy improvement: The policy improvement theorem guarantees that a procedure like this will lead monotonically toward optimal policies.

58 58

Q-learning

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

a natural paradigm for composing skills using the control basis actions because it can construct policies using sequences of actions by exploring control interactions in situ. policy π(s) = argmax Q(s, ai) ai maps states to optimal actions by greedy ascent

  • f the value function.

59 59

Example: Learning to Walk (ca. 1996)

Resource Model

sensor resources -

  • configuration of legs {0123}
  • configuration of body (x,y,q)

effector resources -

  • configuration of legs {0123}
  • configuration of body (x,y,q)

control types -

  • moment control
  • kinematic conditioning

THING Quadruped four coordinated robots 213 states ´ 1885 actions

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

13 controllers total of 1885 concurrent control options discrete events:

60 60

Φ1a

abcabc ∈ 0123

{ } Φ2ϕ

0123

kinematic conditioning moment control

Example: Walking Gaits

p0 ← Φ*

012

p1 ← Φ*

023

p2 ← Φ*

123

p3 ← Φ*

013

p4 ← Φϕ

0123 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

slide-5
SLIDE 5

5

61 61

\

Example: Behavioral Logic for Development

Platform stability constraints

  • at least 1 of 4 stable tripod stances to be true at all times
  • kinematic constraints

propositions that constrain patterns of discrete events in the dynamical system reduced model:

  • 32 states ´ 157 actions
  • reduced by 99.94 %

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 62 62

Example: ROTATE schema

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 63 63 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Schemas from Sensorimotor Programming

generating the control programs that evaluate context efficiently is the subject of

  • n-going work on inferential perception.

64 64

Transfer

written by this robot ported to this robot

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

slide-6
SLIDE 6

6

65 65

Implications of Developmental Hierarchy

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 66 66

“Objects” - Fully-Observable Case (Aspect Transition Graph)

…at least one stable grasp must exist at all times…(g(Fg1) g(Fg2) g(Fg3)) Rob Platt

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 68 68

Hierarchical Commonsense Control Knowledge

g(prone) V g(4-point) V g(balance)

prone three point four point balancing

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 69 69

grasp “flip”

SF=SM=0 arms

f

SF=+/-Mx arms

f

models record how SearchTrack options affect visual/haptic tracking events over time

“Objects” - Model-Referenced Aspect Transitions

Laboratory for Perceptual Robotics – College of Information and Computer Sciences

slide-7
SLIDE 7

7

70 70

“Objects” - Model-Referenced Aspect Transitions

in general, multiple options exist for transforming existing sensor geometries into new sensor geometries (visual cues=>viewing angles=>new visual cues)

  • ptions vary in cost and in information content

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 71

Affordance Modeling - Three Objects

exploration habituates when model stops changing visual hue tracker grasp pick-and-place

71

Stephen Hart

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 72

Modeling Simple Assemblies

72

stable multi-body relations

Laboratory for Perceptual Robotics – College of Information and Computer Sciences 73 73 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Human Tracking

  • disambiguate human

structure against cluttered backgrounds

  • references (hands/face)

for control behavior and modeling

slide-8
SLIDE 8

8

74 74 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Expressive Communicative Actions

learning about kinodynamic and intentional agents unreachable objects can be reachable in the presence of a human being, the dynamics of the world change

75 75 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Receptive Pointing Skills Policy Iteration - Dynamic Movement Primitives

landscapes of attractors

  • ptimal control framework

generates sequences of kinematic goals a fixed plant (sensory and motor resources) a particular task fully-observable environment PD controller nonlinear function approximator f(x) = bT f(x) coupling term

76 Laboratory for Perceptual Robotics – College of Information and Computer Sciences

Schaal (2003)

Policy Search - Variable Risk Control

  • Goal: high performance, dynamic behavior

The ability to adapt dynamic policies to a variety of

  • perating contexts

…low battery, insufficient time, dangerous

  • bstacles,…
  • we expect different contexts to demand different

sensitivities to risk (i.e., performance variance) need to adjust risk-sensitivity on-the-fly

slide-9
SLIDE 9

9

Example

Example: Risk-Averse Example: Risk-Averse Example: Risk-Averse

slide-10
SLIDE 10

10

Example: Risk-Averse Example: Risk-Averse Example: Risk-Neutral Example: Risk-Neutral

slide-11
SLIDE 11

11

Example: Risk-Neutral Example: Risk-Neutral Example: Risk-Neutral Off-Line Policy Selection

slide-12
SLIDE 12

12

Off-Line Policy Selection Off-Line Policy Selection

  • Learning rapid arm motions for postural stabilization
  • Biomechanics literature suggests that arm motions play a

significant functional role in postural recovery

  • Impact recovery experiments with the uBot-5
  • Learned arm motions lead to more robust stabilization
  • Using arms to recover significantly reduces total

energy expenditure

Whole-Body Postural Stability

  • Put “2011-09-16 Punchulum Large Impact

Comparison.mp4” on this slide

after 35 episodes using Bayesian optimization, a global policy search algorithm

Learned Policy

slide-13
SLIDE 13

13

  • Put “2011-09-16 Arm Responses in the

Wild.mp4” on this slide

Learned Policy

45 training episodes, different risk characteristics selected offline using the variational Bayesian optimization algorithm. significantly larger impact forces---approximately equal to the robot’s weight

Learning Variable Risk Upper-Body Postural Stabilizing Policies

Dynamic Manipulation

exploit non-linear dynamics with very little interaction with an object