controlbasis iii
play

ControlBasis-III double - PowerPoint PPT Presentation

Recommended Supervisor Structure action set A ControlBasis-III double recommended-setpoint(actions[NACTIONS][NDOF]; (scope is a project i.e. a task supervisor) automating sequential control composition Inside all supervisors: state = 0;


  1. Recommended Supervisor Structure action set A ControlBasis-III double recommended-setpoint(actions[NACTIONS][NDOF]; (scope is a “project” i.e. a task supervisor) automating sequential control composition Inside all supervisors: state = 0; • reinforcement learning \\ collect return status & new recommended setpoints byproduct example: hierarchical walking for (i=0,k) { • Dynamic Motion Primitives (DMPs) di = action[i]; state += d i * 3 i ; • policy search } switch(state) { case(0): submit recommended-setpoint[ACTION] to motor units; break case(1): … Laboratory for Perceptual Robotics – College of Information and Computer Sciences Laboratory for Perceptual Robotics – College of Information and Computer Sciences 46 45 case(K): 46 } Conditioned Response Recommended Supervisor Structure Convention: every action should return its status and pass a recommended setpoint back through an argument---they don’t write directly to motorunits anymore. Standardize Actions: Search(), Track(), SearchTrack(), Chase(), Touch(), ChaseTouch(), FClosure() Goals: to evaluate* the impact of implicit knowledge *in terms of learning performance (all actions, just primitives, just macros) define training episodes, pause every N episodes and write out greedy policy, in post processing: run M greedy trials, compute mean/variance of performance for plotting Pavlov, I. P. (1927), � Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex, � Translated and Edited by G. Reward: squared wrench residual (sparse), other things? V. Anrep. London: Oxford University Press. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 47 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 48 47 48 1

  2. A Computational Model for Conditioned Response Markov Decision Processes value functions - an generalization describe a memory-less stochastic process of the potential field value functions Reinforcement Learning - value iteration the conditional probability distribution of future states of the • � diffusion � processes process depends only on the current/present state---not how the process got to this state. • curse of dimensionality diminished by exploiting neurological structure actions states Laboratory for Perceptual Robotics – College of Information and Computer Sciences 49 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 50 49 50 The Bellman Equation Markov Decision Processes Define a policy, π ( s, a ), to be a function that returns the probability of selecting action a ∈ A from state s ∈ S M = < S, A, , P, R > S : set of system states the value of state s under policy π , denoted V π ( s ), is the expected sum A : set of available actions of discounted future rewards when policy π is executed from state s , subset of actions allowed from each state probability that ( s k , a k ) transitions to state s k+1 real-valued reward for (state, action) pair 0.0 < γ ≤ 1.0 represents a discounting factor per decision, and scalar r t is the reward received at time t . Laboratory for Perceptual Robotics – College of Information and Computer Sciences 51 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 52 51 52 2

  3. The Bellman Equation The Bellman Equation Laboratory for Perceptual Robotics – College of Information and Computer Sciences 53 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 54 53 54 Value Iteration Q-learning Dynamic Programming (DP) algorithms compute optimal policies from complete knowledge of the underlying MDP Typically, DP employs a full backup--- a comprehensive sweep through the Reinforcement Learning (RL) algorithms are an important subset of DP algorithms entire state-action space using numerical relaxation techniques (Appendix C). that do not require prior knowledge of transition probabilities in the MDP. RL techniques generally estimate V π (s) using sampled backups at the expense of optimality guarantees. provides the basis for a numerical iteration that incorporates the Bellman consistency constraints to estimate V π (s). Attractive in robotics because it focuses exploration on portions of the state/action space most relevant to the reward/task greedy ascent of the converged value function is an optimal policy for a recursive numerical technique that converges to V π as k → ∞ accumulating reward. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 55 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 56 55 56 3

  4. Q-learning Q-learning Quality function – the value function written the state/action form a natural paradigm for composing skills using the control basis actions because it can construct policies using sequences of actions by exploring control interactions in situ . Policy improvement: policy π ( s ) = argmax Q( s , a i ) a i The policy improvement theorem guarantees that a procedure like this will lead monotonically toward optimal policies. maps states to optimal actions by greedy ascent of the value function. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 57 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 58 57 58 Example: Learning to Walk (ca. 1996) Example: Walking Gaits Resource Model 13 controllers sensor resources - moment control abc abc ∈ 0123 { } Φ 1 a • configuration of legs {0123} kinematic • configuration of body (x,y, q ) 0123 Φ 2 ϕ conditioning effector resources - total of 1885 concurrent control options • configuration of legs {0123} • configuration of body (x,y, q ) discrete events: 012 p 0 ← Φ * THING Quadruped control types - 013 p 3 ← Φ * • moment control four coordinated robots 023 p 1 ← Φ * 2 13 states ´ 1885 actions • kinematic conditioning 0123 p 4 ← Φ ϕ 123 p 2 ← Φ * Laboratory for Perceptual Robotics – College of Information and Computer Sciences 59 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 60 59 60 4

  5. Example: ROTATE schema Example: Behavioral Logic for Development propositions that constrain patterns of discrete events in the dynamical system Platform stability constraints at least 1 of 4 stable tripod stances to be true at all times • kinematic constraints • reduced model: • 32 states ´ 157 actions • reduced by 99.94 % \ Laboratory for Perceptual Robotics – College of Information and Computer Sciences 61 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 62 61 62 Schemas from Sensorimotor Programming Transfer � written � by this robot ported to this robot generating the control programs that evaluate context efficiently is the subject of on-going work on inferential perception. Laboratory for Perceptual Robotics – College of Information and Computer Sciences 63 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 64 63 64 5

  6. Implications of Developmental Hierarchy “Objects” - Fully-Observable Case (Aspect Transition Graph) Rob Platt …at least one stable grasp must exist at all times…( g ( F g1 ) g ( F g2 ) g ( F g3 )) Laboratory for Perceptual Robotics – College of Information and Computer Sciences 65 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 66 65 66 Hierarchical Commonsense Control Knowledge “Objects” - Model-Referenced Aspect Transitions models record how SearchTrack options affect visual/haptic tracking events over time balancing three four point point S F= S M=0 S F=+/-M x f f prone g ( prone) V g ( 4-point ) V g ( balance) grasp “flip” arms arms Laboratory for Perceptual Robotics – College of Information and Computer Sciences 68 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 69 68 69 6

  7. “Objects” - Model-Referenced Aspect Transitions Affordance Modeling - Three Objects in general, multiple options exist for transforming existing sensor exploration habituates geometries into new sensor geometries when model stops (visual cues=>viewing angles=>new visual cues) changing visual hue tracker grasp pick-and-place options vary in cost and in information content Stephen Hart Laboratory for Perceptual Robotics – College of Information and Computer Sciences 70 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 71 70 71 Modeling Simple Assemblies Human Tracking • disambiguate human structure against cluttered backgrounds • references (hands/face) for control behavior and modeling stable multi-body relations Laboratory for Perceptual Robotics – College of Information and Computer Sciences 72 Laboratory for Perceptual Robotics – College of Information and Computer Sciences 73 72 73 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend