two perspectives on representation learning
play

Two Perspectives on Representation Learning Joseph Modayil - PowerPoint PPT Presentation

Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1 Reasoning & Learning: Two perspectives on knowledge representation For reasoning with a


  1. Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1

  2. Reasoning & Learning: Two perspectives on knowledge representation ‣ For reasoning with a model: • Expressiveness of the model (e.g. space, objects, ...) • Planning with the model is useful for a robot ‣ For learning to predict the consequences of a robot ʼ s behaviour: • Semantics defined by the robot ʼ s future experience • Online, scalable learning during normal robot operation 2

  3. An Analogy with Scientific Knowledge ‣ Reasoning and learning have complementary strengths that are analogous to scientific theories and experiments. • Scientific theories enable broad generalization within a limited domain. Scientific theories enable effective reasoning even when inaccurate. • Experiments measure the world without needing model assumptions. Many experiments are needed to understand the world. ‣ Two approaches for connecting theories and experiments. • Top-down: Theories have experimentally verifiable predictions. • Bottom-up: Many verifiable predictions can generalize to a single theory. • Note: A single prediction a (very) partial model of the world. 3

  4. Rich representations that support reasoning 4

  5. Reasoning with rich representations ‣ Useful analogs to human-scale abstractions can be constructed from robot experience. • The robot constructs models from its sensorimotor experience by searching for particular statistical structures. • The models describe spaces and objects. • The robot reasons within these models to achieve goals. 5

  6. Representing sensor configurations (Modayil, 2010) ‣ Sensors in similar physical configurations yield highly correlated time-series data. (e.g. GP assumption) ‣ Invert this: use time-series data to construct a manifold of sensor configurations. Original Gather Analyze Construct Sensors Experience Time-series Sensor Geometry Sensor Readings Time

  7. Learned geometry from real robot data Cosy Localization Database M e t h o d : 1 . D e fi n e l o c a l d i s t a n c e s b e t w e e n s t r o n g l y c o r r e l a t e d s e n s o r s 2 . U s e t h e f a s t m a x i m u m v a r i a n c e u n f o l d i n g a l g o r i t h m t o c o n s t r u c t a m a n i f o l d C o n c l u s i o n : A r o b o t ʼ s e x p e r i e n c e c a n c o n t a i n e n o u g h i n f o r m a t i o n t o r e c o v e r a p p r o x i m a t e l o c a l s e n s o r g e o m e t r y ( a n d p e r h a p s g l o b a l g e o m e t r y ) . 7

  8. Representing Objects (Modayil & Kuipers, 2007) ‣ Intuition: Moving objects can be distinguished from a static world. ‣ Approach: Use violations of a stationary background model to perceive moving objects. 8

  9. Objects: Background Model The agent has a model of the static environment ‣ Occupancy grid ‣ Observation model (pose,map) → observation ‣ Operators to move the robot to a target pose ‣ Update of the map and robot pose at each time-step 9

  10. Objects: Perception Method 1. Consider sensor readings that violate expectations of a static model. 2. Cluster them in space and then time. 3. Compute new perceptual features from the clusters. distance = average sensor reading angle = average sensor location

  11. Objects: Learned Shapes Note: shape models have size information 11

  12. Objects: Learning Operators M e t h o d : Operator 4: Decrease distance to object 1 . P e r f o r m m o t o r b a b b l i n g t o c o l l e c t d a t a . Description: distance( τ ), decrease, δ < -0.19 2 . U s e b a t c h l e a r n i n g t o fi n d Context: distance( τ ) ≥ 0.43 c o n t e x t s a n d m o t o r o u t p u t s angle( τ ) ≤ 132 t h a t r e l i a b l y c h a n g e a n a t t r i b u t e e v e r y angle( τ ) ≥ 69 t i m e s t e p ( o n e s e c o n d t i m e s t e p s ) . Motor outputs: (0.2 m/s, 0.0 rad/s) 3 . E v a l u a t e t h e l e a r n e d o p e r a t o r s . 12

  13. Objects: Using Operators angle( τ ) increasing distance( τ ) location( τ ) decreasing dir[robot-heading] angle( τ ) decreasing 13

  14. Learning models that support reasoning ‣ Representations that support human-scale abstract reasoning can be learned from sensorimotor experience. • Is a robot ʼ s sensorimotor stream sufficient for learning all useful knowledge? ‣ How can the learning process be improved? • Simple unified semantics with broad applicability • Clarify assumptions • Incremental learning algorithms • Remove need for human oversight 14

  15. Rich representations that support learning 15

  16. Learning to make predictions ‣ A prediction is a claim about a robot ʼ s future experience. • Predictions verified by experiments are the foundation of scientific knowledge. • Thus, the semantics of experimentally verifiable predictions could be a useful foundation for a robot ʼ s knowledge. • An efficient online, incremental algorithm would enable the robot to make and learn many such predictions in parallel. • e.g. Temporal-difference reinforcement learning algorithms. 16

  17. General value functions (GVF) V π , γ ,r,z ( s ) = E [ r ( s 1 ) + . . . + r ( s k ) + z ( s k ) | s 0 = s, a 0: k ∼ π , k ∼ γ ] these four functions define the semantics of an experimentally verifiable prediction policy π : A × S − → [0 , 1] The Experimental Question pseudo reward r : S − → R By selecting actions with the policy, how much reward will be received before termination? termination γ : S − → [0 , 1] terminal reward z : S − → R Note 1: A GVF is a value function, but with a generic reward and termination. Note 2: A constant termination probability corresponds to a timescale.

  18. The Horde Architecture (Sutton et al, 2011) GVF predictions can be learned in parallel and online. } Sparsely activated Non-linear sparse, mostly-binary, binary features φ t. feature representation sparse re-coder (#active << #features) (e.g., tile coding) } ... demons Each computed prediction ( p ) is Each demon is a linear function a full RL agent of the features PSR p = < θ t, Φ t > estimating a sensorimotor general value predictions data The weights ( θ ) can be function learned incrementally in O(#features) time/step by TD( λ ) or related algorithms. 18

  19. The firehose of experience Normalized Sensor Values Timesteps (0.1 second)

  20. 1024 1024 60,000 Light3 Predictions pseudo Ideal 8s reward Light3 (right scale) prediction of a Light 40,000 (left scale) 512 512 Sensor 20,000 0 0,000 0 0 20 40 60 80 100 120 r = Light 3 60,000 Prediction γ = 0 . 9875 TD( " ) at best ! prediction Ideal 8s π = Robot behaviour (offline) Light3 40,000 z = 0 prediction The predictions learned online by 20,000 TD( λ ) are comparable to the ideal predictions and approach the accuracy of the best weight vector. 0 0 60 80 20 40 100 120 (shown after 3 hours of experience) Seconds

  21. Scales to thousands of predictions (Modayil, White, Sutton, 2012) The 2000+ predictions Cumulative mean squared error normalized by dataset sample variance use 6000+ shared Acceleration MotorTemperature OverheatingFlag Light MotorSpeed IR features, MotorCurrent IRLight Thermal LastAction RotationalVelocity Magnetic MotorCommand shared parameters, cover all sensors & many state bits, Unit Variance cover 4 timescales Mean (0.1, 0.5, 2, and 8 Median seconds), and update every 55ms 0 30 60 90 120 150 180 Minutes All experience & learning performed within hours!

  22. Learning predictions about different policies ‣ Off-policy learning enables the robot to learn the consequences of following different policies from a single stream of experience. ‣ Gradient temporal-difference algorithms provide stable, incremental, off-policy learning. (Maei & Sutton, 2009) ‣ Works at scale with robots. (White, Modayil, Sutton, 2012) 22

  23. Summary ‣ Abstract models can be learned from sensorimotor experience. • Learned models of sensor space and objects that support goal-directed planning. ‣ A broad class of predictive knowledge can be learned at scale. • General value function predictions express an expected consequence of a precise experiment. • Temporal-difference algorithms can learn to make such predictions incrementally during normal robot experience. ‣ Robots could benefit from a tighter integration between learning from experience and reasoning with models.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend