Two Perspectives on Representation Learning Joseph Modayil - - PowerPoint PPT Presentation

two perspectives on representation learning
SMART_READER_LITE
LIVE PREVIEW

Two Perspectives on Representation Learning Joseph Modayil - - PowerPoint PPT Presentation

Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1 Reasoning & Learning: Two perspectives on knowledge representation For reasoning with a


slide-1
SLIDE 1

Two Perspectives on Representation Learning

Joseph Modayil

Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta

1

slide-2
SLIDE 2

Reasoning & Learning:

Two perspectives on knowledge representation

  • For reasoning with a model:
  • Expressiveness of the model (e.g. space, objects, ...)
  • Planning with the model is useful for a robot
  • For learning to predict the consequences of a robotʼs behaviour:
  • Semantics defined by the robotʼs future experience
  • Online, scalable learning during normal robot operation

2

slide-3
SLIDE 3

An Analogy with Scientific Knowledge

  • Reasoning and learning have complementary strengths that are analogous

to scientific theories and experiments.

  • Scientific theories enable broad generalization within a limited domain.

Scientific theories enable effective reasoning even when inaccurate.

  • Experiments measure the world without needing model assumptions.

Many experiments are needed to understand the world.

  • Two approaches for connecting theories and experiments.
  • Top-down: Theories have experimentally verifiable predictions.
  • Bottom-up: Many verifiable predictions can generalize to a single theory.
  • Note: A single prediction a (very) partial model of the world.

3

slide-4
SLIDE 4

Rich representations that support reasoning

4

slide-5
SLIDE 5

Reasoning with rich representations

  • Useful analogs to human-scale abstractions can be

constructed from robot experience.

  • The robot constructs models from its sensorimotor

experience by searching for particular statistical structures.

  • The models describe spaces and objects.
  • The robot reasons within these models to achieve goals.

5

slide-6
SLIDE 6

Representing sensor configurations (Modayil, 2010)

  • Sensors in similar physical configurations yield

highly correlated time-series data. (e.g. GP assumption)

  • Invert this: use time-series data to construct a

manifold of sensor configurations.

Sensor Readings Time

Original Sensors Gather Experience Analyze Time-series Construct Sensor Geometry

slide-7
SLIDE 7

Learned geometry from real robot data

7

Cosy Localization Database

M e t h

  • d

: 1 . D e fi n e l

  • c

a l d i s t a n c e s b e t w e e n s t r

  • n

g l y c

  • r

r e l a t e d s e n s

  • r

s 2 . U s e t h e f a s t m a x i m u m v a r i a n c e u n f

  • l

d i n g a l g

  • r

i t h m t

  • c
  • n

s t r u c t a m a n i f

  • l

d C

  • n

c l u s i

  • n

: A r

  • b
  • t

ʼ s e x p e r i e n c e c a n c

  • n

t a i n e n

  • u

g h i n f

  • r

m a t i

  • n

t

  • r

e c

  • v

e r a p p r

  • x

i m a t e l

  • c

a l s e n s

  • r

g e

  • m

e t r y ( a n d p e r h a p s g l

  • b

a l g e

  • m

e t r y ) .

slide-8
SLIDE 8

Representing Objects

(Modayil & Kuipers, 2007)

  • Intuition: Moving objects can be distinguished

from a static world.

  • Approach: Use violations of a stationary

background model to perceive moving objects.

8

slide-9
SLIDE 9

Objects: Background Model

9

The agent has a model of the static environment

  • Occupancy grid
  • Observation model

(pose,map) → observation

  • Operators to move the robot to

a target pose

  • Update of the map and robot

pose at each time-step

slide-10
SLIDE 10

Objects: Perception

Method

  • 1. Consider sensor readings that

violate expectations of a static model.

  • 2. Cluster them in space and then

time.

  • 3. Compute new perceptual features

from the clusters. distance = average sensor reading angle = average sensor location

slide-11
SLIDE 11

Objects: Learned Shapes

11

Note: shape models have size information

slide-12
SLIDE 12

Objects: Learning Operators

12

M e t h

  • d

: 1 . P e r f

  • r

m m

  • t
  • r

b a b b l i n g t

  • c
  • l

l e c t d a t a . 2 . U s e b a t c h l e a r n i n g t

  • fi

n d c

  • n

t e x t s a n d m

  • t
  • r
  • u

t p u t s t h a t r e l i a b l y c h a n g e a n a t t r i b u t e e v e r y t i m e s t e p (

  • n

e s e c

  • n

d t i m e s t e p s ) . 3 . E v a l u a t e t h e l e a r n e d

  • p

e r a t

  • r

s . Operator 4: Decrease distance to object Description: distance(τ), decrease, δ < -0.19 Context: distance(τ) ≥ 0.43 angle(τ) ≤ 132 angle(τ) ≥ 69 Motor outputs: (0.2 m/s, 0.0 rad/s)

slide-13
SLIDE 13

Objects: Using Operators

13

location(τ) dir[robot-heading] angle(τ) increasing distance(τ) decreasing angle(τ) decreasing

slide-14
SLIDE 14

Learning models that support reasoning

  • Representations that support human-scale abstract

reasoning can be learned from sensorimotor experience.

  • Is a robotʼs sensorimotor stream sufficient for learning all

useful knowledge?

  • How can the learning process be improved?
  • Simple unified semantics with broad applicability
  • Clarify assumptions
  • Incremental learning algorithms
  • Remove need for human oversight

14

slide-15
SLIDE 15

Rich representations that support learning

15

slide-16
SLIDE 16

Learning to make predictions

  • A prediction is a claim about a robotʼs future experience.
  • Predictions verified by experiments are the foundation of

scientific knowledge.

  • Thus, the semantics of experimentally verifiable predictions

could be a useful foundation for a robotʼs knowledge.

  • An efficient online, incremental algorithm would enable the

robot to make and learn many such predictions in parallel.

  • e.g. Temporal-difference reinforcement learning algorithms.

16

slide-17
SLIDE 17

General value functions (GVF)

these four functions define the semantics of an experimentally verifiable prediction

policy pseudo reward termination terminal reward

π : A × S − → [0, 1] r : S − → R γ : S − → [0, 1] z : S − → R

V π,γ,r,z(s) = E[r(s1) + . . . + r(sk) + z(sk)|s0 = s, a0:k ∼ π, k ∼ γ]

The Experimental Question By selecting actions with the policy, how much reward will be received before termination? Note 1: A GVF is a value function, but with a generic reward and termination. Note 2: A constant termination probability corresponds to a timescale.

slide-18
SLIDE 18

The Horde Architecture

(Sutton et al, 2011)

Non-linear sparse re-coder

(e.g., tile coding)

sensorimotor data

...

predictions demons

sparse, mostly-binary, feature representation PSR

Each demon is a full RL agent estimating a general value function

18

Sparsely activated binary features φt. (#active << #features)

}

Each computed prediction (p) is a linear function

  • f the features

p = <θt,Φt> The weights (θ) can be learned incrementally in O(#features) time/step by TD(λ) or related algorithms.

}

GVF predictions can be learned in parallel and online.

slide-19
SLIDE 19

The firehose of experience

Timesteps (0.1 second) Normalized Sensor Values

slide-20
SLIDE 20

Predictions

  • f a Light

Sensor

20 40 60 80 100 120 20,000 40,000 60,000 20 40 60 80 100 120 0,000 20,000 40,000 60,000 512 1024

Prediction at best ! TD(") prediction Light3 pseudo reward

(right scale)

Ideal 8s Light3 prediction

(left scale)

1024 512

Seconds

Ideal 8s Light3 prediction (offline)

The predictions learned online by TD(λ) are comparable to the ideal predictions and approach the accuracy

  • f the best weight vector.

(shown after 3 hours of experience)

r = Light3 γ = 0.9875

π = Robot behaviour

z = 0

slide-21
SLIDE 21

Scales to thousands of predictions

(Modayil, White, Sutton, 2012)

The 2000+ predictions use 6000+ shared features, shared parameters, cover all sensors & many state bits, cover 4 timescales (0.1, 0.5, 2, and 8 seconds), and update every 55ms

30 60 90 120 150 180 Minutes Median Mean Unit Variance Acceleration MotorTemperature OverheatingFlag Light MotorSpeed IR MotorCurrent IRLight Thermal LastAction RotationalVelocity Magnetic MotorCommand

Cumulative mean squared error normalized by dataset sample variance

All experience & learning performed within hours!

slide-22
SLIDE 22

Learning predictions about different policies

  • Off-policy learning enables the robot to learn the

consequences of following different policies from a single stream of experience.

  • Gradient temporal-difference algorithms provide

stable, incremental, off-policy learning.(Maei & Sutton,

2009)

  • Works at scale with robots. (White, Modayil, Sutton, 2012)

22

slide-23
SLIDE 23

Summary

  • Abstract models can be learned from sensorimotor experience.
  • Learned models of sensor space and objects that support

goal-directed planning.

  • A broad class of predictive knowledge can be learned at scale.
  • General value function predictions express an expected

consequence of a precise experiment.

  • Temporal-difference algorithms can learn to make such

predictions incrementally during normal robot experience.

  • Robots could benefit from a tighter integration between learning

from experience and reasoning with models.

slide-24
SLIDE 24

Bibliography

  • Model Learning
  • Modayil, J., and Kuipers, B. J. 2007. Autonomous development of a grounded object ontology by

a learning robot. In Proc. 22nd National Conf. on Artificial Intelligence (AAAI-2007).

  • Modayil, J. 2010. Discovering sensor space: Constructing spatial embeddings that explain sensor
  • correlations. In IEEE 9th International Conference on Development and Learning (ICDL).
  • Prediction Learning
  • Maei, H. R., and Sutton, R. S. 2010. GQ(λ): A general gradient algorithm for temporal-difference

prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence.

  • Modayil, J.; White, A.; and Sutton, R. S. 2012. Multi-timescale nexting in a reinforcement learning
  • robot. SAB 2012. Springer. 299–309.
  • Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P. M.; and Precup, D. 2011. Horde: A

scalable real-time architecture for learning knowledge from unsupervised sensorimotor

  • interaction. In Proceedings of the 10th International Conference on Autonomous Agents and

Multiagent Systems (AAMAS).

  • White, A., Modayil, J., and Sutton, R.S. 2012. Scaling Life-long Off-policy Learning. In Second

Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob).