Two Perspectives on Representation Learning
Joseph Modayil
Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta
1
Two Perspectives on Representation Learning Joseph Modayil - - PowerPoint PPT Presentation
Two Perspectives on Representation Learning Joseph Modayil Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta 1 Reasoning & Learning: Two perspectives on knowledge representation For reasoning with a
Reinforcement Learning and Artificial Intelligence Laboratory University of Alberta
1
2
to scientific theories and experiments.
Scientific theories enable effective reasoning even when inaccurate.
Many experiments are needed to understand the world.
3
4
constructed from robot experience.
experience by searching for particular statistical structures.
5
Sensor Readings Time
Original Sensors Gather Experience Analyze Time-series Construct Sensor Geometry
7
M e t h
: 1 . D e fi n e l
a l d i s t a n c e s b e t w e e n s t r
g l y c
r e l a t e d s e n s
s 2 . U s e t h e f a s t m a x i m u m v a r i a n c e u n f
d i n g a l g
i t h m t
s t r u c t a m a n i f
d C
c l u s i
: A r
ʼ s e x p e r i e n c e c a n c
t a i n e n
g h i n f
m a t i
t
e c
e r a p p r
i m a t e l
a l s e n s
g e
e t r y ( a n d p e r h a p s g l
a l g e
e t r y ) .
(Modayil & Kuipers, 2007)
8
9
The agent has a model of the static environment
(pose,map) → observation
a target pose
pose at each time-step
Method
violate expectations of a static model.
time.
from the clusters. distance = average sensor reading angle = average sensor location
11
Note: shape models have size information
12
M e t h
: 1 . P e r f
m m
b a b b l i n g t
l e c t d a t a . 2 . U s e b a t c h l e a r n i n g t
n d c
t e x t s a n d m
t p u t s t h a t r e l i a b l y c h a n g e a n a t t r i b u t e e v e r y t i m e s t e p (
e s e c
d t i m e s t e p s ) . 3 . E v a l u a t e t h e l e a r n e d
e r a t
s . Operator 4: Decrease distance to object Description: distance(τ), decrease, δ < -0.19 Context: distance(τ) ≥ 0.43 angle(τ) ≤ 132 angle(τ) ≥ 69 Motor outputs: (0.2 m/s, 0.0 rad/s)
13
location(τ) dir[robot-heading] angle(τ) increasing distance(τ) decreasing angle(τ) decreasing
reasoning can be learned from sensorimotor experience.
useful knowledge?
14
15
scientific knowledge.
could be a useful foundation for a robotʼs knowledge.
robot to make and learn many such predictions in parallel.
16
these four functions define the semantics of an experimentally verifiable prediction
policy pseudo reward termination terminal reward
π : A × S − → [0, 1] r : S − → R γ : S − → [0, 1] z : S − → R
V π,γ,r,z(s) = E[r(s1) + . . . + r(sk) + z(sk)|s0 = s, a0:k ∼ π, k ∼ γ]
The Experimental Question By selecting actions with the policy, how much reward will be received before termination? Note 1: A GVF is a value function, but with a generic reward and termination. Note 2: A constant termination probability corresponds to a timescale.
(Sutton et al, 2011)
Non-linear sparse re-coder
(e.g., tile coding)
sensorimotor data
predictions demons
sparse, mostly-binary, feature representation PSR
Each demon is a full RL agent estimating a general value function
18
Sparsely activated binary features φt. (#active << #features)
Each computed prediction (p) is a linear function
p = <θt,Φt> The weights (θ) can be learned incrementally in O(#features) time/step by TD(λ) or related algorithms.
GVF predictions can be learned in parallel and online.
20 40 60 80 100 120 20,000 40,000 60,000 20 40 60 80 100 120 0,000 20,000 40,000 60,000 512 1024
Prediction at best ! TD(") prediction Light3 pseudo reward
(right scale)
Ideal 8s Light3 prediction
(left scale)
1024 512
Seconds
Ideal 8s Light3 prediction (offline)
The predictions learned online by TD(λ) are comparable to the ideal predictions and approach the accuracy
(shown after 3 hours of experience)
r = Light3 γ = 0.9875
π = Robot behaviour
z = 0
(Modayil, White, Sutton, 2012)
The 2000+ predictions use 6000+ shared features, shared parameters, cover all sensors & many state bits, cover 4 timescales (0.1, 0.5, 2, and 8 seconds), and update every 55ms
30 60 90 120 150 180 Minutes Median Mean Unit Variance Acceleration MotorTemperature OverheatingFlag Light MotorSpeed IR MotorCurrent IRLight Thermal LastAction RotationalVelocity Magnetic MotorCommand
Cumulative mean squared error normalized by dataset sample variance
2009)
22
goal-directed planning.
consequence of a precise experiment.
predictions incrementally during normal robot experience.
from experience and reasoning with models.
a learning robot. In Proc. 22nd National Conf. on Artificial Intelligence (AAAI-2007).
prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence.
scalable real-time architecture for learning knowledge from unsupervised sensorimotor
Multiagent Systems (AAMAS).
Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob).