Learning Skills from Play: Artificial Curiosity on a Katana Robot - - PowerPoint PPT Presentation

learning skills from play artificial curiosity on a
SMART_READER_LITE
LIVE PREVIEW

Learning Skills from Play: Artificial Curiosity on a Katana Robot - - PowerPoint PPT Presentation

Learning Skills from Play: Artificial Curiosity on a Katana Robot Arm Hung Ngo, Matthew Luciw, Alexander Forster, and Juergen Schmidhuber IDSIA-SUPSI-USI, Lugano, Switzerland {hung, matthew, alexander, juergen}@idsia.ch IEEE International


slide-1
SLIDE 1

Learning Skills from Play: Artificial Curiosity on a Katana Robot Arm

Hung Ngo, Matthew Luciw, Alexander Forster, and Juergen Schmidhuber

IDSIA-SUPSI-USI, Lugano, Switzerland {hung, matthew, alexander, juergen}@idsia.ch

IEEE International Joint Conference on Neural Networks (June 15, 2012)

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 1 / 22

slide-2
SLIDE 2

Outline

Introduction Progress-Based Artificial Curiosity System Architecture Experiments and Results Conclusion

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 2 / 22

slide-3
SLIDE 3

Introduction

Learning from Play

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 3 / 22

slide-4
SLIDE 4

Introduction

Learning from Play

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 3 / 22

slide-5
SLIDE 5

Learning from Play

Introduction

Developmental robotics: lessons from children

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 4 / 22

slide-6
SLIDE 6

Learning from Play

Introduction

Developmental robotics: lessons from children

Intrinsically motivated playing: no external rewards!!!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 4 / 22

slide-7
SLIDE 7

Learning from Play

Introduction

Developmental robotics: lessons from children

Intrinsically motivated playing: no external rewards!!! Constructive play with manipulation skills.

(image from www.safekidscanada.ca) Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 4 / 22

slide-8
SLIDE 8

Artificial Curiosity? A Theory: Compression Progress

Introduction

Juergen Schmidhuber (1990-now): A creative agent needs two learning components—Reinforcement Learner R + Predictor P

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 5 / 22

slide-9
SLIDE 9

Artificial Curiosity? A Theory: Compression Progress

Introduction

Juergen Schmidhuber (1990-now): A creative agent needs two learning components—Reinforcement Learner R + Predictor P The learning progress or expected improvement of P becomes an intrinsic reward for R.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 5 / 22

slide-10
SLIDE 10

Artificial Curiosity? A Theory: Compression Progress

Introduction

Juergen Schmidhuber (1990-now): A creative agent needs two learning components—Reinforcement Learner R + Predictor P The learning progress or expected improvement of P becomes an intrinsic reward for R. Hence, to achieve high intrinsic reward, R is motivated to create new experiences such that P makes quick progress.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 5 / 22

slide-11
SLIDE 11

Details of the System

Top-down image of the workspace

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 6 / 22

slide-12
SLIDE 12

Innate Knowledge and Skills

Details of the System

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 7 / 22

slide-13
SLIDE 13

Innate Knowledge and Skills

Details of the System

Pick a selected block

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 7 / 22

slide-14
SLIDE 14

Innate Knowledge and Skills

Details of the System

Pick a selected block Place at a selected location (X, Y-coordinates & height)

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 7 / 22

slide-15
SLIDE 15

Innate Knowledge and Skills

Details of the System

Pick a selected block Place at a selected location (X, Y-coordinates & height) Outcome concepts “Stable/Unstable”

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 7 / 22

slide-16
SLIDE 16

Innate Knowledge and Skills

Details of the System

Pick a selected block Place at a selected location (X, Y-coordinates & height) Outcome concepts “Stable/Unstable” But, what to pick and where to place?

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 7 / 22

slide-17
SLIDE 17

Top-down image of the workspace – Cropped

Details of the System

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 8 / 22

slide-18
SLIDE 18

Workspace – Boundaries Extraction

Details of the System

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 9 / 22

slide-19
SLIDE 19

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-20
SLIDE 20

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-21
SLIDE 21

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-22
SLIDE 22

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-23
SLIDE 23

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-24
SLIDE 24

Receptive Field: Observation Features Extraction

Details of the System

F=(0,.,.,.,.)

potential placement location Receptive Field

F=(0,0,.,.,.) F=(0,0,1,.,.) F=(0,0,1,0,.) F=(0,0,1,0,0) F=(0,0,1,0,0) s=1 : height 1 a=1 : F has 1 bit set

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 10 / 22

slide-25
SLIDE 25

Receptive Field: Fovea-like Subimage Observations

Details of the System

s0,a0 s0,a0 s0,a0 s0,a0 s0,a0

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 11 / 22

slide-26
SLIDE 26

Receptive Field: Fovea-like Subimage Observations

Details of the System

s0,a0 s0,a0 s0,a0 s0,a0 s0,a0

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 11 / 22

slide-27
SLIDE 27

Receptive Field: Fovea-like Subimage Observations

Details of the System

s0,a0 s0,a0 s0,a0 s0,a0 s0,a0

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 11 / 22

slide-28
SLIDE 28

Receptive Field: Fovea-like Subimage Observations

Details of the System

s0,a0 s0,a0 s0,a0 s0,a0 s0,a0

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 11 / 22

slide-29
SLIDE 29

Receptive Field: Fovea-like Subimage Observations

Details of the System

s0,a0 s0,a0 s0,a0 s0,a0 s0,a0

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 11 / 22

slide-30
SLIDE 30

Receptive Field: Fovea-like Subimage Observations

Details of the System

s1,a5 s2,a1 s2,a5

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 12 / 22

slide-31
SLIDE 31

Receptive Field: Fovea-like Subimage Observations

Details of the System

s1,a5 s2,a1 s2,a5

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 12 / 22

slide-32
SLIDE 32

Receptive Field: Fovea-like Subimage Observations

Details of the System

s1,a5 s2,a1 s2,a5

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 12 / 22

slide-33
SLIDE 33

Predictors Pi: Learning how the world works

Details of the System

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-34
SLIDE 34

Predictors Pi: Learning how the world works

Details of the System

Each Pi predicts a basic physical concept—whether the placed block will stay there after released.

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-35
SLIDE 35

Predictors Pi: Learning how the world works

Details of the System

Each Pi predicts a basic physical concept—whether the placed block will stay there after released. Self-generated labels (Stable ≡ +1/Unstable ≡ −1)

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-36
SLIDE 36

Predictors Pi: Learning how the world works

Details of the System

Each Pi predicts a basic physical concept—whether the placed block will stay there after released. Self-generated labels (Stable ≡ +1/Unstable ≡ −1) Implemented as RLS1-based online linear classifiers [4].

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-37
SLIDE 37

Predictors Pi: Learning how the world works

Details of the System

Each Pi predicts a basic physical concept—whether the placed block will stay there after released. Self-generated labels (Stable ≡ +1/Unstable ≡ −1) Implemented as RLS1-based online linear classifiers [4]. Extended to also give a confidence interval in such prediction.

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-38
SLIDE 38

Predictors Pi: Learning how the world works

Details of the System

Each Pi predicts a basic physical concept—whether the placed block will stay there after released. Self-generated labels (Stable ≡ +1/Unstable ≡ −1) Implemented as RLS1-based online linear classifiers [4]. Extended to also give a confidence interval in such prediction. The learning progress, calculated as confidence improvement, is used as intrinsic reward.

1Regularized Least Square Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 13 / 22

slide-39
SLIDE 39

Reinforcement Learner R: Planning Exploration

Details of the System

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-40
SLIDE 40

Reinforcement Learner R: Planning Exploration

Details of the System

R creates policies (upon the MDP) using the approximation of transition probabilities P, and the current learning progress associated with each state-action pair as expected reward R.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-41
SLIDE 41

Reinforcement Learner R: Planning Exploration

Details of the System

R creates policies (upon the MDP) using the approximation of transition probabilities P, and the current learning progress associated with each state-action pair as expected reward R. Through LSPI-enabled policy iteration, the policy updates.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-42
SLIDE 42

Reinforcement Learner R: Planning Exploration

Details of the System

R creates policies (upon the MDP) using the approximation of transition probabilities P, and the current learning progress associated with each state-action pair as expected reward R. Through LSPI-enabled policy iteration, the policy updates. This curiosity-driven exploration policy tries to improve the agent’s knowledge (most quickly improve the performance of the predictors).

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-43
SLIDE 43

Reinforcement Learner R: Planning Exploration

Details of the System

R creates policies (upon the MDP) using the approximation of transition probabilities P, and the current learning progress associated with each state-action pair as expected reward R. Through LSPI-enabled policy iteration, the policy updates. This curiosity-driven exploration policy tries to improve the agent’s knowledge (most quickly improve the performance of the predictors). As a byproduct, the agent also improves its skills.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-44
SLIDE 44

Reinforcement Learner R: Planning Exploration

Details of the System

R creates policies (upon the MDP) using the approximation of transition probabilities P, and the current learning progress associated with each state-action pair as expected reward R. Through LSPI-enabled policy iteration, the policy updates. This curiosity-driven exploration policy tries to improve the agent’s knowledge (most quickly improve the performance of the predictors). As a byproduct, the agent also improves its skills. Again: no external rewards!!!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 14 / 22

slide-45
SLIDE 45

Where to Place: Most Informative Action Selection

Details of the System

s?,a? s?,a? (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model s1,a3 (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 15 / 22

slide-46
SLIDE 46

Where to Place: Most Informative Action Selection

Details of the System

s?,a? s?,a? (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model s1,a3 (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 15 / 22

slide-47
SLIDE 47

Where to Place: Most Informative Action Selection

Details of the System

s?,a? s?,a? (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model s1,a3 (s*,a*)=argmax Q(s,a) s.t. (s,a) in World Model

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 15 / 22

slide-48
SLIDE 48

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-49
SLIDE 49

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-50
SLIDE 50

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-51
SLIDE 51

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-52
SLIDE 52

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-53
SLIDE 53

Where to place: Action Selection Example

Details of the System

s1 Searching for (s1,a3) s1,a1 Searching for (s1,a3) s1,a0 Searching for (s1,a3) s1,a2 Searching for (s1,a3) s1,a3 Searching for (s1,a3) Best placement location found!

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 16 / 22

slide-54
SLIDE 54

Experiments and Results

Simulated Block World

200 400 600 800 1000 1200 1400 1600 1800 2000 5 10 15 20 25 Interactions with Environment Deviation from True Model (Sq.Err) Max Height = 8 Random Actions Try New Things (Optimistic) Curiosity

Compare the exploration efficiency of our method to random action selection and “optimistic initialization” [5]

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 17 / 22

slide-55
SLIDE 55

Real World Experiments

Experiments and Results

“high impact” demo video.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 18 / 22

slide-56
SLIDE 56

Real World Experiments

Experiments and Results

20 40 0.5 1 Action 1

  • Apx. Prob. Stability

Height 1 Height 2 20 40 0.5 1 Action 2 20 40 0.5 1 Action 3 20 40 0.5 1 Action 4 20 40 0.5 1 Action 5 20 40 0.5 1 Action 6

Predictive knowledge gained through playing experiences

(Initially changes rapidly as new data comes in, but most seems to be converging to a sensible result.) Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 19 / 22

slide-57
SLIDE 57

Real World Experiments

Experiments and Results 10 20 30 40 50 60 1 2 Interactions with Environment Height Placed Upon Katana Robot Experience

Developmental stages

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 20 / 22

slide-58
SLIDE 58

Real World Experiments

Experiments and Results 10 20 30 40 50 60 1 2 Interactions with Environment Height Placed Upon Katana Robot Experience

Developmental stages Tower building as emergent behavior

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 20 / 22

slide-59
SLIDE 59

Conclusion

Learning from Play

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 21 / 22

slide-60
SLIDE 60

Conclusion

Learning from Play

Progress-based exploration may lead to efficient knowledge acquisition (without external rewards).

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 21 / 22

slide-61
SLIDE 61

Conclusion

Learning from Play

Progress-based exploration may lead to efficient knowledge acquisition (without external rewards). Skills can be accumulated as a by-product.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 21 / 22

slide-62
SLIDE 62

Conclusion

Learning from Play

Progress-based exploration may lead to efficient knowledge acquisition (without external rewards). Skills can be accumulated as a by-product. Progressively towards more complex knowledge and skills.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 21 / 22

slide-63
SLIDE 63

Main References

[1] J. Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173–187, 2006. [2] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE TAMD, 2(3):230–247, 2010. [3] J. Storck, S. Hochreiter, and J. Schmidhuber. Reinforcement driven information acquisition in non-deterministic environments. ICANN, volume 2, pages 159–164, 1995. [4] V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213Ð248, 2001. [5] R.I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. JMLR, 3:213–231, 2003.

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 22 / 22

slide-64
SLIDE 64

Thank you!!!

More questions: hung@idsia.ch

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 23 / 22

slide-65
SLIDE 65

Backup Slides

Hung Ngo et al (IDSIA) Learning Skills from Play IJCNN 2012 24 / 22