Research @ Vicarious AI: toward data efficiency, task generality - - PowerPoint PPT Presentation

research vicarious ai
SMART_READER_LITE
LIVE PREVIEW

Research @ Vicarious AI: toward data efficiency, task generality - - PowerPoint PPT Presentation

Research @ Vicarious AI: toward data efficiency, task generality and conceptual understanding Huayan Wang huayan@vicarious.com Breakout A3C (Mnih et al., 2016) Human state-of-the-art Deep RL When


slide-1
SLIDE 1
  • Research @ Vicarious AI:

toward data efficiency, task generality and conceptual understanding

Huayan Wang huayan@vicarious.com

slide-2
SLIDE 2
  • Breakout

Human A3C (Mnih et al., 2016) state-of-the-art Deep RL

slide-3
SLIDE 3
  • When playing the game, we understand it by concepts, causes, and effects.
slide-4
SLIDE 4
  • Do deep reinforcement learning agents

understand concepts, causes, and effects?

slide-5
SLIDE 5
  • Generalization tests

paddle shifted up random target center wall A3C (Mnih et al., 2016), state-of-the-art Deep RL

slide-6
SLIDE 6
  • Schema networks (ICML ’17)

paddle shifted up random target center wall

slide-7
SLIDE 7
  • Vicarious AI research themes
  • Strong inductive bias and data efficiency
  • Task generality
  • Conceptual understanding / model-based approaches
  • Neuro & cognitive sciences
slide-8
SLIDE 8
  • Outline
  • Vicarious AI research overview
  • Schema networks (ICML ’17)
  • Teaching compositionality to CNNs (CVPR ’17)
slide-9
SLIDE 9
  • Schema networks

1.Learn a causal model of an environment 2.Use that model to make a plan 3.Generalize to new environments where causation is preserved

The Problem We Want to Solve

slide-10
SLIDE 10
  • Trained on MiniBreakout
slide-11
SLIDE 11
  • The model had to learn:
  • What causes rewards? Does color matter?
  • Which movements are caused by actions?
  • Why does the ball change direction?
  • Why can’t the paddle move through a wall?
  • Why does the ball bounce differently depending on where it hits

the paddle, but not for bricks or walls?

slide-12
SLIDE 12
  • Learning efficiency on MiniBreakout*

* Best of 5 training runs for A3C. Mean of all 5 training runs for schemas.

perfect score = 30

slide-13
SLIDE 13
  • Zero-shot transfer

paddle shifted up center wall standard

slide-14
SLIDE 14
  • Entity Representation

An entity is any trackable visual feature with associated attributes, represented as random variables. Typical entities:

  • Objects
  • Parts of objects
  • Object

boundaries

  • Surfaces &

contours

slide-15
SLIDE 15
  • All entities share the same sets of attributes. E.g.:

Entity Representation

slide-16
SLIDE 16
  • Schema Definition

A schema describes how the future value of an entity’s attribute depends on the current values of that entity’s attributes and possibly other nearby entities.

slide-17
SLIDE 17
  • Model Definition

Schemas are ORed together to predict a single variable, and self-transition factors carry over states unaffected by any schema.

blue: schema yellow: ST red: OR

slide-18
SLIDE 18
  • An ungrounded schema is “convolved” to construct a factor graph of grounded

schemas, which are bound to specific entities, positions, and times.

Model Definition

blue: schema yellow: ST red: OR

slide-19
SLIDE 19
  • Learning Strategy
  • For each entity, record all other entity states within a given

neighborhood at all times.

  • Convert each neighborhood state into a binary vector.
  • Greedily learn one schema at a time using LP, removing all correctly

predicted timesteps before learning the next schema.

slide-20
SLIDE 20
  • Perform max-prop forward in time until reaching a positive reward.
  • Recursively clamp the conditions of schemas to achieve desired

states in the next timestep.

  • If clamping leads to an inconsistency, backtrack and try a different

schema to cause a desired state.

Inference Method

slide-21
SLIDE 21
  • Visualization of Max-Prop
slide-22
SLIDE 22
  • Visualization of Max-Prop
slide-23
SLIDE 23
  • Zero-shot transfer to Middle-Wall Breakout

A3C Image Only A3C Image + Entities Schema Networks Mean Score per Episode* 9.55 ± 17.44 8.00 ± 14.61 35.22 ± 12.23

* Mean of best 2 of 5 training runs for A3C. Mean of all 5 training runs for schemas.

slide-24
SLIDE 24
  • With additional training on Middle-Wall Breakout

* Best of 5 training runs for A3C. Mean of all 5 training runs for schemas.

slide-25
SLIDE 25
  • Zero-shot transfer to Offset Paddle

A3C Image Only A3C Image + Entities Schema Networks Mean Score per Episode* 0.60 ± 20.05 11.10 ± 17.44 41.42 ± 6.29

slide-26
SLIDE 26
  • Zero-shot transfer to Random Target

A3C Image Only A3C Image + Entities Schema Networks Mean Score per Episode* 6.83 ± 5.02 6.88 ± 6.19 21.38 ± 5.02

slide-27
SLIDE 27
  • Zero-shot transfer to Juggling

A3C Image Only A3C Image + Entities Schema Networks Mean Score per Episode*

  • 39.35 ± 14.57
  • 17.52 ± 17.39
  • 0.11 ± 0.34
slide-28
SLIDE 28
  • [Post-publication]: Predicting collisions with obstacles
slide-29
SLIDE 29
  • [Post-publication]: Other games where we can learn

the dynamics, but planning is tricky. Our blog post: https://www.vicarious.com/schema-nets

slide-30
SLIDE 30
  • Future work
  • Better learning methods needed for
  • non-binary attributes
  • inherently stochastic dynamics
  • Real world applications require working with visual

representations from raw sensory inputs.

slide-31
SLIDE 31
  • Conclusions
  • Model-based causal inference enables zero-shot transfer
  • A compositional representation (entities, attributes)

enabled flexible cause-and-effect modeling.

  • The schema network itself is compositional too, with

ungrounded schemas as basic building blocks.

  • To perform causal inference with the same flexibility in the

real world, we need to learn a compositional visual representation from raw inputs.

slide-32
SLIDE 32
  • Next topic: compositionality in visual representation learning
slide-33
SLIDE 33
  • Our representation of visual knowledge

is compositional

count triangles?

slide-34
SLIDE 34
  • Compositional visual representations
  • (Z.W. Tu et al 2005)
  • (S.-C. Zhu and D. Mumford, 2006)
  • (Z. Si and S.-C. Zhu, 2013)
  • (L. Zhu and A. Yuille, 2005)
  • (I. Kokkinos and A. Yuille, 2011)
  • (M. Lazaro-Gredilla et al, 2017)
  • …….
slide-35
SLIDE 35
  • Hierarchical compositional feature learning

(M. Lazaro-Gredilla et al, 2017)

https://arxiv.org/abs/1611.02252

  • Discovers natural building blocks
  • f images as features
  • Learns using loopy BP (without

EM-like procedure)

slide-36
SLIDE 36
  • The success / hype of deep learning
  • Conv-nets (CNNs) have become the “standard”

representation in may vision applications

  • Segmentation (J. Long, E. Shelhamer et al. 2015, P

. O. Pinheiro et al. 2015)

  • Detection (R. Girshick et al. 2014, S. Ren et al. 2015)
  • Image description (A. Karpathy and L. Fei-Fei, 2015)
  • Image retrieval (J. Johnson et al. 2015)
  • 3D representations (C. B. Choy et al. 2016, H. Su et al. 2017, )
  • ……
slide-37
SLIDE 37
  • Is the CNN representation compositional?
slide-38
SLIDE 38
  • How to test compositionality
  • f CNN feature maps?

Compositionality: the representation of the whole should be composed of the representation of its parts

slide-39
SLIDE 39
  • Define compositionality for

CNN feature maps

“object” can be any primitive visual entity that we expect to re-use and recombine with other entities

slide-40
SLIDE 40
  • Define compositionality for

CNN feature maps

input image mask of visual entity projected mask feature map feature map of masked image masked feature map

slide-41
SLIDE 41
  • CNN (VGG16, K. Simonyan and A. Zisserman, 2015)

feature map (on a high cone-layer) Activation difference (from that of an isolated plane) in the plane region. input frames

slide-42
SLIDE 42
  • Outline
  • Vicarious AI research overview
  • Schema networks (ICML ’17)
  • Teaching compositionality to CNNs (CVPR ’17)
slide-43
SLIDE 43
  • Motivations
  • Strong inductive bias that leads to data efficiency.
  • Robust to re-combination and less prone to

focusing on discriminative but irrelevant background features.

  • In line with findings from neuroscience that

suggest separate processing of figure and ground regions in the visual cortex.

slide-44
SLIDE 44
  • Teaching compositionality to CNNs
slide-45
SLIDE 45
  • Teaching compositionality to CNNs
slide-46
SLIDE 46
  • Training objective

cost = classification costs + compositionality cost

slide-47
SLIDE 47

Compare object recognition accuracy of the following methods

Variants of our method COMP-FULL: (also penalizing activations in the background) COMP-OBJ-ONLY: (not penalizing activation in the background) COMP-NO-MASK: (not applying masks to activation masks) Baselines BASELINE: (training a CNN with unmasked inputs only) BASELINE-AUG: (using masked + unmasked inputs of the same object) BASELINE-REG: (dropout + l2 regularization) BASELINE-AUG-REG: (combining the above two)

slide-48
SLIDE 48
  • Rendered single object on random background

test on seen instances test on unseen instances

Blue: variants of our method. Red: baselines

  • 12 classes
  • ~20 3D models per class
  • 50 viewpoints
  • sampled 1,600 images, 80% for training
slide-49
SLIDE 49
  • Rendered multiple objects on random background

seen instances unseen instances

Blue: variants of our method. Red: baselines

  • 12 classes
  • ~20 3D models per class
  • 50 viewpoints
  • sampled 800 images, 80% for training
slide-50
SLIDE 50
  • MNIST digits with clutter

single digit multiple digits

Blue: variants of our method. Red: baselines

slide-51
SLIDE 51
  • MS-COCO-subset

Blue: variants of our method. Red: baselines

  • 20 classes
  • filtered for object instance with at least 7,000 pixels
  • 22,476 training images
  • 12,254 test images
slide-52
SLIDE 52
slide-53
SLIDE 53
  • without compositionality

with compositionality inputs

slide-54
SLIDE 54
  • Backtracing and localization

input comp-full (ours) baseline-aug VGG16

Backtracing classification decision to input image using guided back-propagation (Springenberg et al ICLR-WS 2015)

slide-55
SLIDE 55

Backtracing classification decision to input image using guided back-propagation (Springenberg et al ICLR-WS 2015)

slide-56
SLIDE 56

Backtracing classification decision to input image using guided back-propagation (Springenberg et al ICLR-WS 2015)

slide-57
SLIDE 57
  • Quantitative measure

Percentage of “mass” of the back-trace heap-map inside the ground truth mask of the back-traced category

slide-58
SLIDE 58
  • Quantitative measure

Percentage of “mass” of the back-trace heap-map inside the ground truth mask of the back-traced category Comp-Full and Baseline-Aug were trained on 22k images. VGG was trained on 1.2 million images same train data same training

  • bjective
slide-59
SLIDE 59

For in and out-of-context

  • bjects, the improvements are

both significant (and similar).

slide-60
SLIDE 60
  • Future work
  • Explicite context modeling in a compositional

representation

  • Re-combinations of learned representations and

learning-to-learn

  • ……
slide-61
SLIDE 61
  • Acknowledgements

Austin Stone Tom Silver Ken Kansky David A. Mely Szymon Sidor John Bauer Michael Stark Robert Hafner Yi Liu Miguel Lazaro Gredilla Xinghua Lou Nimrod Dorfman Mohamed Eldawy

  • D. Scott Phoenix

Dileep George Eric Purdy

slide-62
SLIDE 62
  • check out our research blog at vicarious.com

Thank you!