Probing Emergent Semantics in Predictive Agents via Question - - PowerPoint PPT Presentation

probing emergent semantics in predictive agents via
SMART_READER_LITE
LIVE PREVIEW

Probing Emergent Semantics in Predictive Agents via Question - - PowerPoint PPT Presentation

Probing Emergent Semantics in Predictive Agents via Question Answering Link to slides with playable videos: bit.ly/3iKYJd3 Abhishek Fede Hamza Laura Rosalia Das* 1 Carnevale* Merzic Rimell Schneider Josh Alden Arjun Stephen Greg


slide-1
SLIDE 1

Abhishek Das* 1

Probing Emergent Semantics in Predictive Agents via Question Answering

Fede Carnevale* Rosalia Schneider Josh Abramson Laura Rimell Hamza Merzic Alden Hung Arjun Ahuja Stephen Clark Greg Wayne Felix Hill

* Denotes equal contribution.

1 Now at Facebook AI Research. Work done during an internship at DeepMind.

Link to slides with playable videos: bit.ly/3iKYJd3

slide-2
SLIDE 2

Self-supervised representation learning

Language Vision Reinforcement Learning

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv:1810.04805 (2018). Pathak, Deepak, et al. "Context encoders: Feature learning by inpainting." CVPR 2016. Pathak, Deepak, et al. "Learning features by watching objects move." CVPR 2017. Wayne, Greg, et al. "Unsupervised predictive memory in a goal-directed agent." arXiv:1803.10760 (2018). Ha, David, and Jürgen Schmidhuber. "World models." arXiv:1803.10122 (2018). http://jalammar.github.io/illustrated-bert

slide-3
SLIDE 3

How much objective knowledge about the external world can be learned through egocentric prediction?

slide-4
SLIDE 4
  • Intuitive: simply ask an agent what it knows about its world

and get an answer back

  • Open-ended: pose arbitrarily complex questions to an agent

Question-answering (in English) as an evaluation tool

for investigating how much environment knowledge is encoded in an agent’s internal representation

slide-5
SLIDE 5

Environment

5

slide-6
SLIDE 6

Environment

  • Unity-based; runs at 30 fps
  • 96 x 72 RGB first-person view
  • 50 objects types

10 colors 3 sizes Agent

  • First person view
  • 8-D action space:

Move-{forward, backward, left, right} Look-{up, down, left, right}

Environment

6

slide-7
SLIDE 7

Training task: Exploration

+1 reward for unvisited object 0 reward for visited object rewards refresh once all visited

Top-down view shown for illustration purposes. The agent only has access to first-person observations.

7

slide-8
SLIDE 8

Training task: Exploration

+1 reward for unvisited object 0 reward for visited object rewards refresh once all visited

Top-down view shown for illustration purposes. The agent only has access to first-person observations.

8

slide-9
SLIDE 9

Evaluation probe: Question-answering

9

What is the color of the bed? How many wardrobes are there? What is the object near the bed? Is there a basketball in the room? ...

Top-down view shown for illustration purposes. The agent only has access to first-person observations.

slide-10
SLIDE 10

10

Evaluation probe: Question-answering

Questions are programmatically generated in a manner similar to CLEVR (Johnson et al., 2017)

slide-11
SLIDE 11

Evaluation probe: Question-answering

11

Top-down view shown for illustration purposes. The agent only has access to first-person observations.

Gradients from question-answering are not backpropagated into the agent.

slide-12
SLIDE 12

Setup

12

Top-down view shown for illustration purposes. The agent only has access to first-person observations.

(i) During training, the agent explores and learns to build representations from egocentric observations (ii) During evaluation, we probe the agent’s internal representations on a question-answering task

How many wardrobes are there? What is the color of the bed? What is the object near the bed? Is there a basketball in the room? ...

slide-13
SLIDE 13

Agent architecture

13

slide-14
SLIDE 14

Agent architecture

14

slide-15
SLIDE 15

Agent architecture

15

slide-16
SLIDE 16

Agent architecture

16

slide-17
SLIDE 17

Agent architecture

17

slide-18
SLIDE 18

Agent architecture

18

slide-19
SLIDE 19

Agent architecture

19

slide-20
SLIDE 20

Agent architecture

20

slide-21
SLIDE 21

Agent architecture

21

  • Action-conditioned forward prediction
  • Multiple steps into the future
  • Self-supervised
slide-22
SLIDE 22

Agent architecture

22

slide-23
SLIDE 23

Agent architecture

23

Gradients from the question-answering decoder not backpropagated into the agent

slide-24
SLIDE 24

Agent architecture

24

slide-25
SLIDE 25

Baselines and Oracle

25

Baselines

slide-26
SLIDE 26

Baselines and Oracle

26

Baselines

  • Question-only: no vision
slide-27
SLIDE 27

Baselines and Oracle

27

Baselines

  • Question-only: no vision
  • LSTM: no auxiliary predictive loss

X

slide-28
SLIDE 28

Baselines and Oracle

28

Baselines

  • Question-only: no vision
  • LSTM: no auxiliary predictive loss

Predictive losses

  • CPC|A (Guo et al., 2018)
  • SimCore (Gregor et al., 2019)
slide-29
SLIDE 29

Baselines and Oracle

29

Baselines

  • Question-only: no vision
  • LSTM: no auxiliary predictive loss

Predictive losses

  • CPC|A (Guo et al., 2018)
  • SimCore (Gregor et al., 2019)

Oracle

  • No SG: QA decoder without stop gradient

similar to Embodied / Interactive Question Answering (Das et al., 2018, Gordon et al., 2018)

X

slide-30
SLIDE 30

Results: shape questions

30

SimCore Question-only Oracle CPC|A LSTM Training steps

slide-31
SLIDE 31

Results: overall

31

29% 31% 32% 60% 63% Top-1 QA Accuracy

slide-32
SLIDE 32

Results

32

slide-33
SLIDE 33

Q: What is the aquamarine object? A: Grinder

33

Top-3 answer predictions Answer probabilities P(“Grinder”)

slide-34
SLIDE 34

Q: How many blue objects are there? A: One

P(“One”) P(“Three”) Top-3 answer predictions

slide-35
SLIDE 35

Q: How many yellow objects are there? A: Four

P(“four”) P(“two”) P(“Three”) Top-3 answer predictions

slide-36
SLIDE 36

Unseen: What shape is the green object? Bed

Compositional generalization

ball grinder chair bed b l u e r e d p u r p l e g r e e n

Train-Test split

… . . . 36

Question Answer

Seen: What shape is the blue object? Bed Seen: What shape is the green object? Ball

SimCore

slide-37
SLIDE 37

Top-down map prediction

37

slide-38
SLIDE 38

Top-down map prediction

38

Top-down Map

slide-39
SLIDE 39

Top-down map prediction

39

slide-40
SLIDE 40

Conclusions

  • Question-answering to probe

internal representations, enabling evaluation of agents using natural linguistic interactions.

  • Self-supervised predictive agents,

such as SimCore, capture decodable knowledge about the environment, while non-predictive agents and CPC|A don’t.

  • Generalization of the decoder

suggests some degree of compositionality in internal representations.

  • arxiv.org/abs/2006.01016