Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras - - PowerPoint PPT Presentation

geometry aware deep visual learning
SMART_READER_LITE
LIVE PREVIEW

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras - - PowerPoint PPT Presentation

Geometry-Aware Deep Visual Learning Katerina Fragkiadaki zebras How this talk fits the workshop We will discuss new neural architectures for video understanding and feature learning without human annotations We will still use SGD to


slide-1
SLIDE 1

Geometry-Aware Deep Visual Learning

Katerina Fragkiadaki

slide-2
SLIDE 2

zebras

How this talk fits the workshop

  • We will discuss new neural architectures for video

understanding and feature learning without human annotations

  • We will still use SGD to train the models
slide-3
SLIDE 3

zebras

What is the goal of computer vision?

slide-4
SLIDE 4

label image pixels, detect and segment objects

Image from Bruno Olshausen

slide-5
SLIDE 5
  • K. He et al., MaskRCNN, 2017

zebras

label image pixels, detect and segment objects

slide-6
SLIDE 6

Registration against known HD maps, 3D object detection, 3D motion forecasting

slide-7
SLIDE 7

Image Understanding as Inverse Graphics

slide-8
SLIDE 8

zebras

A reasonable answer: the goal of computer vision is task specific

slide-9
SLIDE 9
slide-10
SLIDE 10

Photos taken by people (and uploaded on the Internet) Photos taken by a NAO robot during a robot soccer game

Internet Vision Mobile (Embodied) Computer Vision Our detectors may not work very well here…

slide-11
SLIDE 11
slide-12
SLIDE 12

Photos taken by people (and uploaded on the Internet) Photos taken by a NAO robot during a robot soccer game

Internet Vision Mobile (Embodied) Computer Vision Our detectors may not work very well here… Do we have more suitable models for this domain?

slide-13
SLIDE 13

Why Embodied Computer Vision Matters

1.Agents that move around in the world, perceive the world and accomplish tasks is (close to) the goal of AI research 2.It may be the key towards unsupervised visual feature learning `` We must perceive in order to move, but we must also move in order to perceive” JJ Gibson

Ecological Approach to Visual Perception, Gibson, 1979

slide-14
SLIDE 14

zebras

Internet and Mobile Perception have developed independently and have each made great progress

  • Internet vision has trained great DeepNets for image

labelling and object detection+segmentation

  • Mobile computer vision has produced great SLAM

(Simultaneous Localization and Mapping) methods

slide-15
SLIDE 15

?

Image Understanding as Inverse Graphics

Should we be engineering a different model for every domain?

slide-16
SLIDE 16

Blocks world

Larry Roberts Input image Image gradient Computed 3D model rendered from a new viewpoint

Machine perception of Three-Dimensional solids, MIT 1965

Image Understanding as Inverse Graphics

slide-17
SLIDE 17

David Marr 1982

Image Understanding as Inverse Graphics

slide-18
SLIDE 18

“Intelligence without reason”, IJCAI, Rodney Brooks (1991)

3D Models are impossible and unecessary

``Internal world models which are complete representations of the external environment, besides being impossible to obtain, are not at all necessary for agents to act in a competent manner.” ``…(1) eventually computer vision will catch up and provide such world models—-I don't believe this based on the biological evidence presented below, or (2) complete objective models of reality are unrealistic and hence the methods of Artificial Intelligence that rely on such models are unrealistic.”

Steering angle

slide-19
SLIDE 19

25 years later

iRobot vacuum cleaner is building a map!

(Rodney Brooks co-founded iRobot in 1990)

slide-20
SLIDE 20

To 3D or not to 3D?

slide-21
SLIDE 21

depth map surface normals 3d mesh 3d point cloud 3d voxel occupancy

?

And if to 3D, what 3D representation to use?

slide-22
SLIDE 22

H × W × D × C

3 spatial dimensions, 1 feature dimension

This talk: To 3D using 3D feature tensors

slide-23
SLIDE 23

H × W × D × C

3 spatial dimensions, 1 feature dimension

This talk: To 3D using 3D feature tensors

slide-24
SLIDE 24

H × W × D × C

3 spatial dimensions, 1 feature dimension

This talk: To 3D using 3D feature tensors

slide-25
SLIDE 25

H × W × D × C

3 spatial dimensions, 1 feature dimension

This talk: To 3D using 3D feature tensors

slide-26
SLIDE 26

Geometry-Aware Recurrent Networks

t

R, t

1.Hidden state: A 4D deep feature tensor, akin to a 3D (feature as opposed to pointcloud) map of the scene 2.Egomotion-stabilized hidden state updates

slide-27
SLIDE 27

2D Recurrent networks, LSTMs, CONVLSTMs,..

T

ht

ht+1

ht+2

CNN CNN CNN

slide-28
SLIDE 28

T

CNN CNN CNN

ht ht

4D latent state

R, t

Egomotion

slide-29
SLIDE 29

T

CNN CNN CNN

ht ht

4D latent state

htht+1

R, t

Egomotion

R, t

Egomotion

slide-30
SLIDE 30

T

CNN CNN CNN

htht

4D latent state

R, t

Egomotion

ht ht+1

R, t

Egomotion

ht

ht+2

slide-31
SLIDE 31

Geometry-Aware Recurrent Networks (GRNNs)

H × W × D × C

slide-32
SLIDE 32

Geometry-Aware Recurrent Networks (GRNNs)

H × W × D × C

slide-33
SLIDE 33
  • A set of differentiable neural modules to learn to go

from 2D to 3D and back

  • A lot of SLAM ideas into the neural modules

GRNNs

t

R, t

slide-34
SLIDE 34

Unprojection (2D to 3D)

slide-35
SLIDE 35

Unprojection (2D to 3D)

slide-36
SLIDE 36

Unprojection (2D to 3D)

slide-37
SLIDE 37

Unprojection (2D to 3D)

slide-38
SLIDE 38

Unprojection (2D to 3D)

slide-39
SLIDE 39

azimuth elevation

Rotation

slide-40
SLIDE 40

Egomotion-stabilized memory update

Unprojection Rotation

3D feature memory

cross convolution

Relative Rotation R

slide-41
SLIDE 41

Hidden state update

Egomotion-stabilized memory update

Unprojection

ht ht+1

Rotation

−R

slide-42
SLIDE 42

d

Projection (3D to 2D)

slide-43
SLIDE 43

d

Projection (3D to 2D)

slide-44
SLIDE 44

d

Projection (3D to 2D)

slide-45
SLIDE 45

d

Projection (3D to 2D)

slide-46
SLIDE 46

d

Projection (3D to 2D)

slide-47
SLIDE 47

Training GRNNs

1.Self-supervised via predicting images the agent will see under novel viewpoints 2.Supervised for 3D object detection

slide-48
SLIDE 48

rotate to query view

Image generation

View prediction

project Image generator

slide-49
SLIDE 49

GRNN 2D RNN [1]

[1] Neural scene representation and rendering DeepMind, Science, 2018

3 input views

slide-50
SLIDE 50

GRNN 2D RNN [1]

[1] Neural scene representation and rendering DeepMind, Science, 2018

3 input views

Testing on scenes with more objets than train time

slide-51
SLIDE 51

View prediction

geometry-aware RNN 2D RNN [1]

slide-52
SLIDE 52

RPN

3D Object Detection

3D version of MaskRCNN

slide-53
SLIDE 53

Results - 3D object detection

slide-54
SLIDE 54

3D object detection

predicted segmentations predicted boxes input views

front-view bird-view

gt prediction

Objects detections learn to perist in time, they do not switch on and off from frame to frame

slide-55
SLIDE 55
  • Differentiable SLAM for better space-aware deep feature learning
  • Generative model of scenes with a 3D bottleneck when trained

from view prediction

  • Generalize better than 2D models

GRNNs

slide-56
SLIDE 56
  • Use GRNNs for tracking, dynamics, learning, perceptual

front-end for RL, robotic learning

What’s next?

slide-57
SLIDE 57

Ziyan Wang Ricson Chen Fish Tung

Thank you!

  • Learning spatial common sense with geometry-aware recurrent networks, F. Tung, R. Cheng, K.F., arxiv
  • Geometry-Aware Recurrent Neural Networks for Active Visual Recognition , R. Cheng, Z. Wang, K.F., NIPS 2018