3D Attention-Driven Depth Acquisition for Object Identification Kai - - PowerPoint PPT Presentation

▶

Feb 27, 2023 187 likes •618 views

3D Attention-Driven Depth Acquisition for Object Identification Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or and Baoquan Chen National University of Defense Technology Shandong University Shenzhen

SLIDE 1

3D Attention-Driven Depth Acquisition for Object Identification

Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or and Baoquan Chen

National University of Defense Technology Shandong University Shenzhen University SIAT Stanford University Tel-Aviv University

SLIDE 2

Robotic indoor scene modeling

Background & motiv ivatio ion

Perception on object

SLIDE 3

Indoor environments acquisition and modeling

Background & motiv ivatio ion

[Nießner et al. 2013] [Xu et al. 2015]

Object Extraction Dense Reconstruction

SLIDE 4

Background & motiv ivatio ion

What are these objects?

SLIDE 5

SLIDE 6

Activ ive obje ject recognit itio ion

SLIDE 7

Activ ive obje ject recognit itio ion

SLIDE 8

Proble lem settin ing

A robot actively acquires new observations to

gradually increase the confidence of object recognition

Two key components:

Object classification

Estimate object class based on so far acquired

bservations

View planning

Predict the Next-Best- View to maximize its information gain

SLIDE 9

The main in chall llenge

Observatio

ion is is partia ial l and progressiv ive

Shape description/matching with partial data is

hard

Observations from varying views

SLIDE 10

The main in chall llenge

Observatio

ion is is partia ial l and progressiv ive

View planning

Observed view

Unobserved views

? ? ?

How can you know which view is better without knowing its observation?

SLIDE 11

The main in chall llenge

Real

l in indoor scenes are often clu luttered

Degrade recognition accuracy
Invalidate the off-line learned viewing policy

SLIDE 12

Related work

SLIDE 13

Rela lated work

Onli

line scene analy lysis is and modeli ling

Plane/Object Extraction [Zhang et al. 2014] SemanticPaint [Valentin et al. 2015]

SLIDE 14

Rela lated work

Activ

ive reconstructio ion and recognit itio ion

Next-best-view for reconstruction [Wu et al. 2014] Next-best-view for recognition [Wu et al. 2015]

SLIDE 15

Method

SLIDE 16

The general l framework

SLIDE 17

The general l framework Goal Belief

Observe

Action

Recognition View planning

SLIDE 18

An attentio ional formula latio ion

“Humans fo focus att ttention sele lectively on part rts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an in internal l re representatio ion of the scene” –– Ronald Rensink

Hand-writing recognition [Mnih et al. 2014] Image caption generation [Xu et al. 2015]

Internal representation

SLIDE 19

Recurrent Attentio ion Model

Recurrent Neural Networks (RNN)

𝐢𝑢 𝐢𝑢−1 𝐢𝑢+1

… …

𝐳𝑢 𝐳𝑢−1 𝐳𝑢+1 𝐲𝑢 𝐲𝑢−1 𝐲𝑢+1 𝐳𝑢 𝐲𝑢 𝐢𝑢

𝐗ℎℎ 𝐗𝑗ℎ 𝐗ℎ𝑝

Aggregate information

SLIDE 20

Vie iew-based observatio ion

𝜒𝑢 𝜄𝑢 𝑤0 𝑤𝑢 𝐽(0) 𝐽(t)

SLIDE 21

ℎ2

(1)

𝐽(0)

𝜄 1 , 𝜚(1)

NBV emission

ℎ1

(1) Feature extraction

𝜄 0 , 𝜚(0)

initial view classify

View selection View aggregation

3D 3D Recurrent Attentio ion Model

ℎ2

(2)

𝐽(1)

𝜄 2 , 𝜚(2)

NBV emission

ℎ1

(2) Feature extraction

𝜄 1 , 𝜚(1)

classify

ℎ2

(3)

𝐽(2)

𝜄 3 , 𝜚(3)

ℎ1

(3) Feature extraction

𝜄 2 , 𝜚(2)

classify

… …

SLIDE 22

3D 3D Recurrent Attentio ion Model

Multi-View CNN [Su et al. 2015] View pooling

… …

ℓ1 ℓ2 ℓ𝐿

……

CNN2 CNN1 CNN1 CNN1

Max-pooling ℎ2

(1)

𝐽(0)

𝜄 1 , 𝜚(1)

NBV emission

ℎ1

(1) Feature extraction

𝜄 0 , 𝜚(0)

initial view classify

ℎ2

(2)

𝐽(1)

𝜄 2 , 𝜚(2)

NBV emission

ℎ1

(2) Feature extraction

𝜄 1 , 𝜚(1)

classify

ℎ2

(3)

𝐽(2)

𝜄 3 , 𝜚(3)

ℎ1

(3) Feature extraction

𝜄 2 , 𝜚(2)

classify

… …

SLIDE 23

Network train inin ing

CNN

Reinforcement learning

𝜄 𝑗 , 𝜚(𝑗) 𝐽(𝑗)

Indifferentiable

𝜄 𝑗 , 𝜚(𝑗) 𝐽(𝑗)

rendering

Back propagation

SLIDE 24

Rein inforcement le learnin ing

agent environment

action reward

Depth acquisition How good the depth is?

state

Stop?

SLIDE 25

Reward 𝑠

𝑢 = 𝐼𝑢 𝑞𝑢,

𝑞 + 𝐽𝑢 𝑞𝑢, 𝑞𝑢−1 − 𝐷𝑢

prediction accuracy movement cost information gain

SLIDE 26

Part-le level attentio ion

How to distinguish these two chairs? Informative parts

cclusion

SLIDE 27

Attentio ion extractio ion

… …

Convolutional Neural Network

……

… … Mid-level kernels in CNN

SLIDE 28

Attentio ion extractio ion

One wing Two wings

SLIDE 29

Results and evaluation

SLIDE 30

Database

57,452 models 57 categories 12,311 models 40 categories Render

model 52 sampled views 260 sampled views

Render with jittering

SLIDE 31

Tim imin ing

Database MV-RNN train MV-RNN test ShapeNet 49 hr. 0.1 sec. ModelNet40 22 hr. 0.1 sec.

SLIDE 32

Vis isuali lizatio ion of attentio ions

Part-level attention View sequence View sequence

SLIDE 33

NBV estim imation

Classification Accuracy

40 classes

SLIDE 34

NBV estim imation under occlu lusio ion

Classification Accuracy

…

SLIDE 35

Result lts on real l scenes

SLIDE 36

Result lts on real l scenes

SLIDE 37

Result lts on real l scenes

SLIDE 38

Lim imit itations

Recognizable objects
No contextual information

SLIDE 39

Future works: Mult lti-modal l recognit itio ion

What is this?

Image database Shape database

SLIDE 40

Future: Mult lti-robot scene reconstructio ion & understandin ing

Turtlebot PR2 AscTec Pelican

SLIDE 41

Future: Mult lti-robot attentio ion model

Attention based on shared internal representation?

SLIDE 42