SLIDE 1 3D Attention-Driven Depth Acquisition for Object Identification
Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or and Baoquan Chen
National University of Defense Technology Shandong University Shenzhen University SIAT Stanford University Tel-Aviv University
SLIDE 2
- Robotic indoor scene modeling
Background & motiv ivatio ion
Perception on object
SLIDE 3
- Indoor environments acquisition and modeling
Background & motiv ivatio ion
[Nießner et al. 2013] [Xu et al. 2015]
Object Extraction Dense Reconstruction
SLIDE 4
Background & motiv ivatio ion
What are these objects?
SLIDE 5
SLIDE 6
Activ ive obje ject recognit itio ion
SLIDE 7
Activ ive obje ject recognit itio ion
SLIDE 8 Proble lem settin ing
- A robot actively acquires new observations to
gradually increase the confidence of object recognition
Object classification
Estimate object class based on so far acquired
View planning
Predict the Next-Best- View to maximize its information gain
SLIDE 9 The main in chall llenge
ion is is partia ial l and progressiv ive
- Shape description/matching with partial data is
hard
- Observations from varying views
SLIDE 10 The main in chall llenge
ion is is partia ial l and progressiv ive
Observed view
Unobserved views
? ? ?
How can you know which view is better without knowing its observation?
SLIDE 11 The main in chall llenge
l in indoor scenes are often clu luttered
- Degrade recognition accuracy
- Invalidate the off-line learned viewing policy
SLIDE 12
Related work
SLIDE 13 Rela lated work
line scene analy lysis is and modeli ling
Plane/Object Extraction [Zhang et al. 2014] SemanticPaint [Valentin et al. 2015]
SLIDE 14 Rela lated work
ive reconstructio ion and recognit itio ion
Next-best-view for reconstruction [Wu et al. 2014] Next-best-view for recognition [Wu et al. 2015]
SLIDE 15
Method
SLIDE 16
The general l framework
SLIDE 17
The general l framework Goal Belief
Observe
Action
Recognition View planning
SLIDE 18 An attentio ional formula latio ion
“Humans fo focus att ttention sele lectively on part rts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an in internal l re representatio ion of the scene” –– Ronald Rensink
Hand-writing recognition [Mnih et al. 2014] Image caption generation [Xu et al. 2015]
Internal representation
SLIDE 19 Recurrent Attentio ion Model
- Recurrent Neural Networks (RNN)
𝐢𝑢 𝐢𝑢−1 𝐢𝑢+1
… …
𝐳𝑢 𝐳𝑢−1 𝐳𝑢+1 𝐲𝑢 𝐲𝑢−1 𝐲𝑢+1 𝐳𝑢 𝐲𝑢 𝐢𝑢
𝐗ℎℎ 𝐗𝑗ℎ 𝐗ℎ𝑝
Aggregate information
SLIDE 20
Vie iew-based observatio ion
𝜒𝑢 𝜄𝑢 𝑤0 𝑤𝑢 𝐽(0) 𝐽(t)
SLIDE 21 ℎ2
(1)
𝐽(0)
𝜄 1 , 𝜚(1)
NBV emission
ℎ1
(1) Feature extraction
𝜄 0 , 𝜚(0)
initial view classify
View selection View aggregation
3D 3D Recurrent Attentio ion Model
ℎ2
(2)
𝐽(1)
𝜄 2 , 𝜚(2)
NBV emission
ℎ1
(2) Feature extraction
𝜄 1 , 𝜚(1)
classify
ℎ2
(3)
𝐽(2)
𝜄 3 , 𝜚(3)
ℎ1
(3) Feature extraction
𝜄 2 , 𝜚(2)
classify
… …
SLIDE 22 3D 3D Recurrent Attentio ion Model
Multi-View CNN [Su et al. 2015] View pooling
… …
ℓ1 ℓ2 ℓ𝐿
……
CNN2 CNN1 CNN1 CNN1
Max-pooling ℎ2
(1)
𝐽(0)
𝜄 1 , 𝜚(1)
NBV emission
ℎ1
(1) Feature extraction
𝜄 0 , 𝜚(0)
initial view classify
ℎ2
(2)
𝐽(1)
𝜄 2 , 𝜚(2)
NBV emission
ℎ1
(2) Feature extraction
𝜄 1 , 𝜚(1)
classify
ℎ2
(3)
𝐽(2)
𝜄 3 , 𝜚(3)
ℎ1
(3) Feature extraction
𝜄 2 , 𝜚(2)
classify
… …
SLIDE 23 Network train inin ing
CNN
Reinforcement learning
𝜄 𝑗 , 𝜚(𝑗) 𝐽(𝑗)
Indifferentiable
𝜄 𝑗 , 𝜚(𝑗) 𝐽(𝑗)
rendering
Back propagation
SLIDE 24
Rein inforcement le learnin ing
agent environment
action reward
Depth acquisition How good the depth is?
state
Stop?
SLIDE 25 Reward 𝑠
𝑢 = 𝐼𝑢 𝑞𝑢,
𝑞 + 𝐽𝑢 𝑞𝑢, 𝑞𝑢−1 − 𝐷𝑢
prediction accuracy movement cost information gain
SLIDE 26 Part-le level attentio ion
How to distinguish these two chairs? Informative parts
SLIDE 27 Attentio ion extractio ion
… …
Convolutional Neural Network
……
… … Mid-level kernels in CNN
SLIDE 28
Attentio ion extractio ion
One wing Two wings
SLIDE 29
Results and evaluation
SLIDE 30 Database
57,452 models 57 categories 12,311 models 40 categories Render
model 52 sampled views 260 sampled views
Render with jittering
SLIDE 31
Tim imin ing
Database MV-RNN train MV-RNN test ShapeNet 49 hr. 0.1 sec. ModelNet40 22 hr. 0.1 sec.
SLIDE 32
Vis isuali lizatio ion of attentio ions
Part-level attention View sequence View sequence
SLIDE 33 NBV estim imation
Classification Accuracy
40 classes
SLIDE 34
NBV estim imation under occlu lusio ion
Classification Accuracy
…
SLIDE 35
Result lts on real l scenes
SLIDE 36
Result lts on real l scenes
SLIDE 37
Result lts on real l scenes
SLIDE 38 Lim imit itations
- Recognizable objects
- No contextual information
SLIDE 39
Future works: Mult lti-modal l recognit itio ion
What is this?
Image database Shape database
SLIDE 40 Future: Mult lti-robot scene reconstructio ion & understandin ing
40
Turtlebot PR2 AscTec Pelican
SLIDE 41 Future: Mult lti-robot attentio ion model
41
Attention based on shared internal representation?
SLIDE 42
Thank you Q & A
More details: kevinkaixu.net & yifeishi.net