Li Lightweight Multi-Vie View 3D 3D Pose ose Esti timati tion - - PowerPoint PPT Presentation

li lightweight multi vie view 3d 3d pose ose esti timati
SMART_READER_LITE
LIVE PREVIEW

Li Lightweight Multi-Vie View 3D 3D Pose ose Esti timati tion - - PowerPoint PPT Presentation

Li Lightweight Multi-Vie View 3D 3D Pose ose Esti timati tion on th throu ough Ca Camera-Di Dise sentangled Represe sentati tion on Edoardo Remelli Shangchen Han Sina Honari Pascal Fua Robert Wang Motivation Multi-view input


slide-1
SLIDE 1

Li Lightweight Multi-Vie View 3D 3D Pose

  • se Esti

timati tion

  • n

th throu

  • ugh

Ca Camera-Di Dise sentangled Represe sentati tion

  • n

Edoardo Remelli Shangchen Han Sina Honari Pascal Fua Robert Wang

slide-2
SLIDE 2

Motivation

Multi-view input from synchronized and calibrated cameras Can we fuse features both effectively and efficiently in latent space instead?

State-of-the-art multi-view pose estimation solutions project 2D detections to 3D volumetric grids and reason jointly across views through computationally intensive 3D CNN or Pictorial Structures

slide-3
SLIDE 3

Problem Setting

Given:

  • Multi-view input crops {"#}#%&

'

  • Camera projection matrices {(#}#%&

'

Find:

  • 3D articulated pose )

in world coordinates

) = ( + = , - + = ,(/+ + 1) Pinhole camera model:

slide-4
SLIDE 4

Lightweight pose estimation

Resnet Shallow decoder Soft Integration Resnet Shallow decoder Soft Integration 2D detections Triangulation 3D pose Real-time multi-view 3D pose estimation methods:

  • Do not share information between features, although they represent the same pose in different

coordinate systems

  • Do not supervise for the metric of interest and use triangulation as a post-processing step

? ?

slide-5
SLIDE 5

Our Baseline [Fusion]

Resnet Shallow decoder Soft Integration Resnet Shallow decoder Soft Integration Conv layers Fusion

3d pose embedding in camera coordinates 512X8x8 512X8x8 512X8x8 512X8x8 1024X8x8 1024X8x8

Let the network do all the hard work… How to reason jointly about pose across views? Cons: overfits by design to camera setting, does not exploit camera transforms explicitly Pros: simple to implement, effective

slide-6
SLIDE 6

Resnet Shallow decoder Soft Integration Resnet Shallow decoder Soft Integration

Feature in camera coordinates

Conv layers Canonical Fusion

Feature in world coordinates Feature in world coordinates Feature in camera coordinates

!

" #$

!

% #$

!

"

!

%

If we could map features to a common frame of reference before fusing them, jointly reasoning about views would become much easier for the network. How to apply transformation to feature maps without using 3d volume aggregation?

& = ! ( = ) * ( = )(,( + .) Pinhole camera model:

Can we do better?

slide-7
SLIDE 7

Review of Transforming Auto-Encoders

Given a representation learning task and a known source of variation !, [1] proposes to learn equivariance with respect to the source of variation by conditioning latent code on the variation

[1]: Worrall, Garbin, Turmukhambetov, and Brostow. Interpretable Transformations with Encoder-Decoder Networks "#$ "#$ "#% "#% "#$ "#%

&#$ &#% &#$ &#% = (#$→#%[&#$] (#$→#%

Auto-Encoder Transforming Auto-Encoder

How to choose transform (#$→#% ?

slide-8
SLIDE 8

Review of Transforming Auto-Encoders [1]

How to choose transformation T?

  • Linear
  • Invertible
  • Norm preserving

ROTATIONS

Feature transform layer: !"# = !"#. &'(ℎ*+'(2, /) !"# ∈ ℝ3 4 5 4 6 !"7 = 8"#→"7!"# !"7 = !"7. &'(ℎ*+'(:, ;, <) ="# ="7

!"# !"7 = >"#→"7[!"#] >"#→"7

We can use a feature transform layers to map features between frames of reference

[1]: Worrall, Garbin, Turmukhambetov, and Brostow. Interpretable Transformations with Encoder-Decoder Networks

slide-9
SLIDE 9

Our architecture [Canonical Fusion]

Resnet Shallow decoder Soft Integration Resnet Shallow decoder Soft Integration

Feature in camera coordinates

Conv layers

Feature in world coordinates Feature in world coordinates Feature in camera coordinates

!

" #$

!

% #$

!

"

!

%

1024X8x8 512X8x8 512X8x8 512X8x8 512X8x8 512X8x8

Canonical Fusion

  • Makes use of camera information (Flexible)
  • Lightweight (Does not rely on volumetric aggregation)
slide-10
SLIDE 10

Now that we computed 2D detections, how can we lift them to 3D differentiably? Direct Linear Transform (DLT)

{"#}#%&

'

(

slide-11
SLIDE 11

Review of DLT

From Epipolar Geometry: {"#}#%&

'

( )#"# = +

#(

)#,# = -#

&.(

)#/# = -#

0.(

)# = -#

1.(

  • #

1.,# − -# &. ( = 3

  • #

1./# − -#

  • 0. ( = 3

Accumulating over available N views:

4( = 3, 4 ∈ ℝ0' ×9

Admits non-trivial solution only if "# and +

# are not noisy, therefore we must solve a relaxed version

min

=

4( , >. @. A = 1

Equivalent to finding the eigenvector of 4.4 associated to the smallest eigenvalue

CD#E(4.4)

slide-12
SLIDE 12

How to solve it?

!"#$ %&% ?

In literature, the smallest eigenvalue is found by computing a Singular Value Decomposition (SVD) of matrix A [2] We argue that this is sub-optimal because:

  • we need only the smallest eigenvalue, not full SVD factorization
  • SVD is not a GPU friendly algorithm [3]

[2]: Hartley and Zisserman, Multiple view geometry in computer vision [3]: Dongarra, Gates, Haidar, Kurzak, Luszczek, Tomov, and Yamazaki, Accelerating numerical dense linear algebra calculations with GPUs

slide-13
SLIDE 13

How to solve it?

Step 1: derive a bound for the the smallest singular value of matrix A: Step 2: use it to estimate the smallest singular value. Then refine the estimate iteratively using Shifted Power Iteration

  • method. Algorithm 1 is guaranteed to converge to the desired singular value because of the bound above.
slide-14
SLIDE 14

Quantitative Evaluation – Direct Linear Triangulation

For reasonably accurate 2D detections, our algorithm converges in as little as 2 iterations to the desired eigenvalue. Since it requires only a small matrix inversion and few matrix multiplications, it is much faster than performing full SVD factorizations, especially on GPUs.

slide-15
SLIDE 15

Quantitative Evaluation – H36M

w/o additional training data: w additional training data:

slide-16
SLIDE 16

Quantitative Evaluation- Total Capture

Seen cameras: Unseen cameras:

slide-17
SLIDE 17

Contributions

Resnet Shallow decoder Soft Integration Resnet Shallow decoder Soft Integration 2D detections Differentiable GPU-friendly Triangulation 3D pose

  • A novel multi-camera fusion technique that exploits 3D geometry in latent space to jointly reason about

different views efficiently

  • A new GPU-friendly differentiable algorithm for solving Direct Linear Triangulation, which is up to 3 orders
  • f magnitude faster than SVD-based implementations while allowing us to supervise directly for the metric
  • f interest

! |#

$ %$

&'(

! |#

$

! |#) ! |#)

%$

Camera-Disentangled Representation

slide-18
SLIDE 18

Please refer to the video for qualitative results and visualizations! For any question, feel free to reach out to

edoardo.remelli@epfl.ch

Thank you!