Recognises your face and voice Kinect Adventures What the Kinect - - PowerPoint PPT Presentation

recognises your face and voice
SMART_READER_LITE
LIVE PREVIEW

Recognises your face and voice Kinect Adventures What the Kinect - - PowerPoint PPT Presentation

D EPTH , H UMAN P OSE , AND C AMERA P OSE JAMIE SHOTTON Depth sensing camera Tracks 20 body joints in real time Recognises your face and voice Kinect Adventures What the Kinect Sees top view side view depth image (camera view)


slide-1
SLIDE 1

DEPTH, HUMAN POSE, AND CAMERA POSE

JAMIE SHOTTON

slide-2
SLIDE 2

Kinect Adventures

  • Depth sensing camera
  • Tracks 20 body joints in real time
  • Recognises your face and voice
slide-3
SLIDE 3
slide-4
SLIDE 4

top view side view depth image (camera view)

What the Kinect Sees

slide-5
SLIDE 5

Structured light

x y z

baseline imaging plane

  • ptic centre
  • f camera
  • ptic centre
  • f IR laser
  • bject at

depth d1

  • bject at

depth d2

slide-6
SLIDE 6

Depth Makes Vision That Little Bit Easier

RGB

 Only works well lit  Background clutter  Scale unknown  Color and texture variation

DEPTH

 Works in low light  Background removal easier  Calibrated depth readings  Uniform texture

slide-7
SLIDE 7
slide-8
SLIDE 8

Joint work with Shahram Izadi, Richard Newcombe, David Kim, Otmar Hilliges, David Molyneaux, Pushmeet Kohli, Steve Hodges, Andrew Davison, Andrew Fitzgibbon. SIGGRAPH, UIST and ISMAR 2011.

KINECTFusion

slide-9
SLIDE 9

KINECTFusion

Camera drift

slide-10
SLIDE 10

ROADMAP

THEVITRUVIAN MANIFOLD [CVPR 2012] SCENE COORDINATE REGRESSION [CVPR 2013]

slide-11
SLIDE 11

THE VITRUVIAN MANIFOLD

Jonathan Taylor Jamie Shotton Toby Sharp Andrew Fitzgibbon CVPR 2012

slide-12
SLIDE 12

Human Pose Estimation

In this work:

  • Single frame at a time (no tracking)
  • Kinect depth image as input (background removed)

Given some image input, recover the 3D human pose:

Joint positions and angles

slide-13
SLIDE 13

Why is Pose Estimation Hard?

slide-14
SLIDE 14

A Few Approaches

Regress directly to pose?

e.g. [Gavrila ’00] [Agarwal & Triggs ’04]

Per-Pixel Body Part Classification

[Shotton et al. ‘11]

Per-Pixel Joint Offset Regression

[Girshick et al. ‘11]

Detect and assemble parts?

e.g. [Felzenszwalb & Huttenlocher ’00] [Ramanan & Forsyth ’03] [Sigal et al. ’04]

Detect parts?

e.g. [Bourdev & Malik ‘09] [Plagemann et al. ‘10] [Kalogerakis et al. ‘10]

slide-15
SLIDE 15

body joint hypotheses

front view side view top view

input depth image body parts

BPC Clustering

Background: Learning Body Parts for Kinect

[Shotton et al. CVPR 2011]

slide-16
SLIDE 16

Synthetic Training Data

Train invariance to: Record mocap

100,000s of poses

Retarget to varied body shapes Render (depth, body parts) pairs

[Vicon]

slide-17
SLIDE 17

Depth Image Features

  • Depth comparisons

– very fast to compute

input depth image

x

Δ

x

Δ

x

Δ x Δ

x

Δ

x

Δ

f(x; v) = 𝑒 x − 𝑒(x + Δ)

  • ffset

depth image coordinate

  • ffset depth

feature response Background pixels d = large constant scales inversely with depth

Δ = 𝐰 𝑒 x

slide-18
SLIDE 18

Decision tree classification

image window centred at x

no no yes yes

P(c) P(c)

f(x; v1) > θ1 f(x; v2) > θ2

no yes

P(c) P(c)

f(x; v3) > θ3

slide-19
SLIDE 19

Training Decision Trees

Sn = x f(x; vn) > θn

no yes

c Pr(c)

body part c Pn(c)

c Pl(c)

Take (v, θ) that maximises information gain:

n l r

Goal: drive entropy at leaf nodes to zero

reduce entropy

[Breiman et al. 84]

for all pixels

Δ𝐹 = − 𝑇l 𝑇𝑜 𝐹(Sl) − 𝑇r 𝑇𝑜 𝐹(Sr)

slide-20
SLIDE 20

Decision Forests Book

  • Theory – Tutorial & Reference
  • Practice – Invited Chapters
  • Software and Exercises
  • Tricks of the Trade
slide-21
SLIDE 21

input depth inferred body parts

no tracking or smoothing

slide-22
SLIDE 22

body joint hypotheses

front view side view top view

input depth image body parts

BPC Clustering

slide-23
SLIDE 23

front view top view side view

input depth inferred body parts inferred joint position hypotheses

no tracking or smoothing

slide-24
SLIDE 24

Single frame at a time –> robust Large training corpus -> invariant Fast, parallel implementation Skeleton does not explain the depth data Limited ability to cope with hard poses

Body Part Recognition in Kinect

slide-25
SLIDE 25

Explain the data directly with a mesh model

[Ballan et al. ‘08] [Baak et al. ‘11]

  • GOOD: Full skeleton
  • GOOD: Kinematic constraints enforced from the outset
  • GOOD: Able to cope with occlusion and cropping
  • BAD:

Many local minima

  • BAD:

Highly sensitive to initial guess

  • BAD:

Potentially slow

A few approaches

slide-26
SLIDE 26

𝑆l_arm(𝜄)

  • Mesh is attached to a hierarchical skeleton
  • Each limb 𝑚 has a transformation matrix 𝑈

𝑚 𝜄

relating its local coordinate system to the world:

  • 𝑆global(𝜄) encodes a global scaling, translation and rotation
  • 𝑆𝑚(𝜄) encodes a rotation and fixed translation relative to its parent
  • 13 parameterized joints using quaternions to represent unconstrained rotations
  • This gives 𝜄 a total of 1 + 3 + 4 + 4 ∗ 13 = 60 degrees of freedom

𝑆global(𝜄)

𝑈

root 𝜄 = 𝑆global(𝜄)

𝑈

𝑚 𝜄

= 𝑈parent 𝑚 𝜄 𝑆𝑚(𝜄)

Human Skeleton Model

slide-27
SLIDE 27

Linear Blend Skinning

𝑁 𝑣; 𝜄 =

𝑙=1 𝐿

𝛽𝑙𝑈𝑚𝑙 𝜄 𝑈

𝑚𝑙 −1 𝜄0 𝑞

Each vertex 𝑣

  • has position 𝑞 in base pose 𝜄0
  • is attached to K limbs 𝑚𝑙 𝑙=1

𝐿

with weights 𝛽𝑙 𝑙=1

𝐿

In a new pose 𝜄, the skinned position 𝑣 of is: Mesh in base pose 𝜄0

position in limb lk’s coordinate system position in world coordinate system

slide-28
SLIDE 28

min

𝜄

min

𝑣1…𝑣𝑜 𝑗

𝑒(𝑦𝑗, 𝑁 𝑣𝑗; 𝜄 )

Test Time Model Fitting

  • Assume each observation 𝑦𝑗 is generated by a point on our model 𝑣𝑗

𝑦𝑗 = 𝑁 𝑣𝑗; 𝜄

What pose is the model in?

Observed 3D Point Predicted 3D Point

  • Optimize:

𝐷𝑝𝑠𝑠𝑓𝑡𝑞𝑝𝑜𝑒𝑗𝑜𝑕 𝑁𝑝𝑒𝑓𝑚 𝑄𝑝𝑗𝑜𝑢𝑡: 𝑣1, … 𝑣𝑜 𝑃𝑐𝑡𝑓𝑠𝑤𝑓𝑒 𝑄𝑝𝑗𝑜𝑢𝑡: 𝑦1, … , 𝑦𝑜 𝑦𝑗

Note: simplified energy - more details to come

slide-29
SLIDE 29

Optimizing

  • Alternating between pose 𝜄 and correspondences 𝑣1, … 𝑣𝑜
  • Articulated Iterative Closest Point (ICP)
  • Traditionally, start from initial 𝜄
  • manual initialization
  • track from previous frame
  • Could we instead infer initial correspondences 𝑣1, … 𝑣𝑜 discriminatively?
  • And, do we even need to iterate?

min

𝜄

min

𝑣1…𝑣𝑜 𝑗

𝑒(𝑦𝑗, 𝑁 𝑣𝑗; 𝜄 )

slide-30
SLIDE 30

One-Shot Pose Estimation: An Early Result

Can we achieve a good result without iterating between pose 𝜄 and correspondences 𝑣1, … 𝑣n?

ground truth correspondences test depth image convergence visualization

slide-31
SLIDE 31

Texture is mapped across body shapes and poses

From Body Parts to Dense Correspondences

increasing number of parts classification regression The “Vitruvian Manifold” Body Parts

slide-32
SLIDE 32

The “Vitruvian Manifold” Embedding in 3D

v = 1 v = -1 u = -1 u = 1 w = -1

[L. Da Vinci, 1487]

w = 1

Geodesic surface distances approximated by Euclidean distance

slide-33
SLIDE 33

Overview

inferred dense correspondences test images

regression forest

energy function

  • ptimization of

model parameters 𝜄 final optimized poses

front right top

training images

slide-34
SLIDE 34

Discriminative Model: Predicting Correspondences

input images inferred dense correspondences

regression forest

training images

slide-35
SLIDE 35

Learning the Correspondences

  • How to learn the mapping from depth pixels to correspondences?

𝑦𝑗

  • Render synthetic training set:

render characters mocap

  • Train regression forest
slide-36
SLIDE 36

mean shift mode detection Each pixel-correspondence pair descends to a leaf in the tree

Learning a Regression Model at the Leaf Nodes

slide-37
SLIDE 37

Inferring Correspondences

slide-38
SLIDE 38

infer correspondences 𝑉

  • ptimize parameters

min

𝜄 𝐹(𝜄, 𝑉)

slide-39
SLIDE 39

Full Energy

  • Term Evis approximates hidden surface removal and uses robust error
  • Gaussian prior term Eprior
  • Self-intersection prior term Eintapproximates interior volume

𝐹 𝜄, 𝑉 = 𝜇vis𝐹vis 𝜄, 𝑉 + 𝜇prior𝐹prior 𝜄 + 𝜇int𝐹int 𝜄

Energy is robust to noisy correspondences

  • Correspondences far from their image points are “ignored”
  • Correspondences facing away from the camera are “ignored”
  • avoids model getting stuck in front of the image pixels

𝜍(𝑓)

𝑓 = 0 𝑑𝑡(𝜄0) 𝑑𝑢(𝜄0)

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

“Easy” Metric: Average Joint Accuracy

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.05 0.1 0.15 0.2

Joints average accuracy (% joints within distance D) D: max allowed distance to GT (m)

Our algorithm Given GT u Optimal θ 𝜄 𝑉

Results on 5000 synthetic images

slide-43
SLIDE 43

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.05 0.1 0.15 0.2 0.25 0.3

Worst-case accuracy (% frames with all joints within dist. D) D: max allowed distance to GT (m)

Our algorithm Given GT u Optimal θ

Hard Metric: “Perfect” Frame Accuracy

𝜄 𝑉

Results on 5000 synthetic images

0.09m 0.11m 0.17m 0.21m 0.45m D:

slide-44
SLIDE 44

Comparison

0% 10% 20% 30% 40% 50% 60% 70% 0.05 0.1 0.15 0.2 0.25 0.3

Worst case accuracy (% frames with all joints within dist. D) D: max allowed distance to GT (m)

Our algorithm [Shotton et al. '11] (top hypothesis) [Girshick et al. 11] (top hypothesis) [Shotton et al. '11] (best of top 5) [Girshick et al. '11] (best of top 5)

Require an oracle Achievable algorithms Results on 5000 synthetic images

Vitruvian Manifold

slide-45
SLIDE 45

Sequence Result

Each frame fit independently: no temporal information used

slide-46
SLIDE 46
slide-47
SLIDE 47
  • Easily extended to Q views where each view has

– 𝑜𝑟 correspondences per view – viewing matrix 𝑄

𝑟 to register the scene

  • Can also extend to 2D silhouette views

– let data points 𝑦𝑗𝑙 be 2D image coordinates – let 𝑄

𝑟 include a projection to 2D

– minimize re-projection error

Generalization to Multiple 3D/2D Views

min

𝜄 𝑟=1 𝑅 𝑗 𝑜𝑟

𝑒(𝑦𝑗𝑟, 𝑄

𝑟𝑁 𝑣𝑗𝑟; 𝜄 )

slide-48
SLIDE 48

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Worst case accuracy (% frames with all joints within dist. D) D: max allowed distance to GT (m)

2 silhouette views 3 silhouette views 5 silhouette views 1 depth view 2 depth views 5 depth views

Silhouette Experiment

slide-49
SLIDE 49

Vitruvian Manifold: Summary

  • Predict per-pixel image-to-model correspondences

– train invariance to body shape, size, and pose

  • “One-shot” pose estimation

– fast, accurate – auto-initializes using correspondences

slide-50
SLIDE 50

SCENE COORDINATE REGRESSION FORESTS FOR CAMERA RELOCALIZATION IN RGB-D IMAGES

JAMIE SHOTTON BEN GLOCKER CHRISTOPHER ZACH SHAHRAM IZADI ANTONIO CRIMINISI ANDREW FITZGIBBON [CVPR 2013]

slide-51
SLIDE 51

Know this Observe this Where is this?

6D camera pose, 𝐼 (camera to scene transformation) Single RGB-D frame A world scene

slide-52
SLIDE 52

APPLICATIONS

 Lost or kidnapped robots  Improving KinectFusion  Augmented reality

slide-53
SLIDE 53

TYPICAL APPROACHES TO CAMERA LOCALIZATION

 Tracking – alignment relative to previous frame

e.g. [Besl & MacKay ‘92]

 Key point detection → local descriptors → matching → geometric verification

e.g. [Holzer et al. ‘12], [Winder & Brown ‘07], [Lepetit & Fua ‘06], [Irschara et al. ‘09]

 Whole key-frame matching

e.g. [Klein & Murray 2008] [Gee & Mayol-Cuevas 2012]

 Epitomic location recognition

[Ni et al. 2009]

approximate precise

slide-54
SLIDE 54

PROBLEMS IN REAL WORLD CAMERA LOCALIZATION

 The real world is less exciting than vision researchers might like

  • sparse interest points can fail

 The real world is big

 ? !

slide-55
SLIDE 55

KEY IDEA: SCENE COORDINATE REGRESSION Scene coordinate XYZ  RGB color space

slide-56
SLIDE 56

KEY IDEA: SCENE COORDINATE REGRESSION

 Let each pixel predict direct correspondence

to 3D point in scene coordinates:

A B C

Input RGB Input Depth Desired Correspondences

A B C

Scene coordinate XYZ  RGB color space 3D model from KinectFusion (only used for visualization)

slide-57
SLIDE 57

SCENE COORDINATE REGRESSION

 Offline approach to relocalization

 observe a scene  train a regression forest  revisit the scene

 Aim for really precise localization

 e.g. suitable for AR overlays  from a single frame  without an explicit 3D model p ℳ𝑚1 𝐪

[Bunny: Stanford]

slide-58
SLIDE 58

SCENE COORDINATE REGRESSION (SCORE) FORESTS

RGB Depth

tree 1

p ℳ𝑚1 𝐪 p

tree T

ℳ𝑚𝑈 𝐪 Depth & RGB features SCoRe Forest

𝜀1 𝐸(𝐪) 𝜀2 𝐸(𝐪)

Leaf Predictions ℳ𝑚 ⊂ ℝ3

𝐪

Forest Predictions ℳ 𝐪 =

𝑢

ℳ𝑚𝑢(𝐪)

slide-59
SLIDE 59

TRAINING A SCORE FOREST

 RGB-D frames with known camera poses 𝐼  Generate 3D pixel labels automatically:

𝐧 = 𝐼𝐲

Training Data

RGB Depth 𝐲 Labels 𝐧

Learning (standard)

 Greedily train tree  Reduction in spatial variance objective:  Regression, not classification  Mean shift to summarize distribution

at leaf 𝑚 into small set ℳ𝒎 ⊂ ℝ3

slide-60
SLIDE 60

ROBUST CAMERA POSE OPTIMIZATION

pixel index camera pose robust error function correspondences predicted by forest at pixel 𝑗

Energy Function Optimization

 Preemptive RANSAC [Nistér ICCV 2003]  With pose refinement [Chum et al. DAGM 2003]

 efficient updates to means & covariances

used by Kabsch SVD  Only a small subset of pixels used

slide-61
SLIDE 61

INLYING FOREST PREDICTIONS

Ground truth Inferred

Test images Inliers for six hypotheses

𝐼1 𝐼2 𝐼3 𝐼4 𝐼5 𝐼6

Camera pose

slide-62
SLIDE 62

PREEMPTIVE RANSAC OPTIMIZATION

slide-63
SLIDE 63

THE 7SCENES DATASET

Heads Pumpkin RedKitchen Stairs Dataset available from authors

slide-64
SLIDE 64

BASELINES FOR COMPARISON

 ORB matching

[Rublee et al. ICCV 2011]

 FAST detector  Rotation aware BRIEF descriptor  Hashing for matching

 Geometric verification

 RANSAC & perspective 3 point  Final refinement given inliers

Sparse Key-Points (RGB only) Tiny-Image Key-Frames (RGB & Depth)

 Downsample to 40x30 pixels  Blur  Normalized Euclidean distance  Brute-force search  Interpolation of 100 closest poses

[Klein & Murray ECCV 2008] [Gee & Mayol-Cuevas BMVC 2012]

slide-65
SLIDE 65

QUANTITATIVE COMPARISON

Choice of different image features Proportion of test frames with < 0.05m translational error and < 5○ angular error Metric: Results:

slide-66
SLIDE 66

QUALITATIVE COMPARISON

ground truth DA-RGB SCoRe forest sparse baseline closest training pose

slide-67
SLIDE 67

QUALITATIVE COMPARISON

ground truth DA-RGB SCoRe forest sparse baseline closest training pose

slide-68
SLIDE 68

TRACK VISUALIZATION VIDEOS

ground truth DA-RGB SCoRe forest RGB sparse baseline single frame at a time – no tracking

slide-69
SLIDE 69

AR VISUALIZATION

RGB input + AR overlay depth input + AR overlay rendering of model from inferred pose

single frame at a time – no tracking

[Bunny: Stanford]

slide-70
SLIDE 70

SIMPLE ROBUST TRACKING

 Add a single extra hypothesis to optimization: the result from previous frame

Single frame

slide-71
SLIDE 71

AR VISUALIZATION WITH TRACKING

RGB input + AR overlay depth input + AR overlay rendering of model from inferred pose

simple robust frame-to-frame tracking enabled

[Bunny: Stanford]

slide-72
SLIDE 72

MODEL-BASED REFINEMENT

 Model-based refinement

 requires 3D model of scene  run rigid ICP from our inferred pose between observed image and model

0% 20% 40% 60% 80% 100% Chess Fire Heads Office Pumpkin RedKitchen Stairs

Proportion of frames correct

Baseline: Tiny-Image Depth Baseline: Tiny-Image RGB Baseline: Tiny-Image RGB-D Baseline: Sparse RGB Ours: Depth Ours: DA-RGB Ours: DA-RGB + D

[Besl & McKay PAMI 1992]

slide-73
SLIDE 73

AR VISUALIZATION WITH TRACKING AND REFINEMENT

RGB input + AR overlay depth input + AR overlay rendering of model from inferred pose

simple robust frame-to-frame tracking and ICP-based model refinement enabled

[Bunny: Stanford]

slide-74
SLIDE 74

Fire Scene

SCoRe Forest (single frame at a time) SCoRe Forest + simple robust frame-to-frame tracking SCoRe Forest + simple robust frame-to-frame tracking + ICP refinement to 3D model RGB input + AR overlay depth input + AR overlay rendering of model from inferred pose

[Bunny: Stanford]

slide-75
SLIDE 75

Pumpkin Scene

RGB input + AR overlay depth input + AR overlay rendering of model from inferred pose SCoRe Forest (single frame at a time) SCoRe Forest + simple robust frame-to-frame tracking SCoRe Forest + simple robust frame-to-frame tracking + ICP refinement to 3D model

[Bunny, Armadillo: Stanford]

slide-76
SLIDE 76

SCENE RECOGNITION

Chess Fire Heads Office Pumpkin RedKitchen Stairs Chess 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Fire 2.0% 98.0% 0.0% 0.0% 0.0% 0.0% 0.0% Heads 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% Office 0.0% 0.5% 4.0% 95.5% 0.0% 0.0% 0.0% Pumpkin 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% RedKitchen 2.8% 1.2% 3.6% 0.0% 0.0% 92.4% 0.0% Stairs 0.0% 0.0% 10.0% 0.0% 0.0% 0.0% 90.0%

 Train one SCoRe Forest per scene  Test frame against all scenes  Scene with lowest energy wins  Single frame only

slide-77
SLIDE 77

SCENE COORDINATE REGRESSION - SUMMARY

 Scene coordinate regression forests

 provide a single-step alternative to detection/description/matching pipeline  can be applied at any valid pixel, not just at interest points  allow accurate relocalization without explicit 3D model

 Tracking-by-detection is approaching temporal tracking accuracy

slide-78
SLIDE 78

Unifying principal:

Per-pixel regression and per-image model fitting

 Depth cameras are having huge impact  Decision forests + big data

WRAP UP

slide-79
SLIDE 79

Thank you!

With thanks to:

Andrew Fitzgibbon, Jon Taylor, Ross Girshick, Mat Cook, Andrew Blake, Toby Sharp, Pushmeet Kohli, Ollie Williams, Sebastian Nowozin, Antonio Criminisi, Mihai Budiu, Duncan Robertson, John Winn, Shahram Izadi The whole Kinect team, especially: Alex Kipman, Mark Finocchio, Ryan Geiss, Richard Moore, Robert Craig, Momin Al-Ghosien, Matt Bronder, Craig Peeper