Deep neural nets for human pose estimation in videos Tomas Pfister, - - PowerPoint PPT Presentation

deep neural nets for human pose estimation in videos
SMART_READER_LITE
LIVE PREVIEW

Deep neural nets for human pose estimation in videos Tomas Pfister, - - PowerPoint PPT Presentation

Deep neural nets for human pose estimation in videos Tomas Pfister, James Charles, Andrew Zisserman Department of Engineering Science University of Oxford http://www.robots.ox.ac.uk/~vgg Aim: Estimate 2D upper body joint positions (wrist,


slide-1
SLIDE 1

Deep neural nets for human pose estimation in videos

Tomas Pfister, James Charles, Andrew Zisserman Department of Engineering Science University of Oxford http://www.robots.ox.ac.uk/~vgg

slide-2
SLIDE 2

Aim:

Estimate 2D upper body joint positions (wrist, elbow, shoulder, head) with high accuracy in real-time

slide-3
SLIDE 3

Outline

  • Two types of loss functions for pose estimation
  • Coordinate net
  • Heatmap net
  • Optical flow for pose estimation in videos
  • Results (cf state of the art)
slide-4
SLIDE 4

Method overview: single frame learning

  • 1. Coordinate Net
  • 2. Heatmap Net

e.g. DeepPose CVPR14, Pfister et al ACCV14 e.g. Jain et al ICLR14, Tompson et al CVPR15

slide-5
SLIDE 5

Coordinate Net: regress joint positions

Training loss: L2 on joint positions OverFeat like architecture

slide-6
SLIDE 6

Heatmap Net: regress heatmap for each joint

7 joints

256 x 256 64 x 64

Represent joint position by Gaussian Training loss: L2 on pixels

slide-7
SLIDE 7

Comparison

Heatmap Net Coordinate Net Regression target Coordinates Heatmap

slide-8
SLIDE 8

BBC sign language videos data set

Training set

Training:

15 videos each 0.5-1hr long, all frames annotated

Testing:

5 videos, 200 annotated frames per video Extended Training: 72 videos with noisy automated annotations

slide-9
SLIDE 9

Results on architecture comparison

CoordinateNets HeatmapNets

Evaluated on BBC Pose

More training data

CoordinateNet CoordinateNet - more data HeatmapNet - more data HeatmapNet - data+flow HeatmapNet

  • Heatmap net superior to coordinate net
  • Performance of coordinate net saturates with more training data
slide-10
SLIDE 10

Why is the heatmap network superior?

Heatmap Net Coordinate Net

Regression target

Coordinates Heatmap

  • 1. Can represent multimodal estimates,

so can model uncertainty/confidence

  • 2. In training there is an error signal

from every pixel, so better smoothing for back propagation Also, it is easier to visualize (and understand) what is being learnt

slide-11
SLIDE 11

Timelapse of training

slide-12
SLIDE 12

Multiple modes example

early in training late in training

slide-13
SLIDE 13

What do the layers learn?

Three randomly selected activations from each layer

Input frame

Edges Body parts (some)

slide-14
SLIDE 14

Learning from videos

  • Temporal information

– How do we learn from temporal information with a ConvNet?

Hand moving in x direction

slide-15
SLIDE 15

Late fusion using flow

Warp the heatmaps from previous/next frames & combine

Cf S. Zuffi et al., Estimating human pose with flowing puppets. Proc. ICCV, 2013 Charles et al., Upper Body Pose Estimation with Temporal Sequential Forests, BMVC 2014

slide-16
SLIDE 16

Optical flow Example: Heatmap Net & Optical flow

Tracks for optical flow for wrist positions

Flow: Brox et al GPU flow from OpenCV, or FastDeepFlow

slide-17
SLIDE 17

Optical flow Example: Heatmap Net & Optical flow

Warping heatmaps to frame t

slide-18
SLIDE 18

Flowing ConvNets

  • Learn the pooling of the warped heatmaps
slide-19
SLIDE 19

Results: with/without optical flow

slide-20
SLIDE 20

Results Comparison of pooling types

slide-21
SLIDE 21

Results Learnt optical flow pooling weights

elbow wrist

slide-22
SLIDE 22

Results Comparison to the state of the art

Poses in the Wild

12% improvement at d = 10px

slide-23
SLIDE 23

50fps on 1 GPU without optical flow, 5fps with optical flow

Results: Example pose estimation

slide-24
SLIDE 24

Results Failure cases

Main failure case: Picking the wrong mode BBC Pose ChaLearn

Correctable with a spatial model

slide-25
SLIDE 25

Additional Pooling Fusion Layers

256 x 256

Conv A 8x8x64 Conv B 13x13x64 Conv C 15x15x64 Conv D 1x1x128 Conv E 1x1x7

Implicit spatial model

slide-26
SLIDE 26

Results: Additional Pooling Fusion Layers

Poses in the Wild

Heat map CNNs

  • riginal

with fusion with fusion and flow

slide-27
SLIDE 27

Results: Additional Pooling Fusion Layers

FLIC: single image predictions

slide-28
SLIDE 28

Summary

  • Deep Heatmap ConvNet achieves state of the art with implicit

spatial models

  • Performance improved by optical flow pooling
  • Futures:

– Robust regression – Data dependent flow channel pooling – More training data