of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May - - PowerPoint PPT Presentation

of objects and human poses
SMART_READER_LITE
LIVE PREVIEW

of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May - - PowerPoint PPT Presentation

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May 28 th , 2013 1 Agenda Introduction & Motivation Dataset description Model


slide-1
SLIDE 1

1

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context

  • f Objects and Human Poses

Maryam Daneshi, Konstantin Bayandin May 28th, 2013

slide-2
SLIDE 2

2

Agenda

  • Introduction & Motivation
  • Dataset description
  • Model
  • Training
  • Inference
  • Results
slide-3
SLIDE 3

3

Human visual system uses context for recognition

Context and Recognition

slide-4
SLIDE 4

4

Human Object Interaction (HOI)

slide-5
SLIDE 5

5

Human Poses and Objects

Human pose estimation is challenging.

Unusual part appearances Self occlusion Patch looks like body part

slide-6
SLIDE 6

6

Human Poses and Objects

Given the

  • bject is

detected.

slide-7
SLIDE 7

7

Human Poses and Objects

Small, low- resolution, partially

  • ccluded

Image region similar to detection target

Object detection is challenging

slide-8
SLIDE 8

8

Given the pose is estimated.

Human Poses and Objects

slide-9
SLIDE 9

9

Datasets - Sports

Images of six sports activities

slide-10
SLIDE 10

10

Datasets - PPMI

People interacting with 12 classes of musical instruments

slide-11
SLIDE 11

11

Atomic poses – pose dictionary

slide-12
SLIDE 12

12

Mutual Context Model

  • Goal: Estimate the human pose and detect the
  • bjects that the human interacts with

– Occluded or small objects – Articulated human poses – variation of poses in one class of activity

  • Conditional random field model
  • Human interacting with any number of objects
slide-13
SLIDE 13

13

y(A,O, H, I) = f1(A,O, H)+f2(O, H) +f3(O, I)+f4(H, I)+f5(A, I)

Model

Co-occurrence context Spatial context Modeling objects Modeling human pose Modeling activity

Activity H A P1 P2 PL OM O1 Human pose Body parts I Image of human-object interaction Objects

slide-14
SLIDE 14

14

Model: Co-occurrence Context

Compatibility between actions, objects, and human poses

f1(A,O, H) = 1(H = hi).1(Om = oj).1(A = ak)zi, j,k

k=1 Na

å

j=1 No

å

m=1 M

å

i=1 Nb

å

Activity H A P1 P2 PL OM O1 Human pose Body parts I Image of human-object interaction Objects

slide-15
SLIDE 15

15

Model: Co-occurrence Context

f1(A,O, H) = 1(H = hi).1(Om = oj).1(A = ak)zi, j,k

k=1 Na

å

j=1 No

å

m=1 M

å

i=1 Nh

å

Nh: total number of atomic poses hi: the ith atomic pose No: total number of objects

  • j: the jth object

Na: total number of activates ak: the kth activity ζi,j,k : strength of the co-occurrence interaction

slide-16
SLIDE 16

16

Model: Spatial Context

Spatial relationship between object and different body parts of the human

f2(H,O) = 1(H = hi).1(Om = oj).li, j,l

T l=1 L

å

.b(XI

l,Om) j=1 No

å

i=1 Nh

å

m=1 M

å

Activity H A P1 P2 PL OM O1 Human pose Body parts I Image of human-object interaction Objects

slide-17
SLIDE 17

17

Model: Spatial Context

f2(H,O) = 1(H = hi).1(Om = oj).li, j,l

T l=1 L

å

.b(XI

l,Om) j=1 No

å

i=1 Nh

å

m=1 M

å

xI

l: location of the center of human’s lth body part in image I

b(xI

l , O m): spatial relationship between xI l and the mth object

bounding box  sparse binary vector with one 1 λi,j,l: Weight for the relationship

slide-18
SLIDE 18

18

Model: Objects

Modeling objects using the detection scores in all the object bounding boxes and the spatial relationship between these boxes.

f3(O, I) = 1(Om = oj).g j

T.g(Om)+ j=1 No

å

m=1 M

å

1(Om = oj).1(O

¢ m = o ¢ j ).g j, ¢ j T .b(Om,O ¢ m ) ¢ j =1 L

å

j=1 No

å

¢ m =1 M

å

m=1 M

å

Activity H A P1 P2 PL OM O1 Human pose Body parts I Objects

slide-19
SLIDE 19

19

Model: Objects

f3(O, I) = 1(Om = oj).g j

T.g(Om)+ j=1 No

å

m=1 M

å

1(Om = oj).1(O

¢ m = o ¢ j ).g j, ¢ j T .b(Om,O ¢ m ) ¢ j =1 L

å

j=1 No

å

¢ m =1 M

å

m=1 M

å

[Desai et al, 2009]

g(Om): vector of scores of all detected object in the mth box ϒj: the detection score weight for the jth object b(Om, Om’): binary vector of spatial relationship between pairs of objects ϒj,j’: weight for geometric configuration between oj and oj’

slide-20
SLIDE 20

20

Model: Human Pose

Likelihood of observing image I given the atomic pose hi

f4(H, I) = 1(H = hi).(ai,l

T .p(XI l | Xhi l ))+ l=1 L

å

i=1 Nh

å

bi,l

  • T. f l(I))

Activity H A P1 P2 PL OM O1 Human pose Body parts I Image of human-object interaction

slide-21
SLIDE 21

21

Model: Human Pose

f4(H, I) = 1(H = hi).(ai,l

T .p(XI l | Xhi l ))+ l=1 L

å

i=1 Nh

å

bi,l

  • T. f l(I))

p(xI

l | xhi l): Gaussian likelihood of observing xI l, given the standard joint

location of the lth body part in pose hi f l(I): the lth body part detection output αj,l: location weight for the lth body part in pose hi βj,l: appearance weight for the lth body part in pose hi

slide-22
SLIDE 22

22

Model: Activities

Activity classifier to model HOI activity

f5(A, I) = 1(A = ak).hk

T. k=1 No

å

bi,l

T.s(I))

Activity H A P1 P2 PL OM O1 Human pose Body parts I Image of human-object interaction Objects

slide-23
SLIDE 23

23

Model: Activities

f5(A, I) = 1(A = ak).hk

T. k=1 No

å

bi,l

T.s(I))

ηk: feature weight for activity ak s(I): output of one-versus-all discriminative classifier

slide-24
SLIDE 24

24

Training: Atomic Poses

Hierarchical clustering from a given set of poses on training images:

  • Position and orientation of parts with distance
  • Normalization to the same position/size of torso (sports) or head (music)
  • Variations in position and orientation are normalized to [-1,1]
  • Missing parts are filled from the image’s nearest neighbor
  • Atomic poses are shared by all activitiesw𝑈 ⋅ ∣x𝑚 − x𝑚 ∣
slide-25
SLIDE 25

25

Training: Objects and Part Detectors

Deformable Parts Model with SVM on HOG feature detectors:

  • One mixture component per per body part
  • Two mixture components per object unless aspect ratios do not change
  • value of the object detection score divided by the threshold
  • value of the body part detection divided by the threshold
slide-26
SLIDE 26

26

Training: Activity Classifier

Spatial Pyramid Matching method:

  • Sparse SIFT features on three layers
  • a vector with confidence scores obtained from an SVM classifier
slide-27
SLIDE 27

27

Training: Estimating Model Parameters

Conditional Random Field with no hidden variables:

  • model parameters
  • Maximum likelihood approach
  • Zero-mean Gaussians priors
slide-28
SLIDE 28

28

Inference: Iterative Process

Initialization:

  • Action classification with SPM classification
  • Object bounding boxes from independent object detectors (scores >0.9)
  • Initial pose from a pictorial structure model from all training images

Two Iterations:

  • Updating the layout of human body parts - updating Gaussian priors

for part locations with poses marginal probabilities:

  • Updating object detection results - greedy forward search:
  • Updating the activity and atomic pose labels - maximizing the overall

sum by enumerating all possible values for actions and human poses

slide-29
SLIDE 29

29

Results: Examples for Testing Images

slide-30
SLIDE 30

30

Results: Sports – Object Detection

  • Better overall performance across all objects
  • Better discrimination of similar objects (cricket ball vs. croquet ball)
slide-31
SLIDE 31

31

Results: Sports – Human Pose Estimation

  • Better overall performance across all poses
  • Outperform even Pictorial Structure model trained on separate classes!
slide-32
SLIDE 32

32

Results: Sports – Activity Classification

  • Better overall performance
  • Performance is better than just SPM by about 4%
slide-33
SLIDE 33

33

Results: Music – Object Detection

  • Better overall performance across all objects
  • Better improvement for “playing instrument” situations when context

plays a more important role

slide-34
SLIDE 34

34

Results: Music – Object Detection

  • Demonstration of the importance of human poses for object detection
slide-35
SLIDE 35

35

Results: Music – Human Pose Estimation

  • Better performance for poses with “playing instrument”
  • Only marginally better for poses with “not playing instrument”
  • No significant improvement as compared to Pictorial Structure model
slide-36
SLIDE 36

36

Results: Music – Activity Classification

  • Better overall performance as compared to SPM and grouplet approach