 
              Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May 28 th , 2013 1
Agenda • Introduction & Motivation • Dataset description • Model • Training • Inference • Results 2
Context and Recognition Human visual system uses context for recognition 3
Human Object Interaction (HOI) 4
Human Poses and Objects Human pose Unusual part estimation is appearances challenging. Self occlusion Patch looks like body part 5
Human Poses and Objects Given the object is detected. 6
Human Poses and Objects Object detection is challenging Small, low- resolution, partially occluded Image region similar to detection target 7
Human Poses and Objects Given the pose is estimated. 8
Datasets - Sports Images of six sports activities 9
Datasets - PPMI People interacting with 12 classes of musical instruments 10
Atomic poses – pose dictionary 11
Mutual Context Model • Goal: Estimate the human pose and detect the objects that the human interacts with – Occluded or small objects – Articulated human poses – variation of poses in one class of activity • Conditional random field model • Human interacting with any number of objects 12
Model y ( A , O , H , I ) = f 1 ( A , O , H ) + f 2 ( O , H ) Activity Co-occurrence context Spatial context A + f 3 ( O , I ) + f 4 ( H , I ) + f 5 ( A , I ) Human pose H Objects O M O 1 Modeling objects Modeling activity Body parts Modeling human pose P 1 P 2 P L I Image of human-object interaction 13
Model: Co-occurrence Context Activity A Compatibility between actions, objects, and Human pose human poses H Objects O M O 1 Body parts f 1 ( A , O , H ) = P 1 P 2 P L N b N o N a M 1( H = h i ).1( O m = o j ).1( A = a k ) z i , j , k å å å å i = 1 m = 1 j = 1 k = 1 I Image of human-object interaction 14
Model: Co-occurrence Context f 1 ( A , O , H ) = N h N o N a M 1( H = h i ).1( O m = o j ).1( A = a k ) z i , j , k å å å å i = 1 m = 1 j = 1 k = 1 N h : total number of atomic poses h i : the i th atomic pose N o : total number of objects o j : the j th object N a : total number of activates a k : the k th activity ζ i,j,k : strength of the co-occurrence interaction 15
Model: Spatial Context Activity A Spatial relationship between object and Human pose different body parts of the human H Objects O M O 1 Body parts f 2 ( H , O ) = P 1 P 2 P L N h N o M L 1( H = h i ).1( O m = o j ). l i , j , l å å å å T l , O m ) . b ( X I m = 1 i = 1 j = 1 l = 1 I Image of human-object interaction 16
Model: Spatial Context f 2 ( H , O ) = N h N o M L 1( H = h i ).1( O m = o j ). l i , j , l å å å å T l , O m ) . b ( X I m = 1 i = 1 j = 1 l = 1 l : location of the center of human’s l th body part in image I x I l and the m th object l m ): spatial relationship between x I b(x I , O bounding box  sparse binary vector with one 1 λ i,j,l : Weight for the relationship 17
Model: Objects Modeling objects using the detection scores Activity in all the object bounding boxes and the A spatial relationship between these boxes. Human pose H Objects f 3 ( O , I ) = O M O 1 N o M 1( O m = o j ). g j å å T . g ( O m ) + Body parts m = 1 j = 1 P 1 P 2 P L N o M M L 1( O m = o j ).1( O m = o ¢ å å å å T . b ( O m , O m ) ¢ ¢ j ). g j , ¢ j m = 1 m = 1 ¢ j = 1 j = 1 ¢ I 18
Model: Objects f 3 ( O , I ) = N o M 1( O m = o j ). g j å å T . g ( O m ) + m = 1 j = 1 N o M M L 1( O m = o j ).1( O m = o ¢ å å å å T . b ( O m , O m ) ¢ ¢ j ). g j , ¢ j m = 1 m = 1 ¢ j = 1 j = 1 ¢ g(O m ) : vector of scores of all detected object in the m th box ϒ j : the detection score weight for the j th object b(O m, O m’ ) : binary vector of spatial relationship between pairs of objects ϒ j,j ’ : weight for geometric configuration between o j and o j ’ [Desai et al, 2009] 19
Model: Human Pose Likelihood of observing image I given the Activity atomic pose h i A Human pose H f 4 ( H , I ) = O M O 1 N h L å å T . p ( X I l | X h i l )) + Body parts 1( H = h i ).( a i , l b i , l T . f l ( I )) i = 1 l = 1 P 1 P 2 P L I Image of human-object interaction 20
Model: Human Pose f 4 ( H , I ) = N h L å å T . p ( X I l | X h i l )) + 1( H = h i ).( a i , l b i , l T . f l ( I )) i = 1 l = 1 l | x hi l ) : Gaussian likelihood of observing x I l , given the standard joint p(x I location of the l th body part in pose h i f l (I) : the l th body part detection output α j,l : location weight for the l th body part in pose h i β j,l : appearance weight for the l th body part in pose h i 21
Model: Activities Activity classifier to model HOI activity Activity A f 5 ( A , I ) = Human pose H N o å Objects 1( A = a k ). h k b i , l T . T . s ( I )) O M O 1 k = 1 Body parts P 1 P 2 P L I Image of human-object interaction 22
Model: Activities f 5 ( A , I ) = N o å 1( A = a k ). h k b i , l T . T . s ( I )) k = 1 η k : feature weight for activity a k s(I) : output of one-versus-all discriminative classifier 23
Training: Atomic Poses Hierarchical clustering from a given set of poses on training images: • Position and orientation of parts with distance • Normalization to the same position/size of torso (sports) or head (music) • Variations in position and orientation are normalized to [-1,1] • Missing parts are filled from the image’s nearest neighbor • Atomic poses are shared by all activities w 𝑈 ⋅ ∣ x 𝑚 − x 𝑚 ∣ 24
Training: Objects and Part Detectors Deformable Parts Model with SVM on HOG feature detectors: • One mixture component per per body part • Two mixture components per object unless aspect ratios do not change • - value of the object detection score divided by the threshold • - value of the body part detection divided by the threshold 25
Training: Activity Classifier Spatial Pyramid Matching method: • Sparse SIFT features on three layers • - a vector with confidence scores obtained from an SVM classifier 26
Training: Estimating Model Parameters Conditional Random Field with no hidden variables: • - model parameters • Maximum likelihood approach • Zero-mean Gaussians priors 27
Inference: Iterative Process Initialization : • Action classification with SPM classification • Object bounding boxes from independent object detectors (scores >0.9) • Initial pose from a pictorial structure model from all training images Two Iterations : • Updating the layout of human body parts - updating Gaussian priors for part locations with poses marginal probabilities: • Updating object detection results - greedy forward search: • Updating the activity and atomic pose labels - maximizing the overall sum by enumerating all possible values for actions and human poses 28
Results: Examples for Testing Images 29
Results: Sports – Object Detection • Better overall performance across all objects • Better discrimination of similar objects (cricket ball vs. croquet ball) 30
Results: Sports – Human Pose Estimation • Better overall performance across all poses • Outperform even Pictorial Structure model trained on separate classes! 31
Results: Sports – Activity Classification • Better overall performance • Performance is better than just SPM by about 4% 32
Results: Music – Object Detection • Better overall performance across all objects • Better improvement for “playing instrument” situations when context plays a more important role 33
Results: Music – Object Detection • Demonstration of the importance of human poses for object detection 34
Results: Music – Human Pose Estimation • Better performance for poses with “playing instrument” • Only marginally better for poses with “not playing instrument” • No significant improvement as compared to Pictorial Structure model 35
Results: Music – Activity Classification • Better overall performance as compared to SPM and grouplet approach 36
Recommend
More recommend