[PPT] - and more Yang Wu Nara Institute of Science and Technology PowerPoint Presentation

SLIDE 1

Understanding humans: identity, communication, state, and more

Yang Wu （伍洋） Nara Institute of Science and Technology 奈良先端科学技术大学院大学

1

SLIDE 2

For helping a person

Robot Society Service

NAIST International Collaborative Laboratory for Robotics Vision

2

SLIDE 3

the system needs to understand the person

Robot Society Service

NAIST International Collaborative Laboratory for Robotics Vision

3

SLIDE 4

Computer Vision Action/Intension Understanding

E.g. Progress of cooking, busyness

Robots Proper Supporting Actions

E.g. Directly doing it, or asking for help

Augmented Reality Guidance, Info., and Showing Robots’ Intension

E.g. Choosing what to show and how to show it.

NAIST International Collaborative Laboratory for Robotics Vision

A possible application scenario

4

SLIDE 5

Communication

(What [does he/she want]? How [does he/she feel]?)

Identity

(Who?)

NAIST International Collaborative Laboratory for Robotics Vision

State, Action, ...

(What [is he/she doing]? How [does he/she do it]?)

Explicit expression Implicit expression

5

SLIDE 6

Head Gesture Recognition 3D Hand Tracking Across-camera Person Re-identification NAIST International Collaborative Laboratory for Robotics Vision

6

SLIDE 7

Person re-identification (Re-ID)

7

Identity: in-a-distance and unobtrusive

To look for a specific person in a camera network

?

SLIDE 8

8

Re-ID in the Context of Video Surveillance

Camera Sensors Multi-camera tracking (Across-camera tracing) Multi-camera activity analysis Storage and Networking Person/Object re-identification Monitoring System Summarization Camera Tampering Motion Face Detection Human/Object Detection People Counting Tailgating Left Behind Human/Object Tracking Compressing, Enhancing, and Irregularity Detection Superresolution Intrusion Loitering Personalized Services Statistics and ROIs Infrastructure Single Camera Applications - Camera Network Applications -

Intelligent Video Surveillance Industry Development - Figure 1. Position of re-identification in the intelligent video surveillance industry.

. . . . . . . . .

SLIDE 9

9

Single-shot

(a) Two camera views

Multiple-shot “multiple-shot” is more generic and useful Problem Introduction: Subtypes and Our Focus

... ...

(b) Images of sampled individual persons

Our main interest!

SLIDE 10

Single-shot: Looking at the “Pose”

Pose Normalization

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

Pose Adaptation

[submitted to AAAI 2019] Qiu et. al., “Pose-adaptive Image Generation for Person Re- identification”.

10

SLIDE 11

Key challenges

Environmental challenges Others

Body movements Camera viewpoints Occlusions Background Illumination

Clothes

Accessories

pose variations

Challenges

11

SLIDE 12

Motivation

1. Lack of cross-view paired training data

12

SLIDE 13

Motivation

2. Identity-sensitive and View-invariant representation

Identity A Identity B

Same ID Same ID

Diff. IDs

One example

13

SLIDE 14

Key idea: Eliminating the pose differences

(may be a little difficult)

?

imagine (may be easier)

Proposal

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

14

SLIDE 15

Network (PN-GAN)

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

15

SLIDE 16

1. Pose estimation – OpenPose [1]; 2. Feature extraction – ResNet-50;
3. Pose clustering – K-means.

[1] Cao,Z.,Simon,T.,Wei,S.E.,Sheikh,Y.:Realtimemulti-person2dposeestimation using part affinity fields. In: CVPR (2017)

Network (eight canonical poses)

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

16

SLIDE 17

Network (framework)

Image Generation Feature Extraction Feature Fusion

Features from original images Features from generated images

17

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

SLIDE 18

Visualization

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”. 18

SLIDE 19

Visualization

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

19

SLIDE 20

Code

https://github.com/naiq/PN_GAN

20

[ECCV 2018] Qian et. al., “Pose-Normalized Image Generation for Person Re-identification”.

SLIDE 21

Generating data with an arbitrary pose for any specific person.
Enhancing the generation with Re-ID specific losses
Forcing the ReID model to be pose invariant.

Another Strategy: Pose Adaptation

21

Conditioned Image Conditioned Image SG-DGAN Results SG-DGAN Results Conditioned Image SG-DGAN Results

[submitted to AAAI 2019] Qiu et. al., “Pose-adaptive Image Generation for Person Re-identification”.

SLIDE 22

Video-based ReID:

Perspectives of Set and Sequence

Set Sequence

22

… …

SLIDE 23

Set: Robustness and Flexibility of Geometry

23

SLIDE 24

24

Multiple-shot Re-ID: A Set-based Perspective

… … …

…

1 g

S

2 g

S

c g

S

Training

…

1 p

S

c p

S

Probe Gallery

SLIDE 25

25

Multiple-shot Re-ID: A Set-based Perspective

Who?

… … …

…

1 g

S

2 g

S

n g

S

Testing

…

i p

S

Probe Gallery

SLIDE 26

Set-to-set distance + metric learning

26

One direction  Parametric methods

[ECCV 2012] Yang Wu, et al., "Set based discriminative ranking for recognition".

SLIDE 27

27

Q

i

X

j

X Q

j

X Q

i

X

j

X Q

i

X

j

X

 

,

S S W j

d Q X



 

,

S S W i

d Q X



 

,

S S i W

d Q X



 

,

S S j W

d Q X



(a) Original query and gallery sets (b) Between-set geometric distance finding (c) Metric (space) learning (d) Learned space and distances

i

X

Q

i

X

i

X

n

X

1

X

i

X

n

X

1

X

n

X

1

X

Q Q

(1) Original query and gallery sets (2) Mapped sets in the learned metric space (3) Between-set distance based classification/ranking

Q

i

X

Match!

i

X

n

X

1

X

Ranking

Testing Stage: Testing Stage: Training Stage: Training Stage:

W

Metric

Set-to-set distance + metric learning

One direction  Parametric methods

[ECCV 2012] Yang Wu, et al., "Set based discriminative ranking for recognition".

SLIDE 28

Collaborative representation

(a) Set-to-set distances (b) Set-to-sets distance

Y

1

X

i

X

n

X Y

i

X

n

X

1

X Y Y

28

Another direction  Nonparametric methods (MPD, AHISD/CHISD, SANP/KSANP , RNP) (CSA, CRNP , LCSA, LCRNP ,CMA)

- Query/Probe Set

, {1, , }

i i

n   X

- Gallery Sets

Y

[AVSS 2012] [BMVC 2013] [ACPR 2014] [FCV 2014] [MIRU 2014] Yang Wu, et al.

SLIDE 29

29

Sparse representation based classification

J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma., Robust Face Recognition

via Sparse Representation. IEEE TPAMI, 31(2):210–227, 2009. Collaborative Representation for Re-ID  Related Work

2 1 2 1

ˆ argmin + .   

α

α y Xα α  

2 2

ˆ , ,

i i i

r i    y y X α

   

arg min .

i i

C r  y y

SLIDE 30

30

Results (Sparse model)

Collaborative Representation for Re-ID  Sparse CR

Yang Wu, et al, "Collaborative Sparse Approximation for Multiple-shot Across-camera Person Re-identification", 9th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2012.

SLIDE 31

31

Face recognition accuracy (%) comparison on the Honda/UCSD dataset. Face recognition accuracy (%) comparison on the CMU MoBo dataset. Performance comparison for person re-identification on three benchmark datasets. Collaborative Representation for Re-ID  Non-sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Collaboratively Regularized Nearest Points for Set Based Recognition", InProc. of The 24th British Machine Vision Conference (BMVC), 2013.

Results (Non-sparse model)

SLIDE 32

32

Results (Non-sparse model)

Computational cost

For those methods which can have (parts of) their models pre-computed using the training data, the total pre-computation time (in seconds) is listed for comparison. Computational cost comparison with all the related methods on all of the recognition tasks (in the ``milliseconds per sample'' manner, excluding the time for feature extraction).

Collaborative Representation for Re-ID  Non-sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Collaboratively Regularized Nearest Points for Set Based Recognition", InProc. of The 24th British Machine Vision Conference (BMVC), 2013.

SLIDE 33

33

Collaboratively Regularized Nearest Points

Distance finding optimization

 

2 2 2 1 2 2 2 , 2

min ,      

α β

z Qα Xβ α β

Iterative Optimization:

Fix , and optimize :

β α

Fix , and optimize :

β α

*

( ),

q

  α P z Xβ

with

1 1 )

. (

T T q





  P Q Q I Q

*

( ),

x

  β P z Qα

with

1 2 )

. (

T T x





  P X X I X

One-step closed- form solution?

Yes!

But,

- it is expensive,
- the whole
ptimization is needed

for each query/probe set.

Collaborative Representation for Re-ID  Non-sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Collaboratively Regularized Nearest Points for Set Based Recognition", InProc. of The 24th British Machine Vision Conference (BMVC), 2013.

SLIDE 34

34

Collaboratively Regularized Nearest Points

Classification

* * * 1

[ , , ]

n

  β β β

Like sparse/collaborative representation models for single-instance based recognition, here the set-specific coefficients is implicitly made to have some discrimination power. Therefore, we design our classification model as follows.

 

2 2 * * * * * 2 2

· . /

i CRNP i i i i

d    Q X Qα X β β

 

 

arg min ,

i CRNP i

C d  Q

where Recall that RNP doesn’t directly use the coefficients themselves which are actually also discriminative.

 

2 * * * * 2

· ,

i RNP i i

d    Q X Qα X β

Collaborative Representation for Re-ID  Non-sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Collaboratively Regularized Nearest Points for Set Based Recognition", InProc. of The 24th British Machine Vision Conference (BMVC), 2013.

SLIDE 35

LCSA (Locality-constrained Collaborative Sparse Approximation)

(a) SANP (b) CSA (c) LCSAwNN (d) LCSAwMPD p

X

1 g

X

g i

X

g n

X

p

X

1 g

X

g i

X

g n

X

p

X

1 g

X

g i

X

g n

X

p

X

p

X

p

X

1 g

X

g i

X

g n

X

Collaborative Representation for Re-ID  Sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification", In Proc. of The Asian Conference on Pattern Recognition (ACPR), 2013.

35

SLIDE 36

36

Experimental Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.45 0.5 0.55 0.6 0.65 0.7 Locality ratio Accuracy at rank top 10%

Performance changes on the "iLIDS-AA" dataset

LCSAwNN, N=10 LCSAwNN, N=23 LCSAwNN, N=46 LCSAwMPD, N=10 LCSAwMPD, N=23 LCSAwMPD, N=46

Collaborative Representation for Re-ID  Sparse CR

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, "Locality-constrained Collaborative Sparse Approximation for Multiple-shot Person Re-identification", In Proc. of The Asian Conference on Pattern Recognition (ACPR), 2013.

SLIDE 37

37

LCRNP (Locality-constrained Collaboratively Regularized Nearest Points) Collaborative Representation for Re-ID  Non-sparse CR

(a) LCSAwNN (b) LCSAwMPD p

X

1 g

X

g i

X

g n

X

p

X

1 g

X

g i

X

g n

X

(c) LCRNPwNN (d) LCRNPwMPD p

X

1 g

X

g i

X

g n

X

p

X

1 g

X

g i

X

g n

X

Yang Wu, et al., "Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification", FCV 2014.

Sparse Non-sparse

SLIDE 38

38

Experimental results for LCRNP , in comparison with the others

Yang Wu, et al., "Locality-constrained Collaboratively Regularized Nearest Points for Multiple-shot Person Re-identification", FCV 2014.

Collaborative Representation for Re-ID  Non-sparse CR

1 2 3 4 0.4 0.5 0.6 0.7 0.8 0.9 Rank Recognition percentage

CMC on the "iLIDS-MA" dataset with N=10

CSA (0.700) LCSAwNN (0.750) LCSAwMPD (0.780) CRNP (0.777) LCRNPwNN (0.787) LCRNPwMPD (0.798) 1 2 3 4 0.4 0.5 0.6 0.7 0.8 0.9 Rank Recognition percentage

CMC on the "iLIDS-MA" dataset with N=23

CSA (0.732) LCSAwNN (0.768) LCSAwMPD (0.787) CRNP (0.790) LCRNPwNN (0.815) LCRNPwMPD (0.838) 1 2 3 4 0.4 0.5 0.6 0.7 0.8 0.9 Rank Recognition percentage

CMC on the "iLIDS-MA" dataset with N=46

CSA (0.725) LCSAwNN (0.800) LCSAwMPD (0.825) CRNP (0.775) LCRNPwNN (0.850) LCRNPwMPD (0.875) 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "iLIDS-AA" dataset with N=10

CSA (0.554) LCSAwNN (0.655) LCSAwMPD (0.604) CRNP (0.707) LCRNPwNN (0.722) LCRNPwMPD (0.721) 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "iLIDS-AA" dataset with N=23

CSA (0.613) LCSAwNN (0.694) LCSAwMPD (0.676) CRNP (0.734) LCRNPwNN (0.745) LCRNPwMPD (0.737) 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "iLIDS-AA" dataset with N=46

CSA (0.578) LCSAwNN (0.688) LCSAwMPD (0.673) CRNP (0.713) LCRNPwNN (0.759) LCRNPwMPD (0.714) 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "CAVIAR4REID" dataset with N=5

CSA (0.446) LCSAwNN (0.588) LCSAwMPD (0.544) CRNP (0.624) LCRNPwNN (0.642) LCRNPwMPD (0.638) 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "CAVIAR4REID" dataset with N=10

CSA (0.540) LCSAwNN (0.720) LCSAwMPD (0.660) CRNP (0.700) LCRNPwNN (0.740) LCRNPwMPD (0.700) 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rank Recognition percentage

CMC on the "CAVIAR4REID" dataset with N=10, unspecified

CSA (0.652) LCSAwNN (0.760) LCSAwMPD (0.704) CRNP (0.674) LCRNPwNN (0.734) LCRNPwMPD (0.734)

SLIDE 39

39

1. Parametric (Set-to-set distance + metric learning)
2. Non-parametric (Collaborative representation)

How about combining them?

SLIDE 40

Background: Dictionary Learning

40

… …

... ... ... ... ... ... ...

… … …

... ... ... ... ...

… … … … … …

... ... ... ... ...

…

X

Samples

 

Dictionary

  D

Coefficients

  α d

 

Feature vector Regularizer (e.g. Discrimination) Regularizer (e.g. Sparsity) Related work  Parametric methods

N k

 

k N 

Training …

SLIDE 41

Discriminative Collaborative Representation (DCR)

41

… …

... ... ... ... ...

…

… … … … … ... ... ... ... ... …

X

Samples p

 

Dictionary

  D

Coefficients p

  α

d

 

… … … … … ... ... ... ... ... …



g

  α

p

N

… …

... ... ... ... ...

…

d

g

N



… …

... ... ... ... ... ... ...

…

Strong and costly regularization terms were used.

X

g

 

Parametric (Collaborative representation + Dictionary learning)

Yang Wu, et al., "Discriminative Collaborative Representation for Classification", ACCV 2014.

…

… …

…

… …

SLIDE 42

Dictionary Collaborative Learning (DCL)

42

… …

... ... ... ... ... ... ...

…

… …

... ... ... ...

…

… … … … … ... ... ... ... …

X

Samples g

 

Dictionary g

  D

Coefficients g

  β

d

 

… …

... ... ... ... ... ... ...

…

… … … … … ... ... ... ... …



p

  D

p

  β

c

… …

... ... ... ...

…

d c



New proposal: dictionary co-learning

,1

, 1 .

i g

i i g g i i g g N

N  X α α

,1

, 1 .

i p

i i p p i i p p N

N  X α α

Learning Camera-specific Dictionaries Collaboratively

X p

…

… …

…

… …

SLIDE 43

Experimental results: Effectiveness

43

Experiments  Results Rank 1 accuracy Parametric Nonparametric

SLIDE 44

Experimental results: Efficiency

Running time in milliseconds/person, using matlab with a normal

CPU.

44

Experiments  Results Parametric Nonparametric 10-100x Speedup

SLIDE 45

Set Sequence

45

… …

Video-based ReID:

Perspectives of Set and Sequence

SLIDE 46

Sequence: the order matters!

46

SLIDE 47

Proposal: Temporal Convolution

[AAAI 2018] Wu et. al., “Temporal-Enhanced Convolutional Network for Person Re-identification”.

47

SLIDE 48

Communication

(What [does he/she want]? How [does he/she feel]?)

Identity

(Who?)

NAIST International Collaborative Laboratory for Robotics Vision

State, Action, ...

(What [is he/she doing]? How [does he/she do it]?)

Explicit expression Implicit expression

48

SLIDE 49

People communicate to understand each other

What if machines understand them?

49

SLIDE 50

Our goal: automatic recognition of spontaneous head gestures

50

SLIDE 51

Targeted head gestures

51

Nod Ticks Jerk Up Down Tilt Shake Turn Forward Backward

SLIDE 52

Benefits of understanding communication

[Maatman et al. 2005]

Human-robot interaction Communication assistance

[Asakawa 2015]

52

SLIDE 53

Importance of non-verbal information

Non-verbal information influences significantly e.g.) Mehrabian’s rule (Rule of 7%-38%-55%)

Communication

Verbal information Non-verbal information Audio information Visual information Expression Hand gesture Head gesture

⋮

We focus on head gesture detection

Appears frequently

[Hadar et al. 1983]

Takes important role [Kousidis et al. 2013, McClave 2000]

53 Verbal Audio Visual

SLIDE 54

Our contributions and novelties

 Contributions

 Built a novel dataset  Evaluated representative automatic recognition models

Novelties (with comparison to existing work)

 Dataset:

closer to real applications
better for deeper and further researches

 Solution:

a general hand-crafted feature
a comparative study of representative recognition algorithms

54

SLIDE 55

Only Nod and Shake are widely handled gestures. Nod is commonly concerned.

55

Recognized head gestures

Nod [Morency et al. 2007] [Nakamura et al. 2013] [Chen et al. 2015] Nod, Shake [Kawato et al. 2000] [Kapoor et al. 2001] [Tan et al. 2003] [Morency et al. 2005] [Wei et al. 2013] Nod, Shake, Turn [Saiga et al. 2010] Nod, Shake, Tilt, Still [Fujie et al. 2004]

Previous studies on head gesture detection

P r e v i o u s s t u d i e s

SLIDE 56

Recording conditions

No interlocutors [Kawato et al. 2000] [Kapoor et al. 2001] [Tan et al. 2003] [Wei et al. 2013] Against a robot [Fujie et al. 2004] [Morency et al. 2005] [Morency et al. 2007] Speaker-listener style [Nakamura et al. 2013] Mutual conversations [Chen et al. 2015] [Saiga et al. 2010]

Few people have worked on spontaneous head gestures in human conversations

56

P r e v i o u s s t u d i e s

Previous studies on head gesture detection

SLIDE 57

57

Dataset Construction

SLIDE 58

Recording

30 sequences of approx. 10 min. from 15 participant
Includes familiar/unfamiliar pairs, indoor/outdoor records
Conversations with topics chosen beforehand
Purpose of the recording is announced

Wearable camera Fixed camera Microphone Microphone Fixed camera Wearable camera 58

SLIDE 59

Annotation

A freeware Anvil5 [Kipp 2014] was used for manual annotation.

(up to 3 overlapping gestures were allowed)

3 naive annotators annotated all the data independently, after a quick training with guideline and examples. 59

SLIDE 60

Ground-Truth Inference

IoU: Interaction over Union

60

Interval 1: 𝑇1 Interval 2: 𝑇2 Intersection: 𝐽1,2 Union: 𝑉1,2 time

𝐽𝑝𝑉 𝑇1, 𝑇2 = 𝑚𝑓𝑜𝑕𝑢ℎ(𝐽1,2) 𝑚𝑓𝑜𝑕𝑢ℎ(𝑉1,2)

SLIDE 61

Ground-Truth Inference

61

Nod, 2

Shake, 3

Nod, 3 Nod, 2 Down, 3 Nod, 2.5

Annotator A: Annotator B: Annotator C: Inferred: Shake, 2

Up, 2 Turn, 1 Shake, 3 Tilt, 1 Suppose IoU_th = 0.5

Shake, 3

A&B: IoU=0.6 > IoU_th A&B: IoU=0.65 > IoU_th A&C: IoU=0.8 > IoU_th B&C: IoU=0.6 > IoU_th A&B: IoU=0.65 > IoU_th A&C: IoU=0.8 > IoU_th B&C: IoU=0.6 > IoU_th

Gesture Type Strength

Non- maximum Suppression

SLIDE 62

Statistics (Inferred Ground-truth with IoU=0.5)

62

Total No. of Samples: 4147

SLIDE 63

Type Distribution per Subject

63

SLIDE 64

Strength Distribution per Subject

64

SLIDE 65

Familiar vs. Unfamiliar

65

Ticks Nod

SLIDE 66

Length Distribution

66

Median

SLIDE 67

Recognition tasks

Detection：Given a sequence, to infer when and which gestures appear. To understand the problem better, we also work on the task of Classification：Given a segmented gesture clip, to infer which type it

belongs to.

To detect varied head gestures from spontaneous conversations

Nod Shake Nod

67

? ? ?

Tilt Shake classifier Nod Turn

SLIDE 68

General framework

Features Head pose

68

Classifier or Detector

SLIDE 69

Head pose estimation

Head pose (and position) were estimated with ZFace [Jeni et al.

2015]

69

Pitch Roll Yaw X Y Scale Frame number

SLIDE 70

A general hand-crafted feature Histogram of Velocity and Acceleration (HoVA)

70

Original 1st derivative

⌇ ⌇ ⌇ ⌇ ⌇ ⌇ ⌇ ⌇

SLIDE 71

Histogram of Velocity and Acceleration (HoVA)

71

Original 1st derivative

＋：2.4 －：2.6 ＋：4.3 －：1.8 ＋：1.4 －：2.0

SLIDE 72

72

Original 2nd derivative

＋：2.4 －：2.6 ＋：4.3 －：1.8 ＋：1.4 －：2.0 ＋：2.4 －：2.6 ＋：4.3 －：1.8 ＋：2.2 －：2.0

Histogram of Velocity and Acceleration (HoVA)

SLIDE 73

Existing classification models

Learning model

(rule-base) [Kawato et al. 2000] [Saiga et al. 2010] [Nakamura et al. 2013] SVM [Morency et al. 2005] [Chen et al. 2015] HMM [Kapoor et al. 2001] [Tan et al. 2003] [Fujie et al. 2004] [Wei et al. 2013] LDCRF [Morency et al. 2007] 73

P r e v i o u s s t u d i e s

SLIDE 74

We evaluate the following models

 Non-graphical

SVM

 Graphical

Hidden-state Conditional Random Field (HCRF) for classification
Latent-Dynamic Conditional Random Field (LDCRF) for detection
Long-Short Term Memory (LSTM)

74

LDCRF

SLIDE 75

LDCRF (Latent-Dynamic Conditional Random Field)

[Morency et al. 2007]

Conditional Random Field enhanced for action detection Learn weights between each and Optimize the order of hidden states throughout temporal data

75

A1 A1 A2 B2 B1 A1 A2 −2 −1 3 10 12 −1 −2 A A A B B A A

Label Hidden state Data

Hidden state

Data

Hidden state Hidden state

SLIDE 76

76

LSTM Bidirectional LSTM Max pooling ⋰ ⋮ ⋱ Dense + ReLU ⋰ ⋮ ⋱ Dense + Softmax Input temporal data n x 24 n x 64 n x 64 n x 64 192 32 10 Output

LSTMs (Long-Short Term Memory)

SLIDE 77

Results – Classification (Accuracy, F-score)

Accuracy (Averaged)
F-score (Averaged)

77

Method

Training Set Training-Val Set Validation Set Test Set

SVM 0.68±0.02 0.74±0.04 0.62±0.11 0.60±0.12 SVM_weighted 0.65±0.02 0.76±0.01 0.59±0.11 0.57±0.13 HCRF 0.88±0.04 0.83±0.03 0.66±0.14 0.64±0.10 LSTMs 0.79±0.02 0.84±0.06 0.63±0.14 0.61±0.15 Method

Training Set Training-Val Set Validation Set Test Set

SVM 0.483 0.318 0.387 0.307 SVM_weighted 0.493 0.324 0.408 0.388 HCRF 0.799 0.386 0.433 0.382 LSTMs 0.600 0.394 0.386 0.391

SLIDE 78

Results – Classification (Confusion Matrix)

78

SVM SVM_weighted HCRF

Test set only, overall accumulation

LSTMs

SLIDE 79

Results – Classification (Class-specific)

79

SLIDE 80

Simulated Human Performance -- Classification

Frame-wise confusion matrix (with “None” class) Frame-wise confusion matrix (without “None” class)

80

SLIDE 81

Results – Detection (PR-curve, AP)

81

SLIDE 82

Poorer results when fewer samples are available
LDCRF can better model classes with more diversities, e.g. Ticks.

82

Results – Detection (AP)

0.25 0.5 0.75 1

SVM LDCRF

Nod Jerk Up Down Ticks Tilt Shake Turn Forward Backward 全体

SLIDE 83

Conclusions and discussions

 Spontaneous head gesture recognition is a hard problem

Hard for humans, but even harder for automatic recognition

 Gestures types are not equally hard for automatic recognition  Larger model is stronger  Deep learning is more promising, but more data is needed.

83

SLIDE 84

Communication

(What [does he/she want]? How [does he/she feel]?)

Identity

(Who?)

NAIST International Collaborative Laboratory for Robotics Vision

State, Action, ...

(What [is he/she doing]? How [does he/she do it]?)

Explicit expression Implicit expression

84

SLIDE 85

85

Proposal of a Wrist-mounted Depth Camera for Finger Gesture Recognition

Kai Akiyama, Yang Wu Nara Institute of Science and Technology

Time-of-Flight camera Retrieved depth images AR/VR controller Daily activity recognition

SLIDE 86

Hand pose estimation - Applications

Driving assistant Surgery assistant Playing games etc.

86

SLIDE 87

(S. Yuan, et al. 2017)

Background – Depth-based 3D hand pose estimation benchmark

Hands In the Million Challenge (HIM2017)

Training data (957K) Testing data Single frame (296K) Tracking (295K) Interaction (2K)

Pose estimator Hand detector + Pose estimator

87

SLIDE 88

88

Proposed 3D hand pose estimator architecture (1)

1024 dense Block 2 Block 3 Block 4 Block 1 27 24 24

24 24 63

Thickened cloud points

f hand

3D coordinates of hand joints

Output_T Output_I Output_M Output_R Output_P Output_hand

SLIDE 89

89

1024 dense Block 2 Block 3 Block 4 Block 1 27 24 24

24 24 63

Thickened cloud points

f hand

3D coordinates of hand joints

Proposed 3D hand pose estimator architecture (2)

SLIDE 90

Pipeline of Pose estimator

Single frame pose estimation

Pose Estimato r

Extracting hand based by given bounding box Represent data by 50x50x50 volume Estimate 3D hand pose and transform back to original coordinates

90

SLIDE 91

Qualitative results of 3D hand pose estimator

91

SLIDE 92

Evaluation on the 3D hand pose estimation task of HIM2017 benchmark

92

SLIDE 93

Utilizing a hand detector for tracking and interaction task

Testing data Single frame Tracking Interaction Pose estimator Hand detector + Pose estimator We need a hand detector to find where is the hand in real application

93

SLIDE 94

Architecture of the 3D hand pose tracking system

Hand Detector

X

Hand Verifier

Pose Estimator

Taking pose from previous frame Success Fail

Hand detector + Hand verifier + Pose estimator

Hand verifier:

1. Comparing with the previous frame, whether the center of detected hand area shift more

than 150 mm;

2. Whether the number of pixels for detected hand area is more than 1000.

94

SLIDE 95

Qualitative results of 3D hand pose tracking

… … … … … … … … …

Depth image Hand mask Estimated hand pose and depth image

Sequential Frames

95

SLIDE 96

Evaluation on the 3D hand tracking task of HIM2017 benchmark

96

SLIDE 97

Applying modified tracking system on Hand object interaction

Hand Detector

X

Pose Estimator

97

SLIDE 98

Qualitative results of 3D hand-object interaction pose estimation

Depth image Hand mask Estimated hand pose and depth image

98

SLIDE 99

Evaluation on the hand object interaction task of HIM2017 benchmark

99

SLIDE 100

Evaluation results on all tasks of HIM2017 benchmark

100

SLIDE 101

Who is Doing What in Drone-recorded WAMI

101

[Submitted to AAAI 2019]

SLIDE 102

Results – Region Proposal

102

SLIDE 103

Results – Tracking

103

SLIDE 104

Results – Action Recognition

104

SLIDE 105

105

About NAIST

SLIDE 106

Osaka Kyoto

NAIST Location

Nara

106

SLIDE 107

Kansai Science City (Keihanna)

Research park in the Kansai Hills area, extending to three prefectures, Kyoto, Osaka and Nara, and covering about 150

km2. More than 110 companies and institutes such as:

Kyocera Panasonic ATR (Advanced Telecommunications Research Institute International) NICT (National Institute of Information and Communications Technology) RITE (Research Institute of Innovative Technology for the Earth)

107

SLIDE 108

NAIST Campus

Administrative Offices Student & Staff Dormitories Graduate School of Biological Sciences Graduate School of Materials Science Graduate School of Information Science Interdisciplinary/Integrated Research Building Interdisciplinary/Integrated Research Building

108

SLIDE 109

GSIS: Core Laboratories

Computing Architecture Dependable System Ubiquitous Computing System Mobile Computing Software Engineering Software Design and Analysis Internet Engineering Internet Architecture and Systems Computational Linguistics Augmented Human Communication Network Systems Vision and Media Computing Interactive Media Design Optical Media Interface Ambient Intelligence Robotics Intelligent System Control Large-Scale Systems Management Mathematical Informatics Imaging-based Computational Biomedicine Computational Systems Biology Robotics Vision

Computer Science Applied Informatics Media Informatics

109

SLIDE 110

NAIST External Evaluation

Ranked 1st in Japan Revenue for research expenses (per faculty member) Number of Grants-in-Aid for scientific research (per faculty member) Allotment of Grants-in-Aid for Scientific Research (per faculty member) Revenue from patent implementation (per faculty member) Number of university business ventures (per faculty member) Percentage of Young Faculty (Younger than 37 years old)

The 87th Session of the Council for Science and Technology Policy

Ranked 1st Citation Index of ISI (overall) among Japanese National Universities

Ranking 2013 by Asahi Shimbun

110

SLIDE 111

111

SLIDE 112

NAIST elected for major university programs by MEXT

 2014 Top Global University Project  2013 The Program for Promoting the Enhancement

f Research Universities

112

SLIDE 113

113

About My Lab

SLIDE 114

NAIST International Collaborative Laboratory for Robotics Vision

114

SLIDE 115

Established in Dec., 2014

NAIST International Collaborative Laboratory for Robotics Vision

115

SLIDE 116

We won

The Best International Collaborative Lab of NAIST, 2017

NAIST International Collaborative Laboratory for Robotics Vision

116

Best Student Paper Award: The Piero Zamperoni Best Student Paper Award of ICPR 2018 (Global)
Best Paper Award: The AutoML 2018 workshop @ ICML/IJCAI-ECAI 2018 (Global)
Winner: The 2017 Hands in the Million Challenge (Hand-Object Interaction Task) (Global)
Winner: ISMAR 2015 Tracking Competition (Off-Site Category: Level 1) (Global)
Excellent Demo: IPSJ Distributed Processing System Workshop 2016 (Japan)
Excellent Award: Creative and International Competitiveness Project 2017 (NAIST)
Excellent Student Award: 2018 Excellent Student Award (NAIST)
Excellent Student Award: 2017 Excellent Student Award (NAIST)