When We First Met: Visual-Inertial Person Localization for Co-Robot - - PowerPoint PPT Presentation

when we first met visual inertial person localization for
SMART_READER_LITE
LIVE PREVIEW

When We First Met: Visual-Inertial Person Localization for Co-Robot - - PowerPoint PPT Presentation

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020 Motivation Use case 1: Autonomous vehicle identifies and locates its user


slide-1
SLIDE 1

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous

Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020

slide-2
SLIDE 2

Motivation

Use case 1: Autonomous vehicle identifies and locates its user

2

Use case 2: Assistive robot tries to locate the target person for the first time

slide-3
SLIDE 3

Proposed Task

3

Given a query IMU sequence from a person’s smartphone, locate the person in the video that the IMU comes from.

slide-4
SLIDE 4

Why IMU?

4

  • Inertial measurements (accelerometer and gyroscope readings) provides rich relative 3D

motion information

  • People often carry smart devices (smart-phones and smart watches) which are

equipped with inertial sensors

  • IMU data can be easily and selectively transmitted at a low cost
  • Contains minimal biometric or privacy-sensitive information
slide-5
SLIDE 5

Prior Work on Visual-Inertial Person Identification and Tracking

Graph optimization problem

  • Node: predict person’s orientation with respect to the

camera from a single image using a VGG16 based network

  • Edge: estimate 3D foot position of the person at each

frame to compute 3D velocity between pairs of node images

5

Henschel, Roberto, Timo von Marcard, and Bodo Rosenhahn. "Simultaneous identification and tracking of multiple people using video and IMUs." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019.

Hand-crafted inertial features to match with visual data

slide-6
SLIDE 6

Proposed Approach

6

For the query IMU data from a person labeled by index , where is the number of people in the video, extract the inertial feature . For each candidate person in the video, extract visual feature . Learn two mappings such that the transformed features from the same person lie closely in joint feature space.

n ∈ [N] N gIMU gVIS HVIS : gVIS → f, HIMU : gIMU → f

slide-7
SLIDE 7

IMU

Framework

Feature Extraction Feature Encoding Joint Feature Space

slide-8
SLIDE 8

Inertial Feature Extraction

IMU data: 3D linear acceleration and angular velocity in smartphone local frame Pre-processing:

  • 1. Uniform sampling to fix number of

samples synchronized with video frames

  • 2. Low-pass filtering for smoothing

8

slide-9
SLIDE 9

IMU

Bounding box size trajectory

TSP optical flow

Human pose keypoint trajectory

Positive

Framework

Feature Extraction Feature Encoding Joint Feature Space

slide-10
SLIDE 10

Visual Feature Extraction

1. Person detection with YOLOv3 and tracking with DeepSORT 2. Decompose person tracklets into Temporal Super-Pixels (TSP) 3. Compute average optical flow for each TSP

Person Tracklets Temporal Super-Pixels with Person labels

10

Farhadi, Joseph Redmon Ali, and Joseph Redmon. "YOLOv3: An incremental improvement." Retrieved September 17 (2018): 2018. Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." 2017 IEEE international conference on image processing (ICIP). IEEE, 2017.

slide-11
SLIDE 11

Temporal Super-Pixels as Visual Representation of Human Motion

Chang, Jason, Donglai Wei, and John W. Fisher. "A video representation using temporal superpixels." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.

11

Mask Pixels Coarse Fine

slide-12
SLIDE 12

Optical flow only capture a 2D projection of the person’s 3D motion Factors that would cause ambiguity in matching 3D inertial feature to optical flow:

  • 1. Person’s distance to the camera
  • 2. Person’s orientation to the camera

Ambiguity in 2D and 3D Feature Matching

v v Similar inertial measurements but different optical flow

12

slide-13
SLIDE 13

Additional Visual Cues to Address Ambiguity

  • 1. Bounding box size trajectory (generated from

YOLO person detection)

  • 2. Human pose keypoints trajectory (generated

from AlphaPose)

  • Left and right shoulder keypoints relative

position to the center of the bounding box

13

Xiu, Yuliang, et al. "Pose flow: Efficient online pose tracking." BMVC 2018.

slide-14
SLIDE 14

IMU

TSP optical flow

Bounding box size trajectory Human pose keypoint trajectory

Positive

Framework

Feature Extraction Feature Encoding Joint Feature Space

LSTM-Box LSTM-Pose

LSTM-OpticalFlow

LSTM-IMU

Conv1D FC

slide-15
SLIDE 15

Triplet Loss

Negative LSTM-Box LSTM-Pose

Share Weights

FC

Bounding box size trajectory Human pose keypoint trajectory

TSP optical flow LSTM-OpticalFlow

LSTM-Box LSTM-Pose

LSTM-OpticalFlow

LSTM-IMU

Conv1D FC

IMU

TSP optical flow

Bounding box size trajectory Human pose keypoint trajectory

Positive

Framework

Feature Extraction Feature Encoding Joint Feature Space

slide-16
SLIDE 16

Learning the Visual-Inertial Feature Space

  • LSTM to encode inertial and visual features
  • Triplet loss using Euclidean distance between the visual (positive and negative person) and

inertial (query) feature embeddings

Predicted IMU source person in the video: Visual and Inertial Feature Encoders:

Average feature distance between all TSPs from a candidate person and the query IMU

Extracted visual feature with TSPj from negative person Extracted visual feature with TSPi from positive person

L(gn

IMU, g+ VIS(ξi), g− VIS(ξj)) = max(||HVIS(g+ VIS(ξi)) − HIMU(gn IMU)||2−

||HVIS(g−

VIS(ξj)) − HIMU(gn IMU)||2 + κ, 0)

16

slide-17
SLIDE 17

Experimental Setup

  • We collect our own dataset for implementation and evaluation.
  • We evaluate our framework on test videos with different number of people

(from 2 to 5) in the scene.

  • We assume the task complexity would increase as the number of people in

the video increases. (There are more potential false positives)

17

slide-18
SLIDE 18

Data Collection

Video

  • Recorded with a non-calibrated webcam at about

1 meter high from the ground

  • 30fps 1080p

IMU

  • Recorded with a hand-held iPhone
  • 100 Hz

Total length of the video with regard to number of people in the scene

18

slide-19
SLIDE 19

Results Compared to Baseline Methods

Non-learning based Transform one modality to the other Learning joint feature space

19

slide-20
SLIDE 20

Optimal Temporal Window Size

Longer time window includes more feature information that helps discriminate similar motions, but decreases sample size due to more occlusions.

20

slide-21
SLIDE 21

Ablation Study — inertial feature extraction

  • Raw IMU accelerometer and gyroscope readings:
  • Estimated velocity from acceleration integration
  • Concat
  • Replace
  • Low-pass filtering (smoothing):

Prediction Accuracy with Different IMU Features

21

slide-22
SLIDE 22

Ablation Study — additional visual cues

Prediction Accuracy with Different Visual Features

22

slide-23
SLIDE 23

Visualizing Results

small large Green: IMU source White: predicted IMU source feature distance

Failure

23

slide-24
SLIDE 24

Conclusions

Summary

  • Visual-Inertial dataset with common pedestrian activities
  • Proposed framework identifies IMU source in the video with 80.7% accuracy among various number of

people in the scene, without strict constraints on IMU placement

Future work

  • More data collection in terms of number of people, variation in people’s motion and background scenes
  • Extend to video recorded from a dynamic camera: for deployment of system on autonomous vehicles or

mobile robots

24

slide-25
SLIDE 25

Thank you!

25