When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous
Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020
When We First Met: Visual-Inertial Person Localization for Co-Robot - - PowerPoint PPT Presentation
When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020 Motivation Use case 1: Autonomous vehicle identifies and locates its user
Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020
Use case 1: Autonomous vehicle identifies and locates its user
2
Use case 2: Assistive robot tries to locate the target person for the first time
3
Given a query IMU sequence from a person’s smartphone, locate the person in the video that the IMU comes from.
4
motion information
equipped with inertial sensors
Graph optimization problem
camera from a single image using a VGG16 based network
frame to compute 3D velocity between pairs of node images
5
Henschel, Roberto, Timo von Marcard, and Bodo Rosenhahn. "Simultaneous identification and tracking of multiple people using video and IMUs." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019.
Hand-crafted inertial features to match with visual data
6
For the query IMU data from a person labeled by index , where is the number of people in the video, extract the inertial feature . For each candidate person in the video, extract visual feature . Learn two mappings such that the transformed features from the same person lie closely in joint feature space.
n ∈ [N] N gIMU gVIS HVIS : gVIS → f, HIMU : gIMU → f
IMU
Feature Extraction Feature Encoding Joint Feature Space
IMU data: 3D linear acceleration and angular velocity in smartphone local frame Pre-processing:
samples synchronized with video frames
8
IMU
Bounding box size trajectory
TSP optical flow
Human pose keypoint trajectory
Positive
Feature Extraction Feature Encoding Joint Feature Space
1. Person detection with YOLOv3 and tracking with DeepSORT 2. Decompose person tracklets into Temporal Super-Pixels (TSP) 3. Compute average optical flow for each TSP
Person Tracklets Temporal Super-Pixels with Person labels
10
Farhadi, Joseph Redmon Ali, and Joseph Redmon. "YOLOv3: An incremental improvement." Retrieved September 17 (2018): 2018. Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." 2017 IEEE international conference on image processing (ICIP). IEEE, 2017.
Chang, Jason, Donglai Wei, and John W. Fisher. "A video representation using temporal superpixels." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.
11
Mask Pixels Coarse Fine
Optical flow only capture a 2D projection of the person’s 3D motion Factors that would cause ambiguity in matching 3D inertial feature to optical flow:
v v Similar inertial measurements but different optical flow
12
YOLO person detection)
from AlphaPose)
position to the center of the bounding box
13
Xiu, Yuliang, et al. "Pose flow: Efficient online pose tracking." BMVC 2018.
IMU
TSP optical flow
Bounding box size trajectory Human pose keypoint trajectory
Positive
Feature Extraction Feature Encoding Joint Feature Space
LSTM-Box LSTM-Pose
LSTM-OpticalFlow
LSTM-IMU
Conv1D FC
Triplet Loss
Negative LSTM-Box LSTM-Pose
Share Weights
FC
Bounding box size trajectory Human pose keypoint trajectory
TSP optical flow LSTM-OpticalFlow
LSTM-Box LSTM-Pose
LSTM-OpticalFlow
LSTM-IMU
Conv1D FC
IMU
TSP optical flow
Bounding box size trajectory Human pose keypoint trajectory
Positive
Feature Extraction Feature Encoding Joint Feature Space
inertial (query) feature embeddings
Predicted IMU source person in the video: Visual and Inertial Feature Encoders:
Average feature distance between all TSPs from a candidate person and the query IMU
Extracted visual feature with TSPj from negative person Extracted visual feature with TSPi from positive person
L(gn
IMU, g+ VIS(ξi), g− VIS(ξj)) = max(||HVIS(g+ VIS(ξi)) − HIMU(gn IMU)||2−
||HVIS(g−
VIS(ξj)) − HIMU(gn IMU)||2 + κ, 0)
16
(from 2 to 5) in the scene.
the video increases. (There are more potential false positives)
17
Video
1 meter high from the ground
IMU
Total length of the video with regard to number of people in the scene
18
Non-learning based Transform one modality to the other Learning joint feature space
19
Longer time window includes more feature information that helps discriminate similar motions, but decreases sample size due to more occlusions.
20
Prediction Accuracy with Different IMU Features
21
Prediction Accuracy with Different Visual Features
22
small large Green: IMU source White: predicted IMU source feature distance
Failure
23
Summary
people in the scene, without strict constraints on IMU placement
Future work
mobile robots
24
25