When We First Met: Visual-Inertial Person Localization for Co-Robot - PowerPoint PPT Presentation

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020

Motivation Use case 1: Autonomous vehicle identifies and locates its user Use case 2: Assistive robot tries to locate the target person for the first time 2

Proposed Task Given a query IMU sequence from a person’s smartphone, locate the person in the video that the IMU comes from. 3

Why IMU? • Inertial measurements (accelerometer and gyroscope readings) provides rich relative 3D motion information • People often carry smart devices (smart-phones and smart watches) which are equipped with inertial sensors • IMU data can be easily and selectively transmitted at a low cost • Contains minimal biometric or privacy-sensitive information 4

Prior Work on Visual-Inertial Person Identification and Tracking Graph optimization problem • Node: predict person’s orientation with respect to the camera from a single image using a VGG16 based network • Edge: estimate 3D foot position of the person at each frame to compute 3D velocity between pairs of node images Hand-crafted inertial features to match with visual data Henschel, Roberto, Timo von Marcard, and Bodo Rosenhahn. "Simultaneous identification and tracking of multiple people using video and IMUs." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops . 2019. 5

Proposed Approach For the query IMU data from a person labeled by index n ∈ [ N ] , where is the number of people in the video, N extract the inertial feature . g IMU For each candidate person in the video, extract visual feature . g VIS Learn two mappings H VIS : g VIS → f, H IMU : g IMU → f such that the transformed features from the same person lie closely in joint feature space. 6

Framework Feature Extraction Feature Encoding Joint Feature Space IMU

Inertial Feature Extraction IMU data: 3D linear acceleration and angular velocity in smartphone local frame Pre-processing: 1. Uniform sampling to fix number of samples synchronized with video frames 2. Low-pass filtering for smoothing 8

Framework Feature Extraction Feature Encoding Joint Feature Space IMU Bounding box size trajectory Human pose Positive keypoint trajectory TSP optical flow

Visual Feature Extraction 1. Person detection with YOLOv3 and tracking with DeepSORT 2. Decompose person tracklets into Temporal Super-Pixels (TSP) 3. Compute average optical flow for each TSP Person Tracklets Temporal Super-Pixels with Person labels Farhadi, Joseph Redmon Ali, and Joseph Redmon. "YOLOv3: An incremental improvement." Retrieved September 17 (2018): 2018. Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." 2017 IEEE international conference on image processing (ICIP) . IEEE, 2017. 10

Temporal Super-Pixels as Visual Representation of Human Motion Mask Pixels Coarse Fine Chang, Jason, Donglai Wei, and John W. Fisher. "A video representation using temporal superpixels." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2013. 11

Ambiguity in 2D and 3D Feature Matching Optical flow only capture a 2D projection of the person’s 3D motion Factors that would cause ambiguity in matching 3D inertial feature to optical flow: Similar inertial measurements but di ff erent optical flow 1. Person’s distance to the camera v 2. Person’s orientation to the camera v 12

Additional Visual Cues to Address Ambiguity 1. Bounding box size trajectory (generated from YOLO person detection) 2. Human pose keypoints trajectory (generated from AlphaPose) • Left and right shoulder keypoints relative position to the center of the bounding box Xiu, Yuliang, et al. "Pose flow: Efficient online pose tracking." BMVC 2018. 13

Framework Feature Extraction Feature Encoding Joint Feature Space Conv1D LSTM-IMU IMU Bounding box size LSTM-Box trajectory Human pose LSTM-Pose FC Positive keypoint trajectory LSTM-OpticalFlow TSP optical flow

Framework Feature Extraction Feature Encoding Joint Feature Space Conv1D LSTM-IMU IMU Bounding box size LSTM-Box trajectory Human pose Triplet LSTM-Pose FC Positive keypoint trajectory Loss LSTM-OpticalFlow TSP optical flow Share Weights Bounding box size LSTM-Box trajectory Human pose FC LSTM-Pose keypoint trajectory Negative LSTM-OpticalFlow TSP optical flow

Learning the Visual-Inertial Feature Space • LSTM to encode inertial and visual features • Triplet loss using Euclidean distance between the visual (positive and negative person) and inertial (query) feature embeddings Visual and Inertial Feature Encoders: Extracted visual feature with TSPi from positive person Extracted visual feature with TSPj from negative person IMU , g + VIS ( ξ j )) = max( || H VIS ( g + L ( g n VIS ( ξ i )) − H IMU ( g n VIS ( ξ i ) , g − IMU ) || 2 − VIS ( ξ j )) − H IMU ( g n || H VIS ( g − IMU ) || 2 + κ , 0) Predicted IMU source person in the video: Average feature distance between all TSPs from a candidate person and the query IMU 16

Experimental Setup • We collect our own dataset for implementation and evaluation. • We evaluate our framework on test videos with di ff erent number of people (from 2 to 5) in the scene. • We assume the task complexity would increase as the number of people in the video increases. (There are more potential false positives) 17

Data Collection Video • Recorded with a non-calibrated webcam at about 1 meter high from the ground • 30fps 1080p IMU • Recorded with a hand-held iPhone • 100 Hz Total length of the video with regard to number of people in the scene 18

Results Compared to Baseline Methods Non-learning based Transform one modality to the other Learning joint feature space 19

Optimal Temporal Window Size Longer time window includes more feature information that helps discriminate similar motions, but decreases sample size due to more occlusions. 20

Ablation Study — inertial feature extraction • Raw IMU accelerometer and gyroscope readings: • Estimated velocity from acceleration integration • Concat • Replace • Low-pass filtering (smoothing): Prediction Accuracy with Di ff erent IMU Features 21

Ablation Study — additional visual cues Prediction Accuracy with Di ff erent Visual Features 22

Visualizing Results large small feature distance Green: IMU source White: predicted IMU source Failure 23

Conclusions Summary • Visual-Inertial dataset with common pedestrian activities • Proposed framework identifies IMU source in the video with 80.7% accuracy among various number of people in the scene, without strict constraints on IMU placement Future work • More data collection in terms of number of people, variation in people’s motion and background scenes • Extend to video recorded from a dynamic camera: for deployment of system on autonomous vehicles or mobile robots 24

Thank you! 25

When We First Met: Visual-Inertial Person Localization for Co-Robot - PowerPoint PPT Presentation

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020 Motivation Use case 1: Autonomous vehicle identifies and locates its user

Inertial support of distinguished and inertial support representations Examples G -data

Category-level localization Cordelia Schmid Category-level localization Localization of

Real-Time Visual-Inertial Mapping, Re-localization and Planning Onboard MAVs in Unknown

Dipole Assisted Dipole Assisted Inertial Electrostatic Inertial Electrostatic Confinement

Smart Cameras Mark DiVelbiss, Selena Grant, Qing Liu Overview - First Person vs Third Person -

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

A highly scalable Met Office NERC Cloud model EASC 2015 Nick Brown (EPCC), Michele Weiland

Visual-Inertial Odometry and Object Mapping with Structural Constraints Mo Shan and Nikolay

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

Slide 4 / 63 Galilean-Newtonian Relativity Relativity principle: The basic laws of physics are

6.808: Mobile and Sensor Computing Lecture 8: Introduction to Inertial Sensing & Sensor

Slide 7 / 63 Slide 8 / 63 The Michelson - Morley Experiement The Michelson - Morley Experiement

The role of Planes and Edges in Lidar-Inertial Integration Teresa Vidal-Calleja Montreal, May

Met aDat aPhile Met aDat aPhile <The Collapse of Visual Inf ormat ion> Cory Arcangel +

Extracting Semantic Information from on-line Art Music Discussion Forums. Mohamed Sordo, Joan

Relation Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 23, 2017

Automated Large-Scale Phonetic Analysis: DASS William A. Kretzschmar, Jr., Joseph Stanley,

Cryptanalytic Extraction of Neural Network Models Nicholas Carlini 1 , Matthew Jagielski 12 , Ilya

Framework to extract Coq terms to -terms Semi-automatic verification (only briefly

CS6 Practical System Skills Fall 2019 edition Leonhard Spiegelberg lspiegel@cs.brown.edu

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples Guanhong Tao ,

When We First Met: Visual-Inertial Person Localization for Co-Robot - PowerPoint PPT Presentation

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020 Motivation Use case 1: Autonomous vehicle identifies and locates its user

Inertial support of distinguished and inertial support representations Examples G -data

Category-level localization Cordelia Schmid Category-level localization Localization of

Real-Time Visual-Inertial Mapping, Re-localization and Planning Onboard MAVs in Unknown

Dipole Assisted Dipole Assisted Inertial Electrostatic Inertial Electrostatic Confinement

Smart Cameras Mark DiVelbiss, Selena Grant, Qing Liu Overview - First Person vs Third Person -

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

A highly scalable Met Office NERC Cloud model EASC 2015 Nick Brown (EPCC), Michele Weiland

Visual-Inertial Odometry and Object Mapping with Structural Constraints Mo Shan and Nikolay

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

Slide 4 / 63 Galilean-Newtonian Relativity Relativity principle: The basic laws of physics are

6.808: Mobile and Sensor Computing Lecture 8: Introduction to Inertial Sensing &amp; Sensor

Slide 7 / 63 Slide 8 / 63 The Michelson - Morley Experiement The Michelson - Morley Experiement

The role of Planes and Edges in Lidar-Inertial Integration Teresa Vidal-Calleja Montreal, May

Met aDat aPhile Met aDat aPhile &lt;The Collapse of Visual Inf ormat ion&gt; Cory Arcangel +

Extracting Semantic Information from on-line Art Music Discussion Forums. Mohamed Sordo, Joan

Relation Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 23, 2017

Automated Large-Scale Phonetic Analysis: DASS William A. Kretzschmar, Jr., Joseph Stanley,

Cryptanalytic Extraction of Neural Network Models Nicholas Carlini 1 , Matthew Jagielski 12 , Ilya

Framework to extract Coq terms to -terms Semi-automatic verification (only briefly

CS6 Practical System Skills Fall 2019 edition Leonhard Spiegelberg lspiegel@cs.brown.edu

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples Guanhong Tao ,

6.808: Mobile and Sensor Computing Lecture 8: Introduction to Inertial Sensing & Sensor

Met aDat aPhile Met aDat aPhile <The Collapse of Visual Inf ormat ion> Cory Arcangel +