3D Vision Torsten Sattler and Martin Oswald Spring 2018 3D Vision - - PowerPoint PPT Presentation
3D Vision Torsten Sattler and Martin Oswald Spring 2018 3D Vision - - PowerPoint PPT Presentation
3D Vision Torsten Sattler and Martin Oswald Spring 2018 3D Vision Understanding geometric relations between images and the 3D world between images Obtaining 3D information describing our 3D world from images from
3D Vision
- Understanding geometric relations
- between images and the 3D world
- between images
- Obtaining 3D information describing
- ur 3D world
- from images
- from dedicated sensors
3D Vision
- Extremely important in robotics and
AR / VR
- Visual navigation
- Sensing / mapping the environment
- Obstacle detection, …
- Many further application areas
- A few examples …
Google Tango
(officially discontinued, lives on as ARCore)
Google Tango
Image-Based Localization
Geo-Tagging Holiday Photos
(Li et al. ECCV 2012)
Augmented Reality
(Middelberg et al. ECCV 2014)
Video credit: Johannes Schönberger
Large-Scale Structure-from-Motion
Virtual Tourism
UNC/UKY UrbanScape project
3D Urban Modeling
3D Urban Modeling
Mobile Phone 3D Scanner
Mobile Phone 3D Scanner
Self-Driving Cars
Self-Driving Cars
Self-Driving Cars
Micro Aerial Vehicles
Microsoft HoloLens
Mixed Reality
Virtual Reality
Raw Kinect Output: Color + Depth
http://grouplab.cpsc.ucalgary.ca/cookbook/index.php/Technologies/Kinect
Human-Machine Interface
3D Video with Kinect
Autonomous Micro-Helicopter Navigation
Use Kinect to map out obstacles and avoid collisions
Dynamic Reconstruction
Performance Capture
Performance Capture
(Oswald et al. ECCV 14)
Performance Capture
Motion Capture
Interactive 3D Modeling
(Sinha et al. Siggraph Asia 08) collaboration with Microsoft Research (and licensed to MS)
Scanning Industrial Sites
as-build 3D model of off-shore oil platform
Scanning Cultural Heritage
Cultural Heritage
Stanford’s Digital Michelangelo
Digital archive Art historic studies
accuracy ~1/500 from DV video (i.e. 140kb jpegs 576x720)
Archaeology
Forensics
- Crime scene recording and analysis
Forensics
Sports
Surgery
Johannes Schönberger CAB G 85.1
jsch@inf.ethz.ch
Martin Oswald CNB G103.2
martin.oswald@inf.ethz.ch
Torsten Sattler CNB 104
torsten.sattler@inf.ethz.ch
Federico Camposeco CAB G 86.3
federico.camposeco@inf.ethz.ch
Peidong Liu CAB G 84.2
peidong.liu@inf.ethz.ch
Nikolay Savinov CAB G 81.1
nikolay.savinov@inf.ethz.ch
3D Vision Course Team
Katarina Tóthóva CAB G 102.2
katarina.tothova@inf.ethz.ch
- To understand the concepts that relate
images to the 3D world and images to
- ther images
- Explore the state of the art in 3D vision
- Implement a 3D vision system/algorithm
Course Objectives
Learning Approach
- Introductory lectures:
- Cover basic 3D vision concepts and approaches.
- Further lectures:
- Short introduction to topic
- Paper presentations (you)
(seminal papers and state of the art, related to your projects)
- 3D vision project:
- Choose topic, define scope (by week 4)
- Implement algorithm/system
- Presentation/demo and paper report
Grade distribution
- Paper presentation & discussions: 25%
- 3D vision project & report: 75%
Slides and more
http://www.cvg.ethz.ch/teaching/3dvision/
Also check out on-line “shape-from-video” tutorial:
http://www.cs.unc.edu/~marc/tutorial.pdf http://www.cs.unc.edu/~marc/tutorial/
Textbooks:
- Hartley & Zisserman, Multiple View Geometry
- Szeliski, Computer Vision: Algorithms and Applications
Materials
Feb 19 Introduction Feb 26 Geometry, Camera Model, Calibration Mar 5 Features, Tracking / Matching Mar 12 Project Proposals by Students Mar 19 Structure from Motion (SfM) + papers Mar 26 Dense Correspondence (stereo / optical flow) + papers Apr 2 Bundle Adjustment & SLAM + papers Apr 9 Student Midterm Presentations Arp16 Easter break Apr 23 Multi-View Stereo & Volumetric Modeling + papers Apr 30 Whitsundite May 7 3D Modeling with Depth Sensors + papers May 14 3D Scene Understanding + papers May 21 4D Video & Dynamic Scenes + papers May 28 Student Project Demo Day = Final Presentations
Schedule
Fast Forward
- Quick overview of what is coming…
Pinhole camera Geometric transformations in 2D and 3D
- r
Camera Models and Geometry
- Know 2D/3D correspondences,
compute projection matrix also radial distortion (non-linear)
Camera Calibration
Harris corners, KLT features, SIFT features
key concepts: invariance of extraction, descriptors to viewpoint, exposure and illumination changes
Feature Tracking and Matching
l2 C1 m1 M? L1 m2 L2 M C2 Triangulation
- calibration
- correspondences
3D from Images
Fundamental matrix Essential matrix
Also how to robustly compute from images
Epipolar Geometry
Initialize Motion (P1,P2 compatibel with F) Initialize Structure (minimize reprojection error) Extend motion (compute pose through matches seen in 2 or more previous views) Extend structure (Initialize new structure, refine existing structure)
Structure from Motion
- Visual Simultaneous Navigation and Mapping
Visual SLAM
(Clipp et al. ICCV’09)
Stereo and Rectification
Warp images to simplify epipolar geometry Compute correspondences for all pixels
Multi-View Stereo
Joint 3D Reconstruction and Class Segmentation
(Haene et al CVPR13)
joint reconstruction and segmentation
(ground, building, vegetation, stuff)
reconstruction only
(isotropic smoothness prior)
■ Building ■ Ground ■ Vegetation ■ Clutter
Structured Light
- Projector = camera
- Use specific patterns to obtain
correspondences
Papers and Discussion
- Will cover recent state of the art
- Each student team will present a paper (5min
per team member), followed by discussion
- “Adversary” to lead the discussion
- Papers will be related to projects/topics
- Will distribute papers later
(depending on chosen projects)
Projects and reports
- Project on 3D Vision-related topic
- Implement algorithm / system
- Evaluate it
- Write a report about it
- 3 Presentations / Demos:
- Project Proposal Presentation (week 4)
- Midterm Presentation (week 8)
- Project Demos (week 15)
- Ideally: Groups of 3 students
Course project example: Build your own 3D scanner!
Example: Bouguet ICCV’98
Project Topics
Goal: Description:
DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks
The goal is to implement a deep recurrent convolutional neural network for end-to-end visual
- dometry [1]
Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimization, etc. Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior knowledge is also required to recover an absolute scale for monocular VO. This project is to implement a novel end-to- end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset show competitive performance to state-of-the-art methods, verifying that the end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.
[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA 2017
Recommended : Python and prior knowledge in machine learning
Peidong Liu, CNB D102 peidong.liu@inf.ethz.ch
Goal: Description:
Deep Relative Pose Estimation for Stereo Camera
Design a neural network to estimate the relative pose between two frames for a stereo camera.
Recently there is some work on relative pose estimation between two images/frames based on neural network, which aims for the application in autonomous driving. However, compared to traditional geometric methods (e.g. 5-point algorithm), these methods have much worse accuracy. With a stereo camera we can obtain two frames captured at the same time, and recover the depth for each frame without scale ambiguity. This would help the pose estimation. This project aims to design a neural network to estimate the relative pose between two frames for a stereo camera. The students will start from learning the existing neural network for disparity/depth estimation and pose estimation for the monocular camera. Then they will focuson the design ofneural network forthe stereo camera.
[1] Zhou T, Brown M, Snavely N, Lowe DG. Unsupervised learning ofdepth and ego-motion from video. In CVPR 2017. [2] Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T. Demon: Depth and motion network for learning monocular stereo. In CVPR 2017. [3] Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A, Brox T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR 2016.
Required: Python, Linux Recommended: Experience with TensorFlow, PyTorch or other deep learning frameworks
Zhaopeng Cui zhaopeng.cui@inf.ethz.ch CNB G104
Goal: Description:
DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks
The goal is to implement a deep recurrent convolutional neural network for end-to-end visual
- dometry [1]
Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimization, etc. Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior knowledge is also required to recover an absolute scale for monocular
- VO. This project is to implement a novel end-to-end framework for monocular VO by using deep Recurrent
Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset show competitive performance to state-of- the-art methods, verifying that the end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.
[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA 2017 Recommended : Python and prior knowledge in machine learning
Peidong Liu, CNB D102 peidong.liu@inf.ethz.ch
Goal: Description:
Differential Rolling-Shutter SfM
Model the Rolling Shutter (RS) to create RS artifact free images
It is well know that moving RS cameras create distorted images. The effect is typically visible when vertical structures appear slanted. In this work we want to model the RS effect and compensate for it. The input to the algorithm is a short image burst, from which we will first compute the optical flow, then estimate the camera pose and camera motion parameters. Finally we want to create a global shutter image by warping the RS image over the estimated depth into a global shutter reference frame.
[1] Zhuang et al., “Rolling-Shutter-Aware Differential SfM and Image Rectification”, ICCV 2017 Required: C++, some experience with image processing Recommended: Experience with OpenCV
Olivier Saurer <saurero@inf.ethz.ch>
Goal: Description:
3DScanBox
Build a multi-camera 3D scan box
Implement a simple 3D scanner by using an aluminum frame and a bunch of cameras. The necessary material is provided. Depending on the group size and interest the focus of the project can be put on different aspects. Ranging from multi-camera online calibration to multi-view stereo or fusion.
Required: C++, some experience with image processing Recommended: OpenCV, maybe Google Ceres
Petri Tanskanen, petri.tanskanen@inf.ethz.ch
Goal: Description:
Camera Pose Estimation For Artistic Purposes
Create a Blender plugin that finds poses and focal lengths (extrinsic and intrinsic parameters) of a set of reference images. 3D artists take reference images of scenes and objects they want to model. For modelling it is helpful to align the virtual cameras in the modelling tool of choice (Blender, Maya, …) to the reference images. There exists a Blender plugin[1] that utilizes two-point perspective and user input to find pose and focal length of a single reference image. This project aims to implement an SfM plugin to find relative poses and focal lengths of a set of reference images. Open questions: rely on user input for point matches or use SIFT + Feature Matching? images might need to be undistorted since Blender’s camera does not model this phenomenon
[1] Per Gantelius,“BLAM”,https://github.com/stuffmatic/blam Required: Python, C++ Recommended: Blender, OpenCV or COLMAP Daniel Thul daniel.thul@inf.ethz.ch
Goal: Description:
Transfer from Recognition to Optical Flow by Matching Neural Paths
Implementation of Optical Flow Method by Matching Neural Paths
The goal is to extend the stereo method of Savinov et al [1] to optical flow. The main challenge is with handling of large memory requirements by passing only restricted subset of most probable labels during the back-propagation phase of label likelihoods. The method could be implemented in any deep learning framework.
[1] Savinov et al., “Matching Neural Paths: Transfer from Recognition to Correspondence Search”, NIPS 2017 Required: C++, CUDA, any Deep learning framework
Lubor Ladicky, lubor.ladicky@inf.ethz.ch
Goal: Description:
Navigation by Reinforcement Learning
Benchmark different RL algorithms on their ability to learn to navigate to a goal
You will take one of the popular RL libraries like Tensorforce [2] or OpenAI Baselines [3] and benchmark them on 3D navigations tasks proposed in [1]. Those tasks are implemented as maps in a Vizdoom environment. The agent is given a high reward for reaching the image-specified goal and small reward for collecting items like healthkits (which should ignite his curiosity and make him explore). His goal is to maximize rewards. You will compare the following RL methods: A3C, A2C, PPO.
[1] Savinov et al., "Semi-parametric topological memory for navigation", ICLR 2018, https://openreview.net/pdf?id=SygwwGbRW [2] https://github.com/reinforceio/tensorforce [3] https://github.com/openai/baselines Required: python Recommended: knowledge in Machine Learning, experience with tensorflow and RL
Nikolay Savinov nikolay.savinov@inf.ethz.ch
Goal: Description:
Appearance Representation based
- n Auto-Encoders
Improve appearance model with deep auto-encoder This project aims to build efficient appearance representations of shapes observed from multiple viewpoints and over time. Recent work [1] has addressed this using Principal Components Analysis (PCA). The goal of this project is to explore, as an alternative, deep auto-encoders for dimensionality reduction. The students will build on existing tools using MATLAB and python / tensorflow to explore appearance representations obtained from auto-encoders and compare the results to [1]. [1] Boukhayma et al., “Eigen appearance maps of dynamic shapes”, ECCV 201 Required: Python and MATLAB, some experience with image processing Recommended: Some experience with deep learning / tensorflow
- Dr. Vagia Tsiminaki (vagia.tsiminaki@inf.ethz.ch)
- Dr. Lisa Koch (lisa.koch@inf.ethz.ch)
Auto-encoder for dimensionality reduction
Goal: Description:
SuperPoint: Self-Supervised Interest Point Detection and Description
The goal is to implement a self-supervised fully convolutional neural network for interest point detection and description [1] This project is to implement a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, this fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multi- homography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner
- detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when
compared to LIFT, SIFT and ORB.
[1] DeTone et. al., SuperPoint: Self-Supervised Interest Point Detection and Description, arXiv 2017
Recommended : Python and prior knowledge in machine learning
Peidong Liu, CNB D102 peidong.liu@inf.ethz.ch
Goal: Description:
Real-Time Surface Reconstruction
Perform depth-map fusion directly into the mesh.
Traditional approaches rely on volumetric representations or point-clouds to represent the environment and fuse the different depth measurements. In this work we want to use a mesh representation to model the environment. The goal is to fuse new depth estimates directly into the mesh. Adaptive tessellation is used to represent different levels of geometric details in the scene. The input to the algorithm is a set of calibrated RGBD images. The focus of the is the implementation of the fusion algorithm. For this we will closely follow Yienkiewicz et al. [1].
[1] Zienkiewicz et al., “Monocular, Real-Time Surface Reconstruction using Dynamic Level of Detail”, 3DV 2016 Required: C++, some experience with image processing Recommended: Experience with OpenCV
Olivier Saurer <saurero@inf.ethz.ch>
Goal: Description:
Data Generation with a Virtual Simulator for Autonomous Driving
Generate 3D training data with a recent open urban driving simulator for autonomous driving.
Recently there is an open-source simulator for autonomous driving research, which is named as CARLA [1]. This simulator supports real-time data acquisition of RGB image, semantic segmentation, and depth map, which can be used as the training data for the deep learning methods. This project aims to utilize the virtual simulator to generate more kinds of training data including 2D/3D instance-level segmentation, 3D bounding boxes, 3D shapes and poses of vehicles, etc., which will be used for the training of deep 3D detection methods.
[1] Dosovitskiy A, Ros G, Codevilla F, López A, Koltun V. CARLA: An open urban driving simulator. Conference on Robot Learning (CoRL), 2017 Required: C++, Python Recommended: Familiar with UE4 programming
Zhaopeng Cui zhaopeng.cui@inf.ethz.ch CNB G104
Goal: Description:
Data fusion for semantic 3D reconstruction
Improve the data fusion pipeline for semantic 3D reconstruction using a learning approach
Semantic 3D reconstruction is the task of jointly reconstructing and segmenting a 3D model. It has been shown in [1] that both tasks can benefit from each other: the 3D structure offers ground for regularizing the segmentation, while the semantic information gives access to shape priors (eg ground is flat and horizontal…). Methods presented in [1] or [2] take as input multiple depth maps and corresponding 2D semantic segmentations, and fuse them into a modified Truncated Signed Distance Function (TSDF) [3]. Though very efficient, this fusion could be improved in order to
- btain better input for the methods.
In this project we propose to leverage the availability of semantic 3D data, and machine learning libraries such as tensorflow, in
- rder to learn a method to fuse the data used for semantic 3D segmentation.
[1] Dense Semantic 3D Reconstruction, Häne et al., TPAMI 2017 [2] Learning Priors for semantic 3D reconstruction, Cherabier et al., unpublished 2018 [3] A volumetric method for building complex models from range images, Curless et Levoy, SIGGRAPH 1996
Required: Python, Tensorflow Recommended: Optimization (Maths), C++
Ian Cherabier (ian.cherabier@inf.ethz.ch) Martin Oswald (martin.oswald@inf.ethz.ch)
Goal: Description:
Surface Reconstruction in Medical Imaging: Data and CNNs
Creation of synthetic MR datasets and their use in testing of various surface reconstruction architectures
Required: Python, Matlab Recommended: Experience with machine learning and TensorFlow
Katarina Tothova katarina.tothova@inf.ethz.ch
Reconstruction of organ surfaces is an important task in medical image analysis, especially in cardiac and neuro-imaging. Besides their significance in diagnosis and surgical planning, high-quality organ surface models provide powerful measures for statistical analysis or disease tracking. Thanks to recent advances in machine learning, we are devising a deep neural network–based approach for direct organ surface reconstruction from MRI data. To test the efficacy of the proposed network architectures, it is necessary to design and produce relevant synthetic MR data
Goal: Description:
3D Appearance Super-resolution Benchmark
Generate appearance super-resolution benchmark datasets This project aims to generate a Super-Resolution Appearance dataset and provide a systematic benchmark for evaluation. Previous work [1] presents a framework for synthetic generation of realistic benchmarks for 3D reconstruction from images. ETH3D Benchmark [2] covers a variety of indoor and outdoor scenes. The goal of this project is to build on these works and generate super-resolved appearance dataset for the multi-view case. [1] A. Ley et al. “SyB3R: A Realistic Synthetic Benchmark for 3D Reconstruction from Images” ECCV 2016 [2] T. Schöps et al. “A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos” CVPR 2017
Required: Matlab/Python, some experience with image processing Recommended: Experience with C++, scripting language
- Dr. Vagia Tsiminaki (vagia.tsiminaki@inf.ethz.ch)
Goal: Description:
Super-resolving Appearance of 3D Faces for Dermatology App
Super-resolve appearance for mobile phone applications
This project aims to implement a super-resolution algorithm for appearance representations of 3D
- faces. Previous work [1] presented a method to retrieve high resolution textures of objects observed
in multiple videos under small object deformations. The goal of this project is to implement the proposed method for mobile phone applications where performance in terms of time and memory are important. The students will implement the super-resolution framework [1] using C++ . The project can be build upon an existing C++/CUDA implementation of [2]. [1] Tsiminaki et al. “High resolution 3D shape texture from multiple videos” CVPR 2014 [2] D. Mitzel and T. Pock and T. Schoenemann and D. Cremers, Video Super Resolution using Duality based TV-L1 Optical Flow, DAGM, pages 432-441, 2009
Required: C++, some experience with image processing Recommended: Experience with Matlab
- Dr. Vagia Tsiminaki (vagia.tsiminaki@inf.ethz.ch)
- Dr. Martin Oswald (martin.oswald@inf.ethz.ch)
`
Goal: Description:
Motion blur aware camera pose tracking
The goal is to implement a camera pose tracker for motion blurred images Camera pose tracker is usually a front-end of a visual odometry (VO) algorithm. Most existing works assume the input images to VO are sharp images. However, images can be easily blurred, which would further fail the VO, if the camera moves too fast with a longer exposure time. In this project, we plan to investigate and implement a motion blur aware camera pose tracker. To make the problem tractable, we assume the reference image is sharp and only current image is being motion
- blurred. Furthermore, we assume the depth map corresponding to the reference image is already known.
All the required dataset can be generated from a simulation tool, which is already being set up for you.
[1] Good programming skills in C++
Peidong Liu and Vagia Tsiminaki, CNB D102 peidong.liu@inf.ethz.ch vagia.tsiminaki@inf.ethz.ch
3D Vision, Spring Semester 2018
Goal:
Requirements / Tools: Supervisor:
Description:
Your Own Project Learn about the techniques presented in the lecture Choose your own topic! Available hardware: Google Tango Tablets Microsoft HoloLens GoPro Cameras Intel RealSense Sensor We find one for you
Required: Related to 3D Vision / topics of the lecture
Your Next Steps
- Find a group (ideally: groups of 3)
- Find a project (one of ours or your own)
- Topic subscription via doodle in a few days:
- For questions contact us via the lecture Moodle
(preferred) or contact Nikolay per email
- First come first serve!
- Do not contact supervisors directly!
- After topic assignment: talk with your supervisor
- Write a project proposal
- Don’t worry: You’ll get reminders!
Feb 19 Introduction Feb 26 Geometry, Camera Model, Calibration Mar 5 Features, Tracking / Matching Mar 12 Project Proposals by Students Mar 19 Structure from Motion (SfM) + papers Mar 26 Dense Correspondence (stereo / optical flow) + papers Apr 2 Easter break Apr 9 Bundle Adjustment & SLAM + papers Apr 16 Student Midterm Presentations Apr 23 Multi-View Stereo & Volumetric Modeling + papers Apr 30 3D Modeling with Depth Sensors + papers May 7 3D Scene Understanding + papers May 14 4D Video & Dynamic Scenes + papers May 21 Whitsundite May 28 Student Project Demo Day = Final Presentations