GPU Computing Projects E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1

Instructions for the Project

Objectives The goals of the project are to: • apply data-parallelism concepts • practice with CUDA • set up a benchmark with a sound evaluation procedure • present your results in a clear and convincing way 2

Possible Subjects Standard Option We propose 1 subject, that most of you should work on: Implementation and performance analysis of the Iterative Closest Point algorithm in CUDA This is an important Point Cloud Registration algorithm, and will be described briefly in later slides. Special Option For students who are at ease with CUDA, and want to investigate a particular question: Implementation and performance analysis of SOME INTERESTING algorithm in YOUR PARALLEL PROGRAMMING TECHNOLOGY OF CHOICE If you choose this way, you must validate your subject with us before Sept. 25th. Contact us by email. 3

Our Expectations We expect your implementation to be: • running on GPU; • correct, ie to produce an acceptable result. Do not try to make it fast at first, just make it work. Then, try to apply NVidia’s Assess, Parallelize, Optimize, Deploy (APOD) design cycle as described in their CUDA C++ Best Practices Guide (click here): 1. identify the part of the code which is responsible for the bulk of the execution time; 2. use all available weapons (CUDA API, libraries, research papers) to obtain a parallel version of the code (assumed to be sequential at first); 3. use all available weapons (CUDA API, libraries, research papers) to optimize the performance of the parallel code; 4. measure the performance of the new code. 4

Project Outline for Standard Performance Analysis Broad Outline Concrete Example Choose an application Mandelbrot Determine the most time-consuming part of Global atomics the app Determine one or more data-parallel Tiling. . . approaches to solving the problem Create multiple implementations of the One naïve version, one version with shared approach memory. . . Benchmark the implementations Record memory transfer time, kernel time, utilization, FLOPS, etc. Relate results to course concepts Identify the cause of the bottleneck (memory or compute bounding) 5

Teams Teams of 4 (at most 1 special group per promo). Everyone must select a group on the course in Moodle before Sunday . 6

Final Deliverables (1/3) 1. Implementation • Source code for C++ CPU reference • Source code for CUDA implementation(s) • Source code for benchmark tools • Build scripts (GNU Make, CMake. . . ) 7

Final Deliverables (2/3) 2. Report • Description of the problem • Detailed if custom subject • Quick summary otherwise • Quick description of the baseline CPU implementation (paper reference, parallel or not, etc.) • Quick description of the baseline GPU implementation (same as CPU baseline) • Justification of the performance indicators you have used • Analysis of performance bottlenecks (with measured indicators, graphs, etc.) • For each improvement over the GPU baseline (implementations): • justification of this work regarding performance analysis • description of the improvement (ex: used output privatization instead of global atomics) • comparison of the performance with and without this implementation • Table with summary of the benchmark • Summary of who did what (contribution of each team member) 8

Final Deliverables (3/3) 3. A live lecture / defense • 15’ presentation • 10’ discussion Submit implementation + report + slides on Moodle before October 31st. 9

Grade Sheet Used for Last Session 10

Defenses Defenses will be held in the beginning of November (exact date TBA). We will use Teams to meet if needed be. The participation of all team members is required in all cases. 11

Moodle Links Course page for GISTRE: https://moodle.cri.epita.fr/course/view.php?id=325 Course page for SCIA: https://moodle.cri.epita.fr/course/view.php?id=326 12

Summary of Tasks and Deadlines What Deadline Who Register on Moodle Sept. 15th Everyone Choose a group Sept. 20th Everyone (opt.) Complete the feedback form Sept. 30th Everyone . . . Work on your project. . . Submit code + report + presentation slides Oct. 31st 1 person/team Defend your project (live presentation) Beginning of Nov. Everyone 13

About Iterative Closest Points

Iterative Closest Points Overview A point cloud processing algorithm used in many applications: medical image registration, LIDAR frames registration, SLAM. . . 14

Iterative Closest Points Algorithm 1. For each point in the source point cloud, match the closest point in the reference point cloud 2. Estimate the combination of rotation and translation using a root mean square point to point distance metric minimization technique 3. Transform the source points using the obtained transformation. 4. Loop until iteration limit or distance threshold or. . . Do not use complex versions unless everything else is perfect! 15

Pseudo code algorithm ICP( M , S ) θ := θ 0 while not registered: X := ∅ for m i ∈ transform( M , θ ): ˆ s j := closest point in S to m i X := X + � m i , ˆ s j � θ := least squares(X) return θ Source: en.wikipedia.org/wiki/Point_set_registration#Iterative_closest_point 16

A short video https://www.youtube.com/watch?v=QWDM4cFdKrE 17

Recommended Resources A detailed Python implementation (Jupyter notebook) https://github.com/niosus/notebooks/blob/master/icp.ipynb A great visualization tool for 3D point clouds https://www.paraview.org/ The Point Cloud Library (PCL) https://pointclouds.org/ PCL Tutorial: How to use iterative closest point https://pcl.readthedocs.io/projects/tutorials/en/latest/iterative_closest_p oint.html 18

Recommended implementation and grading 1. Get a working CPU version 2. Identify which parts you will port to GPU 3. Get a basic GPU port 4. Get an optimized GPU port 5. Benchmark — Minimum expected work 6. Add some point indexation structure (kd-tree, octree) 7. Perform more CUDA optimizations 8. Experiment with algorithm variants (point to plane, outlier rejection, etc.) 19

Dataset for testing We will provide you with testing data in a few days. You should try your algorithm on the simplest possible data to begin. 20

Implementations Hints (Final Reminders) • Have a working ( slow ) C++ reference implementation first (and keep it forever) • Tag ( git tag ) the versions of your program before any optimization (useful to track and benchmark ideas) • Try optimizations step by step so that you can tell which ones are the most important 21

GPU Computing Projects E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1 Instructions for the Project Objectives The goals of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Humanoid Robotics 3D World Representations Maren Bennewitz 1 Robots in 3D Environments source:

Object Pose Estimation in Robotics Using a Low-Cost RGB-D Camera Alexander Ganslandt &

3D Modeling and Visualization By Morteza Daneshmand iCV Group, Leader of the 3D Modeling and

An Efficient Algorithm for Feature-based 3D Point Cloud Correspondence Search Outline

Getting started with the 12.03.2012 3D Photography class spring 2012 Institute of Visual

Closed-Loop Control of 3D Printers via Webservices Felix Baumann and Dieter Roller Wintersemester

CONFERENCE CALL AND WEBCAST WWW.3DSYSTEMS.COM|NYSE:DDD PARTICIPANTS Wally Loewenbaum

Introduction Dr. Francesco Banterle, francesco.banterle@isti.cnr.it banterle.com/francesco Who

GPU Computing Projects E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1 Instructions for the Project Objectives The goals of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Humanoid Robotics 3D World Representations Maren Bennewitz 1 Robots in 3D Environments source:

Object Pose Estimation in Robotics Using a Low-Cost RGB-D Camera Alexander Ganslandt &amp;

3D Modeling and Visualization By Morteza Daneshmand iCV Group, Leader of the 3D Modeling and

An Efficient Algorithm for Feature-based 3D Point Cloud Correspondence Search Outline

Getting started with the 12.03.2012 3D Photography class spring 2012 Institute of Visual

Closed-Loop Control of 3D Printers via Webservices Felix Baumann and Dieter Roller Wintersemester

CONFERENCE CALL AND WEBCAST WWW.3DSYSTEMS.COM|NYSE:DDD PARTICIPANTS Wally Loewenbaum

Introduction Dr. Francesco Banterle, francesco.banterle@isti.cnr.it banterle.com/francesco Who

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Object Pose Estimation in Robotics Using a Low-Cost RGB-D Camera Alexander Ganslandt &