gpu computing projects
play

GPU Computing Projects E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1 Instructions for the Project Objectives The goals of


  1. GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1

  2. Instructions for the Project

  3. Objectives The goals of the project are to: • apply data-parallelism concepts • practice with CUDA • set up a benchmark with a sound evaluation procedure • present your results in a clear and convincing way 2

  4. Possible Subjects Standard Option We propose 1 subject, that most of you should work on: Implementation and performance analysis of the Iterative Closest Point algorithm in CUDA This is an important Point Cloud Registration algorithm, and will be described briefly in later slides. Special Option For students who are at ease with CUDA, and want to investigate a particular question: Implementation and performance analysis of SOME INTERESTING algorithm in YOUR PARALLEL PROGRAMMING TECHNOLOGY OF CHOICE If you choose this way, you must validate your subject with us before Sept. 25th. Contact us by email. 3

  5. Our Expectations We expect your implementation to be: • running on GPU; • correct, ie to produce an acceptable result. Do not try to make it fast at first, just make it work. Then, try to apply NVidia’s Assess, Parallelize, Optimize, Deploy (APOD) design cycle as described in their CUDA C++ Best Practices Guide (click here): 1. identify the part of the code which is responsible for the bulk of the execution time; 2. use all available weapons (CUDA API, libraries, research papers) to obtain a parallel version of the code (assumed to be sequential at first); 3. use all available weapons (CUDA API, libraries, research papers) to optimize the performance of the parallel code; 4. measure the performance of the new code. 4

  6. Project Outline for Standard Performance Analysis Broad Outline Concrete Example Choose an application Mandelbrot Determine the most time-consuming part of Global atomics the app Determine one or more data-parallel Tiling. . . approaches to solving the problem Create multiple implementations of the One naïve version, one version with shared approach memory. . . Benchmark the implementations Record memory transfer time, kernel time, utilization, FLOPS, etc. Relate results to course concepts Identify the cause of the bottleneck (memory or compute bounding) 5

  7. Teams Teams of 4 (at most 1 special group per promo). Everyone must select a group on the course in Moodle before Sunday . 6

  8. Final Deliverables (1/3) 1. Implementation • Source code for C++ CPU reference • Source code for CUDA implementation(s) • Source code for benchmark tools • Build scripts (GNU Make, CMake. . . ) 7

  9. Final Deliverables (2/3) 2. Report • Description of the problem • Detailed if custom subject • Quick summary otherwise • Quick description of the baseline CPU implementation (paper reference, parallel or not, etc.) • Quick description of the baseline GPU implementation (same as CPU baseline) • Justification of the performance indicators you have used • Analysis of performance bottlenecks (with measured indicators, graphs, etc.) • For each improvement over the GPU baseline (implementations): • justification of this work regarding performance analysis • description of the improvement (ex: used output privatization instead of global atomics) • comparison of the performance with and without this implementation • Table with summary of the benchmark • Summary of who did what (contribution of each team member) 8

  10. Final Deliverables (3/3) 3. A live lecture / defense • 15’ presentation • 10’ discussion Submit implementation + report + slides on Moodle before October 31st. 9

  11. Grade Sheet Used for Last Session 10

  12. Defenses Defenses will be held in the beginning of November (exact date TBA). We will use Teams to meet if needed be. The participation of all team members is required in all cases. 11

  13. Moodle Links Course page for GISTRE: https://moodle.cri.epita.fr/course/view.php?id=325 Course page for SCIA: https://moodle.cri.epita.fr/course/view.php?id=326 12

  14. Summary of Tasks and Deadlines What Deadline Who Register on Moodle Sept. 15th Everyone Choose a group Sept. 20th Everyone (opt.) Complete the feedback form Sept. 30th Everyone . . . Work on your project. . . Submit code + report + presentation slides Oct. 31st 1 person/team Defend your project (live presentation) Beginning of Nov. Everyone 13

  15. About Iterative Closest Points

  16. Iterative Closest Points Overview A point cloud processing algorithm used in many applications: medical image registration, LIDAR frames registration, SLAM. . . 14

  17. Iterative Closest Points Algorithm 1. For each point in the source point cloud, match the closest point in the reference point cloud 2. Estimate the combination of rotation and translation using a root mean square point to point distance metric minimization technique 3. Transform the source points using the obtained transformation. 4. Loop until iteration limit or distance threshold or. . . Do not use complex versions unless everything else is perfect! 15

  18. Pseudo code algorithm ICP( M , S ) θ := θ 0 while not registered: X := ∅ for m i ∈ transform( M , θ ): ˆ s j := closest point in S to m i X := X + � m i , ˆ s j � θ := least squares(X) return θ Source: en.wikipedia.org/wiki/Point_set_registration#Iterative_closest_point 16

  19. A short video https://www.youtube.com/watch?v=QWDM4cFdKrE 17

  20. Recommended Resources A detailed Python implementation (Jupyter notebook) https://github.com/niosus/notebooks/blob/master/icp.ipynb A great visualization tool for 3D point clouds https://www.paraview.org/ The Point Cloud Library (PCL) https://pointclouds.org/ PCL Tutorial: How to use iterative closest point https://pcl.readthedocs.io/projects/tutorials/en/latest/iterative_closest_p oint.html 18

  21. Recommended implementation and grading 1. Get a working CPU version 2. Identify which parts you will port to GPU 3. Get a basic GPU port 4. Get an optimized GPU port 5. Benchmark — Minimum expected work 6. Add some point indexation structure (kd-tree, octree) 7. Perform more CUDA optimizations 8. Experiment with algorithm variants (point to plane, outlier rejection, etc.) 19

  22. Dataset for testing We will provide you with testing data in a few days. You should try your algorithm on the simplest possible data to begin. 20

  23. Implementations Hints (Final Reminders) • Have a working ( slow ) C++ reference implementation first (and keep it forever) • Tag ( git tag ) the versions of your program before any optimization (useful to track and benchmark ideas) • Try optimizations step by step so that you can tell which ones are the most important 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend