GPU Computing Projects E. Carlinet, J. Chazalon { - - PowerPoint PPT Presentation

gpu computing projects
SMART_READER_LITE
LIVE PREVIEW

GPU Computing Projects E. Carlinet, J. Chazalon { - - PowerPoint PPT Presentation

GPU Computing Projects E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Sept. 2020 EPITA Research & Development Laboratory (LRDE) Slides generated on September 15, 2020 1 Instructions for the Project Objectives The goals of


slide-1
SLIDE 1

GPU Computing Projects

  • E. Carlinet, J. Chazalon {firstname.lastname@lrde.epita.fr}
  • Sept. 2020

EPITA Research & Development Laboratory (LRDE)

Slides generated on September 15, 2020

1

slide-2
SLIDE 2

Instructions for the Project

slide-3
SLIDE 3

Objectives

The goals of the project are to:

  • apply data-parallelism concepts
  • practice with CUDA
  • set up a benchmark with a sound evaluation procedure
  • present your results in a clear and convincing way

2

slide-4
SLIDE 4

Possible Subjects

Standard Option We propose 1 subject, that most of you should work on: Implementation and performance analysis of the Iterative Closest Point algorithm in CUDA This is an important Point Cloud Registration algorithm, and will be described briefly in later slides. Special Option For students who are at ease with CUDA, and want to investigate a particular question: Implementation and performance analysis of SOME INTERESTING algorithm in YOUR PARALLEL PROGRAMMING TECHNOLOGY OF CHOICE If you choose this way, you must validate your subject with us before Sept. 25th. Contact us by email.

3

slide-5
SLIDE 5

Our Expectations

We expect your implementation to be:

  • running on GPU;
  • correct, ie to produce an acceptable result.

Do not try to make it fast at first, just make it work. Then, try to apply NVidia’s Assess, Parallelize, Optimize, Deploy (APOD) design cycle as described in their CUDA C++ Best Practices Guide (click here):

  • 1. identify the part of the code which is responsible for the bulk of the execution time;
  • 2. use all available weapons (CUDA API, libraries, research papers) to obtain a parallel

version of the code (assumed to be sequential at first);

  • 3. use all available weapons (CUDA API, libraries, research papers) to optimize the

performance of the parallel code;

  • 4. measure the performance of the new code.

4

slide-6
SLIDE 6

Project Outline for Standard Performance Analysis

Broad Outline Concrete Example Choose an application Mandelbrot Determine the most time-consuming part of the app Global atomics Determine one or more data-parallel approaches to solving the problem

  • Tiling. . .

Create multiple implementations of the approach One naïve version, one version with shared

  • memory. . .

Benchmark the implementations Record memory transfer time, kernel time, utilization, FLOPS, etc. Relate results to course concepts Identify the cause of the bottleneck (memory or compute bounding)

5

slide-7
SLIDE 7

Teams

Teams of 4 (at most 1 special group per promo). Everyone must select a group on the course in Moodle before Sunday.

6

slide-8
SLIDE 8

Final Deliverables (1/3)

  • 1. Implementation
  • Source code for C++ CPU reference
  • Source code for CUDA implementation(s)
  • Source code for benchmark tools
  • Build scripts (GNU Make, CMake. . . )

7

slide-9
SLIDE 9

Final Deliverables (2/3)

  • 2. Report
  • Description of the problem
  • Detailed if custom subject
  • Quick summary otherwise
  • Quick description of the baseline CPU implementation (paper reference, parallel or not,

etc.)

  • Quick description of the baseline GPU implementation (same as CPU baseline)
  • Justification of the performance indicators you have used
  • Analysis of performance bottlenecks (with measured indicators, graphs, etc.)
  • For each improvement over the GPU baseline (implementations):
  • justification of this work regarding performance analysis
  • description of the improvement (ex: used output privatization instead of global atomics)
  • comparison of the performance with and without this implementation
  • Table with summary of the benchmark
  • Summary of who did what (contribution of each team member)

8

slide-10
SLIDE 10

Final Deliverables (3/3)

  • 3. A live lecture / defense
  • 15’ presentation
  • 10’ discussion

Submit implementation + report + slides on Moodle before October 31st.

9

slide-11
SLIDE 11

Grade Sheet Used for Last Session

10

slide-12
SLIDE 12

Defenses

Defenses will be held in the beginning of November (exact date TBA). We will use Teams to meet if needed be. The participation of all team members is required in all cases.

11

slide-13
SLIDE 13

Moodle Links

Course page for GISTRE: https://moodle.cri.epita.fr/course/view.php?id=325 Course page for SCIA: https://moodle.cri.epita.fr/course/view.php?id=326

12

slide-14
SLIDE 14

Summary of Tasks and Deadlines

What Deadline Who Register on Moodle

  • Sept. 15th

Everyone Choose a group

  • Sept. 20th

Everyone (opt.) Complete the feedback form

  • Sept. 30th

Everyone . . . Work on your project. . . Submit code + report + presentation slides

  • Oct. 31st

1 person/team Defend your project (live presentation) Beginning of Nov. Everyone

13

slide-15
SLIDE 15

About Iterative Closest Points

slide-16
SLIDE 16

Iterative Closest Points Overview

A point cloud processing algorithm used in many applications: medical image registration, LIDAR frames registration, SLAM. . .

14

slide-17
SLIDE 17

Iterative Closest Points Algorithm

  • 1. For each point in the source point cloud, match the closest point in the

reference point cloud

  • 2. Estimate the combination of rotation and translation using a root mean

square point to point distance metric minimization technique

  • 3. Transform the source points using the obtained transformation.
  • 4. Loop until iteration limit or distance threshold or. . .

Do not use complex versions unless everything else is perfect!

15

slide-18
SLIDE 18

Pseudo code

algorithm ICP( M, S) θ := θ0 while not registered: X := ∅ for mi ∈ transform(M, θ): ˆ sj := closest point in S to mi X := X + mi, ˆ sj θ := least squares(X) return θ Source: en.wikipedia.org/wiki/Point_set_registration#Iterative_closest_point

16

slide-19
SLIDE 19

A short video

https://www.youtube.com/watch?v=QWDM4cFdKrE

17

slide-20
SLIDE 20

Recommended Resources

A detailed Python implementation (Jupyter notebook) https://github.com/niosus/notebooks/blob/master/icp.ipynb A great visualization tool for 3D point clouds https://www.paraview.org/ The Point Cloud Library (PCL) https://pointclouds.org/ PCL Tutorial: How to use iterative closest point https://pcl.readthedocs.io/projects/tutorials/en/latest/iterative_closest_p

  • int.html

18

slide-21
SLIDE 21

Recommended implementation and grading

  • 1. Get a working CPU version
  • 2. Identify which parts you will port to GPU
  • 3. Get a basic GPU port
  • 4. Get an optimized GPU port
  • 5. Benchmark — Minimum expected work
  • 6. Add some point indexation structure (kd-tree, octree)
  • 7. Perform more CUDA optimizations
  • 8. Experiment with algorithm variants (point to plane, outlier rejection, etc.)

19

slide-22
SLIDE 22

Dataset for testing

We will provide you with testing data in a few days. You should try your algorithm on the simplest possible data to begin.

20

slide-23
SLIDE 23

Implementations Hints (Final Reminders)

  • Have a working (slow) C++ reference implementation first (and keep it forever)
  • Tag (git tag) the versions of your program before any optimization (useful to track and

benchmark ideas)

  • Try optimizations step by step so that you can tell which ones are the most important

21