register efficient memory allocator for gpus
play

Register Efficient Memory Allocator for GPUs Marek Vinkler - PowerPoint PPT Presentation

Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic Motivation ray tracing data structure (kd-tree) GPU memory


  1. Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic

  2. Motivation ● ray tracing ● data structure (kd-tree) – GPU memory allocation – existing allocators slow Presentation: Marek Vinkler 2/27

  3. Motivation contd. ● sophisticated algorithms – complex memory usage ● kernel launch overhead ● CPU allocators inefficient ● http://decibel.fi.muni.cz/~xvinkl/CMalloc Presentation: Marek Vinkler 3/27

  4. State-of-the-art ● AtomicMalloc [Tzeng et al., 2010] (hinted) ● CudaMalloc (CUDA built-in, 2010) ● ScatterAlloc [Steinberger et al., 2012] ● FDGMalloc [Widmer et al., 2013] Presentation: Marek Vinkler 4/27

  5. AtomicMalloc [Tzeng et al., 2010] ● simplest: one atomic operation ● very fast ● serialization of execution ● cannot free memory Presentation: Marek Vinkler 5/27

  6. AtomicMalloc contd. mem memOffset Presentation: Marek Vinkler 6/27

  7. CudaMalloc [2010] ● CUDA 3.2 and higher built-in ● unpublished algorithm ● very slow Presentation: Marek Vinkler 7/27

  8. CudaMalloc contd. CudaMalloc ? Presentation: Marek Vinkler 8/27

  9. ScatterAlloc [Steinberger et al., 2012] ● hashing – many parallel allocations – roughly the same size of requests ● very fast on small allocations ● very slow on large allocations Presentation: Marek Vinkler 9/27

  10. ScatterAlloc contd. mem page Presentation: Marek Vinkler 10/27

  11. FDGMalloc [Widmer et al., 2013] ● CudaMalloc for superblocks ● local management in superblocks – low overhead ● very fast subsequent allocations ● cannot free memory separately Presentation: Marek Vinkler 11/27

  12. FDGMalloc contd. CudaMalloc offset offset Presentation: Marek Vinkler 12/27

  13. Limitations ● slow on large/variable sized allocations – CudaMalloc (CUDA built-in, 2010) – ScatterAlloc [Steinberger et al., 2012] ● cannot reuse memory – AtomicMalloc [Tzeng et al. 2010] – FDGMalloc [Widmer et al. 2013] Presentation: Marek Vinkler 13/27

  14. Two Proposed GPU Allocators ● AWMalloc – variant of AtomicMalloc – can reuse memory ● CMalloc – practical list-based allocator Presentation: Marek Vinkler 14/27

  15. AWMalloc ● circular memory pool ● used memory can be overwritten ● large memory pool and short-lived allocations Presentation: Marek Vinkler 15/27

  16. AWMalloc contd. mem memOffset Presentation: Marek Vinkler 16/27

  17. CMalloc ● circular memory pool ● header for individual allocations – correctness – split / merge ● initial splitting: Presentation: Marek Vinkler 17/27

  18. CMalloc contd. mem memOffset Presentation: Marek Vinkler 18/27

  19. CMalloc Variants ● CFMalloc – fused flag and next pointer – single word read/write ● CMMalloc & CFMMalloc – multiple offsets into memory pool – fewer conflicts – fused variant Presentation: Marek Vinkler 19/27

  20. Four Evaluation Tests ● Alloc Dealloc [Huang et al., 2010] – allocation followed by deallocation ● Alloc Cycle Dealloc [Widmer et al. 2013] – many allocations followed by deallocation of all memory ● Probability [Steinberger et al., 2012] – multiple kernels – allocation with probability pAlloc, deallocation with probability pFree ● Data Structure Build – kd-tree build – allocation of children index arrays Presentation: Marek Vinkler 20/27

  21. Alloc Dealloc – Allocation Size 21/27

  22. Alloc Cycle Dealloc – Allocations per Thread 23/27

  23. Probability – Number of CUDA Blocks 24/27

  24. Data Structure Build (Slowdown) Kd-tree build time (the lower the better) Allocator #Registers Sponza Crytek Dragon Sodahall Sponza AtomicMalloc* 4 18.4 ms 43.2 ms 378.9 ms 325.5 ms CudaMalloc 6 2.99x 3.82x 20.95x 3.38x ScatterAlloc 38 1.74x 2.01x 1.78x 5.08x AWMalloc* 6 0.88x 0.96x 0.90x 0.95x CMalloc 16 1.04x 1.14x 1.27x 1.14x CFMalloc 10 1.20x 1.08x 1.17x 1.17x * Not full dynamic memory allocator 25/27

  25. Conclusions ● four evaluation tests (new kd-tree build) ● AtomicMalloc / AWMalloc – no deallocation ● ScatterAlloc – similarly sized allocations, enough registers ● FDGMalloc – successive allocations ● CMalloc / CFMalloc – variably sized allocations / unknown allocation properties – up to 5.08 / 1.14 = 4.46x faster than ScatterAlloc Presentation: Marek Vinkler 26/27

  26. Project Webpage ● http://decibel.fi.muni.cz/~xvinkl/CMalloc Presentation: Marek Vinkler 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend