Register Efficient Memory Allocator for GPUs Marek Vinkler - - PowerPoint PPT Presentation
Register Efficient Memory Allocator for GPUs Marek Vinkler - - PowerPoint PPT Presentation
Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic Motivation ray tracing data structure (kd-tree) GPU memory
Presentation: Marek Vinkler 2/27
Motivation
- ray tracing
- data structure (kd-tree)
– GPU memory allocation – existing allocators slow
Presentation: Marek Vinkler 3/27
Motivation contd.
- sophisticated algorithms
– complex memory usage
- kernel launch overhead
- CPU allocators inefficient
- http://decibel.fi.muni.cz/~xvinkl/CMalloc
Presentation: Marek Vinkler 4/27
State-of-the-art
- AtomicMalloc [Tzeng et al., 2010] (hinted)
- CudaMalloc (CUDA built-in, 2010)
- ScatterAlloc [Steinberger et al., 2012]
- FDGMalloc [Widmer et al., 2013]
Presentation: Marek Vinkler 5/27
AtomicMalloc [Tzeng et al., 2010]
- simplest: one atomic operation
- very fast
- serialization of execution
- cannot free memory
Presentation: Marek Vinkler 6/27
AtomicMalloc contd.
mem memOffset
Presentation: Marek Vinkler 7/27
CudaMalloc [2010]
- CUDA 3.2 and higher built-in
- unpublished algorithm
- very slow
Presentation: Marek Vinkler 8/27
CudaMalloc contd.
CudaMalloc
?
Presentation: Marek Vinkler 9/27
ScatterAlloc [Steinberger et al., 2012]
- hashing
– many parallel allocations – roughly the same size of requests
- very fast on small allocations
- very slow on large allocations
Presentation: Marek Vinkler 10/27
ScatterAlloc contd.
mem page
Presentation: Marek Vinkler 11/27
FDGMalloc [Widmer et al., 2013]
- CudaMalloc for superblocks
- local management in superblocks
– low overhead
- very fast subsequent allocations
- cannot free memory separately
Presentation: Marek Vinkler 12/27
CudaMalloc
FDGMalloc contd.
- ffset
- ffset
Presentation: Marek Vinkler 13/27
Limitations
- slow on large/variable sized allocations
– CudaMalloc (CUDA built-in, 2010) – ScatterAlloc [Steinberger et al., 2012]
- cannot reuse memory
– AtomicMalloc [Tzeng et al. 2010] – FDGMalloc [Widmer et al. 2013]
Presentation: Marek Vinkler 14/27
Two Proposed GPU Allocators
- AWMalloc
– variant of AtomicMalloc – can reuse memory
- CMalloc
– practical list-based allocator
Presentation: Marek Vinkler 15/27
AWMalloc
- circular memory pool
- used memory can be overwritten
- large memory pool and short-lived allocations
Presentation: Marek Vinkler 16/27
AWMalloc contd.
mem memOffset
Presentation: Marek Vinkler 17/27
CMalloc
- circular memory pool
- header for individual allocations
– correctness – split / merge
- initial splitting:
Presentation: Marek Vinkler 18/27
CMalloc contd.
mem memOffset
Presentation: Marek Vinkler 19/27
CMalloc Variants
- CFMalloc
– fused flag and next pointer – single word read/write
- CMMalloc & CFMMalloc
– multiple offsets into memory pool – fewer conflicts – fused variant
Presentation: Marek Vinkler 20/27
Four Evaluation Tests
- Alloc Dealloc [Huang et al., 2010]
– allocation followed by deallocation
- Alloc Cycle Dealloc [Widmer et al. 2013]
– many allocations followed by deallocation of all memory
- Probability [Steinberger et al., 2012]
– multiple kernels – allocation with probability pAlloc, deallocation with probability pFree
- Data Structure Build
– kd-tree build – allocation of children index arrays
21/27
Alloc Dealloc – Allocation Size
23/27
Alloc Cycle Dealloc – Allocations per Thread
24/27
Probability – Number of CUDA Blocks
25/27
Data Structure Build (Slowdown)
Kd-tree build time (the lower the better) Allocator #Registers Sponza Crytek Sponza Dragon Sodahall AtomicMalloc* 4 18.4 ms 43.2 ms 378.9 ms 325.5 ms CudaMalloc 6 2.99x 3.82x 20.95x 3.38x ScatterAlloc 38 1.74x 2.01x 1.78x 5.08x AWMalloc* 6 0.88x 0.96x 0.90x 0.95x CMalloc 16 1.04x 1.14x 1.27x 1.14x CFMalloc 10 1.20x 1.08x 1.17x 1.17x * Not full dynamic memory allocator
Presentation: Marek Vinkler 26/27
Conclusions
- four evaluation tests (new kd-tree build)
- AtomicMalloc / AWMalloc
– no deallocation
- ScatterAlloc
– similarly sized allocations, enough registers
- FDGMalloc
– successive allocations
- CMalloc / CFMalloc
– variably sized allocations / unknown allocation properties – up to 5.08 / 1.14 = 4.46x faster than ScatterAlloc
Presentation: Marek Vinkler 27
Project Webpage
- http://decibel.fi.muni.cz/~xvinkl/CMalloc