Register Efficient Memory Allocator for GPUs Marek Vinkler - - PowerPoint PPT Presentation

register efficient memory allocator for gpus
SMART_READER_LITE
LIVE PREVIEW

Register Efficient Memory Allocator for GPUs Marek Vinkler - - PowerPoint PPT Presentation

Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic Motivation ray tracing data structure (kd-tree) GPU memory


slide-1
SLIDE 1

Register Efficient Memory Allocator for GPUs

Marek Vinkler

Masaryk University in Brno, Czech Republic

Vlastimil Havran

Czech Technical University in Prague, Czech Republic

slide-2
SLIDE 2

Presentation: Marek Vinkler 2/27

Motivation

  • ray tracing
  • data structure (kd-tree)

– GPU memory allocation – existing allocators slow

slide-3
SLIDE 3

Presentation: Marek Vinkler 3/27

Motivation contd.

  • sophisticated algorithms

– complex memory usage

  • kernel launch overhead
  • CPU allocators inefficient
  • http://decibel.fi.muni.cz/~xvinkl/CMalloc
slide-4
SLIDE 4

Presentation: Marek Vinkler 4/27

State-of-the-art

  • AtomicMalloc [Tzeng et al., 2010] (hinted)
  • CudaMalloc (CUDA built-in, 2010)
  • ScatterAlloc [Steinberger et al., 2012]
  • FDGMalloc [Widmer et al., 2013]
slide-5
SLIDE 5

Presentation: Marek Vinkler 5/27

AtomicMalloc [Tzeng et al., 2010]

  • simplest: one atomic operation
  • very fast
  • serialization of execution
  • cannot free memory
slide-6
SLIDE 6

Presentation: Marek Vinkler 6/27

AtomicMalloc contd.

mem memOffset

slide-7
SLIDE 7

Presentation: Marek Vinkler 7/27

CudaMalloc [2010]

  • CUDA 3.2 and higher built-in
  • unpublished algorithm
  • very slow
slide-8
SLIDE 8

Presentation: Marek Vinkler 8/27

CudaMalloc contd.

CudaMalloc

?

slide-9
SLIDE 9

Presentation: Marek Vinkler 9/27

ScatterAlloc [Steinberger et al., 2012]

  • hashing

– many parallel allocations – roughly the same size of requests

  • very fast on small allocations
  • very slow on large allocations
slide-10
SLIDE 10

Presentation: Marek Vinkler 10/27

ScatterAlloc contd.

mem page

slide-11
SLIDE 11

Presentation: Marek Vinkler 11/27

FDGMalloc [Widmer et al., 2013]

  • CudaMalloc for superblocks
  • local management in superblocks

– low overhead

  • very fast subsequent allocations
  • cannot free memory separately
slide-12
SLIDE 12

Presentation: Marek Vinkler 12/27

CudaMalloc

FDGMalloc contd.

  • ffset
  • ffset
slide-13
SLIDE 13

Presentation: Marek Vinkler 13/27

Limitations

  • slow on large/variable sized allocations

– CudaMalloc (CUDA built-in, 2010) – ScatterAlloc [Steinberger et al., 2012]

  • cannot reuse memory

– AtomicMalloc [Tzeng et al. 2010] – FDGMalloc [Widmer et al. 2013]

slide-14
SLIDE 14

Presentation: Marek Vinkler 14/27

Two Proposed GPU Allocators

  • AWMalloc

– variant of AtomicMalloc – can reuse memory

  • CMalloc

– practical list-based allocator

slide-15
SLIDE 15

Presentation: Marek Vinkler 15/27

AWMalloc

  • circular memory pool
  • used memory can be overwritten
  • large memory pool and short-lived allocations
slide-16
SLIDE 16

Presentation: Marek Vinkler 16/27

AWMalloc contd.

mem memOffset

slide-17
SLIDE 17

Presentation: Marek Vinkler 17/27

CMalloc

  • circular memory pool
  • header for individual allocations

– correctness – split / merge

  • initial splitting:
slide-18
SLIDE 18

Presentation: Marek Vinkler 18/27

CMalloc contd.

mem memOffset

slide-19
SLIDE 19

Presentation: Marek Vinkler 19/27

CMalloc Variants

  • CFMalloc

– fused flag and next pointer – single word read/write

  • CMMalloc & CFMMalloc

– multiple offsets into memory pool – fewer conflicts – fused variant

slide-20
SLIDE 20

Presentation: Marek Vinkler 20/27

Four Evaluation Tests

  • Alloc Dealloc [Huang et al., 2010]

– allocation followed by deallocation

  • Alloc Cycle Dealloc [Widmer et al. 2013]

– many allocations followed by deallocation of all memory

  • Probability [Steinberger et al., 2012]

– multiple kernels – allocation with probability pAlloc, deallocation with probability pFree

  • Data Structure Build

– kd-tree build – allocation of children index arrays

slide-21
SLIDE 21

21/27

Alloc Dealloc – Allocation Size

slide-22
SLIDE 22

23/27

Alloc Cycle Dealloc – Allocations per Thread

slide-23
SLIDE 23

24/27

Probability – Number of CUDA Blocks

slide-24
SLIDE 24

25/27

Data Structure Build (Slowdown)

Kd-tree build time (the lower the better) Allocator #Registers Sponza Crytek Sponza Dragon Sodahall AtomicMalloc* 4 18.4 ms 43.2 ms 378.9 ms 325.5 ms CudaMalloc 6 2.99x 3.82x 20.95x 3.38x ScatterAlloc 38 1.74x 2.01x 1.78x 5.08x AWMalloc* 6 0.88x 0.96x 0.90x 0.95x CMalloc 16 1.04x 1.14x 1.27x 1.14x CFMalloc 10 1.20x 1.08x 1.17x 1.17x * Not full dynamic memory allocator

slide-25
SLIDE 25

Presentation: Marek Vinkler 26/27

Conclusions

  • four evaluation tests (new kd-tree build)
  • AtomicMalloc / AWMalloc

– no deallocation

  • ScatterAlloc

– similarly sized allocations, enough registers

  • FDGMalloc

– successive allocations

  • CMalloc / CFMalloc

– variably sized allocations / unknown allocation properties – up to 5.08 / 1.14 = 4.46x faster than ScatterAlloc

slide-26
SLIDE 26

Presentation: Marek Vinkler 27

Project Webpage

  • http://decibel.fi.muni.cz/~xvinkl/CMalloc