Register Efficient Memory Allocator for GPUs Marek Vinkler - PowerPoint PPT Presentation

Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic

Motivation ● ray tracing ● data structure (kd-tree) – GPU memory allocation – existing allocators slow Presentation: Marek Vinkler 2/27

Motivation contd. ● sophisticated algorithms – complex memory usage ● kernel launch overhead ● CPU allocators inefficient ● http://decibel.fi.muni.cz/~xvinkl/CMalloc Presentation: Marek Vinkler 3/27

State-of-the-art ● AtomicMalloc [Tzeng et al., 2010] (hinted) ● CudaMalloc (CUDA built-in, 2010) ● ScatterAlloc [Steinberger et al., 2012] ● FDGMalloc [Widmer et al., 2013] Presentation: Marek Vinkler 4/27

AtomicMalloc [Tzeng et al., 2010] ● simplest: one atomic operation ● very fast ● serialization of execution ● cannot free memory Presentation: Marek Vinkler 5/27

AtomicMalloc contd. mem memOffset Presentation: Marek Vinkler 6/27

CudaMalloc [2010] ● CUDA 3.2 and higher built-in ● unpublished algorithm ● very slow Presentation: Marek Vinkler 7/27

CudaMalloc contd. CudaMalloc ? Presentation: Marek Vinkler 8/27

ScatterAlloc [Steinberger et al., 2012] ● hashing – many parallel allocations – roughly the same size of requests ● very fast on small allocations ● very slow on large allocations Presentation: Marek Vinkler 9/27

ScatterAlloc contd. mem page Presentation: Marek Vinkler 10/27

FDGMalloc [Widmer et al., 2013] ● CudaMalloc for superblocks ● local management in superblocks – low overhead ● very fast subsequent allocations ● cannot free memory separately Presentation: Marek Vinkler 11/27

FDGMalloc contd. CudaMalloc offset offset Presentation: Marek Vinkler 12/27

Limitations ● slow on large/variable sized allocations – CudaMalloc (CUDA built-in, 2010) – ScatterAlloc [Steinberger et al., 2012] ● cannot reuse memory – AtomicMalloc [Tzeng et al. 2010] – FDGMalloc [Widmer et al. 2013] Presentation: Marek Vinkler 13/27

Two Proposed GPU Allocators ● AWMalloc – variant of AtomicMalloc – can reuse memory ● CMalloc – practical list-based allocator Presentation: Marek Vinkler 14/27

AWMalloc ● circular memory pool ● used memory can be overwritten ● large memory pool and short-lived allocations Presentation: Marek Vinkler 15/27

AWMalloc contd. mem memOffset Presentation: Marek Vinkler 16/27

CMalloc ● circular memory pool ● header for individual allocations – correctness – split / merge ● initial splitting: Presentation: Marek Vinkler 17/27

CMalloc contd. mem memOffset Presentation: Marek Vinkler 18/27

CMalloc Variants ● CFMalloc – fused flag and next pointer – single word read/write ● CMMalloc & CFMMalloc – multiple offsets into memory pool – fewer conflicts – fused variant Presentation: Marek Vinkler 19/27

Four Evaluation Tests ● Alloc Dealloc [Huang et al., 2010] – allocation followed by deallocation ● Alloc Cycle Dealloc [Widmer et al. 2013] – many allocations followed by deallocation of all memory ● Probability [Steinberger et al., 2012] – multiple kernels – allocation with probability pAlloc, deallocation with probability pFree ● Data Structure Build – kd-tree build – allocation of children index arrays Presentation: Marek Vinkler 20/27

Alloc Dealloc – Allocation Size 21/27

Alloc Cycle Dealloc – Allocations per Thread 23/27

Probability – Number of CUDA Blocks 24/27

Data Structure Build (Slowdown) Kd-tree build time (the lower the better) Allocator #Registers Sponza Crytek Dragon Sodahall Sponza AtomicMalloc* 4 18.4 ms 43.2 ms 378.9 ms 325.5 ms CudaMalloc 6 2.99x 3.82x 20.95x 3.38x ScatterAlloc 38 1.74x 2.01x 1.78x 5.08x AWMalloc* 6 0.88x 0.96x 0.90x 0.95x CMalloc 16 1.04x 1.14x 1.27x 1.14x CFMalloc 10 1.20x 1.08x 1.17x 1.17x * Not full dynamic memory allocator 25/27

Conclusions ● four evaluation tests (new kd-tree build) ● AtomicMalloc / AWMalloc – no deallocation ● ScatterAlloc – similarly sized allocations, enough registers ● FDGMalloc – successive allocations ● CMalloc / CFMalloc – variably sized allocations / unknown allocation properties – up to 5.08 / 1.14 = 4.46x faster than ScatterAlloc Presentation: Marek Vinkler 26/27

Project Webpage ● http://decibel.fi.muni.cz/~xvinkl/CMalloc Presentation: Marek Vinkler 27

Register Efficient Memory Allocator for GPUs Marek Vinkler - PowerPoint PPT Presentation

Register Efficient Memory Allocator for GPUs Marek Vinkler Vlastimil Havran Masaryk University in Brno, Czech Technical University in Czech Republic Prague, Czech Republic Motivation ray tracing data structure (kd-tree) GPU memory

Valgrind register allocator overhaul Ivo Raisr FOSDEM 2018 Ivo Raisr 39.6 GNU Toolchain

An allocator What are objects? What are values? [311] An Allocator is a handle to a

Interference graph Simone Campanoni simonec@eecs.northwestern.edu A graph-coloring register

Memory Management Ausgewhlte Betriebssysteme Kernel Page Frames Buddy Allocator

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

The hard work behind large physical memory allocations in the kernel Vlastimil Babka SUSE Labs

rmalloc() and rpipe() a uGNI-based Distributed Remote Memory Allocator and Access Library for

Project 1: -allocator Computation structures October 9, 2018 Memory allocation Static

Global Register Allocation Memory Hierarchy Management Register Allocation via Graph

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel

Efficient GPU-only Tree Walks in ChaNGa Jianqiao Liu, Milind Kulkarni Purdue University gpus!

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer (Stanford) Henry

Global Register Allocation Lecture Outline Memory Hierarchy Management Register

TLSF: a New Dynamic Memory Allocator for Real-Time Systems M. Masmano, I. Ripoll, A. Crespo, and

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

1 Register-memory Register-register (Load-store) There is no implicit operand Both operands are

Population Management and Using a Registry to Drive Care Session 4 Used with permission from the

NC State Board of Elections February 2018 NVRA COORDINATION The Executive Director of the NC

What you need to know to contract with BHA? Federal requirements: General BHA rules: Section

History of the MNA Registry 1991 to 2017 THE MNAA MEMBERSHIP REGISTRY Prior to 1991,

FCPF READINESS ASSESSMENT CAMEROONS R-PACKAGE Presented by: Dr HAMAN UNUSA GEF OFP / CEP

Registry Accounts Presented by Sandra K. Sanders County Clerk, Wharton County, Texas May 9,

THE CAPE TOWN CONVENTION REGIMEN General rules applicable to the taking of security in mobile

AZ MyIR Immunization Portal The Arizona Date Immunization Portal What is AZ MyIR?