Memory Management Tips, Tricks & Techniques Stephen Jones, - - PowerPoint PPT Presentation

memory management
SMART_READER_LITE
LIVE PREVIEW

Memory Management Tips, Tricks & Techniques Stephen Jones, - - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &


slide-1
SLIDE 1

Memory Management

Tips, Tricks & Techniques

Stephen Jones, SpaceX, GTC 2015

slide-2
SLIDE 2

Conclusion

  • 1. Wrap malloc/cudaMalloc with your own allocator

Non-Blockin, High Performance Sub-Allocation Host/Device Data Management Leak Detection, Debugging & Profiling

slide-3
SLIDE 3

Conclusion

  • 2. There are three types of memory allocation

For allocations spanning multiple program iterations

  • main data storage
  • C++ objects & configuration data

Persistent, Long-Lived Storage For data which does not persist outside of one iteration

  • per-iteration derived quantities
  • peration working space, double buffers, etc.

Working Space, Lifetime Of Single Iteration For transient allocations with single-procedure lifetime

  • local queues, stacks & objects
  • function-scope working space

Temporary, Local Allocation

slide-4
SLIDE 4

Take Control Of Memory Allocation

slide-5
SLIDE 5

Take Control Of Memory Allocation

Define your own allocate/free functions Overload new & delete for all classes Never call native malloc()

  • r free()

Debug & Leak Detection Non-Blocking Allocation Lightweight Allocators Host/Device Management

slide-6
SLIDE 6

It’s Easy!

// Please don’t name these “malloc” & “free” void *hostAlloc(size_t len) { return malloc(len); } void freeMem(void *ptr) { free(ptr); } // Every class should be “public AllocBase” class AllocBase { public: void *operator new(size_t len) { return hostAlloc(len); } void operator delete(void *ptr) { freeMem(ptr); } };

C malloc() & free() C++ new & delete

slide-7
SLIDE 7

Also Control Device Allocation

// Please don’t name these “malloc” & “free” void *hostAlloc(size_t len) { return malloc(len); } void freeMem(void *ptr) { free(ptr); } void *deviceAlloc(size_t len) { void *ptr; cudaMalloc(&ptr, len); return ptr; } // Every class should be “public AllocBase” class AllocBase { public: void *operator new(size_t len) { return hostAlloc(len); } void operator delete(void *ptr) { freeMem(ptr); } };

C malloc() & free() C++ new & delete

slide-8
SLIDE 8

Allocation Tracking, Leak Detection & Profiling

slide-9
SLIDE 9

Memory Leak Detection

Track each allocation with unique identifier

  • Allocate extra space for tracking ID
  • Store ID in front of allocation
  • Record IDs assigned & released

requested space ID

char *ptr = (char *)hostAlloc(1000);

1008 bytes 1000 bytes allocation counter Return offset address

actual allocation start

Allocation ID Record

slide-10
SLIDE 10

1 2 3 4 5 6 7 8 9

Allocate

Memory Leak Detection

slide-11
SLIDE 11

Memory Leak Detection

1 3 4 5 8 9

Free

2 6 7

slide-12
SLIDE 12

Memory Leak Detection

2 6 7

Identify Memory Leaks

slide-13
SLIDE 13

Memory Leak Detection

// Use a C++11 atomic to count up allocation ownership static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; void *hostAlloc(size_t len) { long long id = alloc_id++; // Count up allocation ID allocationList[id] = 1; // Record ID as “allocated” // Store allocation ID in front of returned memory void *ptr = malloc(len + 8); *(int *)ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { // Extract allocation ID from front of allocation id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; // Record ID as “released” free((char *)ptr - 8); }

slide-14
SLIDE 14

Displaying Unreleased Allocations

For global-scope objects:

  • Constructor called before main()
  • Destructor called after main() exits

WARNING

  • Order of static object construction

& destruction is undefined

  • Tracking objects should not

interact

class TrackingObject { public: // Set up initial data in constructor TrackingObject() { InitTrackingData(); } // Analyse tracking data in destructor virtual ~TrackingObject() { ProcessTrackingData(); } virtual void InitTrackingData() {} virtual void ProcessTrackingData() {} }; // Create global-scope static object. Destructor // is called automatically when program exits. static TrackingObject dataTracker;

slide-15
SLIDE 15

Displaying Unreleased Allocations

// Walks the allocation list looking for unallocated data class AllocationTracker : public TrackingObject { public: void ProcessTrackingData() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf(“Allocation %d not freed\n”, i); } } } } // Creates a tracker which will be called on program shutdown static AllocationTracker __allocationTracker;

slide-16
SLIDE 16

Complete Leak Tracking Code

// Auto display of memory leaks static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; class AllocationTracker { public: void ~AllocationTracker() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf(“Allocation %d not freed\n”, i); } } } } static AllocationTracker __allocationTracker; // Allocator with leak tracking void *hostAlloc(size_t len) { long long id = alloc_id++; allocationList[id] = 1; void *ptr = malloc(len + 8); *ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; free((char *)ptr - 8); }

slide-17
SLIDE 17

Host / Device Data Management

slide-18
SLIDE 18

Managing Data Movement

Minimise Code Impact

  • Use managed memory
  • C++ operator & casting shenanigans
  • Focus on memory layout

Large Separate GPU & CPU Code Sections

Explicit Locality Control

  • Streams & copy/compute overlap
  • Carefully managed memory

Interleaved CPU & GPU Execution

No One-Size-Fits-All

  • Fine-grained memory regions
  • Signaling between host & device
  • Consider zero-copy memory

Concurrent CPU & GPU Execution

slide-19
SLIDE 19

Always Use Streams

slide-20
SLIDE 20

Always Use Streams

Whenever you launch a kernel Whenever you copy data Whenever you synchronize

slide-21
SLIDE 21

Streams & Copy/Compute Overlap

Copy Up Copy Back

Tesla & Quadro GPUs support bi-directional copying

slide-22
SLIDE 22

Streams & Copy/Compute Overlap

CPU GPU

slide-23
SLIDE 23

Streams & Copy/Compute Overlap

CPU GPU

slide-24
SLIDE 24

Streams & Copy/Compute Overlap

CPU GPU

Step 1

slide-25
SLIDE 25

Streams & Copy/Compute Overlap

CPU GPU

Step 2

slide-26
SLIDE 26

Streams & Copy/Compute Overlap

CPU GPU

Step 3

slide-27
SLIDE 27

Streams & Copy/Compute Overlap

CPU GPU

Step 4

slide-28
SLIDE 28

Streams & Copy/Compute Overlap

CPU GPU

Step 5

slide-29
SLIDE 29

Streams & Copy/Compute Overlap

CPU GPU

Step 6

slide-30
SLIDE 30

Streams & Copy/Compute Overlap

CPU GPU

Step 7

slide-31
SLIDE 31

Streams & Copy/Compute Overlap

CPU GPU

Step 8

slide-32
SLIDE 32

Streams & Copy/Compute Overlap

CPU GPU

Step 9

slide-33
SLIDE 33

Streams & Copy/Compute Overlap

CPU GPU

slide-34
SLIDE 34

Streams & Copy/Compute Overlap

CPU GPU copy up

Step 1

slide-35
SLIDE 35

Streams & Copy/Compute Overlap

CPU GPU copy up compute

Step 2

slide-36
SLIDE 36

Streams & Copy/Compute Overlap

CPU GPU copy up copy back compute

Step 3

slide-37
SLIDE 37

Streams & Copy/Compute Overlap

CPU GPU compute copy back

Step 4

slide-38
SLIDE 38

Streams & Copy/Compute Overlap

CPU GPU copy back

Step 5

slide-39
SLIDE 39

Streams & Copy/Compute Overlap

CPU GPU

Three Simultaneous Operations

1 2 3

copy up copy back compute

slide-40
SLIDE 40

Overlapping Copy & Compute

copy compute copy time start finish

slide-41
SLIDE 41

Overlapping Copy & Compute

time start finish non-overlapped finish time saved

slide-42
SLIDE 42

In More Detail...

copy up

Stream 3 Stream 1

compute copy back

Stream 2

time start finish

slide-43
SLIDE 43

Compute/Copy Overlap, in Code

// Convert cats to dogs in “N” chunks void catsToDogs(char *cat, char *dog, int width, int height, int N) { // Loop copy+compute+copy for each chunk for( int h=0; h<height; h+=(height/N) ) { // Create a stream for this iteration cudaStream_t s; cudaStreamCreate( &s ); // Allocate device data for this chunk char *deviceData; cudaMalloc( &deviceData, width * (height/N) ); // Copy up then convert then copy back, in our stream cudaMemcpyAsync( deviceData, cat+h*width, ...hostToDevice, s ); convert<<< width, height/N, 0, s >>>( deviceData ); cudaMemcpyAsync( dog+h*width, deviceData, ...deviceToHost, s ); // Free up this iteration’s resources cudaStreamDestroy( s ); cudaFree( deviceData ); } }

slide-44
SLIDE 44

Managed Memory

Very convenient for minimising code impact

  • Can access same pointer from CPU & GPU, directly
  • Data moves automatically
  • Allows full-bandwidth access from GPU
  • Tricky to use because of concurrency constraints (see next slides)

int *data; cudaMallocManaged( &data, 10000000 ); data[100] = 1234; // Access on CPU first launch<<< 1, 1 >>>( data ); // Access on GPU second

slide-45
SLIDE 45

Drawback Of Managed Memory

CPU cannot touch managed memory while the GPU is active

  • “active” means any launch or copy since last synchronize()

int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>( data ); // Access on GPU first data[100] = 1234; // CPU access fails // because GPU is busy

slide-46
SLIDE 46

Drawback Of Managed Memory

CPU cannot touch managed memory while the GPU is active

  • “active” means any launch or copy since last synchronize()
  • Even if the GPU kernel is not actually using the data

int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>(); // GPU does not touch data data[100] = 1234; // CPU access still fails // because GPU is busy!

slide-47
SLIDE 47

“Attaching” Managed Memory

“Attach” reduces constraint to ’while a specific stream is active’

  • Allows CPU to touch some data while GPU is busy with other data

// Assume streams “s1” & “s2” exist int *data; cudaMallocManaged( &data, 10000000 ); // Associate “data” with stream s1 cudaStreamAttachMemAsync( s1, data ); // Launch GPU work on stream s2 launch<<< 1, 1, 0, s2 >>>(); data[100] = 1234; // Access on CPU succeeds

slide-48
SLIDE 48

Managed Memory Attach Tricks

Trick: You can attach just a part of an allocation

  • Allows heterogeneous access to different parts of the same allocation

// Assume streams “s1” & “s2” exist int *data; cudaMallocManaged( &data, 65536); // Associate half of “data” with stream s1 // and half with stream s2 cudaStreamAttachMemAsync( s1, data, 32768 ); cudaStreamAttachMemAsync( s2, data+32768, 32768 ); // Launch on stream s2 is fine, but you are // responsible for not touching top half of data launch<<< 1, 1, 0, s2 >>>( data ); data[100] = 1234; // Access on CPU succeeds

Attached to stream s2 Attached to stream s1

Single managed allocation

Device access here Host access here

slide-49
SLIDE 49

Managed Memory Attach Tricks

Dirty Trick: Attach memory not used by a kernel to the CPU

  • You can tell CUDA that you know best, and CPU-access is safe
  • Must re-attach to a stream to use it on the device
  • WARNING: Memory will not be shared with GPU while host-attached

// Assume stream s1 exists int *data; cudaMallocManaged( &data, 10000000 ); // Associate data with the CPU cudaStreamAttachMemAsync( s1, data, 0, cudaMemAttachHost ); // Launch GPU work that doesn’t use “data” launch<<< 1, 1, 0, s1 >>>(); data[100] = 1234; // Access on CPU succeeds

slide-50
SLIDE 50

Owning GPU Memory Allocation

slide-51
SLIDE 51

CUDA Memory Allocation Issues

GPU memory allocation can and does synchronize all streams

slide-52
SLIDE 52

cudaMalloc() Behaviour Varies Widely

Repeated allocation of fixed size

slide-53
SLIDE 53

cudaMalloc() Behaviour Varies Widely

Repeated allocation of fixed size

slide-54
SLIDE 54

cudaMalloc() Behaviour Varies Widely

Repeated allocation of fixed size

slide-55
SLIDE 55

cudaMalloc() Behaviour Varies Widely

Repeated allocations of increasing size

slide-56
SLIDE 56

cudaMalloc() Behaviour Varies Widely

Repeated allocations of increasing size

slide-57
SLIDE 57

cudaMalloc() Behaviour Varies Widely

Mixed allocation & free, increasing size

slide-58
SLIDE 58

cudaMalloc() Behaviour Varies Widely

Time variation when allocating small blocks of data

slide-59
SLIDE 59

cudaMalloc() Behaviour Varies Widely

Time variation when allocating larger blocks of data

slide-60
SLIDE 60

Roll-Your-Own Allocators

slide-61
SLIDE 61

Sub-Allocators

  • 1. Pre-allocate a one or more large chunks of memory

Pre-allocated large chunk

Entire GPU memory

slide-62
SLIDE 62

Sub-Allocators

  • 1. Pre-allocate a one or more large chunks of memory
  • 2. Allocation requests then carve out pieces without having to

touch the hardware

Pre-allocated large chunk Suballocation

Entire GPU memory

slide-63
SLIDE 63

Heap Allocators

Memory

Allocation may go anywhere in memory

New Allocation

slide-64
SLIDE 64

Heap Allocators

New Allocation

Memory

No fit Allocation may go anywhere in memory

slide-65
SLIDE 65

Heap Allocators

New Allocation

Memory

No fit Allocation may go anywhere in memory Fit Fit

slide-66
SLIDE 66

Heap Allocators

  • New allocation must find next and/or best-fit free space in memory
  • Free releases block in-place (fragmentation)
  • Complex; countless approaches: SLAB, SLUB, red-black trees, etc.

Memory

Allocation may go anywhere in memory

New Allocation

slide-67
SLIDE 67

Ring-Buffer Allocators

Memory

New Allocation

Allocate from head Free from tail

tail head

slide-68
SLIDE 68

Ring-Buffer Allocators

Memory

New Allocation

Allocate from head Free from tail Always placed at head

tail head

slide-69
SLIDE 69

Ring-Buffer Allocators

  • New allocations always adds to head of buffer
  • Free only permitted from tail – (out-of-order free = fragmentation)
  • Fast, fairly simple, but long-lived allocations will block allocator

Memory

Free from tail

tail head

New Allocation

slide-70
SLIDE 70

Ring-Buffer Allocators

  • New allocations always adds to head of buffer
  • Free only permitted from tail – (out-of-order free = fragmentation)
  • Fast, fairly simple, but long-lived allocations will block allocator

Memory

Free from tail

tail

New Allocation

head

slide-71
SLIDE 71

Ring-Buffer Allocators

  • New allocations always adds to head of buffer
  • Free only permitted from tail – (out-of-order free = fragmentation)
  • Fast, fairly simple, but long-lived allocations will block allocator

Memory

Free from tail

tail

New Allocation

head

slide-72
SLIDE 72

Ring-Buffer Allocators

  • New allocations always adds to head of buffer
  • Free only permitted from tail – (out-of-order free = fragmentation)
  • Fast, fairly simple, but long-lived allocations will block allocator

Memory

tail

New Allocation

head

slide-73
SLIDE 73

Ring-Buffer Allocators

  • New allocations always adds to head of buffer - wraps around at end
  • Free only permitted from tail – (out-of-order free = fragmentation)
  • Fast, fairly simple, but long-lived allocations will block allocator

Memory

tail head

slide-74
SLIDE 74

Stack Allocators

Memory

New Allocation

Allocate up from stack top

Top of stack

Free down from stack top

slide-75
SLIDE 75

Stack Allocators

Memory

New Allocation

Allocate up from stack top

Top of stack

Free down from stack top

slide-76
SLIDE 76

Stack Allocators

  • New allocations always grows top of stack
  • Free always shrinks top of stack – no fragmentation
  • Very fast & simple, but requires free in reverse allocation order

Memory

New Allocation

Top of stack

Free down from stack top

slide-77
SLIDE 77

Sub-Allocator Goals

  • 1. Fast, consistent allocation time
  • 2. Non-blocking (i.e. no implicit synchronization)
  • 3. Efficient – low fragmentation
  • 4. Simple & parallelizable
slide-78
SLIDE 78

Using The Right Tool For The Job

Heap Allocators

  • Memory efficient
  • Slow & serial
  • Wide size range

Persistent, long-lived storage

Ring Buffer Allocators

  • Perf > Efficiency
  • Fast & parallel
  • Fixed block sizes

Iteration-lifetime working space

Stack Allocators

  • Memory efficient
  • Very fast & parallel
  • Small blocks only

Temporary, local allocations

slide-79
SLIDE 79

Implementing Custom Allocators

When allocating memory, record the size so you can free it Also: Pay attention to alignment of returned pointer

8 bytes on CPU, 256 bytes on GPU

requested space ID

char *ptr = (char *)hostAlloc(1000);

1000 bytes Size 1016 bytes

slide-80
SLIDE 80

Conclusion

slide-81
SLIDE 81

Wrap new, malloc & cudaMalloc

High Performance, Non-Blocking Sub-Allocation Host/Device Data Management Leak Detection, Debugging & Profiling

(even if internally you just call malloc/free directly)

slide-82
SLIDE 82

Use The Right Type Of Allocator

For allocations spanning multiple program iterations

  • main data storage
  • C++ objects & configuration data

Persistent, Long-Lived Storage For data which does not persist outside of one iteration

  • per-iteration derived quantities
  • peration working space, double buffers, etc.

Working Space, Lifetime Of Single Iteration For transient allocations with single-procedure lifetime

  • local queues, stacks & objects
  • function-scope working space

Temporary, Local Allocation

(but just use malloc() for persistent heap storage)