Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015

Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging & Sub-Allocation Profiling

Conclusion 2. There are three types of memory allocation Persistent, For allocations spanning multiple program iterations Long-Lived main data storage  C++ objects & configuration data  Storage Working Space, For data which does not persist outside of one iteration Lifetime Of per-iteration derived quantities  operation working space, double buffers, etc. Single Iteration  For transient allocations with single-procedure lifetime Temporary, local queues, stacks & objects  Local Allocation  function-scope working space

Take Control Of Memory Allocation

Take Control Of Memory Allocation Debug & Leak Lightweight Define your own Detection Allocators allocate/free functions Overload new & delete for all classes Never call native malloc() Non-Blocking Host/Device or free() Allocation Management

It’s Easy! C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); } };

Also Control Device Allocation C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); void *deviceAlloc(size_t len) { } void *ptr; }; cudaMalloc(&ptr, len); return ptr; }

Allocation Tracking, Leak Detection & Profiling

Memory Leak Detection Track each allocation with unique identifier  Allocate extra space for tracking ID  Store ID in front of allocation  Record IDs assigned & released Allocation allocation ID counter Record 1008 bytes char *ptr = (char *)hostAlloc(1000); ID requested space 1000 bytes Return offset address actual allocation start

Memory Leak Detection Allocate 0 1 2 3 4 5 6 7 8 9

Memory Leak Detection Free 0 1 2 3 4 5 6 7 8 9

Memory Leak Detection Identify Memory Leaks 2 6 7

Memory Leak Detection // Use a C++11 atomic to count up allocation ownership static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; void *hostAlloc(size_t len) { long long id = alloc_id++; // Count up allocation ID allocationList[id] = 1; // Record ID as “allocated” // Store allocation ID in front of returned memory void *ptr = malloc(len + 8); *(int *)ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { // Extract allocation ID from front of allocation id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; // Record ID as “released” free((char *)ptr - 8); }

Displaying Unreleased Allocations class TrackingObject { For global-scope objects: public: // Set up initial data in constructor  TrackingObject() { Constructor called before main() InitTrackingData();  } Destructor called after main() exits // Analyse tracking data in destructor virtual ~TrackingObject() { ProcessTrackingData(); WARNING } virtual void InitTrackingData() {}  Order of static object construction virtual void ProcessTrackingData() {} }; & destruction is undefined  // Create global-scope static object. Destructor Tracking objects should not // is called automatically when program exits. static TrackingObject dataTracker; interact

Displaying Unreleased Allocations // Walks the allocation list looking for unallocated data class AllocationTracker : public TrackingObject { public: void ProcessTrackingData() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf (“Allocation %d not freed \ n”, i); } } } } // Creates a tracker which will be called on program shutdown static AllocationTracker __allocationTracker;

Complete Leak Tracking Code // Auto display of memory leaks // Allocator with leak tracking static std::atomic<long long>alloc_id = 0; void *hostAlloc(size_t len) { static std::vector<long long>allocationList; long long id = alloc_id++; allocationList[id] = 1; class AllocationTracker { public: void *ptr = malloc(len + 8); void ~AllocationTracker() { *ptr = id; for( long long i=0; i<alloc_id; i++ ) { return (char *)ptr + 8; if( allocationList[i] != 0 ) { } printf (“Allocation %d not freed \ n”, i); } void freeMem(void *ptr) { } id = *(long long *)((char *)ptr – 8); } allocationList[id] = 0; } free((char *)ptr - 8); static AllocationTracker __allocationTracker; }

Host / Device Data Management

Managing Data Movement Minimise Code Impact Large Separate Use managed memory  GPU & CPU C++ operator & casting shenanigans  Code Sections Focus on memory layout  Explicit Locality Control Interleaved Streams & copy/compute overlap  CPU & GPU Carefully managed memory  Execution No One-Size-Fits-All Concurrent Fine-grained memory regions  CPU & GPU Signaling between host & device  Execution Consider zero-copy memory 

Always Use Streams

Always Use Streams Whenever you launch a kernel Whenever you copy data Whenever you synchronize

Streams & Copy/Compute Overlap Copy Up Copy Back Tesla & Quadro GPUs support bi-directional copying

Streams & Copy/Compute Overlap CPU GPU

Streams & Copy/Compute Overlap CPU GPU Step 1

Streams & Copy/Compute Overlap CPU GPU

Streams & Copy/Compute Overlap copy up CPU GPU Step 1

Streams & Copy/Compute Overlap compute copy up CPU GPU Step 2

Streams & Copy/Compute Overlap copy back compute copy up CPU GPU Step 3

Streams & Copy/Compute Overlap copy back compute CPU GPU Step 4

Streams & Copy/Compute Overlap copy back CPU GPU Step 5

Streams & Copy/Compute Overlap 1 copy back 2 compute 3 copy up CPU GPU Three Simultaneous Operations

Overlapping Copy & Compute copy compute copy time start finish

Overlapping Copy & Compute time saved time start finish non-overlapped finish

In More Detail... copy up Stream 1 compute copy back Stream 2 Stream 3 time start finish

Compute/Copy Overlap, in Code // Convert cats to dogs in “N” chunks void catsToDogs(char *cat, char *dog, int width, int height, int N) { // Loop copy+compute+copy for each chunk for( int h=0; h<height; h+=(height/N) ) { // Create a stream for this iteration cudaStream_t s; cudaStreamCreate( &s ); // Allocate device data for this chunk char *deviceData; cudaMalloc( &deviceData, width * (height/N) ); // Copy up then convert then copy back, in our stream cudaMemcpyAsync( deviceData, cat+h*width, ...hostToDevice, s ); convert<<< width, height/N, 0, s >>>( deviceData ); cudaMemcpyAsync( dog+h*width, deviceData, ...deviceToHost, s ); // Free up this iteration’s resources cudaStreamDestroy( s ); cudaFree( deviceData ); } }

Managed Memory Very convenient for minimising code impact  Can access same pointer from CPU & GPU, directly  Data moves automatically  Allows full-bandwidth access from GPU  Tricky to use because of concurrency constraints (see next slides) int *data; cudaMallocManaged( &data, 10000000 ); data[100] = 1234; // Access on CPU first launch<<< 1, 1 >>>( data ); // Access on GPU second

Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize() int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>( data ); // Access on GPU first data[100] = 1234; // CPU access fails // because GPU is busy

Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize()  Even if the GPU kernel is not actually using the data int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>(); // GPU does not touch data data[100] = 1234; // CPU access still fails // because GPU is busy!

Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

Lecture 5: Memory Management 1 / 54 Memory Management Administrivia Assignment 1 is due on

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Memory Management Memory Management Memory a linear array of bits, bytes, words, pages ... -

Kernel level memory management 1. The very base on boot vs memory management 2. Memory Nodes

Kernel level memory management 1. The very base on boot vs memory management 2. Memory Nodes

Kernel level memory management 1. The very base on boot vs memory management 2. Memory Nodes

CS333 Intro to Operating Systems Jonathan Walpole Memory Management Memory Management Memory

Memory Management Outline Processes (done) Memory Management Basic (done)

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management n Basic

Memory Management Ideally programmers want memory that is large fast non

Memory Management 1 Overview Basic memory management Address Spaces Virtual

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Memory Management 5A. Memory Management and Address Spaces 5B. Allocation Algorithms Operating

Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management Basic

Memory Management Chapter 7 1 Memory Management Subdividing memory to accommodate multiple

Garbage collection Strategies for automatic memory management Memory management Explicit

Memory Management 54 Memory Management Programs expand to fill the memory that holds them.

Memory Management Outline Processes (done) Memory Management Basic (done)

Memory management programs use a lot of memory over their lifetime, but not all of it at the same

Memory Management Memory Manager Requirements Minimize primary memory access time

Last class: Memory Management Today: Paging Memory Management Basics Allocation

Last Class: Memory Management Allocating memory to processes Limited physical memory,