memory management
play

Memory Management Tips, Tricks & Techniques Stephen Jones, - PowerPoint PPT Presentation

Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015 Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging &


  1. Memory Management Tips, Tricks & Techniques Stephen Jones, SpaceX, GTC 2015

  2. Conclusion 1. Wrap malloc/cudaMalloc with your own allocator Non-Blockin, Host/Device Data Leak Detection, High Performance Management Debugging & Sub-Allocation Profiling

  3. Conclusion 2. There are three types of memory allocation Persistent, For allocations spanning multiple program iterations Long-Lived main data storage  C++ objects & configuration data  Storage Working Space, For data which does not persist outside of one iteration Lifetime Of per-iteration derived quantities  operation working space, double buffers, etc. Single Iteration  For transient allocations with single-procedure lifetime Temporary, local queues, stacks & objects  Local Allocation  function-scope working space

  4. Take Control Of Memory Allocation

  5. Take Control Of Memory Allocation Debug & Leak Lightweight Define your own Detection Allocators allocate/free functions Overload new & delete for all classes Never call native malloc() Non-Blocking Host/Device or free() Allocation Management

  6. It’s Easy! C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); } };

  7. Also Control Device Allocation C++ new & delete C malloc() & free() // Please don’t name these “ malloc ” & “free” // Every class should be “public AllocBase ” void *hostAlloc(size_t len) { class AllocBase { return malloc(len); public: } void *operator new(size_t len) { return hostAlloc(len); void freeMem(void *ptr) { } free(ptr); } void operator delete(void *ptr) { freeMem(ptr); void *deviceAlloc(size_t len) { } void *ptr; }; cudaMalloc(&ptr, len); return ptr; }

  8. Allocation Tracking, Leak Detection & Profiling

  9. Memory Leak Detection Track each allocation with unique identifier  Allocate extra space for tracking ID  Store ID in front of allocation  Record IDs assigned & released Allocation allocation ID counter Record 1008 bytes char *ptr = (char *)hostAlloc(1000); ID requested space 1000 bytes Return offset address actual allocation start

  10. Memory Leak Detection Allocate 0 1 2 3 4 5 6 7 8 9

  11. Memory Leak Detection Free 0 1 2 3 4 5 6 7 8 9

  12. Memory Leak Detection Identify Memory Leaks 2 6 7

  13. Memory Leak Detection // Use a C++11 atomic to count up allocation ownership static std::atomic<long long>alloc_id = 0; static std::vector<long long>allocationList; void *hostAlloc(size_t len) { long long id = alloc_id++; // Count up allocation ID allocationList[id] = 1; // Record ID as “allocated” // Store allocation ID in front of returned memory void *ptr = malloc(len + 8); *(int *)ptr = id; return (char *)ptr + 8; } void freeMem(void *ptr) { // Extract allocation ID from front of allocation id = *(long long *)((char *)ptr – 8); allocationList[id] = 0; // Record ID as “released” free((char *)ptr - 8); }

  14. Displaying Unreleased Allocations class TrackingObject { For global-scope objects: public: // Set up initial data in constructor  TrackingObject() { Constructor called before main() InitTrackingData();  } Destructor called after main() exits // Analyse tracking data in destructor virtual ~TrackingObject() { ProcessTrackingData(); WARNING } virtual void InitTrackingData() {}  Order of static object construction virtual void ProcessTrackingData() {} }; & destruction is undefined  // Create global-scope static object. Destructor Tracking objects should not // is called automatically when program exits. static TrackingObject dataTracker; interact

  15. Displaying Unreleased Allocations // Walks the allocation list looking for unallocated data class AllocationTracker : public TrackingObject { public: void ProcessTrackingData() { for( long long i=0; i<alloc_id; i++ ) { if( allocationList[i] != 0 ) { printf (“Allocation %d not freed \ n”, i); } } } } // Creates a tracker which will be called on program shutdown static AllocationTracker __allocationTracker;

  16. Complete Leak Tracking Code // Auto display of memory leaks // Allocator with leak tracking static std::atomic<long long>alloc_id = 0; void *hostAlloc(size_t len) { static std::vector<long long>allocationList; long long id = alloc_id++; allocationList[id] = 1; class AllocationTracker { public: void *ptr = malloc(len + 8); void ~AllocationTracker() { *ptr = id; for( long long i=0; i<alloc_id; i++ ) { return (char *)ptr + 8; if( allocationList[i] != 0 ) { } printf (“Allocation %d not freed \ n”, i); } void freeMem(void *ptr) { } id = *(long long *)((char *)ptr – 8); } allocationList[id] = 0; } free((char *)ptr - 8); static AllocationTracker __allocationTracker; }

  17. Host / Device Data Management

  18. Managing Data Movement Minimise Code Impact Large Separate Use managed memory  GPU & CPU C++ operator & casting shenanigans  Code Sections Focus on memory layout  Explicit Locality Control Interleaved Streams & copy/compute overlap  CPU & GPU Carefully managed memory  Execution No One-Size-Fits-All Concurrent Fine-grained memory regions  CPU & GPU Signaling between host & device  Execution Consider zero-copy memory 

  19. Always Use Streams

  20. Always Use Streams Whenever you launch a kernel Whenever you copy data Whenever you synchronize

  21. Streams & Copy/Compute Overlap Copy Up Copy Back Tesla & Quadro GPUs support bi-directional copying

  22. Streams & Copy/Compute Overlap CPU GPU

  23. Streams & Copy/Compute Overlap CPU GPU

  24. Streams & Copy/Compute Overlap CPU GPU Step 1

  25. Streams & Copy/Compute Overlap CPU GPU Step 2

  26. Streams & Copy/Compute Overlap CPU GPU Step 3

  27. Streams & Copy/Compute Overlap CPU GPU Step 4

  28. Streams & Copy/Compute Overlap CPU GPU Step 5

  29. Streams & Copy/Compute Overlap CPU GPU Step 6

  30. Streams & Copy/Compute Overlap CPU GPU Step 7

  31. Streams & Copy/Compute Overlap CPU GPU Step 8

  32. Streams & Copy/Compute Overlap CPU GPU Step 9

  33. Streams & Copy/Compute Overlap CPU GPU

  34. Streams & Copy/Compute Overlap copy up CPU GPU Step 1

  35. Streams & Copy/Compute Overlap compute copy up CPU GPU Step 2

  36. Streams & Copy/Compute Overlap copy back compute copy up CPU GPU Step 3

  37. Streams & Copy/Compute Overlap copy back compute CPU GPU Step 4

  38. Streams & Copy/Compute Overlap copy back CPU GPU Step 5

  39. Streams & Copy/Compute Overlap 1 copy back 2 compute 3 copy up CPU GPU Three Simultaneous Operations

  40. Overlapping Copy & Compute copy compute copy time start finish

  41. Overlapping Copy & Compute time saved time start finish non-overlapped finish

  42. In More Detail... copy up Stream 1 compute copy back Stream 2 Stream 3 time start finish

  43. Compute/Copy Overlap, in Code // Convert cats to dogs in “N” chunks void catsToDogs(char *cat, char *dog, int width, int height, int N) { // Loop copy+compute+copy for each chunk for( int h=0; h<height; h+=(height/N) ) { // Create a stream for this iteration cudaStream_t s; cudaStreamCreate( &s ); // Allocate device data for this chunk char *deviceData; cudaMalloc( &deviceData, width * (height/N) ); // Copy up then convert then copy back, in our stream cudaMemcpyAsync( deviceData, cat+h*width, ...hostToDevice, s ); convert<<< width, height/N, 0, s >>>( deviceData ); cudaMemcpyAsync( dog+h*width, deviceData, ...deviceToHost, s ); // Free up this iteration’s resources cudaStreamDestroy( s ); cudaFree( deviceData ); } }

  44. Managed Memory Very convenient for minimising code impact  Can access same pointer from CPU & GPU, directly  Data moves automatically  Allows full-bandwidth access from GPU  Tricky to use because of concurrency constraints (see next slides) int *data; cudaMallocManaged( &data, 10000000 ); data[100] = 1234; // Access on CPU first launch<<< 1, 1 >>>( data ); // Access on GPU second

  45. Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize() int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>( data ); // Access on GPU first data[100] = 1234; // CPU access fails // because GPU is busy

  46. Drawback Of Managed Memory CPU cannot touch managed memory while the GPU is active  “active” means any launch or copy since last synchronize()  Even if the GPU kernel is not actually using the data int *data; cudaMallocManaged( &data, 10000000 ); launch<<< 1, 1 >>>(); // GPU does not touch data data[100] = 1234; // CPU access still fails // because GPU is busy!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend