ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore - PowerPoint PPT Presentation

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark { weifeng, vinter } @nbi.dk March 1, 2014 Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 1 / 52

First Section Heap Data Structure Review Binary heap Figure: The layout of a binary heap (2-heap) of size 12. Given a node at storage position i , its parent node is at ⌊ ( i − 1) / 2 ⌋ , its child nodes are at 2 i + 1 and 2 i + 2. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 2 / 52

First Section Heap Data Structure Review d -heaps [Johnson, 1975] Figure: The layout of a 4-heap of size 12. For node i , its parent node is at ⌊ ( i − 1) / d ⌋ , its child nodes begin from di + 1 and end up at di + d . Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 3 / 52

First Section Heap Data Structure Review Cache-aligned d -heaps [LaMarca and Ladner, 1996] Figure: The layout of a cache-aligned 4-heap of size 12. For node i , its parent node is at ⌊ ( i − 1) / d ⌋ + offset , its child nodes begin from di + 1 + offset and end up at di + d + offset , where offset = d − 1 is the padded head size. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 4 / 52

First Section Heap Data Structure Review Operations on the d -heaps insert adds a new node at the end of the heap, increases the heap size to n + 1, and takes O ( log d n ) worst-case time to reconstruct the heap property, delete-max copies the last node to the position of the root node, decreases the heap size to n − 1, and takes O ( dlog d n ) worst-case time to reconstruct the heap property, update-key updates a node, keeps the heap size unchanged, and takes O ( dlog d n ) worst-case time to reconstruct the heap property. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 5 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 0) Figure: Initial status. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 6 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 1) Figure: Update the value of the root node. Then the heap property on the level-1 and level-2 might be broken. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 7 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 2) Figure: Find the maximum child node of the updated parent node. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 8 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 3) Figure: Compare, and swap if the max child node is larger than its parent node. Then the heap property on the level-2 and level-3 might be broken. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 9 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 4) Figure: Find the maximum child node of the updated parent node. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 10 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 5) Figure: Compare, and swap if the max child node is larger than its parent node. Then no more child node, heap property reconstruction is done. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 11 / 52

First Section Heap Data Structure Review Update-key operation on the root node (step 6) Figure: Final status. Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 12 / 52

First Section Heap Data Structure Review Unroll the above update-key operation Step 1: update the root node Step 2: find-maxchild Step 3: compare-and-swap Step 4: find-maxchild Step 5: compare-and-swap Step 6: heap property satisfied, return Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 13 / 52

Second Section When Heaps Met GPUs Running d -heaps on GPUs? Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 14 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 15 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 16 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 17 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction) Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 18 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction) Step 5: compare-and-swap Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 19 / 52

Second Section When Heaps Met GPUs The above update-key operation on GPUs Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction) Step 5: compare-and-swap Step 6: heap property satisfied, return Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 20 / 52

Second Section When Heaps Met GPUs Pros and Cons Pros – why we want GPUs? Run much faster find-maxchild using parallel reduction Load continuous child nodes with few memory transactions (coalesced memory access) Shallow heap can accelerate insert operation Cons – why we hate them? Run slow compare-and-swap using only one single weak thread Other threads have to wait for a long time due to single-thread high-latency off-chip memory access Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 21 / 52

Third Section Asymmetric Multicore Processors Emerging Asymmetric Multicore Processors (AMPs) Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 22 / 52

Third Section Asymmetric Multicore Processors The block diagram of an AMP used in this work The chip consists of four major parts: a group of Latency Compute Units (LCUs) with caches, a group of Throughput Compute Units (TCUs) with shared command processors, scratchpad memory and caches, a shared memory management unit, and a shared global DRAM Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 23 / 52

Third Section Asymmetric Multicore Processors Heterogeneous System Architecture (HSA): a step forward Main features in the current HSA design: the two types of compute units share unified memory address space no data transfer through PCIe link large pageable memory for the TCUs much more efficient LCU-TCU interactions due to coherency fast LCU-TCU synchronization mechanism user-mode queueing system shared memory signal object much lighter driver overhead Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 24 / 52

Third Section Asymmetric Multicore Processors Leveraging the AMPs? A direct way is to exploit task, data and pipeline parallelism in the two types of cores. But, we still have two questions: Whether or not the AMPs can expose fine-grained parallelism in fundamental data structure and algorithm design? Can new designs outperform their conventional counterparts plus the coarse-grained (task, data and pipeline) parallelization? Weifeng Liu and Brian Vinter (NBI) ad -heap ( GPGPU-7 , Salt Lake City) March 1, 2014 25 / 52

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore - PowerPoint PPT Presentation

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark { weifeng, vinter } @nbi.dk March 1, 2014 Weifeng Liu and Brian Vinter (NBI) ad

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

PROGRAM HEAP 2020-2021 WHAT IS HEAP? HEAP is a federally funded program that assists low

Heapsort Build-Max-Heap Next we build a full heap from an unsorted sequence Build-Max-Heap(A)

Chapter 19: Binomial Heaps We will study another heap structure called, the binomial heap. The

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

Understanding the heap by breaking it A case study of the heap as a persistent data structure

Fibonacci Heap Group Minus One Second December 6, 2016 Group Minus One Second Fibonacci Heap

1 Fib- -Heap Heap- -Extract Extract- -Min Min Example: Fib- -Heap Heap- -Extract

Priority Queue / Heap Stores ( key,data ) pairs (like dictionary) But, different set of

A heap, a stack, a bottle and a rack The Stack Canary Birds The heap Answer: The first three

CS 241 Data Organization Heapsort February 15, 2018 Heapsort algorithm Make heap While

heaps and heapsort on n elements height of a heap is in (log n ) building a heap bottum-up in O (

In the Beginning was the Word Stephen Heap In the Beginning was the Word

When Testing in Production is a Good Idea Dan Robinson CTO, Heap whoami Joined as Heap's

Fibonacci Heap Group Paradox December 21, 2016 Contents 1. Introduction 2. Operations 3. Why

The beautiful binary heap. Weiss has a chapter on the binary heap - chapter 20, pp581-601.

Vectorial AdS/CFT and quantum higher spins Arkady Tseytlin Partition functions and Casimir

Recent analytical and numerical studies of asymptotically AdS spacetimes in spherical symmetry

The SeibergWitten theory Coulomb phase of N = 1 SO ( N ) N = 1, SO ( N ) F = N 2: SO ( N )

The NAI Mobile Application Code: Extending Third-Party Compliance into the Mobile Ecosystem Why

Quantity vs. Quality: Evaluating User Interest Profiles Using Ad Preference Managers Muhammad

Research at the Boundary of Robotics and AI Prof: Peter Stone Department of Computer Science

Showing Relevant Ads via Context Multi-Armed Bandits D avid P al December 17, 2008

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction