ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore - - PowerPoint PPT Presentation

ad heap an efficient heap data structure for asymmetric
SMART_READER_LITE
LIVE PREVIEW

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore - - PowerPoint PPT Presentation

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark { weifeng, vinter } @nbi.dk March 1, 2014 Weifeng Liu and Brian Vinter (NBI) ad


slide-1
SLIDE 1

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Weifeng Liu and Brian Vinter

Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk

March 1, 2014

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 1 / 52

slide-2
SLIDE 2

First Section Heap Data Structure Review

Binary heap

Figure: The layout of a binary heap (2-heap) of size 12.

Given a node at storage position i, its parent node is at ⌊(i − 1)/2⌋, its child nodes are at 2i + 1 and 2i + 2.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 2 / 52

slide-3
SLIDE 3

First Section Heap Data Structure Review

d-heaps [Johnson, 1975]

Figure: The layout of a 4-heap of size 12.

For node i, its parent node is at ⌊(i − 1)/d⌋, its child nodes begin from di + 1 and end up at di + d.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 3 / 52

slide-4
SLIDE 4

First Section Heap Data Structure Review

Cache-aligned d-heaps [LaMarca and Ladner, 1996]

Figure: The layout of a cache-aligned 4-heap of size 12.

For node i, its parent node is at ⌊(i − 1)/d⌋ + offset, its child nodes begin from di + 1 + offset and end up at di + d + offset, where offset = d − 1 is the padded head size.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 4 / 52

slide-5
SLIDE 5

First Section Heap Data Structure Review

Operations on the d-heaps

insert adds a new node at the end of the heap, increases the heap size to n + 1, and takes O(logdn) worst-case time to reconstruct the heap property, delete-max copies the last node to the position of the root node, decreases the heap size to n − 1, and takes O(dlogdn) worst-case time to reconstruct the heap property, update-key updates a node, keeps the heap size unchanged, and takes O(dlogdn) worst-case time to reconstruct the heap property.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 5 / 52

slide-6
SLIDE 6

First Section Heap Data Structure Review

Update-key operation on the root node (step 0)

Figure: Initial status.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 6 / 52

slide-7
SLIDE 7

First Section Heap Data Structure Review

Update-key operation on the root node (step 1)

Figure: Update the value of the root node. Then the heap property on the level-1 and level-2 might be broken.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 7 / 52

slide-8
SLIDE 8

First Section Heap Data Structure Review

Update-key operation on the root node (step 2)

Figure: Find the maximum child node of the updated parent node.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 8 / 52

slide-9
SLIDE 9

First Section Heap Data Structure Review

Update-key operation on the root node (step 3)

Figure: Compare, and swap if the max child node is larger than its parent node. Then the heap property on the level-2 and level-3 might be broken.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 9 / 52

slide-10
SLIDE 10

First Section Heap Data Structure Review

Update-key operation on the root node (step 4)

Figure: Find the maximum child node of the updated parent node.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 10 / 52

slide-11
SLIDE 11

First Section Heap Data Structure Review

Update-key operation on the root node (step 5)

Figure: Compare, and swap if the max child node is larger than its parent node. Then no more child node, heap property reconstruction is done.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 11 / 52

slide-12
SLIDE 12

First Section Heap Data Structure Review

Update-key operation on the root node (step 6)

Figure: Final status.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 12 / 52

slide-13
SLIDE 13

First Section Heap Data Structure Review

Unroll the above update-key operation

Step 1: update the root node Step 2: find-maxchild Step 3: compare-and-swap Step 4: find-maxchild Step 5: compare-and-swap Step 6: heap property satisfied, return

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 13 / 52

slide-14
SLIDE 14

Second Section When Heaps Met GPUs

Running d-heaps on GPUs?

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 14 / 52

slide-15
SLIDE 15

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 15 / 52

slide-16
SLIDE 16

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction)

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 16 / 52

slide-17
SLIDE 17

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 17 / 52

slide-18
SLIDE 18

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction)

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 18 / 52

slide-19
SLIDE 19

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction) Step 5: compare-and-swap

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 19 / 52

slide-20
SLIDE 20

Second Section When Heaps Met GPUs

The above update-key operation on GPUs

Given a 32-heap running in a thread-block (or work-group) of size 32 threads (or work-items). Step 1: update the root node Step 2: find-maxchild (parallel reduction) Step 3: compare-and-swap Step 4: find-maxchild (parallel reduction) Step 5: compare-and-swap Step 6: heap property satisfied, return

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 20 / 52

slide-21
SLIDE 21

Second Section When Heaps Met GPUs

Pros and Cons

Pros – why we want GPUs? Run much faster find-maxchild using parallel reduction Load continuous child nodes with few memory transactions (coalesced memory access) Shallow heap can accelerate insert operation Cons – why we hate them? Run slow compare-and-swap using only one single weak thread Other threads have to wait for a long time due to single-thread high-latency off-chip memory access

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 21 / 52

slide-22
SLIDE 22

Third Section Asymmetric Multicore Processors

Emerging Asymmetric Multicore Processors (AMPs)

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 22 / 52

slide-23
SLIDE 23

Third Section Asymmetric Multicore Processors

The block diagram of an AMP used in this work

The chip consists of four major parts: a group of Latency Compute Units (LCUs) with caches, a group of Throughput Compute Units (TCUs) with shared command processors, scratchpad memory and caches, a shared memory management unit, and a shared global DRAM

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 23 / 52

slide-24
SLIDE 24

Third Section Asymmetric Multicore Processors

Heterogeneous System Architecture (HSA): a step forward

Main features in the current HSA design: the two types of compute units share unified memory address space

no data transfer through PCIe link large pageable memory for the TCUs much more efficient LCU-TCU interactions due to coherency

fast LCU-TCU synchronization mechanism

user-mode queueing system shared memory signal object much lighter driver overhead

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 24 / 52

slide-25
SLIDE 25

Third Section Asymmetric Multicore Processors

Leveraging the AMPs?

A direct way is to exploit task, data and pipeline parallelism in the two types of cores. But, we still have two questions: Whether or not the AMPs can expose fine-grained parallelism in fundamental data structure and algorithm design? Can new designs outperform their conventional counterparts plus the coarse-grained (task, data and pipeline) parallelization?

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 25 / 52

slide-26
SLIDE 26

Fourth Section ad-heap

ad-heap data structure

We propose ad-heap (asymmetric d-heap), a new heap data structure that can obtain performance benefits from both of the two types of cores. The ad-heap data structure introduces a new component – a bridge structure, located in the originally empty head part of the d-heap. The bridge consists of one node counter and one sequence of size 2h, where h is the height of the heap.

Figure: The layout of the ad-heap data structure.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 26 / 52

slide-27
SLIDE 27

Fourth Section ad-heap

Update-key operation on the root node (step 0)

Figure: Initial status.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 27 / 52

slide-28
SLIDE 28

Fourth Section ad-heap

Update-key operation on the root node (step 1)

Figure: An LCU updates the value of the root node. Then the heap property on the level-1 and level-2 might be broken, so we issue an LCU → TCU call.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 28 / 52

slide-29
SLIDE 29

Fourth Section ad-heap

Update-key operation on the root node (step 2)

Figure: An invoked TCU initializes the bridge in its scratchpad memory, finds the maximum child node of the updated parent node.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 29 / 52

slide-30
SLIDE 30

Fourth Section ad-heap

Update-key operation on the root node (step 3)

Figure: The TCU compares, and updates node counter, saves the parent node position and the max child node value to the on-chip bridge, if the max child node is larger than its parent node. Then the heap property on the level-2 and level-3 might be broken.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 30 / 52

slide-31
SLIDE 31

Fourth Section ad-heap

Update-key operation on the root node (step 4)

Figure: The TCU finds the maximum child node of the updated parent node.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 31 / 52

slide-32
SLIDE 32

Fourth Section ad-heap

Update-key operation on the root node (step 5)

Figure: The TCU compares, and updates node counter, saves the parent node position and the max child node value to the bridge, if the max child node is larger than its parent node.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 32 / 52

slide-33
SLIDE 33

Fourth Section ad-heap

Update-key operation on the root node (step 6)

Figure: The TCU updates node counter and saves the child node position and the parent node value to the bridge, due to no more child node and the heap property reconstruction is done.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 33 / 52

slide-34
SLIDE 34

Fourth Section ad-heap

Update-key operation on the root node (step 7)

Figure: The TCU dumps the bridge from the scratchpad memory to the global

  • memory. Then we issue an TCU → LCU call.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 34 / 52

slide-35
SLIDE 35

Fourth Section ad-heap

Update-key operation on the root node (step 8)

Figure: An invoked LCU reads each key-value pair, saves the value to its final

  • position. Then all entries are up-to-date.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 35 / 52

slide-36
SLIDE 36

Fourth Section ad-heap

Update-key operation on the root node (step 9)

Figure: Final status.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 36 / 52

slide-37
SLIDE 37

Fourth Section ad-heap

Largely reduced TCU off-chip cost

Using the ad-heap, the number of the TCU off-chip memory access needs hd/w + (2h + 1)/w transactions, instead of h(d/w + 1) in the d-heap, where h is the heap height and w is warp size. For example, given a 7-level 32-heap and set w to 32, the d-heap needs 14

  • ff-chip memory transactions while the ad-heap only needs 8.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 37 / 52

slide-38
SLIDE 38

Fifth Section Performance Evaluation

ad-heap simulator (before the HSA tools are ready)

The simulator pre-executes a d-heap based workload, counts the numbers

  • f all kinds of operations, then run the same workload on real CPU-GPU

systems. The AMP queueing system is simulated by DKit C++ Library and Boost C++ Libraries.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 38 / 52

slide-39
SLIDE 39

Fifth Section Performance Evaluation

Testbeds

System Machine 1 Machine 2 CPU AMD A6-1450 APU Intel Core i7-3770 CPU cores 4 cores/1.0 GHz 4 cores/3.4 GHz GPU AMD Radeon HD 8250 nVidia GeForce GTX 680 GPU SIMD units 128 Radeon cores 1536 CUDA cores ad-heap simulator C++ and OpenCL C++ and CUDA

Table: The Machines Used in Our Experiments

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 39 / 52

slide-40
SLIDE 40

Fifth Section Performance Evaluation

Benchmark and Datasets

Benchmark: a heap-based batch k-selection algorithm that finds the kth smallest entry from each of the sub-lists in parallel. One of its applications is batch kNN search in large-scale concurrent queries. We set sizes of the list sets to 225 and 228 on the two machines, respectively, data type to 32-bit integer (randomly generated), size of each sub-list to the same length l (from 211 to 221), and k to 0.1l.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 40 / 52

slide-41
SLIDE 41

Fifth Section Performance Evaluation

Performance results of the Machine 1

Figure: d = 8

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 41 / 52

slide-42
SLIDE 42

Fifth Section Performance Evaluation

Performance results of the Machine 1

Figure: d = 16

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 42 / 52

slide-43
SLIDE 43

Fifth Section Performance Evaluation

Performance results of the Machine 1

Figure: d = 32

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 43 / 52

slide-44
SLIDE 44

Fifth Section Performance Evaluation

Performance results of the Machine 1

Figure: d = 64

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 44 / 52

slide-45
SLIDE 45

Fifth Section Performance Evaluation

Performance results of the Machine 1

Figure: aggregated results

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 45 / 52

slide-46
SLIDE 46

Fifth Section Performance Evaluation

Performance results of the Machine 2

Figure: d = 8

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 46 / 52

slide-47
SLIDE 47

Fifth Section Performance Evaluation

Performance results of the Machine 2

Figure: d = 16

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 47 / 52

slide-48
SLIDE 48

Fifth Section Performance Evaluation

Performance results of the Machine 2

Figure: d = 32

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 48 / 52

slide-49
SLIDE 49

Fifth Section Performance Evaluation

Performance results of the Machine 2

Figure: d = 64

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 49 / 52

slide-50
SLIDE 50

Fifth Section Performance Evaluation

Performance results of the Machine 2

Figure: aggregated results

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 50 / 52

slide-51
SLIDE 51

Sixth Section Conclusion

Conclusion

We proposed ad-heap, a new efficient heap data structure for the AMPs, and obtained up to 1.5x and 3.6x performance of the optimal scheduling method on two representative machines, respectively. The performance numbers also showed that redesigning data structure and algorithm is necessary for exposing higher computational power of the AMPs. We are looking forward to running ad-heap on real HSA programming tools but not simulators.

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 51 / 52

slide-52
SLIDE 52

Sixth Section Conclusion

Thanks! Questions?

Weifeng Liu and Brian Vinter (NBI) ad-heap (GPGPU-7, Salt Lake City) March 1, 2014 52 / 52