Design and Evaluation of Scalable Concurrent Queues for Many-Core - - PowerPoint PPT Presentation

design and evaluation of scalable concurrent queues for
SMART_READER_LITE
LIVE PREVIEW

Design and Evaluation of Scalable Concurrent Queues for Many-Core - - PowerPoint PPT Presentation

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures ICPE 2015 February 2 nd , 2015 Thomas R. W. Scogland, Wu-chun Feng LLNL-PRES-666776 This work was performed under the auspices of the U.S. Department of Energy by


slide-1
SLIDE 1

LLNL-PRES-666776

This work was performed under the auspices of the U.S. Department

  • f Energy by Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

ICPE 2015 Thomas R. W. Scogland, Wu-chun Feng

February 2nd, 2015

slide-2
SLIDE 2

Lawrence Livermore National Laboratory

LLNL-PRES-666776

3

Why another concurrent queue?

slide-3
SLIDE 3

Lawrence Livermore National Laboratory

LLNL-PRES-666776

4

Heterogeneity and many-core are a fact of life in modern computing

slide-4
SLIDE 4

Lawrence Livermore National Laboratory

LLNL-PRES-666776

5

Everything from cell phones

By Zach Vega (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

slide-5
SLIDE 5

Lawrence Livermore National Laboratory

LLNL-PRES-666776

6

To supercomputers

Image Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

slide-6
SLIDE 6

Lawrence Livermore National Laboratory

LLNL-PRES-666776

7

Why not existing lock-free queues?

! Traditional lock-free queues focus on progress over throughput ! Perfect for over-subscribed systems, but they do not scale

7 "

500 1000 1500 2000 2500 3 3 6 6 9 9 12 12 15 15 18 18 21 21 24 24 27 27 30 30 Operations per millisecond Independent threads

Four Opteron 6134 CPUs

500 1000 1500 2000 2500 11 83 155 227 299 371 443 515 587 659 731 803 Operations per millisecond Independent threads

One NVIDIA K20c GPU

slide-7
SLIDE 7

Lawrence Livermore National Laboratory

LLNL-PRES-666776

8

! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions

Outline

slide-8
SLIDE 8

Lawrence Livermore National Laboratory

LLNL-PRES-666776

9

! Work-item: The basic unit of work in OpenCL

  • Groups of work-items execute in lock-step
  • Work-items are not threads

! Thread: An independently schedulable entity

  • An OS thread on CPUs
  • In OpenCL, defined as a group of work-items of size

“PREFERRED_WORK_GROUP_SIZE_MULTIPLE”

Definitions: What is a “thread”?

slide-9
SLIDE 9

Lawrence Livermore National Laboratory

LLNL-PRES-666776

10

! All operations defined in terms of atomics ! On CPU:

  • Add: Atomic Fetch-and-add (FAA)
  • Read: Normal load
  • Write: Normal store
  • CAS: Atomic Compare and Swap

! On OpenCL:

  • Add: Atomic Fetch-and-add (FAA)
  • Read: Atomic Fetch-and-add 0, or atomic_read, or regular

read after flush if available

  • Write: Atomic exchange
  • CAS: Atomic Compare and Swap

Abstractions

slide-10
SLIDE 10

Lawrence Livermore National Laboratory

LLNL-PRES-666776

11

Device Num. devices Cores/ device Threads/ core Max. threads Max. achieved AMD Opteron 6134 4 8 1 32 32 AMD Opteron 6272 2 16 1 32 32 Intel Xeon E5405 2 4 1 8 8 Intel Xeon X5680 1 12 2 24 24 Intel Core i5-3400 1 4 1 4 4

Experimental Setup: Hardware: CPUs

slide-11
SLIDE 11

Lawrence Livermore National Laboratory

LLNL-PRES-666776

12

Device Num. devices Cores/ device Threads/ core Max. threads Max. achieved AMD HD5870 1 20 24 496 140 AMD HD7970 1 32 40 1280 386 AMD HD7990 1 (of 2 dies) 32 40 1280 1020 Intel Xeon Phi P1750 1 61 4 244 244 NVIDIA GTX 280 1 30 32 960 960 NVIDIA Tesla C2070 1 14 32 448 448 NVIDIA Tesla K20c 1 13 64 832 832

Experimental Setup: Hardware: GPUs/Co-processors

slide-12
SLIDE 12

Lawrence Livermore National Laboratory

LLNL-PRES-666776

13

! Debian Wheezy Linux 64-bit kernel version 3.2 ! NVIDIA driver v. 313.3 with CUDA SDK 5.0 ! AMD fglrx driver v. 9.1.11 and APP SDK v. 2.8 ! Intel Xeon Phi driver MPSS gold 3 ! CPU and Phi OpenMP use Intel ICC v. 13.0.1

Experimental setup: Software

slide-13
SLIDE 13

Lawrence Livermore National Laboratory

LLNL-PRES-666776

14

Experimental setup: Detecting the real number of threads

void test(unsigned *num_threads, unsigned *present){ if(atomic_read(num_threads) != 0) return; atomic_fetch_and_add(present,1); run_benchmark(); atomic_compare_and_swap(num_threads, 0, atomic_read(present)); }

Check if kernel is complete Increment number of threads, returns TID Set kernel complete

slide-14
SLIDE 14

Lawrence Livermore National Laboratory

LLNL-PRES-666776

15

! Definitions, abstractions and experimental setup ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions

Outline

slide-15
SLIDE 15

Lawrence Livermore National Laboratory

LLNL-PRES-666776

16

Atomic performance test

kernel void cas_test(__global unsigned * in, __global unsigned * out, unsigned iterations){! const unsigned tid = (get_local_id(1)*get_local_size(0)) + get_local_id(0);! const unsigned gid = (get_group_id(1)*get_local_size(0)) + get_group_id(0);! __local unsigned success;! unsigned my_success = 0;! ! if(tid == 0){! unsigned prev = 0; ! for(size_t i=0; i < iterations; ++i){! prev = atomic_add(in,0);! my_success += atomic_cmpxchg(in,prev,prev+1) == prev ? 1 : 0; ! } !

  • ut[gid] = my_success;!

} ! }

slide-16
SLIDE 16

Lawrence Livermore National Laboratory

LLNL-PRES-666776

17

Atomic operation performance

17 "

Independent threads Throughput in million operations per second

Operation

  • Attempted CAS

FAA READ Successful CAS WRITE XCHG

Acc., AMD HD7970

250 500 750 20 40 60 80 200 400 600 Throughput in million operations per second

  • ● ● ● ● ● ● ● ●

Acc., Intel Xeon Phi

20 40 60 80 1 2 3 4 50 100 150 200

  • ● ● ● ● ● ●

Acc., NVIDIA Tesla K20c

200 400 600 30 60 90 300 200 400 600

Concurrent threads

  • ● ● ● ● ● ● ● ● ●

CPU, 2 − AMD Opteron 6272s

5 10 15 10 20 30 40 50 25 10 20 30

Successful CAS rate decreases with number of threads! Other atomic operations scale up with the thread count

For more architectures, see the paper

CAS: 1,478/ms FAA: 859,524/ms FAA is 581 times faster

slide-17
SLIDE 17

Lawrence Livermore National Laboratory

LLNL-PRES-666776

18

! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions

Outline

slide-18
SLIDE 18

Lawrence Livermore National Laboratory

LLNL-PRES-666776

19

! All concurrent queues require either:

  • Locks, or
  • Atomic operations

! Model result: Throughput (T) for a given number of

threads (t)

! Terms, average latency of constituent atomics:

  • Read: r
  • Write: w
  • Successful contended CAS: c
  • Attempted CAS: C

General modeling of queues

slide-19
SLIDE 19

Lawrence Livermore National Laboratory

LLNL-PRES-666776

20

! Contended CAS

  • MS queue and TZ queue

! Un-contended CAS

  • LCRQ

! Combining

  • FC queue

! FAA or blocking array

  • CB queue and our queue

Queue types

Tt = 2 rt ×2 +ct

( )+(rt + wt +ct)

Tt = 1 at +rt +Ct

( )

Tt = 2 r1+ w1×2

( )+(r1×2 + w1)

Tt = 2 at +rt + wt

( )+(at + wt ×2)

slide-20
SLIDE 20

Lawrence Livermore National Laboratory

LLNL-PRES-666776

25

Modeled queue throughput

25 "

Independent threads Throughput in million operations per second For more architectures, see the paper

Operation

  • Combining queue

Contended CAS queue FAA queue Un−Contended CAS queue

  • ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 9 12 15 18 25 10 20 30

2x Opteron 6272

  • ● ● ● ●

5 10 15 20 50 100 150 200 50 100 150 200

Intel Xeon Phi

  • 50

100 150 200 15 20 25 30 200 400 600

Concurrent threads

NVIDIA K20c

  • ● ● ● ● ● ● ● ● ●

Acc., AMD HD7970

100 200 300 5 10 15 20 200 400 600 Throughput in million operations per second

AMD HD7970

Combining queue performance is independent

  • f thread count

Contended-CAS queue performance degrades as threads increase Un-contended-CAS and FAA queues scale with additional threads

slide-21
SLIDE 21

Lawrence Livermore National Laboratory

LLNL-PRES-666776

26

! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions

Outline

slide-22
SLIDE 22

Lawrence Livermore National Laboratory

LLNL-PRES-666776

27

Our queue design: Goals

! Scale well on many-core architectures

  • Avoid contended CAS!

! Maintain Linearizability and FIFO ordering ! Allow the status of the queue to be inspected

slide-23
SLIDE 23

Lawrence Livermore National Laboratory

LLNL-PRES-666776

29

! Blocking interface: The fast, concurrent interface

  • enqueue(q, data) -> success or closed
  • dequeue(q, &data) -> success or closed

! Non-waiting interface:

  • enqueue_nw(q, data) -> success, not_ready or closed
  • dequeue_nw(q, &data) -> success, not_ready or closed

! Status inspection interface

  • distance(q) -> the distance between head and tail, corrected for

rollover

  • waiting_enqueuers(q) -> number of enqueuers blocking
  • waiting_dequeuers(q) -> number of dequeuers blocking
  • is_full(q) -> true if full, else false
  • is_empty(q) -> true if empty, else false

Our queue design: Solution, divide the interfaces

slide-24
SLIDE 24

Lawrence Livermore National Laboratory

LLNL-PRES-666776

31

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

31 "

3 Tail Head Thread 1 Thread 2 Thread 3 4 5 6 1 2 3 Slot array 1 1 1 Value array

slide-25
SLIDE 25

Lawrence Livermore National Laboratory

LLNL-PRES-666776

32

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

32 "

3 Tail Head Thread 1 Thread 2 Thread 3 4 5 6 1 2 3 Slot array 1 1 1 Value array

slide-26
SLIDE 26

Lawrence Livermore National Laboratory

LLNL-PRES-666776

33

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

33 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-27
SLIDE 27

Lawrence Livermore National Laboratory

LLNL-PRES-666776

34

Our queue’s blocking behavior: Enqueue example: Write values

34 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-28
SLIDE 28

Lawrence Livermore National Laboratory

LLNL-PRES-666776

35

Our queue’s blocking behavior: Enqueue example: Write values

35 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-29
SLIDE 29

Lawrence Livermore National Laboratory

LLNL-PRES-666776

36

Our queue’s blocking behavior: Enqueue example: Write values

36 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-30
SLIDE 30

Lawrence Livermore National Laboratory

LLNL-PRES-666776

37

Our queue’s blocking behavior: Enqueue example: Update slots

37 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-31
SLIDE 31

Lawrence Livermore National Laboratory

LLNL-PRES-666776

38

Our queue’s blocking behavior: Enqueue example: Update slots

38 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 5 6 4 1 2 3 Slot array 1 1 1 Value array

slide-32
SLIDE 32

Lawrence Livermore National Laboratory

LLNL-PRES-666776

39

Our queue’s blocking behavior: Enqueue example: Update slots

39 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 5 6 4 1 2 3 Slot array 1 1 1 1 1 1 Value array

slide-33
SLIDE 33

Lawrence Livermore National Laboratory

LLNL-PRES-666776

40

Our queue’s blocking behavior: Enqueue example: Update slots

40 "

6 Tail Head Thread 1 Thread 2 Thread 3 4 3 5 5 6 4 5 6 4 1 2 3 Slot array 1 1 1 1 1 1 Value array

Safe and parallel! For complete implementation details, see the paper

slide-34
SLIDE 34

Lawrence Livermore National Laboratory

LLNL-PRES-666776

44

! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions

Outline

slide-35
SLIDE 35

Lawrence Livermore National Laboratory

LLNL-PRES-666776

45

Evaluation: Queues Under Consideration

! Michael & Scott (MS) queue: Contended CAS

  • Storage: Unbounded linked list
  • Progress guarantee: lock-free
  • Coherence mechanism: CAS on head and tail

! Tsigas & Zhang (TZ) queue: Contended CAS

  • Storage: Bounded array
  • Progress guarantee: lock-free
  • Coherence mechanism: CAS on head and tail

! Flat-combining (FC) queue: Combining

  • Storage: Unbounded linked list
  • Progress guarantee: lock-free, *blocking*
  • Coherence mechanism: Serialization, single worker thread at a time

! Linked Concurrent Ring Queue (LCRQ): Un-contended CAS

  • Storage: Unbounded linked-list of blocking array-based queues
  • Progress guarantee: lock-free
  • Coherence mechanism: Double-wide CAS (precludes implementation on

AMD GPUs)

slide-36
SLIDE 36

Lawrence Livermore National Laboratory

LLNL-PRES-666776

46

Evaluation: Test loops

! Matching enqueue/dequeue:

  • All threads:

— Dequeue a value — Work on the value for 100 iterations — Enqueue the new value — Work out-of-band for 100 iterations

! Producer/consumer:

  • 25% of all threads:

— Enqueue a value — Work for 100 iterations

  • The remaining 75%:

— Dequeue a value — Work for 100 iterations

! All tests run for 5 seconds and are self-stopped on the device

slide-37
SLIDE 37

Lawrence Livermore National Laboratory

LLNL-PRES-666776

49

  • ● ● ●

Intel Xeon X5680 5 10 15 30 5 10 15 20 25

Evaluation: CPU performance

Operations in millions per second Independent threads

Queue ● Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

  • ● ●
  • 4 − AMD Opteron 6134s

1 2 3 4 2.5 5.0 7.5 10.0 12.5 8 10 20 30

  • ● ● ● ● ● ● ●

Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s 1 2 3 5 10 15 8 10 20 30

Independent threads

  • ● ● ●
  • ● ●

Intel Xeon X5680 2.5 5.0 7.5 10.0 12.5 30 5 10 15 20 25

Matching tests Producer/Consumer tests

slide-38
SLIDE 38

Lawrence Livermore National Laboratory

LLNL-PRES-666776

50

Evaluation: CPU performance: Oversubscribing

Operations in millions per second Independent threads

Queue ● Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

  • Producer/Consumer

5 10 15 5 10 15 50 100

Operations in millions per second

  • Matching Enq/Deq

5 10 15 50 100

slide-39
SLIDE 39

Lawrence Livermore National Laboratory

LLNL-PRES-666776

53

AMD HD7990 2−4 2−2 1 4 16 64 256 2 2 2 250 500 750 1000

Evaluation: Acc. performance: Current-Gen: Matching benchmark

Operations in millions per second (log2) Independent threads

Queue ● Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

  • ● ● ● ● ●

NVIDIA Tesla K20c 2−2 1 4 16 64 256 250 500 750 1000 200 400 600 800

  • ● ● ●
  • ● ● ● ● ● ● ● ●

Intel Xeon Phi 2−3 2−2 2−1 1 2 4 8 1000 50 100 150 200

LCRQ lags behind by only 17% 1,408 times speedup from MS-queue to the blocking queue

slide-40
SLIDE 40

Lawrence Livermore National Laboratory

LLNL-PRES-666776

54

  • ● ●
  • ● ● ●
  • Intel Xeon Phi

2−3 2−2 2−1 1 2 4 1000 50 100 150 200 AMD HD7990 2−4 2−2 1 4 16 64 256 2 2 2 250 500 750 1000

  • ● ● ● ● ●

NVIDIA Tesla K20c 2−8 2−6 2−4 2−2 1 4 16 64 250 500 750 1000 200 400 600 800

Evaluation: Acc. performance: Current-Gen: Prod./Cons.

Operations in millions per second (log2) Independent threads

Queue ● Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

LCRQ drops to 2,700 ops/second 33,043.91 times more ops with blocking than LCRQ in this case

slide-41
SLIDE 41

Lawrence Livermore National Laboratory

LLNL-PRES-666776

56

! Designing concurrent data-structures for

throughput is important in modern architectures

! CAS can be dangerous with enough threads ! Our queue shows between a 1.5x and 1000x

speedup over state of the practice for many-core architectures

! Allowing blocking can be beneficial!

Conclusions