KEY-VALUE RATE KEYS-ONLY RATE DEVICE (10 6 pairs / - PowerPoint PPT Presentation

– All threads run the same program ( kernel ) • SIMD + SMT – Explicit control over memory storage hierarchy • Registers, fast local shared per core, global DRAM – Excels at : • Flat data-parallelism (i.e., data-independent and statically-known data dependences) – Needs work : • Dynamic, irregular, and nested parallelism ¡

KEY-‑VALUE ¡RATE ¡ KEYS-‑ONLY ¡RATE ¡ DEVICE (10 6 ¡pairs ¡/ ¡sec) ¡ (10 6 ¡keys ¡/ ¡sec) ¡ CUDPP ¡ Our ¡SRTS ¡Radix ¡ CUDPP ¡ Our ¡SRTS ¡Radix ¡ Name Radix (speedup) Radix (speedup) NVIDIA ¡GTX ¡480 775 1005 NVIDIA ¡Tesla ¡C2050 581 742 NVIDIA ¡GTX ¡285 134 490 (3.7x) 199 615 (2.8x) NVIDIA ¡GTX ¡280 117 449 (3.8x) 184 534 (2.6x) NVIDIA ¡Tesla ¡C1060 111 333 (3.0x) 176 524 (2.7x) NVIDIA ¡9800 ¡GTX+ 82 189 (2.0x) 111 265 (2.0x) NVIDIA ¡8800 ¡GT 63 129 (2.1x) 83 171 (2.1x) NVIDIA ¡Quadro ¡FX5600 55 110 (2.0x) 66 147 (2.2x) Intel ¡ ¡Knight's ¡Ferry ¡MIC ¡ 560 32-‑core** Intel ¡ ¡Core ¡i7 ¡quad-‑core ¡** 240 Intel ¡ ¡Core-‑2 ¡quad-‑core** 138 **Satish et al., " Fast Sort on CPUs, GPUs and Intel MIC Architectures ,“ Tech Report 2010.

KEY-‑VALUE ¡RATE ¡ KEYS-‑ONLY ¡RATE ¡ DEVICE (10 6 ¡pairs ¡/ ¡sec) ¡ (10 6 ¡keys ¡/ ¡sec) ¡ CUDPP ¡ Our ¡SRTS ¡Radix ¡ CUDPP ¡ Our ¡SRTS ¡Radix ¡ Name Radix (speedup) Radix (speedup) NVIDIA ¡GTX ¡480 775 1005 NVIDIA ¡Tesla ¡C2050 581 742 NVIDIA ¡GTX ¡285 134 490 (3.7x) 199 615 (2.8x) NVIDIA ¡GTX ¡280 117 449 (3.8x) 184 534 (2.6x) NVIDIA ¡Tesla ¡C1060 111 333 (3.0x) 176 524 (2.7x) NVIDIA ¡9800 ¡GTX+ 82 189 (2.0x) 111 265 (2.0x) NVIDIA ¡8800 ¡GT 63 129 (2.1x) 83 171 (2.1x) NVIDIA ¡Quadro ¡FX5600 55 110 (2.0x) 66 147 (2.2x) Intel ¡ ¡Knight's ¡Ferry ¡MIC ¡ 560 32-‑core** Intel ¡ ¡Core ¡i7 ¡quad-‑core ¡** 240 Intel ¡ ¡Core-‑2 ¡quad-‑core** 138 **Satish et al., " Fast Sort on CPUs, GPUs and Intel MIC Architectures ,“ Tech Report 2010. ¡

1100 ¡ GTX ¡480 ¡ 1000 ¡ C2050 ¡(no ¡ECC) ¡ 900 ¡ GTX ¡285 ¡ SorXng ¡Rate ¡(106 ¡keys/sec) ¡ 800 ¡ C2050 ¡(ECC) ¡ 700 ¡ GTX ¡280 ¡ 600 ¡ C1060 ¡ 500 ¡ 400 ¡ 9800 ¡GTX+ ¡ 300 ¡ 200 ¡ 100 ¡ 0 ¡ 0 ¡ 16 ¡ 32 ¡ 48 ¡ 64 ¡ 80 ¡ 96 ¡ 112 ¡ 128 ¡ 144 ¡ 160 ¡ 176 ¡ 192 ¡ 208 ¡ 224 ¡ 240 ¡ 256 ¡ 272 ¡ Problem ¡size ¡(millions) ¡ ¡

– Design patterns and idioms for program composition – Burdens these techniques place upon the programming model / toolkit ¡

Input ¡ Thread ¡ Thread ¡ Thread ¡ Thread ¡ Output ¡ – Each output has a dependence upon a single input element • Threads are decomposed by output element • Input and output indices are static functions of thread-id – E.g., scalar operations ¡

Input ¡ Thread ¡ Thread ¡ Thread ¡ Thread ¡ Output ¡ – Each output has dependences upon a bounded subset of the input • Threads are decomposed by output element • The output (and at least one input) index is a static function of thread-id – E.g., matrix / vector multiply ¡

Input ¡ Output ¡ – Each output element has dependences upon any / all input elements – E.g., sorting, reduction, compaction, duplicate removal, histogram generation, etc. ¡

– (c) globally-dependent transformations must be constructed from multiple passes of Neighborhood transformations – Threads are decomposed by output element Thread ¡ Thread ¡ Thread ¡ Thread ¡ – Repeatedly iterate over recycled input streams – Output stream size is statically known before each pass Thread ¡ Thread ¡ Thread ¡ Thread ¡ ¡

+ + + + – O(n) global work from passes of pairwise-neighbor-reduction – Static dependences, uniform output ¡

– Repeated, deterministic pairwise – Repeatedly check each vertex or compare-smem edge • Bubble sort is O ( n 2 ) • Such breadth-first search is O ( V 2 ) • Bitonic sort is O ( n log 2 n ) • Want O ( V + E ) BFS • Want O ( n log n ) comparison or O ( kn ) radix sorting – Need queue: dynamic, cooperative allocation – Need partitioning: dynamic, cooperative allocation ¡

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ¡ ● ● ● ● ● ● ● ● ● ● – Variable output per thread – Need dynamic, cooperative allocation ¡

● ● – Variable output per thread – Need dynamic, cooperative allocation ¡

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● – Variable output per thread – Need dynamic, cooperative allocation ¡

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● – Variable output per thread – Need dynamic, cooperative allocation ¡

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ¡ ● ● ● ● ● ● ● ● ● ● – Variable output per thread – Need dynamic, cooperative allocation ¡

1. Work-optimal implementations for problems with dynamic dependences... 2. ...that fit the machine model well – Input-centric decomposition • Input indices are a static function of thread-id, but output indices are completely dynamic – A generalized allocation problem • “I may write zero or more output items, and I need to cooperate with everyone to figure out where they go” – Need efficient means of reservation/allocation • Parallel prefix scan (and relaxations / generalizations) ¡

Input ¡ ¡( ¡& ¡allocaXon ¡ ¡ 2 ¡ 1 ¡ 0 ¡ 3 ¡ 2 ¡ requirement) ¡ Result ¡of ¡ ¡ 0 ¡ 2 ¡ 3 ¡ 3 ¡ 6 ¡ prefix ¡scan ¡(sum) ¡ Output ¡ 0 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ – O ( n ) work – For allocation: use scan results as a scattering vector – Origins in adder circuitry, popularized as a parallel primitive by Blelloch et al. in the ‘90s – Merrill et al. Parallel Scan for Stream Architectures . Technical Report CS2009-14, Department of Computer Science, University of Virginia. 2009

Input ¡ ¡( ¡& ¡allocaXon ¡ ¡ 2 ¡ 1 ¡ 0 ¡ 3 ¡ 2 ¡ requirement) ¡ Result ¡of ¡ ¡ 0 ¡ 2 ¡ 3 ¡ 3 ¡ 6 ¡ prefix ¡scan ¡(sum) ¡ Output ¡ 0 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ – O ( n ) work – For allocation: use scan results as a scattering vector – Origins in adder circuitry, popularized as a parallel primitive by Blelloch et al. in the ‘90s – Merrill et al. Parallel Scan for Stream Architectures . Technical Report CS2009-14, Department of Computer Science, University of Virginia. 2009 ¡

111 0 001 1 101 0 011 1 110 0 100 0 010 1 000 1 Key sequence 0 1 2 3 4 5 6 7 0s 1s Flag vectors 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0s 1s Compacted flag vectors 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Output key sequence 111 0 101 0 110 0 100 0 001 1 011 1 010 1 000 1 0 1 2 3 4 5 6 7 – 0/1-flag each key as having a digit of 0,1,2,3, etc. – Scan flag vectors for radix r digits – Relocate keys into bins for each digit ¡

Determine ¡allocation ¡size ¡ Determine ¡allocation ¡ Scan ¡ Global ¡Device ¡Memory ¡ Global ¡Device ¡Memory ¡ CUDPP ¡scan ¡ Host ¡Program ¡ Host ¡Program ¡ CUDPP ¡Scan ¡ Scan ¡ CUDPP ¡scan ¡ Scan ¡ Distribute ¡output ¡ Distribute ¡output ¡ Host ¡ GPU ¡ Host ¡ GPU ¡ Un-fused Fused 1. Propagate live data between orthogonal steps in fast registers / smem 2. Use scan (or variant) as a “runtime” for everything. 3. Heavy SMT (over-threading) yields usable “bubbles” of free computation ¡

KEY-VALUE RATE KEYS-ONLY RATE DEVICE (10 6 pairs / - PowerPoint PPT Presentation

All threads run the same program ( kernel ) SIMD + SMT Explicit control over memory storage hierarchy Registers, fast local shared per core, global DRAM Excels at : Flat data-parallelism (i.e., data-independent and

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1.0 sec 0.1 sec 10 sec 1.0 sec 0.1 sec Min:500

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

hashing Nov. 10, 2017 1 RECALL: Map keys (type K) values (type V) Each (key, value) pairs is

2010 2500 keys > 100 uses 1250 keys > 1000 uses 2018 11000 keys >

hashes Hashes in lisp are basically a lookup table of key-value pairs can create/destroy

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Dictionaries Key-Value Pairs Introducing last new type: dictionary (or dict ) One of the

Everglades excerpts of a talk by Fritz Davis 2004 John Kunkel Small The Keys Lower

For personal use only For personal use only For personal use only For personal use only For

5. Introduction to CMOS Digital Gates Lecture notes: Sec. 4 Sedra & Smith (6 th Ed): Sec.

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Line segment intersection Find all pairs of intersecting line segments. Find all pairs of

FEBRUARY 3, 2020 Topic Presenter Welcome, Introductions & Agenda Review Tracey DaCosta

+ Decisions and Control Structure + Questions? / Announcements 2 n Assignment 1 can be seen

u t = u t u t d u = u u + u u d t = u A Vectors Derivative is

Beginning C Programming for Engineers Lecture 4: Functions R. Lindsay Todd Functions p.

THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS 2 1 05 09 2015 OVERVIEW The

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

7#8#3"#().#(%8#2 ! !"#$%&$%&'%#()%"+%+#'%&$%,-(-.#$-)%/%0$#&(-) !

The topology of random lemniscates Erik Lundberg, Florida Atlantic University joint work (Proc.

Sambuz

Useful Links

Newsletter

Mail Us

KEY-VALUE RATE KEYS-ONLY RATE DEVICE (10 6 pairs / - PowerPoint PPT Presentation

All threads run the same program ( kernel ) SIMD + SMT Explicit control over memory storage hierarchy Registers, fast local shared per core, global DRAM Excels at : Flat data-parallelism (i.e., data-independent and

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1.0 sec 0.1 sec 10 sec 1.0 sec 0.1 sec Min:500

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

hashing Nov. 10, 2017 1 RECALL: Map keys (type K) values (type V) Each (key, value) pairs is

2010 2500 keys &gt; 100 uses 1250 keys &gt; 1000 uses 2018 11000 keys &gt;

hashes Hashes in lisp are basically a lookup table of key-value pairs can create/destroy

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Dictionaries Key-Value Pairs Introducing last new type: dictionary (or dict ) One of the

Everglades excerpts of a talk by Fritz Davis 2004 John Kunkel Small The Keys Lower

For personal use only For personal use only For personal use only For personal use only For

5. Introduction to CMOS Digital Gates Lecture notes: Sec. 4 Sedra &amp; Smith (6 th Ed): Sec.

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction &amp; History Pairs of

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Line segment intersection Find all pairs of intersecting line segments. Find all pairs of

FEBRUARY 3, 2020 Topic Presenter Welcome, Introductions &amp; Agenda Review Tracey DaCosta

+ Decisions and Control Structure + Questions? / Announcements 2 n Assignment 1 can be seen

u t = u t u t d u = u u + u u d t = u A Vectors Derivative is

Beginning C Programming for Engineers Lecture 4: Functions R. Lindsay Todd Functions p.

THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS 2 1 05 09 2015 OVERVIEW The

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

7#8#3&quot;#().#(%8#2 ! !&quot;#$%&amp;$%&amp;'%#()%&quot;*+%+#'%&amp;$%,-(-.#$-)%/%*0$#&amp;(-) !

The topology of random lemniscates Erik Lundberg, Florida Atlantic University joint work (Proc.

Sambuz

Useful Links

Newsletter

Mail Us

2010 2500 keys > 100 uses 1250 keys > 1000 uses 2018 11000 keys >

5. Introduction to CMOS Digital Gates Lecture notes: Sec. 4 Sedra & Smith (6 th Ed): Sec.

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

FEBRUARY 3, 2020 Topic Presenter Welcome, Introductions & Agenda Review Tracey DaCosta

7#8#3"#().#(%8#2 ! !"#$%&$%&'%#()%"+%+#'%&$%,-(-.#$-)%/%0$#&(-) !