BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON - - PowerPoint PPT Presentation

by their fruits shall ye know them
SMART_READER_LITE
LIVE PREVIEW

BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON - - PowerPoint PPT Presentation

BY THEIR FRUITS SHALL YE KNOW THEM A DATA ANALYSTS PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN Holger Pirk Sam Madden Mike Stonebraker A CRUCIAL DISTINCTION INSPIRATION MY PLEDGE OF LOYALTY SCIENTIFIC RATIONALE GENE AMDAHL


slide-1
SLIDE 1

BY THEIR FRUITS SHALL YE KNOW THEM

A DATA ANALYST’S PERSPECTIVE ON MASSIVELY PARALLEL SYSTEM DESIGN

Holger Pirk Sam Madden Mike Stonebraker

slide-2
SLIDE 2
slide-3
SLIDE 3

A CRUCIAL DISTINCTION

slide-4
SLIDE 4

INSPIRATION

slide-5
SLIDE 5

MY PLEDGE OF LOYALTY

slide-6
SLIDE 6

SCIENTIFIC RATIONALE

slide-7
SLIDE 7

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

Processing 500GB/s

GENE AMDAHL TAUGHT US THAT SYSTEMS NEED TO BE BALANCED

slide-8
SLIDE 8

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia

Processing 500GB/s

NVIDIA AND AMD PROCESS LOT OF SMALL DATA WORDS

slide-9
SLIDE 9

Memory

SIMT Cores

Instruction Scheduler

SIMT

slide-10
SLIDE 10

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

INTEL PROCESSES FEWER LARGE DATAWORDS

slide-11
SLIDE 11

Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core

MANY

  • CORE SIMD

Pentium Cores 512 Bits

slide-12
SLIDE 12

Memory SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Scatter/Gather Unit

SIMD WITH SCATTER/GATHER

slide-13
SLIDE 13

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

ALL OF THEM CAN PROCESS WAY MORE DATA THAN THEY CAN LOAD

slide-14
SLIDE 14

SPEC BANDWIDTH-WISE, PHI OUTPERFORMS CURRENT GPUS

GB/s Memory Bandwidth 100 200 300 400 Phi GTX 780
slide-15
SLIDE 15

Processed Instructions per Second

1T 1P 1E

Processed Bytes per Instruction

1 10 100

AMD Nvidia Intel

Processing 500GB/s

OUR QUESTION: DOES IT MATTER? DOES PHI CHANGE ANYTHING?

slide-16
SLIDE 16

THE OBSTACLE COURSE

slide-17
SLIDE 17

Facts Dimension

π

Ɣ

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

Bandwidth Computation Synchronization Capacity

slide-18
SLIDE 18

DATA-CENTRIC APPLICATIONS HAVE TYPICAL CHOKEPOINTS

Facts Dimension

π

Ɣ

Tuple Width # of Conflicts Hash Complexity Access Locality

slide-19
SLIDE 19

PHI VS. GTX 780

slide-20
SLIDE 20

Facts Dimension

π

Ɣ

Bandwidth

FIRST CHOKEPOINT

slide-21
SLIDE 21

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

BANDWIDTH OF PHI LOOKS SIMILAR TO GPU AT FIRST GLANCE

slide-22
SLIDE 22

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

A SECOND GLANCE REVEALS SOMETHING ODD…

A Non-Linear Cost Function

slide-23
SLIDE 23

4 8 16 32 64 128 256 512 Stride in Bytes 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi

A SECOND GLANCE REVEALS SOMETHING ODD…

Not Dominated (only) by Cache Misses

slide-24
SLIDE 24

Facts Dimension

π

Ɣ

Capacity

SECOND CHOKEPOINT

slide-25
SLIDE 25

64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.02 0.04 0.08 0.16 0.32 0.64 1.28 Time per Access in ns GTX 780 Xeon Phi Xeon Phi Lower Bound GTX 780 Lower Bound

PHI BENEFITS FROM LARGER CACHES

slide-26
SLIDE 26

Facts Dimension

π

Ɣ

Computation

THIRD CHOKEPOINT

slide-27
SLIDE 27

1 2 4 8 16 32 Number of Murmur Rehashes 0.05 0.10 0.20 0.40 0.80 Time per hash in ns Xeon Phi GTX 780

COMPUTATION PERFORMANCE IS VERY SIMILAR…

slide-28
SLIDE 28

Facts Dimension

π

Ɣ

Synchronization

THIRD CHOKEPOINT

slide-29
SLIDE 29

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Values per Bucket 0.0 5.0 10.0 15.0 Time per Access in ns GTX 780 Xeon Phi

…AND SO IS HASH-BUILDING

slide-30
SLIDE 30

RECAP

  • Phi & GPU mostly en par in
  • Computation
  • Synchronization
  • Cache-Utilization
  • But what is up with the memory access
slide-31
SLIDE 31

PHI IN DEPTH

slide-32
SLIDE 32

SCATTER/GATHER

slide-33
SLIDE 33 CHAPTER 6. INSTRUCTION DESCRIPTIONS VGATHERDPD - Gather Float64 Vector With Signed Dword Indices Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, Uf64(mvt) Gather loat64 vector Uf64(mvt) into loat64 vector zmm1 using doubleword indices and k1 as completion mask. Description A set of 8 memory locations pointed by base address BASE_ADDR and doubleword index vector V INDEX with scale SCALE are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT_SUBSET). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & (∼0x3F) and (element_linear_address & (∼0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size
  • f a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector V INDEX. Operation mvt Reference Number: 327364-001 297

LET’S LOOK AT THE DOCUMENTATION

slide-34
SLIDE 34

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPD - Gather Float64 Vector With Signed Dword Indices

Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, Uf64(mvt) Gather loat64 vector Uf64(mvt) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.

Description

A set of 8 memory locations pointed by base address _ and doubleword index vector with scale are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function _ ). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size

  • f a single vector element before up-conversion.

Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .

Operation

Reference Number: 327364-001

297

LET’S LOOK AT THE DOCUMENTATION

???

slide-35
SLIDE 35

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPD - Gather Float64 Vector With Signed Dword Indices

Opcode Instruction Description MVEX.512.66.0F38.W1 92 /r /vsib vgatherdpd zmm1 {k1}, ( ) Gather loat64 vector ( ) into loat64 vector zmm1 using doubleword indices and k1 as completion mask.

Description

A set of 8 memory locations pointed by base address BASE_ADDR and doubleword index vector V INDEX with scale SCALE are converted to a loat64 vector. The result is written into loat64 vector zmm1. Note the special mask behavior as only a subset of the active elements of write mask k1 are actually operated on (as denoted by function SELECT_SUBSET). There are only two guarantees about the function: (a) the destination mask is a subset of the source mask (identity is included), and (b) on a given invocation of the instruction, at least one element (the least signiicant enabled mask bit) will be selected from the source mask. Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero). Note that accessed element by will always access 64 bytes of memory. The memory region accessed by each element will always be between elemen_linear_address & ( 0x3F) and (element_linear_address & ( 0x3F)) + 63 boundaries. This instruction has special disp8*N and alignment rules. N is considered to be the size

  • f a single vector element before up-conversion.

Note also the special mask behavior as the corresponding bits in write mask k1 are reset with each destination element being updated according to the subset of write mask k1. This is useful to allow conditional re-trigger of the instruction until all the elements from a given write mask have been successfully loaded. The instruction will #GP fault if the destination vector zmm1 is the same as index vector .

Operation

Reference Number: 327364-001

297

LET’S LOOK AT THE DOCUMENTATION

??? !☠"

slide-36
SLIDE 36

8 64 512 4K 32K 256K 2M 16M Size of Lookup Table in Bytes 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio

GATHER-LOADING ONLY YIELDS MODERATE LOOKUP IMPROVEMENT…

slide-37
SLIDE 37

4 8 16 32 64 128 256 512 Stride in Bytes 0.03 0.06 0.13 0.25 0.50 1.00 2.00 4.00 Time per Access in ns Scalar Vectorized Ratio

…SAME FOR PROJECTIONS

slide-38
SLIDE 38

PREFETCHING

slide-39
SLIDE 39

4 8 16 32 64 128 256 512 1K 2K 4K 0.03 0.06 0.13 0.25 0.50 1.00 2.00

THE PHI PREFETCHER SEEMS OVERLY AGGRESSIVE

With Prefetcher Bypassing Prefetcher

Expected Behavior

Stride in Bytes

Overhead

slide-40
SLIDE 40

4 8 16 32 64 128 256 512 1K 2K 4K Stride in Bytes 20 40 80 160 320 GB/s Cache-Overhead Adjusted Transfer Rate

ONLY WHEN FACTORING IN TRANSFER OVERHEAD IS THE NOMINAL PHI BANDWIDTH ACHIEVED

slide-41
SLIDE 41

TAKEAWAY

  • Phi is en-par with mid-level GPUs compute-intensive applications
  • Data-intensive performance is weird, though:
  • Prefetcher seems overly aggressive
  • Gather implementation seems half-baked: to few cache ports?
slide-42
SLIDE 42

THANK YOU