Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

algorithm engineering
SMART_READER_LITE
LIVE PREVIEW

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 9 Yan n Gu An Overview of Computer Architecture Many slides in this lecture are borrowed from the first and second lecture in Stanford CS149 Parallel Computing.


slide-1
SLIDE 1

Algorithm Engineering

(aka. How to Write Fast Code)

An Overview of Computer Architecture

CS26 S260 – Lecture cture 9 Yan n Gu

Many slides in this lecture are borrowed from the first and second lecture in Stanford CS149 Parallel Computing. The credit is to Prof. Kayvon Fatahalian, and the instructor appreciates the permission to use them in this course.

slide-2
SLIDE 2

Lecture Overview

  • In this lecture you will learn a brief history of the evolution of architecture
  • Instruction level parallelism (ILP)
  • Multiple processing cores
  • Vector (superscalar, SIMD) processing
  • Multi-threading (hyper-threading)
  • Already covered in previous lectures: caching
  • What we cover:
  • Programming perspective of view
  • What we do not cover:
  • How they are implemented in the hardware level (CMU 15-742 / Stanford CS149)
slide-3
SLIDE 3

1 10 100 1,000 10,000 100,000 1,000,000 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Moore’s law: #transistors doubles every 18 months

Processor cores Normalized transistor count Clock speed (MHz)

Stanford’s CPU DB [DKM12]

Year

slide-4
SLIDE 4

Key question for computer architecture research: How to use the more transistors for better performance?

slide-5
SLIDE 5

Until ~15 years ago: two significant reasons for processor performance improvement

  • Increasi

asing ng CPU U clo lock frequ quenc ency

  • Explo

loiti iting ng in instru tructio tion-le level vel parallel llelism ism (supersc perscal alar ar executi ution)

  • n)

6

slide-6
SLIDE 6

What is a computer program?

int main(int argc, char** argv) { int x = 1; for (int i=0; i<10; i++) { x = x + x; } printf(“%d\n”, x); return 0; }

7

slide-7
SLIDE 7

_main: 100000f10: pushq %rbp 100000f11: movq %rsp, %rbp 100000f14: subq $32, %rsp 100000f18: movl $0, -4(%rbp) 100000f1f: movl %edi, -8(%rbp) 100000f22: movq %rsi, -16(%rbp) 100000f26: movl $1, -20(%rbp) 100000f2d: movl $0, -24(%rbp) 100000f34: cmpl $10, -24(%rbp) 100000f38: jge 23 <_main+0x45> 100000f3e: movl

  • 20(%rbp), %eax

100000f41: addl

  • 20(%rbp), %eax

100000f44: movl %eax, -20(%rbp) 100000f47: movl

  • 24(%rbp), %eax

100000f4a: addl $1, %eax 100000f4d: movl %eax, -24(%rbp) 100000f50: jmp

  • 33 <_main+0x24>

100000f55: leaq 58(%rip), %rdi 100000f5c: movl

  • 20(%rbp), %esi

100000f5f: movb $0, %al 100000f61: callq 14 100000f66: xorl %esi, %esi 100000f68: movl %eax, -28(%rbp) 100000f6b: movl %esi, %eax 100000f6d: addq $32, %rsp 100000f71: popq %rbp 100000f72: retq

Review: what is a program?

From a processor’s perspec pectiv tive, e, a p progr gram m is is a sequen uence ce of in instru tructio tions. ns.

slide-8
SLIDE 8

It runs ns program

  • grams!

Processor cessor executes cutes instr nstruction ction refere ferenced ced by the program

  • gram counter

unter (PC)

(executin ecuting g the instruc ruction ion will modify y machine hine state: conten ents of r regis isters ers, , memory, ry, CPU state, , etc.) .)

Move ve to next t instr nstructi tion

  • n …

Then en execute it… And d so on…

_main: 100000f10: pushq %rbp 100000f11: movq %rsp, %rbp 100000f14: subq $32, %rsp 100000f18: movl $0, -4(%rbp) 100000f1f: movl %edi, -8(%rbp) 100000f22: movq %rsi, -16(%rbp) 100000f26: movl $1, -20(%rbp) 100000f2d: movl $0, -24(%rbp) 100000f34: cmpl $10, -24(%rbp) 100000f38: jge 23 <_main+0x45> 100000f3e: movl

  • 20(%rbp), %eax

100000f41: addl

  • 20(%rbp), %eax

100000f44: movl %eax, -20(%rbp) 100000f47: movl

  • 24(%rbp), %eax

100000f4a: addl $1, %eax 100000f4d: movl %eax, -24(%rbp) 100000f50: jmp

  • 33 <_main+0x24>

100000f55: leaq 58(%rip), %rdi 100000f5c: movl

  • 20(%rbp), %esi

100000f5f: movb $0, %al 100000f61: callq 14 100000f66: xorl %esi, %esi 100000f68: movl %eax, -28(%rbp) 100000f6b: movl %esi, %eax 100000f6d: addq $32, %rsp 100000f71: popq %rbp 100000f72: retq

Review: what does a processor do?

PC

slide-9
SLIDE 9

Instruction level parallelism (ILP)

mul r1, r0, r0 mul r1, r1, r1 st r1, mem[r2] ... add r0, r0, r3 add r1, r4, r5 ... ...

Independent instructions Dependent instructions

  • Processo

ssors s did id in in fact le leverag age e parall llel el execut cution ion to m make progr grams ms run fast ster er, , it it w was just t in invis isibl ible e to th the progr gramm mmer er

  • Instruc

ruction ion le level l parallel llelism ism (ILP)

  • Idea: Instructions must appear to be

executed in program order. BUT independent instructions can be executed simultaneously by a processor without impacting program correctness

  • Superscalar execution: processor

dynamically finds independent instructions in an instruction sequence and executes them in parallel

slide-10
SLIDE 10

ILP example

a = x*x + y*y + z*z

// assume r0=x, r1=y, r2=z mul r0, r0, r0 mul r1, r1, r1 mul r2, r2, r2 add r0, r0, r1 add r3, r0, r2 // now r3 stores value of program variable ‘a’

Consider the following program:

This program has five instructions, so it will take five clocks to execute, correct? Can we do better?

slide-11
SLIDE 11

ILP example

a = x*x + y*y + z*z

slide-12
SLIDE 12

ILP example

a = x*x + y*y + z*z

// assume r0=x, r1=y, r2=z

  • 1. mul r0, r0, r0
  • 2. mul r1, r1, r1
  • 3. mul r2, r2, r2
  • 4. add r0, r0, r1
  • 5. add r3, r0, r2

// now r3 stores value of program variable ‘a’

Superscalar execution: processor automatically finds independent instructions in an instruction sequence and executes them in parallel on multiple execution units! In this example: instructions 1, 2, and 3 can be executed in parallel (on a superscalar processor that determines that the lack of dependencies exists) But instruction 4 must come after instructions 1 and 2 And instruction 5 must come after instructions 3 and 4

slide-13
SLIDE 13

A more complex example

a = 2 b = 4 tmp2 = a + b // 6 tmp3 = tmp2 + a // 8 tmp4 = b + b // 8 tmp5 = b * b // 16 tmp6 = tmp2 + tmp4 // 14 tmp7 = tmp5 + tmp6 // 30 if (tmp3 > 7) print tmp3 else print tmp7 00 01 02 03 04 05 06 07 08 09 10

PC Instruction Instruction dependency graph Program (sequence of instructions)

00 01 02 03 04 06 08 09 10 05 07

What does it mean for a superscalar processor to “respect program order”?

value during execution

slide-14
SLIDE 14

Diminishing returns of superscalar execution

Most available ILP is exploited by a processor capable of issuing four instructions per clock (Little performance benefit from building a processor that can issue more)

Instruction issue capability of processor (instructions/clock)

Source: Culler & Singh (data from Johnson 1991)

Speedup

slide-15
SLIDE 15

Until ~15 years ago: two significant reasons for processor performance improvement

  • Increasi

asing ng CPU U clo lock frequ quenc ency

  • Explo

loiti iting ng in instru tructio tion-le level vel parallel llelism ism (supersc perscal alar ar executi ution)

  • n)

16

slide-16
SLIDE 16

Part 1: Parallel Execution

slide-17
SLIDE 17

Example program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Comput pute e sin(x) using ng Taylor lor ex expan ansion: ion: sin(x) = x - x3/3! + x5/5! - x7/7! + ... for ea each h el elem emen ent t of an arra ray of 𝒐 floating ting-poin

  • int numbe

mbers rs

slide-18
SLIDE 18

Compile program

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

slide-19
SLIDE 19

Execute program

Fetch/ Decode Execution Context Execution Unit (ALU)

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

My very simple processor: executes one instruction per clock

slide-20
SLIDE 20

Execute program

PC

Fetch/ Decode Execution Context Execution Unit (ALU)

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

My very simple processor: executes one instruction per clock

slide-21
SLIDE 21

Execute program

PC

Fetch/ Decode Execution Context Execution Unit (ALU)

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

My very simple processor: executes one instruction per clock

slide-22
SLIDE 22

Execute program

My very simple processor: executes one instruction per clock

PC

Fetch/ Decode Execution Context Execution Unit (ALU)

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

slide-23
SLIDE 23

Fetch/ Decode Execution Context

Superscalar processor

Fetch/ Decode 1

Exec 1

Recall from the previous: instruction level parallelism (ILP) Decode and execute two instructions per clock (if possible)

Fetch/ Decode 2

Exec 2

Note: No ILP exists in this region of the program

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

x[i]

result[i]

slide-24
SLIDE 24

Aside: Pentium 4

Image credit: http://ixbtlabs.com/articles/pentium4/index.html

slide-25
SLIDE 25

Processor: pre multi-core era

Fetch/ Decode Execution Context Exec Unit (ALU) Data cache (a big one) Out-of-order control logic Fancy branch predictor Memory pre-fetcher

Majority of chip transistors used to perform operations that help a single instruction stream run fast

More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc. (Also: more transistors → smaller transistors → higher clock frequencies)

slide-26
SLIDE 26

Processor: multi-core era (since 2005)

Idea #1: Use increasing transistor count to add more cores to the processor Rather than use transistors to increase sophistication of processor logic that accelerates a single instruction stream (e.g., out-of-order and speculative operations)

Fetch/ Decode Execution Context Execution Unit (ALU)

slide-27
SLIDE 27

Two cores: compute two elements in parallel

Fetch/ Decode Execution Context Exec (ALU) Fetch/ Decode Execution Context Exec (ALU)

Simpler cores: each core is slower at running a single instruction stream than our original “fancy” core (e.g., 25% slower) But there are now two cores: 𝟑 × 𝟏. 𝟖𝟔 = 𝟐. 𝟔 (potential for speedup!)

slide-28
SLIDE 28

But our program expresses no parallelism

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

This C program, compiled with gcc will run as one thread on one of the processor cores. If each of the simpler processor cores was 25% slower than the

  • riginal single complicated one, our

program now runs 25% slower. :-(

slide-29
SLIDE 29

Using Cilk to provide parallelism

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Loop iterations declared by the programmer to be independent With this information, you could imagine how a compiler might automatically generate parallel threaded code

slide-30
SLIDE 30

Four cores: compute four elements in parallel

Fetch/ Decode Execution Context Exec (ALU) Fetch/ Decode Execution Context Exec (ALU) Fetch/ Decode Execution Context Exec (ALU) Fetch/ Decode Execution Context Exec (ALU)

slide-31
SLIDE 31

Sixteen cores, sixteen simultaneous instruction streams

Sixteen cores: compute sixteen elements in parallel

slide-32
SLIDE 32

Multi-core examples

Intel “Skylake” Core i7 quad-core CPU (2015) NVIDIA GP104 (GTX 1080) GPU 20 replicated (“SM”) cores (2016)

slide-33
SLIDE 33

More multi-core examples

Intel Xeon Phi “Knights Corner” 72-core CPU (2016) Apple A9 dual-core CPU (2015)

A9 image credit: Chipworks (obtained via Anandtech) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3

Core 1 Core 2

slide-34
SLIDE 34

Data-parallel expression

Another interesting property of this code: Parallelism is across iterations of the loop. All the iterations of the loop carry out the exact same sequence of instructions, but on different input data (to compute the sine of the input number)

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

slide-35
SLIDE 35

Add ALUs to increase compute capability

Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs

SIMD processing

Single instruction, multiple data Same instruction broadcast to all ALUs Executed in parallel on all ALUs

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

Execution Context

slide-36
SLIDE 36

Add ALUs to increase compute capability

Recall original compiled program: Instruction stream processes one array element at a time using scalar instructions on scalar registers (e.g., 32-bit floats)

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

Execution Context

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

slide-37
SLIDE 37

Scalar program

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Original compiled program: Processes one array element using scalar instructions on scalar registers (e.g., 32-bit floats)

ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0

slide-38
SLIDE 38

Vector program (using AVX intrinsics)

#include <immintrin.h> void sinx(int N, int terms, float* x, float* result) { float three_fact = 6; // 3! for (int i=0; i<N; i+=8) { __m256 origx = _mm256_load_ps(&x[i]); __m256 value = origx; __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); __m256 denom = _mm256_broadcast_ss(&three_fact); int sign = -1; for (int j=1; j<=terms; j++) { // value += sign * numer / denom __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom); value = _mm256_add_ps(value, tmp); numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3))); sign *= -1; } _mm256_store_ps(&result[i], value); } }

Intrinsics available to C programmers

slide-39
SLIDE 39

Vector program (using AVX intrinsics)

vloadps xmm0, addr[r1] vmulps xmm1, xmm0, xmm0 vmulps xmm1, xmm1, xmm0 ... ... ... ... ... ... vstoreps addr[xmm2], xmm0

Compiled program: Processes eight array elements simultaneously using vector instructions on 256-bit vector registers

#include <immintrin.h> void sinx(int N, int terms, float* x, float* result) { float three_fact = 6; // 3! for (int i=0; i<N; i+=8) { __m256 origx = _mm256_load_ps(&x[i]); __m256 value = origx; __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); __m256 denom = _mm256_broadcast_ss(&three_fact); int sign = -1; for (int j=1; j<=terms; j++) { // value += sign * numer / denom __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom); value = _mm256_add_ps(value, tmp); numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3))); sign *= -1; } _mm256_store_ps(&result[i], value); } }

slide-40
SLIDE 40

16 SIMD cores: 128 elements in parallel

16 cores, 128 ALUs, 16 simultaneous instruction streams

slide-41
SLIDE 41

Data-parallel expression

Compiler understands loop iterations are independent, and that same loop body will be executed on a large number of data elements. Abstraction facilitates automatic generation of both multi-core parallel code, and vector instructions to make use of SIMD processing capabilities within a core.

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

slide-42
SLIDE 42

What about conditional execution?

ALU 1 ALU 2

. . . ALU 8 . . .

Time (clocks)

2 . . . 1 . . . 8

if (x > 0) { } else { } <unconditional code> <resume unconditional code> float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x;

(assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’)

slide-43
SLIDE 43

What about conditional execution?

ALU 1 ALU 2

. . . ALU 8 . . .

Time (clocks)

2 . . . 1 . . . 8

T T T F F F F F

if (x > 0) { } else { } <unconditional code> <resume unconditional code> float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x;

(assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’)

slide-44
SLIDE 44

Mask (discard) output of ALU

Not all ALUs do useful work! Worst case: 1/8 peak performance

ALU 1 ALU 2

. . . ALU 8 . . .

Time (clocks)

2 . . . 1 . . . 8

T T T F F F F F

if (x > 0) { } else { } <unconditional code> <resume unconditional code> float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x;

(assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’)

slide-45
SLIDE 45

After branch: continue at full performance

ALU 1 ALU 2

. . . ALU 8 . . .

Time (clocks)

2 . . . 1 . . . 8

if (x > 0) { } else { } <unconditional code> <resume unconditional code>

T T T F F F F F

float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x;

(assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’)

slide-46
SLIDE 46

SIMD execution on modern CPUs

  • SSE instru

truct ction ions: : 128-bit bit oper erat ations: ions: 4x32 32 bits or 2x64 64 bits (4-wid wide e float t vec ecto tors rs)

  • AVX2

X2 instructions ructions: : 256 56 bit oper erat ations: ions: 8x32 32 bits or 4x64 64 bits (8-wid wide e float t vec ecto tors rs)

  • AVX5

X512 12 instruction: ruction: 512 bit operations: 16x32 bits…

  • Instructi

tructions

  • ns are

e gen ener erated ated by the e compil iler er

  • Parallelism explicitly requested by programmer using intrinsics
  • Parallelism conveyed using parallel language semantics (e.g., forall example)
  • Parallelism inferred by dependency analysis of loops (hard problem, even best

compilers are not great on arbitrary C/C++ code)

  • Terminology: “explicit SIMD”: SIMD parallelization is performed at compile time
  • Can inspect program binary and see instructions (vstoreps, vmulps, etc.)
slide-47
SLIDE 47

SIMD execution on many modern GPUs

  • “Implicit SIMD”
  • Compiler generates a scalar binary (scalar instructions)
  • But N instances of the program are *always run* together on the processor

execute(my_function, N) // execute my_function N times

  • In other words, the interface to the hardware itself is data parallel
  • Hardware (not compiler) is responsible for simultaneously executing the same

instruction from multiple instances on different data on SIMD ALUs

  • SIMD width

h of most moder ern n GP GPUs range nges from 8 to 32

  • Divergence can be a big issue

(poorly written code might execute at 1/32 the peak capability of the machine!)

slide-48
SLIDE 48

Example: eight-core Intel Xeon E5-1660 v4

8 cores 8 SIMD ALUs per core (AVX2 instructions) 490 GFLOPs (@3.2 GHz) (140 Watts)

* Showing only AVX math units, and fetch/decode unit for AVX (additional capability for integer math)

slide-49
SLIDE 49

Example: NVIDIA GTX 1080

20 cores (“SMs”) 128 SIMD ALUs per core (@1.6 GHz) = 8.1 TFLOPs (180 Watts)

slide-50
SLIDE 50

Summary: parallel execution

  • Se

Several al forms ms of p parallel llel executi ution

  • n in

in m moder ern n processo ssors rs

  • Multi

ti-core core: : use multip tiple le process ssing ing core res

  • Provides thread-level parallelism: simultaneously execute a completely different instruction

stream on each core

  • Software/algorithms decides when to create threads (e.g., via cilk_spawn, cilk_for)
  • SIMD:

: use multip tiple le ALUs control trolled by same instruc truction tion strea eam m (within hin a core) e)

  • Efficient design for data-parallel workloads: control amortized over many ALUs
  • Vectorization can be done by compiler (explicit SIMD) or at runtime by hardware
  • [Lack of] dependencies is known prior to execution (usually declared by programmer, but can

be inferred by loop analysis by advanced compiler)

  • Supers

rscalar calar: : exploit

  • it ILP within

hin an instru truction tion strea eam.

  • m. Proce
  • cess

ss different rent instru tructions tions from m the same instructi truction

  • n stream

am in parall llel l (within hin a core) e)

  • Parallelism automatically and dynamically discovered by the hardware during execution (not

programmer visible)

slide-51
SLIDE 51

Part 2: Accessing Memory

slide-52
SLIDE 52

Terminology

  • Memory

ry la latency cy

  • The amount of time for a memory request (e.g., load, store) from a

processor to be serviced by the memory system

  • Example: 100 cycles, 100 nsec
  • Memory

ry bandwi dwidth dth

  • The rate at which the memory system can provide data to a processor
  • Example: 20 GB/s
slide-53
SLIDE 53

Stalls

  • A processor “stalls” when it cannot run the next instruction in

an in instruct ruction ion stre ream am beca cause use of a de depende endency ncy on a p a pre revious ious in instruc ruction ion.

  • Accessing

sing memory

  • ry is

is a a m majo jor source e of stall lls

ld r0 mem[r2] ld r1 mem[r3] add r0, r0, r1

  • Memory access times ~ 100’s of cycles
  • Memory “access time” is a measure of latency

Dependency: cannot execute ‘add’ instruction until data at mem[r2] and mem[r3] have been loaded from memory

slide-54
SLIDE 54

38 GB/sec

L3 cache (8 MB)

L1 cache (32 KB)

L2 cache (256 KB)

. . .

Memory

DDR4 DRAM (Gigabytes)

Core 1

L1 cache (32 KB)

L2 cache (256 KB)

Core P

Review: why do modern processors have caches?

slide-55
SLIDE 55

Processors run efficiently when data is resident in caches Caches reduce memory access latency *

* Caches also provide high bandwidth data transfer to CPU 38 GB/sec

L3 cache (8 MB)

L1 cache (32 KB)

L2 cache (256 KB)

. . .

Memory

DDR4 DRAM (Gigabytes)

Core 1

L1 cache (32 KB)

L2 cache (256 KB)

Core P

Caches reduce length of stalls (reduce latency)

slide-56
SLIDE 56

Prefetching reduces stalls (hides latency)

  • All

ll m moder ern n CPUs Us have e lo logi gic for prefetc etchi hing ng data in into caches es

  • Dynamically analyze program’s access patterns, predict what it will access soon
  • Reduces

es stall lls s sin ince e data is is resid ident ent in in ca cache e when n accessed sed

pre predict dict value of r2, initiat ate load pre predict dict value of r3, initiat ate load ... ... ... ... ... ... ld ld r0 mem[r2] r2] ld ld r1 mem[r3] r3] add dd r0, r0, r1

data arrives in cache data arrives in cache

Note: Prefetching can also reduce performance if the guess is wrong (hogs bandwidth, pollutes caches) (more detail later in course)

These loads are cache hits

slide-57
SLIDE 57

Multi-threading reduces stalls

  • Idea:

: in interle leave ave process ssing ing of mult ltip iple le thread ads s on the same me core to to hi hide e stall lls

  • Lik

ike prefetc etching hing, , mult lti-thr thread eadin ing g is is a la latency cy hid idin ing, , not a la latenc ncy y reducing ing techniqu nique

slide-58
SLIDE 58

Runnable Runnable Runnable Runnable

Hiding stalls with multi-threading

Time

1 2 3 4

Stall Stall Done! Stall Stall Done!

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

1 2 3 4

1 Core (4 hardware threads)

Thread 2 Elements 8 … 15 Thread 3 Elements 16 … 23 Thread 4 Elements 24 … 31 Thread 1 Elements 0 … 7

slide-59
SLIDE 59

Throughput computing trade-off

Key idea of throughput-oriented systems: Potentially increase time to complete work by any one thread, in order to increase overall system throughput when running multiple threads.

During this time, this thread is runnable, but it is not being executed by the processor. (The core is running some other thread.) Runnable

Time

1 2 3 4

Stall Done!

Thread 2 Elements 8 … 15 Thread 3 Elements 16 … 23 Thread 4 Elements 24 … 31 Thread 1 Elements 0 … 7

slide-60
SLIDE 60

Storing execution contexts

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

Context storage (or L1 cache)

Consider on-chip storage of execution contexts a finite resource.

slide-61
SLIDE 61

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

Many small contexts (high latency hiding ability)

1 core (16 hardware threads, storage for small working set per thread)

slide-62
SLIDE 62

Four large contexts (low latency hiding ability)

1 core (4 hardware threads, storage for larger working set per thread)

Fetch/ Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7

slide-63
SLIDE 63

Hardware-supported multi-threading

  • Core

e manages nages execution cution contex texts ts for multi ultipl ple e threads reads

  • Runs instructions from runnable threads (processor makes decision about which

thread to run each clock, not the operating system)

  • Core still has the same number of ALU resources: multi-threading only helps use

them more efficiently in the face of high-latency operations like memory access

  • Inter

terleaved leaved multi ulti-thr threading eading (a.k.a. k.a. temporal mporal multi ulti-thr threading eading)

  • What I described on the previous slides: each clock, the core chooses a thread,

and runs an instruction from the thread on the ALUs

  • Si

Simulta ltane neous

  • us mult

lti-thr thread eadin ing g (SM SMT)

  • Each clock, core chooses instructions from multiple threads to run on ALUs
  • Extension of superscalar CPU design
  • Example: Intel Hyper-threading (2 threads per core)
slide-64
SLIDE 64

Multi-threading summary

  • Benefit: use a core’s execution resources (ALUs) more efficiently
  • Hide memory latency
  • Fill multiple functional units of superscalar architecture
  • (when one thread has insufficient ILP)
  • Costs
  • Requires additional storage for thread contexts
  • Increases run time of any single thread

(often not a problem, we usually care about throughput in parallel apps)

  • Requires additional independent work in a program (more independent work than

ALUs!)

  • Relies heavily on memory bandwidth
  • More threads → larger working set → less cache space per thread
  • May go to memory more often, but can hide the latency
slide-65
SLIDE 65

A fictitious multi-core chip

16 16 cores res 8 8 SI SIMD AL ALUs per core re (128 8 total tal) 4 4 thread reads s per r core re 16 16 simul multane taneous

  • us instruction

struction stre reams ams 64 64 total al concu current rrent instruct struction ion stream reams 512 2 indepe dependent ndent piece eces s of work rk are e needed ded to run un chip ip with th maxi ximal al latency tency hid idin ing g ability ility

slide-66
SLIDE 66

= SIMD function unit, control shared across 32 units (1 MUL-ADD per clock)

“Shared” memory (96 KB) Execution contexts (registers) (256 KB)

Instructions operate on 32 pieces of data at a time (instruction streams called “warps”).

Think: warp = thread issuing 32-wide vector instructions

Different instructions from up to four warps can be executed simultaneously (simultaneous multi-threading)

Up to 64 warps are interleaved on the SM (interleaved multi-threading)

Over 2,048 elements can be processed concurrently by a core

NVIDIA GTX 1080 core (“SM”)

Source: NVIDIA Pascal Tuning Guide

GPUs: extreme throughput-oriented processors

Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode

slide-67
SLIDE 67

NVIDIA GTX 1080

There are 20 SM cores on the GTX 1080: That’s 40,960 pieces of data being processed concurrently to get maximal latency hiding!

slide-68
SLIDE 68

. . .

CPU vs. GPU memory hierarchies

76 GB/sec L3 cache (20 MB)

L1 cache (32 KB) L2 cache (256 KB)

. . .

Memory

DDR4 DRAM (Hundreds GB to TB) Core 1 Core 8

L1 cache (32 KB) L2 cache (256 KB)

CPU:

Big caches, few threads per core, modest memory BW Rely mainly on caches and prefetching (automatic)

GPU:

Small caches, many threads, huge memory BW Rely heavily on multi-threading for performance (manual)

Execution contexts (256 KB)

L1 cache

Scratch- pad

(64 KB)

. . .

Execution contexts (256 KB)

L1 cache

Scratch- pad

(64 KB)

Core 1 Core 20 L2 cache (2 MB)

320 GB/sec

Memory

DDR5 DRAM (4-12 GB)

slide-69
SLIDE 69

Bandwidth limited!

If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for application developers on throughput-optimized systems.

slide-70
SLIDE 70

Bandwidth is a critical resource

Perform

  • rmant

ant parall llel el progr grams ms wil ill: l:

  • Or

Organize nize co computa mputatio tion n to fetch ch data from m memory

  • ry less often
  • Reuse data previously loaded by the same thread

(traditional intra-thread temporal locality optimizations)

  • Share data across threads (inter-thread cooperation)
  • Request

st data less often (instead, do more arithmetic: it’s “free”)

  • Useful term: “arithmetic intensity” — ratio of math operations to

data access operations in an instruction stream

  • Main point: programs must have high arithmetic intensity to utilize

modern processors efficiently

slide-71
SLIDE 71

Summary

  • Th

Three ee maj ajor

  • r idea

eas s that t all modern ern processo cessors rs employ loy to varying ying degrees ees

  • Prov
  • vid

ide multipl tiple e proce cess ssing ing co cores

  • Simpler cores (embrace thread-level parallelism over instruction-level parallelism)
  • Amortiz

rtize instructio ruction n stream eam proce cessi ssing ng over er many y ALUs Us (SIMD)

  • Increase compute capability with little extra cost
  • Use multi

lti-thr threa eading ing to make ke more e efficient cient use of proce cess ssing ing resource urces s (hide latenci ncies, es, fill all availa labl ble e resources urces)

  • Due

e to high ar arithme hmetic tic ca capabi ability ity on modern ern ch chips, ps, man any y par aral allel lel applicat ications ions (on both th CPUs s and GPUs) Us) are bandwi dwidth dth bound nd

  • GPU

U arch chite itectures ctures use the same throughput

  • ughput co

computi mputing ng ideas s as CPUs: s: but GPUs PUs push h these ese co concepts cepts to ex extreme reme sca cales es

slide-72
SLIDE 72

Review slides

(additional examples for review and to check our understanding)

slide-73
SLIDE 73

Putting together the concepts from this lecture:

(if you understand the following sequence you understand this lecture)

slide-74
SLIDE 74

Running code on a simple processor

My ver ery simple e progra ram: m: comput pute e sin(x) usin ing Taylor lor ex expans nsion ion

Fetch/ Decode Execution Context ALU (Execute)

My very simple processor: completes one instruction per clock

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

slide-75
SLIDE 75

void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Un Unmodi

  • difie

fied progr gram am

Execution Context

My single core, superscalar processor: executes up to two instructions per clock from a single instruction stream.

Fetch/ Decode Exec 1 Fetch/ Decode Exec 2

Independent operations in instruction stream (They are detected by the processor at run-time and may be executed in parallel on execution units 1 and 2)

Review: superscalar execution

slide-76
SLIDE 76

Modify program am to c create te two thread ads

  • f c

contro rol l (two instru ructi ction

  • n streams

ms) ) My dual-core processor: executes one instruction per clock from an instruction stream on each core.

Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute)

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Review: multi-core execution (two cores)

slide-77
SLIDE 77

Modify program am to c create te two thread ads

  • f c

contro rol l (two instru ructi ction

  • n streams

ms) ) My superscalar dual-core processor: executes up to two instructions per clock from an instruction stream on each core.

Execution Context

Fetch/ Decode Exec 1 Fetch/ Decode Exec 2

Execution Context

Fetch/ Decode Exec 1 Fetch/ Decode Exec 2

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Review: multi-core + superscalar execution

slide-78
SLIDE 78

Modify fy progra ram to crea eate e many y threa eads s of control trol

My quad-core processor: executes one instruction per clock from an instruction stream on each core.

Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute)

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Review: multi-core (four cores)

slide-79
SLIDE 79

Observ rvat ation: ion: progra ram must st execu cute te many ny itera ration tions s of the same me loop body. Optimiza imization tion: : share re inst struction ruction strea eam m across ss execution ecution of multip ultiple le itera ration tions s (sing ingle le inst structi ruction

  • n multi

ltiple le data a = SIMD MD) My SIMD quad-core processor: executes one 8-wide SIMD instruction per clock from an instruction stream on each core.

Fetch/ Decode Execution Context Fetch/ Decode Execution Context Fetch/ Decode Execution Context Fetch/ Decode Execution Context

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Review: four, 8-wide SIMD cores

slide-80
SLIDE 80

void sinx(int N, int terms, float* x, float* result) { cilk_for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }

Review: four SIMD, multi-threaded cores

Observat rvation

  • n: memory operat

ration

  • ns

s have very long latency Soluti tion

  • n: hide latency of loading

g data for one iteratio tion by executi ting g arithmeti etic c instruc ructi tion

  • ns

s from other er iteratio tions

Fetch/ Decode

Memory load Memory store

Execution Context Execution Context

Fetch/ Decode

Execution Context Execution Context

Fetch/ Decode

Execution Context Execution Context

Fetch/ Decode

Execution Context Execution Context

My multi-threaded, SIMD quad-core processor: executes one SIMD instruction per clock from one instruction stream on each core. But can switch to processing the other instruction stream when faced with a stall.

slide-81
SLIDE 81

Summary: four superscalar, SIMD, multi-threaded cores

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1

My multi-threaded, superscalar, SIMD quad-core processor: executes up to two instructions per clock from one instruction stream on each core (in this example: one SIMD instruction + one scalar instruction). Processor can switch to execute the other instruction stream when faced with stall.

slide-82
SLIDE 82

Connecting it all together

A simple quad-core processor:

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1 L1 Cache L2 Cache

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1 L1 Cache L2 Cache

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1 L1 Cache L2 Cache

Execution Context Execution Context

Fetch/ Decode Fetch/ Decode

SIMD Exec 2

Exec 1 L1 Cache L2 Cache L3 Cache Memory Controller

Memory Bus (to DRAM) On-chip interconnect

Four cores, two-way multi-threading per core (max eight threads active on chip at once), up to two instructions per clock per core (one of those instructions is 8-wide SIMD)