CPU-specific optimization Example of a function that we want to - - PowerPoint PPT Presentation

cpu specific optimization example of a function that we
SMART_READER_LITE
LIVE PREVIEW

CPU-specific optimization Example of a function that we want to - - PowerPoint PPT Presentation

1 2 CPU-specific optimization Example of a function that we want to optimize: Example of a target CPU core: adding 1000 integers mod 2 32 . ARM Cortex-M4F core inside LM4F120H5QR microcontroller Reference implementation: in Stellaris


slide-1
SLIDE 1

1

CPU-specific optimization Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.

2

Example of a function that we want to optimize: adding 1000 integers mod 232. Reference implementation:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

slide-2
SLIDE 2

1

CPU-specific optimization Example of a target CPU core: Cortex-M4F core inside LM4F120H5QR microcontroller Stellaris LM4F120 Launchpad.

2

Example of a function that we want to optimize: adding 1000 integers mod 232. Reference implementation:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

Counting

static volatile *const = (void ... int beforesum int result int aftersum UARTprintf("sum result,aftersum-beforesum);

Output sho Change 1000

slide-3
SLIDE 3

1

  • ptimization

rget CPU core: rtex-M4F core inside microcontroller LM4F120 Launchpad.

2

Example of a function that we want to optimize: adding 1000 integers mod 232. Reference implementation:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

Counting cycles:

static volatile unsigned *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d result,aftersum-beforesum);

Output shows 8012 Change 1000 to 500:

slide-4
SLIDE 4

1

core: inside controller Launchpad.

2

Example of a function that we want to optimize: adding 1000 integers mod 232. Reference implementation:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

Counting cycles:

static volatile unsigned *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

slide-5
SLIDE 5

2

Example of a function that we want to optimize: adding 1000 integers mod 232. Reference implementation:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

slide-6
SLIDE 6

2

Example of a function e want to optimize: 1000 integers mod 232. Reference implementation:

sum(int *x) result = 0; i; (i = 0;i < 1000;++i) result += x[i]; return result;

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012. “Okay, 8 Um, are really this

slide-7
SLIDE 7

2

function

  • ptimize:

gers mod 232. mentation:

0; 1000;++i) x[i];

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012. “Okay, 8 cycles per Um, are microcontrollers really this slow at

slide-8
SLIDE 8

2

232. mentation:

1000;++i)

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012. “Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?”

slide-9
SLIDE 9

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?”

slide-10
SLIDE 10

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results.

slide-11
SLIDE 11

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles.

slide-12
SLIDE 12

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles.

slide-13
SLIDE 13

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles.

slide-14
SLIDE 14

3

Counting cycles:

static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles.

slide-15
SLIDE 15

3

Counting cycles:

volatile unsigned int *const DWT_CYCCNT (void *) 0xE0001004; beforesum = *DWT_CYCCNT; result = sum(x); aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum);

Output shows 8012 cycles. Change 1000 to 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles. Try moving

int sum(int { int result int i; for (i result return }

slide-16
SLIDE 16

3

unsigned int DWT_CYCCNT 0xE0001004; *DWT_CYCCNT; sum(x); *DWT_CYCCNT; %d %d\n", result,aftersum-beforesum);

8012 cycles. 500: 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles. Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < result += *x++; return result; }

slide-17
SLIDE 17

3

int *DWT_CYCCNT; *DWT_CYCCNT; %d\n", result,aftersum-beforesum);

cycles. 4012.

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles. Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

slide-18
SLIDE 18

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles.

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

slide-19
SLIDE 19

4

“Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?” Bad approach: Apply random “optimizations” (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles.

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles.

slide-20
SLIDE 20

4

, 8 cycles per addition. re microcontrollers this slow at addition?” approach: random “optimizations” weak compiler options)

  • u get bored/frustrated.

the fastest results. : 8012 cycles. : 8012 cycles. : 8012 cycles. : 8012 cycles.

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles. Try counting

int sum(int { int result int i; for (i result return }

slide-21
SLIDE 21

4

per addition. controllers at addition?” “optimizations” compiler options) red/frustrated. results. cycles. cycles. cycles. cycles.

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles. Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i result += *x++; return result; }

slide-22
SLIDE 22

4

addition. addition?” tions”

  • ptions)

red/frustrated.

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles. Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

slide-23
SLIDE 23

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles.

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

slide-24
SLIDE 24

5

Try moving the pointer:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

8010 cycles.

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles.

slide-25
SLIDE 25

5

moving the pointer:

sum(int *x) result = 0; i; (i = 0;i < 1000;++i) result += *x++; return result;

cycles.

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles. Try using

int sum(int { int result int *y while result return }

slide-26
SLIDE 26

5

pointer:

0; 1000;++i) *x++;

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles. Try using an end p

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

slide-27
SLIDE 27

5

1000;++i)

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles. Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

slide-28
SLIDE 28

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles.

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

slide-29
SLIDE 29

6

Try counting down:

int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

8010 cycles.

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles.

slide-30
SLIDE 30

6

counting down:

sum(int *x) result = 0; i; (i = 1000;i > 0;--i) result += *x++; return result;

cycles.

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles. Back to

int sum(int { int result int i; for (i result result } return }

slide-31
SLIDE 31

6

wn:

0; 1000;i > 0;--i) *x++;

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles. Back to original. T

int sum(int *x) { int result = 0; int i; for (i = 0;i < result += x[i]; result += x[i } return result; }

slide-32
SLIDE 32

6

0;--i)

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles. Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += result += x[i]; result += x[i + 1]; } return result; }

slide-33
SLIDE 33

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles.

8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

slide-34
SLIDE 34

7

Try using an end pointer:

int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; }

8010 cycles.

8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

5016 cycles.

slide-35
SLIDE 35

7

using an end pointer:

sum(int *x) result = 0; *y = x + 1000; (x != y) result += *x++; return result;

cycles.

8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

5016 cycles.

int sum(int { int result int i; for (i result result result result result } return }

slide-36
SLIDE 36

7

pointer:

0; 1000; *x++;

8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

5016 cycles.

int sum(int *x) { int result = 0; int i; for (i = 0;i < result += x[i]; result += x[i result += x[i result += x[i result += x[i } return result; }

slide-37
SLIDE 37

7 8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

5016 cycles.

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

slide-38
SLIDE 38

8

Back to original. Try unrolling:

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

5016 cycles.

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

slide-39
SLIDE 39

8

to original. Try unrolling:

sum(int *x) result = 0; i; (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; return result;

cycles.

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

4016 cycles

slide-40
SLIDE 40

8

  • riginal. Try unrolling:

0; 1000;i += 2) { x[i]; x[i + 1];

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

4016 cycles. Are w

slide-41
SLIDE 41

8

unrolling:

+= 2) {

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

4016 cycles. Are we done no

slide-42
SLIDE 42

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

10

4016 cycles. Are we done now?

slide-43
SLIDE 43

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal?

slide-44
SLIDE 44

9

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let’s try this approach.

slide-45
SLIDE 45

9

sum(int *x) result = 0; i; (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; return result;

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let’s try this approach. Find “ARM Technical Rely on Wikip M4F = M4 Manual sa “implements architecture Points to Architecture which defines e.g., “ADD” First manual ADD tak

slide-46
SLIDE 46

9

0; 1000;i += 5) { x[i]; x[i + 1]; x[i + 2]; x[i + 3]; x[i + 4];

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let’s try this approach. Find “ARM Cortex-M4 Technical Reference Rely on Wikipedia M4F = M4 + floating-p Manual says that Co “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference which defines instructions: e.g., “ADD” for 32-bit First manual says that ADD takes just 1 cycle.

slide-47
SLIDE 47

9

+= 5) {

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let’s try this approach. Find “ARM Cortex-M4 Processo Technical Reference Manual”. Rely on Wikipedia comment M4F = M4 + floating-point Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle.

slide-48
SLIDE 48

10

4016 cycles. Are we done now? Most random “optimizations” that we tried seem useless. Can spend time trying more. Does frustration level tell us that we’re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let’s try this approach.

11

Find “ARM Cortex-M4 Processor Technical Reference Manual”. Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle.

slide-49
SLIDE 49

10

  • cycles. Are we done now?

random “optimizations” e tried seem useless. end time trying more. frustration level tell us e’re close to optimal? approach:

  • ut lower bound for

spent on arithmetic etc. Understand gap between bound and observed time. try this approach.

11

Find “ARM Cortex-M4 Processor Technical Reference Manual”. Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and “integer has 16 integer special-purp and “program Each eleme be “loaded” Basic load Manual sa a note ab Then mo instruction address not then it saves

slide-50
SLIDE 50

10

we done now? “optimizations” seem useless. trying more. level tell us to optimal? bound for arithmetic etc. between

  • bserved time.

approach.

11

Find “ARM Cortex-M4 Processor Technical Reference Manual”. Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and output “integer registers”. has 16 integer registers, special-purpose “stack and “program counter”. Each element of x be “loaded” into a Basic load instruction: Manual says 2 cycles a note about “pipelining”. Then more explanation: instruction is also address not based then it saves 1 cycle.

slide-51
SLIDE 51

10

now? “optimizations” useless. re. us

  • ptimal?

r etc. time.

11

Find “ARM Cortex-M4 Processor Technical Reference Manual”. Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and output of ADD a “integer registers”. ARMv7- has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR then it saves 1 cycle.

slide-52
SLIDE 52

11

Find “ARM Cortex-M4 Processor Technical Reference Manual”. Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M architecture profile”. Points to the “ARMv7-M Architecture Reference Manual”, which defines instructions: e.g., “ADD” for 32-bit addition. First manual says that ADD takes just 1 cycle.

12

Inputs and output of ADD are “integer registers”. ARMv7-M has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs to be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle.

slide-53
SLIDE 53

11

“ARM Cortex-M4 Processor echnical Reference Manual”.

  • n Wikipedia comment that

M4 + floating-point unit. Manual says that Cortex-M4 “implements the ARMv7E-M rchitecture profile”. to the “ARMv7-M Architecture Reference Manual”, defines instructions: “ADD” for 32-bit addition. manual says that takes just 1 cycle.

12

Inputs and output of ADD are “integer registers”. ARMv7-M has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs to be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive takes only (“more multiple pipelined Can achieve in other but nothing Lower boun 2n + 1 cycles, n cycles Why observed non-consecutive costs of

slide-54
SLIDE 54

11

rtex-M4 Processor Reference Manual”. edia comment that floating-point unit. that Cortex-M4 ARMv7E-M rofile”. “ARMv7-M Reference Manual”, instructions: 32-bit addition. ys that 1 cycle.

12

Inputs and output of ADD are “integer registers”. ARMv7-M has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs to be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs pipelined together”). Can achieve this sp in other ways (LDRD, but nothing seems Lower bound for n 2n + 1 cycles, including n cycles of arithmetic. Why observed time non-consecutive LDR costs of manipulating

slide-55
SLIDE 55

11

Processor Manual”. comment that

  • int unit.

rtex-M4 ARMv7E-M Manual”, addition.

12

Inputs and output of ADD are “integer registers”. ARMv7-M has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs to be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can b pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i.

slide-56
SLIDE 56

12

Inputs and output of ADD are “integer registers”. ARMv7-M has 16 integer registers, including special-purpose “stack pointer” and “program counter”. Each element of x array needs to be “loaded” into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about “pipelining”. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle.

13

n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can be pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i.

slide-57
SLIDE 57

12

and output of ADD are “integer registers”. ARMv7-M integer registers, including ecial-purpose “stack pointer” rogram counter”. element of x array needs to “loaded” into a register. load instruction: LDR. Manual says 2 cycles but adds about “pipelining”. more explanation: if next instruction is also LDR (with address not based on first LDR) saves 1 cycle.

13

n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can be pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles

y = x + p = x result = loop: xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 =

slide-58
SLIDE 58

12

  • utput of ADD are

registers”. ARMv7-M registers, including “stack pointer” counter”. x array needs to a register. instruction: LDR. cycles but adds “pipelining”. explanation: if next also LDR (with based on first LDR) cycle.

13

n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can be pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles using ldr.w

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 xi8 = *(uint32 xi7 = *(uint32 xi6 = *(uint32 xi5 = *(uint32 xi4 = *(uint32 xi3 = *(uint32 xi2 = *(uint32

slide-59
SLIDE 59

12

are ARMv7-M including

  • inter”

needs to register. DR. adds next (with LDR)

13

n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can be pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + xi8 = *(uint32 *) (p + xi7 = *(uint32 *) (p + xi6 = *(uint32 *) (p + xi5 = *(uint32 *) (p + xi4 = *(uint32 *) (p + xi3 = *(uint32 *) (p + xi2 = *(uint32 *) (p +

slide-60
SLIDE 60

13

n consecutive LDRs takes only n + 1 cycles (“more multiple LDRs can be pipelined together”). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i.

14

2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48)

slide-61
SLIDE 61

13

consecutive LDRs

  • nly n + 1 cycles

re multiple LDRs can be elined together”). achieve this speed

  • ther ways (LDRD, LDM)

nothing seems faster. bound for n LDR + n ADD: cycles, including cycles of arithmetic.

  • bserved time is higher:

non-consecutive LDRs;

  • f manipulating i.

14

2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48) xi1 = xi0 = result result result result result result result result result result xi9 = xi8 = xi7 =

slide-62
SLIDE 62

13

Rs cycles LDRs can be together”). speed (LDRD, LDM) seems faster. n LDR + n ADD: including rithmetic. time is higher: LDRs; manipulating i.

14

2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48) xi1 = *(uint32 xi0 = *(uint32 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 xi8 = *(uint32 xi7 = *(uint32

slide-63
SLIDE 63

13

be LDM) n ADD: higher:

14

2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48) xi1 = *(uint32 *) (p + xi0 = *(uint32 *) (p + result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + xi8 = *(uint32 *) (p + xi7 = *(uint32 *) (p +

slide-64
SLIDE 64

14

2281 cycles using ldr.w:

y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48)

15

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28)

slide-65
SLIDE 65

14

cycles using ldr.w:

4000 = 0 *(uint32 *) (p + 76) *(uint32 *) (p + 72) *(uint32 *) (p + 68) *(uint32 *) (p + 64) *(uint32 *) (p + 60) *(uint32 *) (p + 56) *(uint32 *) (p + 52) *(uint32 *) (p + 48)

15

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) xi6 = xi5 = xi4 = xi3 = xi2 = xi1 = xi0 = result result result result result result result result

slide-66
SLIDE 66

14

using ldr.w:

*) (p + 76) *) (p + 72) *) (p + 68) *) (p + 64) *) (p + 60) *) (p + 56) *) (p + 52) *) (p + 48)

15

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) xi6 = *(uint32 xi5 = *(uint32 xi4 = *(uint32 xi3 = *(uint32 xi2 = *(uint32 xi1 = *(uint32 xi0 = *(uint32 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2

slide-67
SLIDE 67

14

76) 72) 68) 64) 60) 56) 52) 48)

15

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) xi6 = *(uint32 *) (p + xi5 = *(uint32 *) (p + xi4 = *(uint32 *) (p + xi3 = *(uint32 *) (p + xi2 = *(uint32 *) (p + xi1 = *(uint32 *) (p + xi0 = *(uint32 *) p; p result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2

slide-68
SLIDE 68

15

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28)

16

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2

slide-69
SLIDE 69

15

*(uint32 *) (p + 44) *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 *(uint32 *) (p + 36) *(uint32 *) (p + 32) *(uint32 *) (p + 28)

16

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result result xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 = xi1 = xi0 = result result result

slide-70
SLIDE 70

15

*) (p + 44) *) (p + 40) *) (p + 36) *) (p + 32) *) (p + 28)

16

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 xi8 = *(uint32 xi7 = *(uint32 xi6 = *(uint32 xi5 = *(uint32 xi4 = *(uint32 xi3 = *(uint32 xi2 = *(uint32 xi1 = *(uint32 xi0 = *(uint32 result += xi9 result += xi8 result += xi7

slide-71
SLIDE 71

15

44) 40) 36) 32) 28)

16

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - xi8 = *(uint32 *) (p - xi7 = *(uint32 *) (p - xi6 = *(uint32 *) (p - xi5 = *(uint32 *) (p - xi4 = *(uint32 *) (p - xi3 = *(uint32 *) (p - xi2 = *(uint32 *) (p - xi1 = *(uint32 *) (p - xi0 = *(uint32 *) (p - result += xi9 result += xi8 result += xi7

slide-72
SLIDE 72

16

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2

17

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7

slide-73
SLIDE 73

16

*(uint32 *) (p + 24) *(uint32 *) (p + 20) *(uint32 *) (p + 16) *(uint32 *) (p + 12) *(uint32 *) (p + 8) *(uint32 *) (p + 4) *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2

17

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7 result result result result result result result xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 =

slide-74
SLIDE 74

16

*) (p + 24) *) (p + 20) *) (p + 16) *) (p + 12) *) (p + 8) *) (p + 4) *) p; p += 160

17

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 xi8 = *(uint32 xi7 = *(uint32 xi6 = *(uint32 xi5 = *(uint32 xi4 = *(uint32 xi3 = *(uint32 xi2 = *(uint32

slide-75
SLIDE 75

16

24) 20) 16) 12) 8) 4) += 160

17

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - xi8 = *(uint32 *) (p - xi7 = *(uint32 *) (p - xi6 = *(uint32 *) (p - xi5 = *(uint32 *) (p - xi4 = *(uint32 *) (p - xi3 = *(uint32 *) (p - xi2 = *(uint32 *) (p -

slide-76
SLIDE 76

17

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72)

slide-77
SLIDE 77

17

result += xi1 result += xi0 *(uint32 *) (p - 4) *(uint32 *) (p - 8) *(uint32 *) (p - 12) *(uint32 *) (p - 16) *(uint32 *) (p - 20) *(uint32 *) (p - 24) *(uint32 *) (p - 28) *(uint32 *) (p - 32) *(uint32 *) (p - 36) *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72) xi1 = xi0 = result result result result result result result result result result goto loop

slide-78
SLIDE 78

17

*) (p - 4) *) (p - 8) *) (p - 12) *) (p - 16) *) (p - 20) *) (p - 24) *) (p - 28) *) (p - 32) *) (p - 36) *) (p - 40)

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72) xi1 = *(uint32 xi0 = *(uint32 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? goto loop if !=

slide-79
SLIDE 79

17

4) 8) 12) 16) 20) 24) 28) 32) 36) 40)

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72) xi1 = *(uint32 *) (p - xi0 = *(uint32 *) (p - result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

slide-80
SLIDE 80

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72)

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

slide-81
SLIDE 81

18

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 *(uint32 *) (p - 44) *(uint32 *) (p - 48) *(uint32 *) (p - 52) *(uint32 *) (p - 56) *(uint32 *) (p - 60) *(uint32 *) (p - 64) *(uint32 *) (p - 68) *(uint32 *) (p - 72)

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

Wikipedia: even perfo

  • ptimizing

performance

slide-82
SLIDE 82

18

*) (p - 44) *) (p - 48) *) (p - 52) *) (p - 56) *) (p - 60) *) (p - 64) *) (p - 68) *) (p - 72)

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

Wikipedia: “By the even performance

  • ptimizing compilers

performance of human

slide-83
SLIDE 83

18

44) 48) 52) 56) 60) 64) 68) 72)

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

Wikipedia: “By the late 1990s even performance sensitive co

  • ptimizing compilers exceeded

performance of human experts.

slide-84
SLIDE 84

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.”

slide-85
SLIDE 85

19

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.
slide-86
SLIDE 86

19

*(uint32 *) (p - 76) *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y loop if !=

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

The big CPUs are farther and from naive

slide-87
SLIDE 87

19

*) (p - 76) *) (p - 80) p - y

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

The big picture CPUs are evolving farther and farther from naive models

slide-88
SLIDE 88

19

76) 80)

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

The big picture CPUs are evolving farther and farther away from naive models of CPUs.

slide-89
SLIDE 89

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs.

slide-90
SLIDE 90

20

Wikipedia: “By the late 1990s for even performance sensitive code,

  • ptimizing compilers exceeded the

performance of human experts.” Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges:

  • Pipelining.
  • Superscalar processing.

Major optimization challenges:

  • Vectorization.
  • Many threads; many cores.
  • The memory hierarchy;

the ring; the mesh.

  • Larger-scale parallelism.
  • Larger-scale networking.
slide-91
SLIDE 91

20

edia: “By the late 1990s for erformance sensitive code,

  • ptimizing compilers exceeded the

rmance of human experts.” y: The fastest software relies on human experts understanding the CPU. Cannot trust compiler to

  • ptimize instruction selection.

Cannot trust compiler to

  • ptimize instruction scheduling.

Cannot trust compiler to

  • ptimize register allocation.

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges:

  • Pipelining.
  • Superscalar processing.

Major optimization challenges:

  • Vectorization.
  • Many threads; many cores.
  • The memory hierarchy;

the ring; the mesh.

  • Larger-scale parallelism.
  • Larger-scale networking.

CPU design f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

  • h0

h1 Gates ∧ : product

  • f integers
slide-92
SLIDE 92

20

the late 1990s for rmance sensitive code, compilers exceeded the human experts.” test software human experts the CPU. compiler to instruction selection. compiler to instruction scheduling. compiler to allocation.

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges:

  • Pipelining.
  • Superscalar processing.

Major optimization challenges:

  • Vectorization.
  • Many threads; many cores.
  • The memory hierarchy;

the ring; the mesh.

  • Larger-scale parallelism.
  • Larger-scale networking.

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤ ◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ ♠♠♠♠♠♠♠♠♠

❊ ❊ ❊ ❊ ② ② ② ②

❊ ❊ ❊ ❊

② ② ②

② ② ② ② ❊ ❊ ❊ ❊

  • h0

h1 h3 Gates ∧ : a; b → 1 product h0 + 2h1 +

  • f integers f0 + 2f1
slide-93
SLIDE 93

20

1990s for code, exceeded the erts.” are erts selection. scheduling. cation.

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges:

  • Pipelining.
  • Superscalar processing.

Major optimization challenges:

  • Vectorization.
  • Many threads; many cores.
  • The memory hierarchy;

the ring; the mesh.

  • Larger-scale parallelism.
  • Larger-scale networking.

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h

  • f integers f0 + 2f1; g0 + 2g1
slide-94
SLIDE 94

21

The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges:

  • Pipelining.
  • Superscalar processing.

Major optimization challenges:

  • Vectorization.
  • Many threads; many cores.
  • The memory hierarchy;

the ring; the mesh.

  • Larger-scale parallelism.
  • Larger-scale networking.

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.
slide-95
SLIDE 95

21

big picture are evolving and farther away naive models of CPUs.

  • ptimization challenges:

elining. erscalar processing.

  • ptimization challenges:

ectorization. Many threads; many cores. memory hierarchy; ring; the mesh. rger-scale parallelism. rger-scale networking.

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.

Electricit percolate If f0; f1; g then h0; a few moments

slide-96
SLIDE 96

21

evolving rther away dels of CPUs. ion challenges: rocessing. tion challenges: many cores. hierarchy; mesh. parallelism. networking.

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.

Electricity takes time percolate through If f0; f1; g0; g1 are stab then h0; h1; h2; h3 a few moments later.

slide-97
SLIDE 97

21

CPUs. challenges: challenges: res.

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.

Electricity takes time to percolate through wires and If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later.

slide-98
SLIDE 98

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later.

slide-99
SLIDE 99

22

CPU design in a nutshell f0

❇ ❇ ❇ g0 ⑤ ⑤ ⑤ ⑤

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h0

h1 h3 h2 Gates ∧ : a; b → 1 − ab computing product h0 + 2h1 + 4h2 + 8h3

  • f integers f0 + 2f1; g0 + 2g1.

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.)

slide-100
SLIDE 100

22

design in a nutshell

  • g0

◗ ◗ ◗ ◗ ◗ ◗ ◗ ◗ g1 ♠♠♠♠♠♠♠♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

❊ ❊ ❊ ❊

② ② ② ②

❊ ❊ ❊ ❊

② ② ② ② ②

☞☞☞☞☞☞☞☞

② ② ② ②

❊ ❊ ❊ ❊

  • h1

h3 h2

∧ : a; b → 1 − ab computing

duct h0 + 2h1 + 4h2 + 8h3 integers f0 + 2f1; g0 + 2g1.

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.) Build circuit 32-bit integer given 4-bit and 32-bit

register read

slide-101
SLIDE 101

22

nutshell

◗ g1 ♠♠

❆ ❆ ❆ f1 ⑥ ⑥ ⑥ ⑥

② ②

② ② ②

☞☞☞☞☞☞☞☞

❊ ❊

  • h2

1 − ab computing + 4h2 + 8h3 2f1; g0 + 2g1.

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.) Build circuit to compute 32-bit integer ri given 4-bit integer and 32-bit integers

register read

slide-102
SLIDE 102

22

f1

computing 8h3 g1.

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.) Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : :

register read

slide-103
SLIDE 103

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.)

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

slide-104
SLIDE 104

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.)

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

slide-105
SLIDE 105

23

Electricity takes time to percolate through wires and gates. If f0; f1; g0; g1 are stable then h0; h1; h2; h3 are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.)

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

Build circuit for addition. Etc.

slide-106
SLIDE 106

23

Electricity takes time to ercolate through wires and gates. ; g0; g1 are stable

0; h1; h2; h3 are stable

moments later. circuit with more gates multiply (e.g.) 32-bit integers: ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ (Details omitted.)

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

Build circuit for addition. Etc. r0; : : : ; r15 where r′

register read register

slide-107
SLIDE 107

23

time to through wires and gates. re stable

3 are stable

later. with more gates (e.g.) 32-bit integers: ⑧ ❄ .)

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

Build circuit for addition. Etc. r0; : : : ; r15; i; j; k → where r′

‘ = r‘ except

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

slide-108
SLIDE 108

23

and gates. stable gates integers:

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

Build circuit for addition. Etc. r0; : : : ; r15; i; j; k → r′

0; : : : ; r15

where r′

‘ = r‘ except r′ i = rjr

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

slide-109
SLIDE 109

24

Build circuit to compute 32-bit integer ri given 4-bit integer i and 32-bit integers r0; r1; : : : ; r15:

register read

Build circuit for “register write”: r0; : : : ; r15; s; i → r′

0; : : : ; r′ 15

where r′

j = rj except r′ i = s.

Build circuit for addition. Etc.

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

slide-110
SLIDE 110

24

circuit to compute integer ri 4-bit integer i 32-bit integers r0; r1; : : : ; r15:

register read

circuit for “register write”: ; r15; s; i → r′

0; : : : ; r′ 15

r′

j = rj except r′ i = s.

circuit for addition. Etc.

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Add mor More arithmetic: replace ( (“×”; i; j (“+”; i; j

slide-111
SLIDE 111

24

compute integer i integers r0; r1; : : : ; r15:

register

“register write”: r′

0; : : : ; r′ 15

except r′

i = s.

  • addition. Etc.

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Add more flexibilit More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and m

slide-112
SLIDE 112

24

: : ; r15: write”:

′ 15

s. Etc.

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options.

slide-113
SLIDE 113

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options.

slide-114
SLIDE 114

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays.

slide-115
SLIDE 115

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′.

slide-116
SLIDE 116

25

r0; : : : ; r15; i; j; k → r′

0; : : : ; r′ 15

where r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′. “Instruction decode”: decompression of compressed format for op; ip; jp; kp; p′.

slide-117
SLIDE 117

25

; r15; i; j; k → r′

0; : : : ; r′ 15

r′

‘ = r‘ except r′ i = rjrk:

register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′. “Instruction decode”: decompression of compressed format for op; ip; jp; kp; p′. Build “flip-flops” storing ( Hook (p; flip-flops Hook outputs into the At each flip-flops with the Clock needs for electricit all the w from flip-flops

slide-118
SLIDE 118

25

→ r′

0; : : : ; r′ 15

except r′

i = rjrk:

register read

⑧ ⑧ ❄ ❄

register write

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′. “Instruction decode”: decompression of compressed format for op; ip; jp; kp; p′. Build “flip-flops” storing (p; r0; : : : ; r Hook (p; r0; : : : ; r15 flip-flops into circuit Hook outputs (p′; into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be for electricity to percolate all the way through from flip-flops to flip-flops.

slide-119
SLIDE 119

25

; r′

15

rjrk:

register

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′. “Instruction decode”: decompression of compressed format for op; ip; jp; kp; p′. Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enou for electricity to percolate all the way through the circuit, from flip-flops to flip-flops.

slide-120
SLIDE 120

26

Add more flexibility. More arithmetic: replace (i; j; k) with (“×”; i; j; k) and (“+”; i; j; k) and more options. More (but slower) storage: “load” from and “store” to larger “RAM” arrays. “Instruction fetch”: p → op; ip; jp; kp; p′. “Instruction decode”: decompression of compressed format for op; ip; jp; kp; p′.

27

Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15)

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops.

slide-121
SLIDE 121

26

more flexibility. arithmetic: replace (i; j; k) with i; j; k) and i; j; k) and more options. (but slower) storage: from and “store” to “RAM” arrays. “Instruction fetch”: ; ip; jp; kp; p′. “Instruction decode”: decompression of compressed for op; ip; jp; kp; p′.

27

Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15)

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have

flip-flops fetch deco register read

⑧ ❄

register write

Further flexibilit e.g., logic

slide-122
SLIDE 122

26

flexibility. rithmetic: with more options. er) storage: “store” to rrays. fetch”: ; p′. decode”:

  • f compressed

; jp; kp; p′.

27

Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15)

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have semi-flexible

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is e.g., logic instructions.

slide-123
SLIDE 123

26

  • ptions.

rage: to ressed

27

Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15)

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions.

slide-124
SLIDE 124

27

Build “flip-flops” storing (p; r0; : : : ; r15). Hook (p; r0; : : : ; r15) flip-flops into circuit inputs. Hook outputs (p′; r′

0; : : : ; r′ 15)

into the same flip-flops. At each “clock tick”, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops.

28

Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions.

slide-125
SLIDE 125

27

“flip-flops” (p; r0; : : : ; r15). (p; r0; : : : ; r15) flip-flops into circuit inputs.

  • utputs (p′; r′

0; : : : ; r′ 15)

the same flip-flops. each “clock tick”, flip-flops are overwritten the outputs. needs to be slow enough lectricity to percolate way through the circuit, flip-flops to flip-flops.

28

Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions. “Pipelining”

flip-flops fetch flip-flops deco flip-flops register read flip-flops

⑧ ❄

flip-flops register write

slide-126
SLIDE 126

27

: ; r15). r15) circuit inputs.

′; r′ 0; : : : ; r′ 15)

flip-flops. tick”,

  • verwritten
  • utputs.

e slow enough percolate through the circuit, flip-flops.

28

Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions. “Pipelining” allows

flip-flops insn fetch flip-flops insn decode flip-flops register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write

slide-127
SLIDE 127

27

inputs.

′ 15)

  • ugh

circuit, flip-flops.

28

Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions. “Pipelining” allows faster clo

flip-flops insn fetch stage flip-flops insn decode stage flip-flops register read register read stage flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage flip-flops register write stage

slide-128
SLIDE 128

28

Now have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

Further flexibility is useful: e.g., logic instructions.

29

“Pipelining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

slide-129
SLIDE 129

28

have semi-flexible CPU:

flip-flops insn fetch insn decode register read register read

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

register write

urther flexibility is useful: logic instructions.

29

“Pipelining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

Goal: Stage

  • ne tick

Instruction reads next feeds p′ After next instruction uncompresses while instruction reads another Some extra Also extra preserve e.g., stall

slide-130
SLIDE 130

28

semi-flexible CPU:

register

y is useful: instructions.

29

“Pipelining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

Goal: Stage n handles

  • ne tick after stage

Instruction fetch reads next instruction, feeds p′ back, sends After next clock tick, instruction decode uncompresses this while instruction fetch reads another instru Some extra flip-flop Also extra area to preserve instruction e.g., stall on read-after-write.

slide-131
SLIDE 131

28

CPU: useful:

29

“Pipelining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write.

slide-132
SLIDE 132

29

“Pipelining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

30

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write.

slide-133
SLIDE 133

29

elining” allows faster clock:

flip-flops insn fetch stage 1 flip-flops insn decode stage 2 flip-flops register read register read stage 3 flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

stage 4 flip-flops register write stage 5

30

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. “Superscala

fetch deco register read register read register write

slide-134
SLIDE 134

29

ws faster clock:

stage 1 stage 2 register stage 3 stage 4 stage 5

30

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. “Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

slide-135
SLIDE 135

29

clock:

stage 1 stage 2 stage 3 stage 4 stage 5

30

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. “Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

slide-136
SLIDE 136

30

Goal: Stage n handles instruction

  • ne tick after stage n − 1.

Instruction fetch reads next instruction, feeds p′ back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write.

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

slide-137
SLIDE 137

30

Stage n handles instruction tick after stage n − 1. Instruction fetch next instruction,

′ back, sends instruction.

next clock tick, instruction decode uncompresses this instruction, instruction fetch another instruction. extra flip-flop area. extra area to reserve instruction semantics: stall on read-after-write.

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

“Vector” Expand each into n-vecto ARM “NEON” Intel “AVX2” Intel “AVX-512” GPUs have

slide-138
SLIDE 138

30

handles instruction stage n − 1. instruction, sends instruction. tick, de this instruction, fetch instruction. flip-flop area. to instruction semantics: read-after-write.

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

“Vector” processing: Expand each 32-bit into n-vector of 32-bit ARM “NEON” has Intel “AVX2” has n Intel “AVX-512” has GPUs have larger n

slide-139
SLIDE 139

30

instruction instruction. instruction, semantics: read-after-write.

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n.

slide-140
SLIDE 140

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n.

slide-141
SLIDE 141

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits.

slide-142
SLIDE 142

31

“Superscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures.

slide-143
SLIDE 143

31

erscalar” processing:

flip-flops insn fetch insn fetch flip-flops insn decode insn decode flip-flops register read register read register read flip-flops

⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ❄ ❄ ❄ ❄ ❄ ❄ ❄

flip-flops register write register write

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network How exp Input: arra Each numb represented Output: in increasing represented same multise

slide-144
SLIDE 144

31

rocessing:

flip-flops insn fetch flip-flops insn decode flip-flops register read register read flip-flops flip-flops register write

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: the How expensive is so Input: array of n numb Each number in ˘ 1 represented in bina Output: array of n in increasing order, represented in bina same multiset as input.

slide-145
SLIDE 145

31

register

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2 represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input.

slide-146
SLIDE 146

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures.

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input.

slide-147
SLIDE 147

32

“Vector” processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM “NEON” has n = 4; Intel “AVX2” has n = 8; Intel “AVX-512” has n = 16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures.

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n1+o(1). For simplicity assume n = 4k.

slide-148
SLIDE 148

32

r” processing: Expand each 32-bit integer

  • vector of 32-bit integers.

“NEON” has n = 4; “AVX2” has n = 8; “AVX-512” has n = 16; have larger n. eedup if rithmetic circuits, read/write circuits. Benefit: Amortizes insn circuits. effect on higher-level rithms and data structures.

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n1+o(1). For simplicity assume n = 4k. Spread a square mesh each of a with nea × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

slide-149
SLIDE 149

32

cessing: 32-bit integer 32-bit integers. has n = 4; has n = 8; has n = 16; rger n. circuits, circuits. rtizes insn circuits. higher-level data structures.

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n1+o(1). For simplicity assume n = 4k. Spread array across square mesh of n small each of area no(1), with near-neighbor × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

slide-150
SLIDE 150

32

integer integers. 16; circuits. higher-level structures.

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n1+o(1). For simplicity assume n = 4k. Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

slide-151
SLIDE 151

33

Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in ˘ 1; 2; : : : ; n2¯ , represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n1+o(1). For simplicity assume n = 4k.

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

slide-152
SLIDE 152

33

rk on chip: the mesh expensive is sorting? array of n numbers. number in ˘ 1; 2; : : : ; n2¯ , resented in binary. Output: array of n numbers, increasing order, resented in binary; multiset as input. Metric: seconds used by

  • f area n1+o(1).

simplicity assume n = 4k.

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Sort row in n0:5+o

  • Sort ea

3 1 4 1 1 3 1 4

  • Sort alternate

1 3 1 4 1 1 3 4

  • Repeat

equals

slide-153
SLIDE 153

33

chip: the mesh is sorting? numbers. ˘ 1; 2; : : : ; n2¯ , binary.

  • f n numbers,

rder, binary; as input. used by

1+o(1).

assume n = 4k.

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until numb

equals row length.

slide-154
SLIDE 154

33

mesh ers. ; n2¯ , ers, 4k.

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length.

slide-155
SLIDE 155

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length.

slide-156
SLIDE 156

34

Spread array across square mesh of n small cells, each of area no(1), with near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length. Sort each row, in parallel, in a total of n0:5+o(1) seconds.

slide-157
SLIDE 157

34

array across mesh of n small cells,

  • f area no(1),

near-neighbor wiring: × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length. Sort each row, in parallel, in a total of n0:5+o(1) seconds. Sort all n in n0:5+o

  • Recursively

in parallel,

  • Sort ea
  • Sort ea
  • Sort ea
  • Sort ea

With prop left-to-right/right-to-left for each that this

slide-158
SLIDE 158

34

across small cells,

(1),

  • r wiring:

× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length. Sort each row, in parallel, in a total of n0:5+o(1) seconds. Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort

in parallel, if n >

  • Sort each column
  • Sort each row in
  • Sort each column
  • Sort each row in

With proper choice left-to-right/right-to-left for each row, can p that this sorts whole

slide-159
SLIDE 159

34

cells, × × × × × × × × × × × × × × × × × × × ×

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length. Sort each row, in parallel, in a total of n0:5+o(1) seconds. Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array.

slide-160
SLIDE 160

35

Sort row of n0:5 cells in n0:5+o(1) seconds:

  • Sort each pair in parallel.

3 1 4 1 5 9 2 6 → 1 3 1 4 5 9 2 6

  • Sort alternate pairs in parallel.

1 3 1 4 5 9 2 6 → 1 1 3 4 5 2 9 6

  • Repeat until number of steps

equals row length. Sort each row, in parallel, in a total of n0:5+o(1) seconds.

36

Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array.

slide-161
SLIDE 161

35

w of n0:5 cells

5+o(1) seconds:

each pair in parallel. 4 1 5 9 2 6 → 1 4 5 9 2 6 alternate pairs in parallel. 1 4 5 9 2 6 → 3 4 5 2 9 6 eat until number of steps equals row length. each row, in parallel, total of n0:5+o(1) seconds.

36

Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, this 8 × 3 1 4 5 3 5 2 3 8 3 3 8 2 8 1 6 9 5 1 7 4 9

slide-162
SLIDE 162

35

cells seconds: in parallel. 6 → 6 pairs in parallel. 6 → 6 number of steps length. in parallel,

5+o(1) seconds.

36

Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, assume this 8 × 8 array is 3 1 4 1 5 9 5 3 5 8 9 7 2 3 8 4 6 2 3 3 8 3 2 7 2 8 8 4 1 1 6 9 3 9 9 5 1 5 8 2 7 4 9 4 4 5

slide-163
SLIDE 163

35

rallel. parallel. steps seconds.

36

Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2

slide-164
SLIDE 164

36

Sort all n cells in n0:5+o(1) seconds:

  • Recursively sort quadrants

in parallel, if n > 1.

  • Sort each column in parallel.
  • Sort each row in parallel.
  • Sort each column in parallel.
  • Sort each row in parallel.

With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array.

37

For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2

slide-165
SLIDE 165

36

ll n cells

5+o(1) seconds:

Recursively sort quadrants parallel, if n > 1. each column in parallel. each row in parallel. each column in parallel. each row in parallel. roper choice of left-to-right/right-to-left h row, can prove this sorts whole array.

37

For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2 Recursively top →, b 1 1 2 3 3 3 3 4 4 5 8 8 1 1 4 4 3 7 6 5 9 9 8

slide-166
SLIDE 166

36

seconds: rt quadrants > 1. column in parallel. in parallel. column in parallel. in parallel.

  • ice of

left-to-right/right-to-left can prove whole array.

37

For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2 Recursively sort quadrants, top →, bottom ← 1 1 2 3 2 2 3 3 3 3 4 5 3 4 4 5 6 6 5 8 8 8 9 9 1 1 2 2 4 4 3 2 5 4 7 6 5 5 9 8 9 9 8 8 9 9

slide-167
SLIDE 167

36

quadrants rallel. rallel. rallel. rallel. .

37

For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2 Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9

slide-168
SLIDE 168

37

For example, assume that this 8 × 8 array is in cells: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 5 8 2 9 7 4 9 4 4 5 9 2

38

Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9

slide-169
SLIDE 169

37

example, assume that × 8 array is in cells: 4 1 5 9 2 6 5 8 9 7 9 3 8 4 6 2 6 4 8 3 2 7 9 5 8 8 4 1 9 7 9 3 9 9 3 7 5 8 2 9 9 4 4 5 9 2

38

Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9 Sort each in parallel: 1 1 1 1 2 3 3 3 3 4 3 4 4 4 5 6 5 7 8 8 9 9 8

slide-170
SLIDE 170

37

assume that is in cells: 9 2 6 7 9 3 2 6 4 7 9 5 1 9 7 9 3 7 2 9 5 9 2

38

Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9 Sort each column in parallel: 1 1 2 2 1 1 2 2 2 2 3 3 3 3 4 4 3 4 3 3 5 5 4 4 4 5 6 6 5 6 5 5 9 8 7 8 8 8 9 9 9 9 8 8 9 9

slide-171
SLIDE 171

37 38

Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9 Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9

slide-172
SLIDE 172

38

Recursively sort quadrants, top →, bottom ←: 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 2 2 1 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9

39

Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9

slide-173
SLIDE 173

38

Recursively sort quadrants, , bottom ←: 2 3 2 2 2 3 3 3 4 5 5 6 4 5 6 6 7 7 8 8 9 9 9 9 2 2 1 3 2 5 4 4 3 5 5 9 8 7 7 8 8 9 9 9 9

39

Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 Sort each alternately 3 2 2 3 3 3 6 5 5 4 4 4 9 8 7 7 8 8 9 9 9

slide-174
SLIDE 174

38

quadrants, ←: 2 2 3 5 5 6 6 7 7 9 9 9 2 1 4 4 3 8 7 7 9 9 9

39

Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 Sort each row in pa alternately ←, →: 1 1 1 3 2 2 2 2 2 3 3 3 3 3 4 6 5 5 5 4 3 4 4 4 5 6 6 9 8 7 7 6 5 7 8 8 8 9 9 9 9 9 9 9 9

slide-175
SLIDE 175

38

quadrants,

39

Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8

slide-176
SLIDE 176

39

Sort each column in parallel: 1 1 2 2 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9

40

Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8

slide-177
SLIDE 177

39

each column rallel: 2 2 1 2 2 2 2 2 3 3 3 4 4 4 3 3 3 5 5 5 6 4 5 6 6 7 7 5 5 9 8 7 7 8 8 9 9 9 9 8 8 9 9 9 9

40

Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8 Sort each in parallel: 3 2 2 3 3 3 4 4 4 6 5 5 7 8 7 9 8 8 9 9 9

slide-178
SLIDE 178

39

column 2 1 2 2 3 4 4 3 5 5 6 6 7 7 8 7 7 9 9 9 9 9 9

40

Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8 Sort each column in parallel: 1 1 1 3 2 2 2 2 2 3 3 3 3 3 3 4 4 4 5 4 4 6 5 5 5 6 5 7 8 7 7 6 6 9 8 8 8 9 9 9 9 9 9 9 9

slide-179
SLIDE 179

39 40

Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8 Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9

slide-180
SLIDE 180

40

Sort each row in parallel, alternately ←, →: 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8

41

Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9

slide-181
SLIDE 181

40

each row in parallel, alternately ←, →: 1 1 1 2 2 2 2 2 2 1 1 3 3 3 4 4 4 5 5 4 3 3 3 4 5 6 6 7 7 7 7 6 5 5 5 8 8 9 9 9 9 9 9 9 9 8 8

41

Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9 Sort each ← or → 2 2 2 3 3 3 4 4 4 5 5 5 6 6 7 8 8 8 9 9 9

slide-182
SLIDE 182

40

parallel, : 1 2 2 2 1 1 4 4 4 3 3 3 6 7 7 5 5 5 9 9 9 9 8 8

41

Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9 Sort each row in pa ← or → as desired: 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 7 7 7 7 8 8 8 8 8 9 9 9 9 9 9 9

slide-183
SLIDE 183

40 41

Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9 Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

slide-184
SLIDE 184

41

Sort each column in parallel: 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

slide-185
SLIDE 185

41

each column rallel: 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 5 4 4 4 4 5 5 6 5 5 5 7 7 6 6 7 7 8 8 9 9 8 8 9 9 9 9 9 9

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 Chips are towards parallelism GPUs: pa Old Xeon New Xeon

slide-186
SLIDE 186

41

column 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 7 7 9 8 8 9 9 9

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 Chips are in fact evolving towards having this parallelism and communication. GPUs: parallel + global Old Xeon Phi: parallel New Xeon Phi: pa

slide-187
SLIDE 187

41 42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh.

slide-188
SLIDE 188

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

43

Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh.

slide-189
SLIDE 189

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

43

Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don’t even get the right exponent without taking this into account.

slide-190
SLIDE 190

42

Sort each row in parallel, ← or → as desired: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

43

Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don’t even get the right exponent without taking this into account. Shock waves from subroutines into high-level algorithm design.