CPU-specific optimization Example of a function that we want to - PowerPoint PPT Presentation

4 5 “Okay, 8 cycles per addition. Try moving the pointer: Um, are microcontrollers int sum(int *x) really this slow at addition?” { Bad approach: int result = 0; Apply random “optimizations” int i; (and tweak compiler options) for (i = 0;i < 1000;++i) until you get bored/frustrated. result += *x++; Keep the fastest results. return result; } Try -Os : 8012 cycles. Try -O1 : 8012 cycles. 8010 cycles. Try -O2 : 8012 cycles. Try -O3 : 8012 cycles.

4 5 , 8 cycles per addition. Try moving the pointer: Try counting re microcontrollers int sum(int *x) int sum(int this slow at addition?” { { approach: int result = 0; int result random “optimizations” int i; int i; weak compiler options) for (i = 0;i < 1000;++i) for (i ou get bored/frustrated. result += *x++; result the fastest results. return result; return } } : 8012 cycles. : 8012 cycles. 8010 cycles. : 8012 cycles. : 8012 cycles.

4 5 per addition. Try moving the pointer: Try counting down: controllers int sum(int *x) int sum(int *x) at addition?” { { int result = 0; int result = 0; “optimizations” int i; int i; compiler options) for (i = 0;i < 1000;++i) for (i = 1000;i red/frustrated. result += *x++; result += *x++; results. return result; return result; } } cycles. cycles. 8010 cycles. cycles. cycles.

4 5 addition. Try moving the pointer: Try counting down: int sum(int *x) int sum(int *x) addition?” { { int result = 0; int result = 0; tions” int i; int i; options) for (i = 0;i < 1000;++i) for (i = 1000;i > 0;--i) red/frustrated. result += *x++; result += *x++; return result; return result; } } 8010 cycles.

5 6 Try moving the pointer: Try counting down: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int i; int i; for (i = 0;i < 1000;++i) for (i = 1000;i > 0;--i) result += *x++; result += *x++; return result; return result; } } 8010 cycles.

5 6 Try moving the pointer: Try counting down: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int i; int i; for (i = 0;i < 1000;++i) for (i = 1000;i > 0;--i) result += *x++; result += *x++; return result; return result; } } 8010 cycles. 8010 cycles.

5 6 moving the pointer: Try counting down: Try using sum(int *x) int sum(int *x) int sum(int { { result = 0; int result = 0; int result i; int i; int *y (i = 0;i < 1000;++i) for (i = 1000;i > 0;--i) while result += *x++; result += *x++; result return result; return result; return } } cycles. 8010 cycles.

5 6 pointer: Try counting down: Try using an end p int sum(int *x) int sum(int *x) { { 0; int result = 0; int result = 0; int i; int *y = x + 1000; 1000;++i) for (i = 1000;i > 0;--i) while (x != y) *x++; result += *x++; result += *x++; return result; return result; } } 8010 cycles.

5 6 Try counting down: Try using an end pointer: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int i; int *y = x + 1000; 1000;++i) for (i = 1000;i > 0;--i) while (x != y) result += *x++; result += *x++; return result; return result; } } 8010 cycles.

6 7 Try counting down: Try using an end pointer: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int i; int *y = x + 1000; for (i = 1000;i > 0;--i) while (x != y) result += *x++; result += *x++; return result; return result; } } 8010 cycles.

6 7 Try counting down: Try using an end pointer: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int i; int *y = x + 1000; for (i = 1000;i > 0;--i) while (x != y) result += *x++; result += *x++; return result; return result; } } 8010 cycles. 8010 cycles.

6 7 counting down: Try using an end pointer: Back to sum(int *x) int sum(int *x) int sum(int { { result = 0; int result = 0; int result i; int *y = x + 1000; int i; (i = 1000;i > 0;--i) while (x != y) for (i result += *x++; result += *x++; result return result; return result; result } } return cycles. 8010 cycles. }

6 7 wn: Try using an end pointer: Back to original. T int sum(int *x) int sum(int *x) { { 0; int result = 0; int result = 0; int *y = x + 1000; int i; 1000;i > 0;--i) while (x != y) for (i = 0;i < *x++; result += *x++; result += x[i]; return result; result += x[i } } return result; 8010 cycles. }

6 7 Try using an end pointer: Back to original. Try unrolling: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int *y = x + 1000; int i; 0;--i) while (x != y) for (i = 0;i < 1000;i += result += *x++; result += x[i]; return result; result += x[i + 1]; } } return result; 8010 cycles. }

7 8 Try using an end pointer: Back to original. Try unrolling: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int *y = x + 1000; int i; while (x != y) for (i = 0;i < 1000;i += 2) { result += *x++; result += x[i]; return result; result += x[i + 1]; } } return result; 8010 cycles. }

7 8 Try using an end pointer: Back to original. Try unrolling: int sum(int *x) int sum(int *x) { { int result = 0; int result = 0; int *y = x + 1000; int i; while (x != y) for (i = 0;i < 1000;i += 2) { result += *x++; result += x[i]; return result; result += x[i + 1]; } } return result; 8010 cycles. } 5016 cycles.

7 8 using an end pointer: Back to original. Try unrolling: int sum(int { sum(int *x) int sum(int *x) int result { int i; result = 0; int result = 0; for (i *y = x + 1000; int i; result (x != y) for (i = 0;i < 1000;i += 2) { result result += *x++; result += x[i]; result return result; result += x[i + 1]; result } result return result; cycles. } } return 5016 cycles. }

7 8 pointer: Back to original. Try unrolling: int sum(int *x) { int sum(int *x) int result = 0; { int i; 0; int result = 0; for (i = 0;i < 1000; int i; result += x[i]; for (i = 0;i < 1000;i += 2) { result += x[i *x++; result += x[i]; result += x[i result += x[i + 1]; result += x[i } result += x[i return result; } } return result; 5016 cycles. }

7 8 Back to original. Try unrolling: int sum(int *x) { int sum(int *x) int result = 0; { int i; int result = 0; for (i = 0;i < 1000;i += int i; result += x[i]; for (i = 0;i < 1000;i += 2) { result += x[i + 1]; result += x[i]; result += x[i + 2]; result += x[i + 1]; result += x[i + 3]; } result += x[i + 4]; return result; } } return result; 5016 cycles. }

8 9 Back to original. Try unrolling: int sum(int *x) { int sum(int *x) int result = 0; { int i; int result = 0; for (i = 0;i < 1000;i += 5) { int i; result += x[i]; for (i = 0;i < 1000;i += 2) { result += x[i + 1]; result += x[i]; result += x[i + 2]; result += x[i + 1]; result += x[i + 3]; } result += x[i + 4]; return result; } } return result; 5016 cycles. }

8 9 to original. Try unrolling: 4016 cycles int sum(int *x) { sum(int *x) int result = 0; int i; result = 0; for (i = 0;i < 1000;i += 5) { i; result += x[i]; (i = 0;i < 1000;i += 2) { result += x[i + 1]; result += x[i]; result += x[i + 2]; result += x[i + 1]; result += x[i + 3]; result += x[i + 4]; return result; } return result; cycles. }

8 9 riginal. Try unrolling: 4016 cycles. Are w int sum(int *x) { int result = 0; int i; 0; for (i = 0;i < 1000;i += 5) { result += x[i]; 1000;i += 2) { result += x[i + 1]; x[i]; result += x[i + 2]; x[i + 1]; result += x[i + 3]; result += x[i + 4]; } return result; }

8 9 unrolling: 4016 cycles. Are we done no int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; += 2) { result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

9 10 4016 cycles. Are we done now? int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

9 10 4016 cycles. Are we done now? int sum(int *x) { Most random “optimizations” int result = 0; that we tried seem useless. int i; Can spend time trying more. for (i = 0;i < 1000;i += 5) { Does frustration level tell us result += x[i]; that we’re close to optimal? result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; }

9 10 4016 cycles. Are we done now? int sum(int *x) { Most random “optimizations” int result = 0; that we tried seem useless. int i; Can spend time trying more. for (i = 0;i < 1000;i += 5) { Does frustration level tell us result += x[i]; that we’re close to optimal? result += x[i + 1]; Good approach: result += x[i + 2]; Figure out lower bound for result += x[i + 3]; cycles spent on arithmetic etc. result += x[i + 4]; Understand gap between } lower bound and observed time. return result; } Let’s try this approach.

9 10 4016 cycles. Are we done now? Find “ARM sum(int *x) Technical Most random “optimizations” Rely on Wikip result = 0; that we tried seem useless. M4F = M4 i; Can spend time trying more. (i = 0;i < 1000;i += 5) { Does frustration level tell us Manual sa result += x[i]; that we’re close to optimal? “implements result += x[i + 1]; architecture Good approach: result += x[i + 2]; Figure out lower bound for Points to result += x[i + 3]; cycles spent on arithmetic etc. Architecture result += x[i + 4]; Understand gap between which defines lower bound and observed time. e.g., “ADD” return result; Let’s try this approach. First manual ADD tak

9 10 4016 cycles. Are we done now? Find “ARM Cortex-M4 Technical Reference Most random “optimizations” Rely on Wikipedia 0; that we tried seem useless. M4F = M4 + floating-p Can spend time trying more. 1000;i += 5) { Does frustration level tell us Manual says that Co x[i]; that we’re close to optimal? “implements the ARMv7E-M x[i + 1]; architecture profile”. Good approach: x[i + 2]; Figure out lower bound for Points to the “ARMv7-M x[i + 3]; cycles spent on arithmetic etc. Architecture Reference x[i + 4]; Understand gap between which defines instructions: lower bound and observed time. e.g., “ADD” for 32-bit Let’s try this approach. First manual says that ADD takes just 1 cycle.

9 10 4016 cycles. Are we done now? Find “ARM Cortex-M4 Processo Technical Reference Manual”. Most random “optimizations” Rely on Wikipedia comment that we tried seem useless. M4F = M4 + floating-point Can spend time trying more. += 5) { Does frustration level tell us Manual says that Cortex-M4 that we’re close to optimal? “implements the ARMv7E-M architecture profile”. Good approach: Figure out lower bound for Points to the “ARMv7-M cycles spent on arithmetic etc. Architecture Reference Manual”, Understand gap between which defines instructions: lower bound and observed time. e.g., “ADD” for 32-bit addition. Let’s try this approach. First manual says that ADD takes just 1 cycle.

10 11 4016 cycles. Are we done now? Find “ARM Cortex-M4 Processor Technical Reference Manual”. Most random “optimizations” Rely on Wikipedia comment that that we tried seem useless. M4F = M4 + floating-point unit. Can spend time trying more. Does frustration level tell us Manual says that Cortex-M4 that we’re close to optimal? “implements the ARMv7E-M architecture profile”. Good approach: Figure out lower bound for Points to the “ARMv7-M cycles spent on arithmetic etc. Architecture Reference Manual”, Understand gap between which defines instructions: lower bound and observed time. e.g., “ADD” for 32-bit addition. Let’s try this approach. First manual says that ADD takes just 1 cycle.

10 11 cycles. Are we done now? Find “ARM Cortex-M4 Processor Inputs and Technical Reference Manual”. “integer random “optimizations” Rely on Wikipedia comment that has 16 integer e tried seem useless. M4F = M4 + floating-point unit. special-purp end time trying more. and “program frustration level tell us Manual says that Cortex-M4 e’re close to optimal? “implements the ARMv7E-M Each eleme architecture profile”. be “loaded” approach: out lower bound for Points to the “ARMv7-M Basic load spent on arithmetic etc. Architecture Reference Manual”, Manual sa Understand gap between which defines instructions: a note ab bound and observed time. e.g., “ADD” for 32-bit addition. Then mo instruction try this approach. First manual says that address not ADD takes just 1 cycle. then it saves

10 11 we done now? Find “ARM Cortex-M4 Processor Inputs and output Technical Reference Manual”. “integer registers”. “optimizations” Rely on Wikipedia comment that has 16 integer registers, seem useless. M4F = M4 + floating-point unit. special-purpose “stack trying more. and “program counter”. level tell us Manual says that Cortex-M4 to optimal? “implements the ARMv7E-M Each element of x architecture profile”. be “loaded” into a bound for Points to the “ARMv7-M Basic load instruction: arithmetic etc. Architecture Reference Manual”, Manual says 2 cycles between which defines instructions: a note about “pipelining”. observed time. e.g., “ADD” for 32-bit addition. Then more explanation: instruction is also approach. First manual says that address not based ADD takes just 1 cycle. then it saves 1 cycle.

10 11 now? Find “ARM Cortex-M4 Processor Inputs and output of ADD a Technical Reference Manual”. “integer registers”. ARMv7- “optimizations” Rely on Wikipedia comment that has 16 integer registers, including useless. M4F = M4 + floating-point unit. special-purpose “stack pointer” re. and “program counter”. us Manual says that Cortex-M4 optimal? “implements the ARMv7E-M Each element of x array needs architecture profile”. be “loaded” into a register. r Points to the “ARMv7-M Basic load instruction: LDR. etc. Architecture Reference Manual”, Manual says 2 cycles but adds which defines instructions: a note about “pipelining”. time. e.g., “ADD” for 32-bit addition. Then more explanation: if next instruction is also LDR (with First manual says that address not based on first LDR ADD takes just 1 cycle. then it saves 1 cycle.

11 12 Find “ARM Cortex-M4 Processor Inputs and output of ADD are Technical Reference Manual”. “integer registers”. ARMv7-M Rely on Wikipedia comment that has 16 integer registers, including M4F = M4 + floating-point unit. special-purpose “stack pointer” and “program counter”. Manual says that Cortex-M4 “implements the ARMv7E-M Each element of x array needs to architecture profile”. be “loaded” into a register. Points to the “ARMv7-M Basic load instruction: LDR. Architecture Reference Manual”, Manual says 2 cycles but adds which defines instructions: a note about “pipelining”. e.g., “ADD” for 32-bit addition. Then more explanation: if next instruction is also LDR (with First manual says that address not based on first LDR) ADD takes just 1 cycle. then it saves 1 cycle.

11 12 “ARM Cortex-M4 Processor Inputs and output of ADD are n consecutive echnical Reference Manual”. “integer registers”. ARMv7-M takes only on Wikipedia comment that has 16 integer registers, including (“more multiple M4 + floating-point unit. special-purpose “stack pointer” pipelined and “program counter”. Manual says that Cortex-M4 Can achieve “implements the ARMv7E-M Each element of x array needs to in other rchitecture profile”. be “loaded” into a register. but nothing to the “ARMv7-M Basic load instruction: LDR. Lower boun Architecture Reference Manual”, Manual says 2 cycles but adds 2 n + 1 cycles, defines instructions: a note about “pipelining”. n cycles “ADD” for 32-bit addition. Then more explanation: if next Why observed instruction is also LDR (with manual says that non-consecutive address not based on first LDR) takes just 1 cycle. costs of then it saves 1 cycle.

11 12 rtex-M4 Processor Inputs and output of ADD are n consecutive LDRs Reference Manual”. “integer registers”. ARMv7-M takes only n + 1 cycles edia comment that has 16 integer registers, including (“more multiple LDRs floating-point unit. special-purpose “stack pointer” pipelined together”). and “program counter”. that Cortex-M4 Can achieve this sp ARMv7E-M Each element of x array needs to in other ways (LDRD, rofile”. be “loaded” into a register. but nothing seems “ARMv7-M Basic load instruction: LDR. Lower bound for n Reference Manual”, Manual says 2 cycles but adds 2 n + 1 cycles, including instructions: a note about “pipelining”. n cycles of arithmetic. 32-bit addition. Then more explanation: if next Why observed time instruction is also LDR (with ys that non-consecutive LDR address not based on first LDR) 1 cycle. costs of manipulating then it saves 1 cycle.

11 12 Processor Inputs and output of ADD are n consecutive LDRs Manual”. “integer registers”. ARMv7-M takes only n + 1 cycles comment that has 16 integer registers, including (“more multiple LDRs can b oint unit. special-purpose “stack pointer” pipelined together”). and “program counter”. rtex-M4 Can achieve this speed ARMv7E-M Each element of x array needs to in other ways (LDRD, LDM) be “loaded” into a register. but nothing seems faster. Basic load instruction: LDR. Lower bound for n LDR + n ADD: Manual”, Manual says 2 cycles but adds 2 n + 1 cycles, including a note about “pipelining”. n cycles of arithmetic. addition. Then more explanation: if next Why observed time is higher: instruction is also LDR (with non-consecutive LDRs; address not based on first LDR) costs of manipulating i . then it saves 1 cycle.

12 13 Inputs and output of ADD are n consecutive LDRs “integer registers”. ARMv7-M takes only n + 1 cycles has 16 integer registers, including (“more multiple LDRs can be special-purpose “stack pointer” pipelined together”). and “program counter”. Can achieve this speed Each element of x array needs to in other ways (LDRD, LDM) be “loaded” into a register. but nothing seems faster. Basic load instruction: LDR. Lower bound for n LDR + n ADD: Manual says 2 cycles but adds 2 n + 1 cycles, including a note about “pipelining”. n cycles of arithmetic. Then more explanation: if next Why observed time is higher: instruction is also LDR (with non-consecutive LDRs; address not based on first LDR) costs of manipulating i . then it saves 1 cycle.

12 13 and output of ADD are n consecutive LDRs 2281 cycles “integer registers”. ARMv7-M takes only n + 1 cycles y = x + integer registers, including (“more multiple LDRs can be p = x ecial-purpose “stack pointer” pipelined together”). result = rogram counter”. Can achieve this speed element of x array needs to in other ways (LDRD, LDM) loop: “loaded” into a register. but nothing seems faster. xi9 = load instruction: LDR. Lower bound for n LDR + n ADD: xi8 = Manual says 2 cycles but adds 2 n + 1 cycles, including xi7 = about “pipelining”. n cycles of arithmetic. xi6 = more explanation: if next Why observed time is higher: xi5 = instruction is also LDR (with non-consecutive LDRs; xi4 = address not based on first LDR) costs of manipulating i . xi3 = saves 1 cycle. xi2 =

12 13 output of ADD are n consecutive LDRs 2281 cycles using ldr.w registers”. ARMv7-M takes only n + 1 cycles y = x + 4000 registers, including (“more multiple LDRs can be p = x “stack pointer” pipelined together”). result = 0 counter”. Can achieve this speed x array needs to in other ways (LDRD, LDM) loop: a register. but nothing seems faster. xi9 = *(uint32 instruction: LDR. Lower bound for n LDR + n ADD: xi8 = *(uint32 cycles but adds 2 n + 1 cycles, including xi7 = *(uint32 “pipelining”. n cycles of arithmetic. xi6 = *(uint32 explanation: if next Why observed time is higher: xi5 = *(uint32 also LDR (with non-consecutive LDRs; xi4 = *(uint32 based on first LDR) costs of manipulating i . xi3 = *(uint32 cycle. xi2 = *(uint32

12 13 are n consecutive LDRs 2281 cycles using ldr.w : ARMv7-M takes only n + 1 cycles y = x + 4000 including (“more multiple LDRs can be p = x ointer” pipelined together”). result = 0 Can achieve this speed needs to in other ways (LDRD, LDM) loop: register. but nothing seems faster. xi9 = *(uint32 *) (p + DR. Lower bound for n LDR + n ADD: xi8 = *(uint32 *) (p + adds 2 n + 1 cycles, including xi7 = *(uint32 *) (p + n cycles of arithmetic. xi6 = *(uint32 *) (p + next Why observed time is higher: xi5 = *(uint32 *) (p + (with non-consecutive LDRs; xi4 = *(uint32 *) (p + LDR) costs of manipulating i . xi3 = *(uint32 *) (p + xi2 = *(uint32 *) (p +

13 14 n consecutive LDRs 2281 cycles using ldr.w : takes only n + 1 cycles y = x + 4000 (“more multiple LDRs can be p = x pipelined together”). result = 0 Can achieve this speed in other ways (LDRD, LDM) loop: but nothing seems faster. xi9 = *(uint32 *) (p + 76) Lower bound for n LDR + n ADD: xi8 = *(uint32 *) (p + 72) 2 n + 1 cycles, including xi7 = *(uint32 *) (p + 68) n cycles of arithmetic. xi6 = *(uint32 *) (p + 64) Why observed time is higher: xi5 = *(uint32 *) (p + 60) non-consecutive LDRs; xi4 = *(uint32 *) (p + 56) costs of manipulating i . xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48)

13 14 consecutive LDRs 2281 cycles using ldr.w : xi1 = only n + 1 cycles xi0 = y = x + 4000 re multiple LDRs can be result p = x elined together”). result result = 0 result achieve this speed result other ways (LDRD, LDM) loop: result nothing seems faster. result xi9 = *(uint32 *) (p + 76) bound for n LDR + n ADD: result xi8 = *(uint32 *) (p + 72) cycles, including result xi7 = *(uint32 *) (p + 68) cycles of arithmetic. result xi6 = *(uint32 *) (p + 64) result observed time is higher: xi5 = *(uint32 *) (p + 60) xi9 = non-consecutive LDRs; xi4 = *(uint32 *) (p + 56) xi8 = of manipulating i . xi3 = *(uint32 *) (p + 52) xi7 = xi2 = *(uint32 *) (p + 48)

13 14 Rs 2281 cycles using ldr.w : xi1 = *(uint32 cycles xi0 = *(uint32 y = x + 4000 LDRs can be result += xi9 p = x together”). result += xi8 result = 0 result += xi7 speed result += xi6 (LDRD, LDM) loop: result += xi5 seems faster. result += xi4 xi9 = *(uint32 *) (p + 76) n LDR + n ADD: result += xi3 xi8 = *(uint32 *) (p + 72) including result += xi2 xi7 = *(uint32 *) (p + 68) rithmetic. result += xi1 xi6 = *(uint32 *) (p + 64) result += xi0 time is higher: xi5 = *(uint32 *) (p + 60) xi9 = *(uint32 LDRs; xi4 = *(uint32 *) (p + 56) xi8 = *(uint32 manipulating i . xi3 = *(uint32 *) (p + 52) xi7 = *(uint32 xi2 = *(uint32 *) (p + 48)

13 14 2281 cycles using ldr.w : xi1 = *(uint32 *) (p + xi0 = *(uint32 *) (p + y = x + 4000 be result += xi9 p = x result += xi8 result = 0 result += xi7 result += xi6 LDM) loop: result += xi5 result += xi4 xi9 = *(uint32 *) (p + 76) n ADD: result += xi3 xi8 = *(uint32 *) (p + 72) result += xi2 xi7 = *(uint32 *) (p + 68) result += xi1 xi6 = *(uint32 *) (p + 64) result += xi0 higher: xi5 = *(uint32 *) (p + 60) xi9 = *(uint32 *) (p + xi4 = *(uint32 *) (p + 56) xi8 = *(uint32 *) (p + xi3 = *(uint32 *) (p + 52) xi7 = *(uint32 *) (p + xi2 = *(uint32 *) (p + 48)

14 15 2281 cycles using ldr.w : xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) y = x + 4000 result += xi9 p = x result += xi8 result = 0 result += xi7 result += xi6 loop: result += xi5 result += xi4 xi9 = *(uint32 *) (p + 76) result += xi3 xi8 = *(uint32 *) (p + 72) result += xi2 xi7 = *(uint32 *) (p + 68) result += xi1 xi6 = *(uint32 *) (p + 64) result += xi0 xi5 = *(uint32 *) (p + 60) xi9 = *(uint32 *) (p + 36) xi4 = *(uint32 *) (p + 56) xi8 = *(uint32 *) (p + 32) xi3 = *(uint32 *) (p + 52) xi7 = *(uint32 *) (p + 28) xi2 = *(uint32 *) (p + 48)

14 15 cycles using ldr.w : xi1 = *(uint32 *) (p + 44) xi6 = xi0 = *(uint32 *) (p + 40) xi5 = 4000 result += xi9 xi4 = result += xi8 xi3 = = 0 result += xi7 xi2 = result += xi6 xi1 = result += xi5 xi0 = result += xi4 result *(uint32 *) (p + 76) result += xi3 result *(uint32 *) (p + 72) result += xi2 result *(uint32 *) (p + 68) result += xi1 result *(uint32 *) (p + 64) result += xi0 result *(uint32 *) (p + 60) xi9 = *(uint32 *) (p + 36) result *(uint32 *) (p + 56) xi8 = *(uint32 *) (p + 32) result *(uint32 *) (p + 52) xi7 = *(uint32 *) (p + 28) result *(uint32 *) (p + 48)

14 15 using ldr.w : xi1 = *(uint32 *) (p + 44) xi6 = *(uint32 xi0 = *(uint32 *) (p + 40) xi5 = *(uint32 result += xi9 xi4 = *(uint32 result += xi8 xi3 = *(uint32 result += xi7 xi2 = *(uint32 result += xi6 xi1 = *(uint32 result += xi5 xi0 = *(uint32 result += xi4 result += xi9 *) (p + 76) result += xi3 result += xi8 *) (p + 72) result += xi2 result += xi7 *) (p + 68) result += xi1 result += xi6 *) (p + 64) result += xi0 result += xi5 *) (p + 60) xi9 = *(uint32 *) (p + 36) result += xi4 *) (p + 56) xi8 = *(uint32 *) (p + 32) result += xi3 *) (p + 52) xi7 = *(uint32 *) (p + 28) result += xi2 *) (p + 48)

14 15 xi1 = *(uint32 *) (p + 44) xi6 = *(uint32 *) (p + xi0 = *(uint32 *) (p + 40) xi5 = *(uint32 *) (p + result += xi9 xi4 = *(uint32 *) (p + result += xi8 xi3 = *(uint32 *) (p + result += xi7 xi2 = *(uint32 *) (p + result += xi6 xi1 = *(uint32 *) (p + result += xi5 xi0 = *(uint32 *) p; p result += xi4 result += xi9 76) result += xi3 result += xi8 72) result += xi2 result += xi7 68) result += xi1 result += xi6 64) result += xi0 result += xi5 60) xi9 = *(uint32 *) (p + 36) result += xi4 56) xi8 = *(uint32 *) (p + 32) result += xi3 52) xi7 = *(uint32 *) (p + 28) result += xi2 48)

15 16 xi1 = *(uint32 *) (p + 44) xi6 = *(uint32 *) (p + 24) xi0 = *(uint32 *) (p + 40) xi5 = *(uint32 *) (p + 20) result += xi9 xi4 = *(uint32 *) (p + 16) result += xi8 xi3 = *(uint32 *) (p + 12) result += xi7 xi2 = *(uint32 *) (p + 8) result += xi6 xi1 = *(uint32 *) (p + 4) result += xi5 xi0 = *(uint32 *) p; p += 160 result += xi4 result += xi9 result += xi3 result += xi8 result += xi2 result += xi7 result += xi1 result += xi6 result += xi0 result += xi5 xi9 = *(uint32 *) (p + 36) result += xi4 xi8 = *(uint32 *) (p + 32) result += xi3 xi7 = *(uint32 *) (p + 28) result += xi2

15 16 *(uint32 *) (p + 44) xi6 = *(uint32 *) (p + 24) result *(uint32 *) (p + 40) xi5 = *(uint32 *) (p + 20) result result += xi9 xi4 = *(uint32 *) (p + 16) xi9 = result += xi8 xi3 = *(uint32 *) (p + 12) xi8 = result += xi7 xi2 = *(uint32 *) (p + 8) xi7 = result += xi6 xi1 = *(uint32 *) (p + 4) xi6 = result += xi5 xi0 = *(uint32 *) p; p += 160 xi5 = result += xi4 result += xi9 xi4 = result += xi3 result += xi8 xi3 = result += xi2 result += xi7 xi2 = result += xi1 result += xi6 xi1 = result += xi0 result += xi5 xi0 = *(uint32 *) (p + 36) result += xi4 result *(uint32 *) (p + 32) result += xi3 result *(uint32 *) (p + 28) result += xi2 result

15 16 *) (p + 44) xi6 = *(uint32 *) (p + 24) result += xi1 *) (p + 40) xi5 = *(uint32 *) (p + 20) result += xi0 xi4 = *(uint32 *) (p + 16) xi9 = *(uint32 xi3 = *(uint32 *) (p + 12) xi8 = *(uint32 xi2 = *(uint32 *) (p + 8) xi7 = *(uint32 xi1 = *(uint32 *) (p + 4) xi6 = *(uint32 xi0 = *(uint32 *) p; p += 160 xi5 = *(uint32 result += xi9 xi4 = *(uint32 result += xi8 xi3 = *(uint32 result += xi7 xi2 = *(uint32 result += xi6 xi1 = *(uint32 result += xi5 xi0 = *(uint32 *) (p + 36) result += xi4 result += xi9 *) (p + 32) result += xi3 result += xi8 *) (p + 28) result += xi2 result += xi7

15 16 44) xi6 = *(uint32 *) (p + 24) result += xi1 40) xi5 = *(uint32 *) (p + 20) result += xi0 xi4 = *(uint32 *) (p + 16) xi9 = *(uint32 *) (p - xi3 = *(uint32 *) (p + 12) xi8 = *(uint32 *) (p - xi2 = *(uint32 *) (p + 8) xi7 = *(uint32 *) (p - xi1 = *(uint32 *) (p + 4) xi6 = *(uint32 *) (p - xi0 = *(uint32 *) p; p += 160 xi5 = *(uint32 *) (p - result += xi9 xi4 = *(uint32 *) (p - result += xi8 xi3 = *(uint32 *) (p - result += xi7 xi2 = *(uint32 *) (p - result += xi6 xi1 = *(uint32 *) (p - result += xi5 xi0 = *(uint32 *) (p - 36) result += xi4 result += xi9 32) result += xi3 result += xi8 28) result += xi2 result += xi7

16 17 xi6 = *(uint32 *) (p + 24) result += xi1 xi5 = *(uint32 *) (p + 20) result += xi0 xi4 = *(uint32 *) (p + 16) xi9 = *(uint32 *) (p - 4) xi3 = *(uint32 *) (p + 12) xi8 = *(uint32 *) (p - 8) xi2 = *(uint32 *) (p + 8) xi7 = *(uint32 *) (p - 12) xi1 = *(uint32 *) (p + 4) xi6 = *(uint32 *) (p - 16) xi0 = *(uint32 *) p; p += 160 xi5 = *(uint32 *) (p - 20) result += xi9 xi4 = *(uint32 *) (p - 24) result += xi8 xi3 = *(uint32 *) (p - 28) result += xi7 xi2 = *(uint32 *) (p - 32) result += xi6 xi1 = *(uint32 *) (p - 36) result += xi5 xi0 = *(uint32 *) (p - 40) result += xi4 result += xi9 result += xi3 result += xi8 result += xi2 result += xi7

16 17 *(uint32 *) (p + 24) result += xi1 result *(uint32 *) (p + 20) result += xi0 result *(uint32 *) (p + 16) xi9 = *(uint32 *) (p - 4) result *(uint32 *) (p + 12) xi8 = *(uint32 *) (p - 8) result *(uint32 *) (p + 8) xi7 = *(uint32 *) (p - 12) result *(uint32 *) (p + 4) xi6 = *(uint32 *) (p - 16) result *(uint32 *) p; p += 160 xi5 = *(uint32 *) (p - 20) result result += xi9 xi4 = *(uint32 *) (p - 24) xi9 = result += xi8 xi3 = *(uint32 *) (p - 28) xi8 = result += xi7 xi2 = *(uint32 *) (p - 32) xi7 = result += xi6 xi1 = *(uint32 *) (p - 36) xi6 = result += xi5 xi0 = *(uint32 *) (p - 40) xi5 = result += xi4 result += xi9 xi4 = result += xi3 result += xi8 xi3 = result += xi2 result += xi7 xi2 =

16 17 *) (p + 24) result += xi1 result += xi6 *) (p + 20) result += xi0 result += xi5 *) (p + 16) xi9 = *(uint32 *) (p - 4) result += xi4 *) (p + 12) xi8 = *(uint32 *) (p - 8) result += xi3 *) (p + 8) xi7 = *(uint32 *) (p - 12) result += xi2 *) (p + 4) xi6 = *(uint32 *) (p - 16) result += xi1 *) p; p += 160 xi5 = *(uint32 *) (p - 20) result += xi0 xi4 = *(uint32 *) (p - 24) xi9 = *(uint32 xi3 = *(uint32 *) (p - 28) xi8 = *(uint32 xi2 = *(uint32 *) (p - 32) xi7 = *(uint32 xi1 = *(uint32 *) (p - 36) xi6 = *(uint32 xi0 = *(uint32 *) (p - 40) xi5 = *(uint32 result += xi9 xi4 = *(uint32 result += xi8 xi3 = *(uint32 result += xi7 xi2 = *(uint32

16 17 24) result += xi1 result += xi6 20) result += xi0 result += xi5 16) xi9 = *(uint32 *) (p - 4) result += xi4 12) xi8 = *(uint32 *) (p - 8) result += xi3 8) xi7 = *(uint32 *) (p - 12) result += xi2 4) xi6 = *(uint32 *) (p - 16) result += xi1 += 160 xi5 = *(uint32 *) (p - 20) result += xi0 xi4 = *(uint32 *) (p - 24) xi9 = *(uint32 *) (p - xi3 = *(uint32 *) (p - 28) xi8 = *(uint32 *) (p - xi2 = *(uint32 *) (p - 32) xi7 = *(uint32 *) (p - xi1 = *(uint32 *) (p - 36) xi6 = *(uint32 *) (p - xi0 = *(uint32 *) (p - 40) xi5 = *(uint32 *) (p - result += xi9 xi4 = *(uint32 *) (p - result += xi8 xi3 = *(uint32 *) (p - result += xi7 xi2 = *(uint32 *) (p -

17 18 result += xi1 result += xi6 result += xi0 result += xi5 xi9 = *(uint32 *) (p - 4) result += xi4 xi8 = *(uint32 *) (p - 8) result += xi3 xi7 = *(uint32 *) (p - 12) result += xi2 xi6 = *(uint32 *) (p - 16) result += xi1 xi5 = *(uint32 *) (p - 20) result += xi0 xi4 = *(uint32 *) (p - 24) xi9 = *(uint32 *) (p - 44) xi3 = *(uint32 *) (p - 28) xi8 = *(uint32 *) (p - 48) xi2 = *(uint32 *) (p - 32) xi7 = *(uint32 *) (p - 52) xi1 = *(uint32 *) (p - 36) xi6 = *(uint32 *) (p - 56) xi0 = *(uint32 *) (p - 40) xi5 = *(uint32 *) (p - 60) result += xi9 xi4 = *(uint32 *) (p - 64) result += xi8 xi3 = *(uint32 *) (p - 68) result += xi7 xi2 = *(uint32 *) (p - 72)

17 18 result += xi1 result += xi6 xi1 = result += xi0 result += xi5 xi0 = *(uint32 *) (p - 4) result += xi4 result *(uint32 *) (p - 8) result += xi3 result *(uint32 *) (p - 12) result += xi2 result *(uint32 *) (p - 16) result += xi1 result *(uint32 *) (p - 20) result += xi0 result *(uint32 *) (p - 24) xi9 = *(uint32 *) (p - 44) result *(uint32 *) (p - 28) xi8 = *(uint32 *) (p - 48) result *(uint32 *) (p - 32) xi7 = *(uint32 *) (p - 52) result *(uint32 *) (p - 36) xi6 = *(uint32 *) (p - 56) result *(uint32 *) (p - 40) xi5 = *(uint32 *) (p - 60) result result += xi9 xi4 = *(uint32 *) (p - 64) result += xi8 xi3 = *(uint32 *) (p - 68) result += xi7 xi2 = *(uint32 *) (p - 72) goto loop

17 18 result += xi6 xi1 = *(uint32 result += xi5 xi0 = *(uint32 *) (p - 4) result += xi4 result += xi9 *) (p - 8) result += xi3 result += xi8 *) (p - 12) result += xi2 result += xi7 *) (p - 16) result += xi1 result += xi6 *) (p - 20) result += xi0 result += xi5 *) (p - 24) xi9 = *(uint32 *) (p - 44) result += xi4 *) (p - 28) xi8 = *(uint32 *) (p - 48) result += xi3 *) (p - 32) xi7 = *(uint32 *) (p - 52) result += xi2 *) (p - 36) xi6 = *(uint32 *) (p - 56) result += xi1 *) (p - 40) xi5 = *(uint32 *) (p - 60) result += xi0 xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) =? xi2 = *(uint32 *) (p - 72) goto loop if !=

17 18 result += xi6 xi1 = *(uint32 *) (p - result += xi5 xi0 = *(uint32 *) (p - 4) result += xi4 result += xi9 8) result += xi3 result += xi8 12) result += xi2 result += xi7 16) result += xi1 result += xi6 20) result += xi0 result += xi5 24) xi9 = *(uint32 *) (p - 44) result += xi4 28) xi8 = *(uint32 *) (p - 48) result += xi3 32) xi7 = *(uint32 *) (p - 52) result += xi2 36) xi6 = *(uint32 *) (p - 56) result += xi1 40) xi5 = *(uint32 *) (p - 60) result += xi0 xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) =? p - y xi2 = *(uint32 *) (p - 72) goto loop if !=

18 19 result += xi6 xi1 = *(uint32 *) (p - 76) result += xi5 xi0 = *(uint32 *) (p - 80) result += xi4 result += xi9 result += xi3 result += xi8 result += xi2 result += xi7 result += xi1 result += xi6 result += xi0 result += xi5 xi9 = *(uint32 *) (p - 44) result += xi4 xi8 = *(uint32 *) (p - 48) result += xi3 xi7 = *(uint32 *) (p - 52) result += xi2 xi6 = *(uint32 *) (p - 56) result += xi1 xi5 = *(uint32 *) (p - 60) result += xi0 xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) =? p - y xi2 = *(uint32 *) (p - 72) goto loop if !=

18 19 Wikipedia: result += xi6 xi1 = *(uint32 *) (p - 76) even perfo result += xi5 xi0 = *(uint32 *) (p - 80) optimizing result += xi4 result += xi9 performance result += xi3 result += xi8 result += xi2 result += xi7 result += xi1 result += xi6 result += xi0 result += xi5 *(uint32 *) (p - 44) result += xi4 *(uint32 *) (p - 48) result += xi3 *(uint32 *) (p - 52) result += xi2 *(uint32 *) (p - 56) result += xi1 *(uint32 *) (p - 60) result += xi0 *(uint32 *) (p - 64) *(uint32 *) (p - 68) =? p - y *(uint32 *) (p - 72) goto loop if !=

18 19 Wikipedia: “By the xi1 = *(uint32 *) (p - 76) even performance xi0 = *(uint32 *) (p - 80) optimizing compilers result += xi9 performance of human result += xi8 result += xi7 result += xi6 result += xi5 *) (p - 44) result += xi4 *) (p - 48) result += xi3 *) (p - 52) result += xi2 *) (p - 56) result += xi1 *) (p - 60) result += xi0 *) (p - 64) *) (p - 68) =? p - y *) (p - 72) goto loop if !=

18 19 Wikipedia: “By the late 1990s xi1 = *(uint32 *) (p - 76) even performance sensitive co xi0 = *(uint32 *) (p - 80) optimizing compilers exceeded result += xi9 performance of human experts. result += xi8 result += xi7 result += xi6 result += xi5 44) result += xi4 48) result += xi3 52) result += xi2 56) result += xi1 60) result += xi0 64) 68) =? p - y 72) goto loop if !=

19 20 Wikipedia: “By the late 1990s for xi1 = *(uint32 *) (p - 76) even performance sensitive code, xi0 = *(uint32 *) (p - 80) optimizing compilers exceeded the result += xi9 performance of human experts.” result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if !=

19 20 Wikipedia: “By the late 1990s for xi1 = *(uint32 *) (p - 76) even performance sensitive code, xi0 = *(uint32 *) (p - 80) optimizing compilers exceeded the result += xi9 performance of human experts.” result += xi8 result += xi7 Reality: The fastest software result += xi6 today relies on human experts result += xi5 understanding the CPU. result += xi4 Cannot trust compiler to result += xi3 optimize instruction selection. result += xi2 result += xi1 Cannot trust compiler to result += xi0 optimize instruction scheduling. Cannot trust compiler to =? p - y optimize register allocation. goto loop if !=

19 20 Wikipedia: “By the late 1990s for The big *(uint32 *) (p - 76) even performance sensitive code, *(uint32 *) (p - 80) CPUs are optimizing compilers exceeded the result += xi9 farther and performance of human experts.” result += xi8 from naive result += xi7 Reality: The fastest software result += xi6 today relies on human experts result += xi5 understanding the CPU. result += xi4 Cannot trust compiler to result += xi3 optimize instruction selection. result += xi2 result += xi1 Cannot trust compiler to result += xi0 optimize instruction scheduling. Cannot trust compiler to =? p - y optimize register allocation. loop if !=

19 20 Wikipedia: “By the late 1990s for The big picture *) (p - 76) even performance sensitive code, *) (p - 80) CPUs are evolving optimizing compilers exceeded the farther and farther performance of human experts.” from naive models Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to p - y optimize register allocation.

19 20 Wikipedia: “By the late 1990s for The big picture 76) even performance sensitive code, 80) CPUs are evolving optimizing compilers exceeded the farther and farther away performance of human experts.” from naive models of CPUs. Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation.

20 21 Wikipedia: “By the late 1990s for The big picture even performance sensitive code, CPUs are evolving optimizing compilers exceeded the farther and farther away performance of human experts.” from naive models of CPUs. Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation.

20 21 Wikipedia: “By the late 1990s for The big picture even performance sensitive code, CPUs are evolving optimizing compilers exceeded the farther and farther away performance of human experts.” from naive models of CPUs. Reality: The fastest software Minor optimization challenges: today relies on human experts • Pipelining. understanding the CPU. • Superscalar processing. Cannot trust compiler to Major optimization challenges: optimize instruction selection. • Vectorization. Cannot trust compiler to • Many threads; many cores. optimize instruction scheduling. • The memory hierarchy; the ring; the mesh. Cannot trust compiler to • Larger-scale parallelism. optimize register allocation. • Larger-scale networking.

� � � � � � � � 20 21 edia: “By the late 1990s for The big picture CPU design erformance sensitive code, CPUs are evolving f 0 g 0 optimizing compilers exceeded the ❇ ❇ ⑤ farther and farther away ❇ ⑤ ❇ ⑤ � ⑤ rmance of human experts.” from naive models of CPUs. ∧ ∧ y: The fastest software Minor optimization challenges: ∧ ∧ relies on human experts • Pipelining. understanding the CPU. • Superscalar processing. � � Cannot trust compiler to ∧ Major optimization challenges: optimize instruction selection. • Vectorization. ∧ Cannot trust compiler to • Many threads; many cores. h 0 h 1 optimize instruction scheduling. • The memory hierarchy; the ring; the mesh. Gates ∧ : Cannot trust compiler to • Larger-scale parallelism. product optimize register allocation. • Larger-scale networking. of integers

� � � � � � � � � � � � 20 21 the late 1990s for The big picture CPU design in a nutshell rmance sensitive code, CPUs are evolving f 0 g 0 ◗ compilers exceeded the � ♠♠♠♠♠♠♠♠♠ ◗ ❇ ◗ ❇ ◗ ⑤ farther and farther away ◗ ❇ ⑤ ◗ ◗ ❇ ⑤ ◗ � ⑤ human experts.” from naive models of CPUs. ∧ ∧ ❊ ❊ ❊ ② ❊ ② ❊ ② test software � ② Minor optimization challenges: ∧ ∧ ∧ ❊ ❊ human experts ❊ ② ❊ ② • Pipelining. ❊ ② � ② the CPU. ∧ ❊ • Superscalar processing. ❊ ② ❊ ② ❊ ② ② � ② compiler to ∧ ∧ Major optimization challenges: instruction selection. • Vectorization. ∧ compiler to • Many threads; many cores. h 0 h 1 h 3 instruction scheduling. • The memory hierarchy; the ring; the mesh. Gates ∧ : a; b �→ 1 compiler to • Larger-scale parallelism. product h 0 + 2 h 1 + allocation. • Larger-scale networking. of integers f 0 + 2 f 1

� � � � � � � � � � � � � � � � � � � � 20 21 1990s for The big picture CPU design in a nutshell code, CPUs are evolving f 0 g 0 g 1 f 1 ◗ exceeded the � ♠♠♠♠♠♠♠♠♠ ◗ ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ farther and farther away ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ erts.” ◗ from naive models of CPUs. ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ ② are � ② Minor optimization challenges: ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② erts ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② • Pipelining. ❊ ② � ② ∧ ❊ • Superscalar processing. ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ Major optimization challenges: selection. • Vectorization. ∧ ∧ • Many threads; many cores. h 0 h 1 h 3 h 2 scheduling. • The memory hierarchy; the ring; the mesh. Gates ∧ : a; b �→ 1 − ab computing • Larger-scale parallelism. product h 0 + 2 h 1 + 4 h 2 + 8 h cation. • Larger-scale networking. of integers f 0 + 2 f 1 ; g 0 + 2 g 1

� � � � � � � � � � � � � � � � � � � � 21 22 The big picture CPU design in a nutshell CPUs are evolving f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ farther and farther away ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ from naive models of CPUs. ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ ② � ② Minor optimization challenges: ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② • Pipelining. ❊ ② � ② ∧ ❊ • Superscalar processing. ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ Major optimization challenges: • Vectorization. ∧ ∧ • Many threads; many cores. h 0 h 1 h 3 h 2 • The memory hierarchy; the ring; the mesh. Gates ∧ : a; b �→ 1 − ab computing • Larger-scale parallelism. product h 0 + 2 h 1 + 4 h 2 + 8 h 3 • Larger-scale networking. of integers f 0 + 2 f 1 ; g 0 + 2 g 1 .

� � � � � � � � � � � � � � � � � � � � 21 22 big picture CPU design in a nutshell Electricit percolate are evolving f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ and farther away ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ then h 0 ; naive models of CPUs. ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments ② ❊ ② ❊ ② � ② optimization challenges: ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② elining. ❊ ② � ② ∧ ❊ erscalar processing. ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ optimization challenges: ectorization. ∧ ∧ Many threads; many cores. h 0 h 1 h 3 h 2 memory hierarchy; ring; the mesh. Gates ∧ : a; b �→ 1 − ab computing rger-scale parallelism. product h 0 + 2 h 1 + 4 h 2 + 8 h 3 rger-scale networking. of integers f 0 + 2 f 1 ; g 0 + 2 g 1 .

� � � � � � � � � � � � � � � � � � � � 21 22 CPU design in a nutshell Electricity takes time percolate through evolving f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g 0 ; g 1 are stab ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ rther away ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ then h 0 ; h 1 ; h 2 ; h 3 dels of CPUs. ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments later. ② ❊ ② ❊ ② � ② ion challenges: ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② ❊ ② � ② ∧ ❊ rocessing. ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ tion challenges: ∧ ∧ many cores. h 0 h 1 h 3 h 2 hierarchy; mesh. Gates ∧ : a; b �→ 1 − ab computing parallelism. product h 0 + 2 h 1 + 4 h 2 + 8 h 3 networking. of integers f 0 + 2 f 1 ; g 0 + 2 g 1 .

� � � � � � � � � � � � � � � � � � � � 21 22 CPU design in a nutshell Electricity takes time to percolate through wires and f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g 0 ; g 1 are stable ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ then h 0 ; h 1 ; h 2 ; h 3 are stable CPUs. ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments later. ② ❊ ② ❊ ② � ② challenges: ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② ❊ ② � ② ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ challenges: ∧ ∧ res. h 0 h 1 h 3 h 2 Gates ∧ : a; b �→ 1 − ab computing product h 0 + 2 h 1 + 4 h 2 + 8 h 3 of integers f 0 + 2 f 1 ; g 0 + 2 g 1 .

� � � � � � � � � � � � � � � � � � � � 22 23 CPU design in a nutshell Electricity takes time to percolate through wires and gates. f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g 0 ; g 1 are stable ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ then h 0 ; h 1 ; h 2 ; h 3 are stable ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments later. ② ❊ ② ❊ ② � ② ∧ ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② ❊ ② � ② ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ ② � ② ∧ ∧ ∧ ∧ ∧ h 0 h 1 h 3 h 2 Gates ∧ : a; b �→ 1 − ab computing product h 0 + 2 h 1 + 4 h 2 + 8 h 3 of integers f 0 + 2 f 1 ; g 0 + 2 g 1 .

� � � � � � � � � � � � � � � � � � � � 22 23 CPU design in a nutshell Electricity takes time to percolate through wires and gates. f 0 g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g 0 ; g 1 are stable ❇ ❆ ◗ ❇ ◗ ⑥ ❆ ⑤ ◗ ❇ ⑥ ❆ ⑤ ◗ ◗ ❆ ❇ ⑥ ⑤ ◗ � ⑥ � ⑤ ◗ then h 0 ; h 1 ; h 2 ; h 3 are stable ∧ ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments later. ② ❊ ② ❊ ② � ② ∧ ∧ ∧ ∧ ∧ ❊ ❊ Build circuit with more gates ② ❊ � ☞☞☞☞☞☞☞☞ ② ❊ ② ❊ ② � ② to multiply (e.g.) 32-bit integers: ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ ② � ② ❄ ∧ ∧ ∧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ∧ ∧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ⑧ ❄ h 0 h 1 h 3 h 2 ❄ ⑧ ❄ ⑧ ⑧ ❄ ❄ ⑧ ⑧ ❄ Gates ∧ : a; b �→ 1 − ab computing ⑧ product h 0 + 2 h 1 + 4 h 2 + 8 h 3 of integers f 0 + 2 f 1 ; g 0 + 2 g 1 . (Details omitted.)

� � � � � � � � � � � � � � � � � 22 23 design in a nutshell Electricity takes time to Build circuit percolate through wires and gates. 32-bit integer g 0 g 1 f 1 ◗ � ♠♠♠♠♠♠♠♠♠ ◗ If f 0 ; f 1 ; g 0 ; g 1 are stable given 4-bit ❆ ◗ ◗ ⑥ ❆ ◗ ⑥ ❆ ◗ ◗ ❆ ⑥ ◗ � ⑥ ◗ then h 0 ; h 1 ; h 2 ; h 3 are stable and 32-bit ∧ ∧ ∧ ❊ ❊ ② ❊ a few moments later. ② ❊ ② ❊ ② � ② ∧ ∧ ∧ ∧ ❊ ❊ Build circuit with more gates ② ❊ � ☞☞☞☞☞☞☞☞ register ② ❊ ② ❊ ② � ② to multiply (e.g.) 32-bit integers: ∧ ❊ ❊ ② ❊ ② ❊ ② ❊ read ② � ② ❄ ∧ ∧ ∧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ∧ ∧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ❄ ⑧ ⑧ ❄ h 1 h 3 h 2 ❄ ⑧ ❄ ⑧ ⑧ ❄ ❄ ⑧ ⑧ ❄ ∧ : a; b �→ 1 − ab computing ⑧ duct h 0 + 2 h 1 + 4 h 2 + 8 h 3 integers f 0 + 2 f 1 ; g 0 + 2 g 1 . (Details omitted.)

CPU-specific optimization Example of a function that we want to - PowerPoint PPT Presentation

1 2 CPU-specific optimization Example of a function that we want to optimize: Example of a target CPU core: adding 1000 integers mod 2 32 . ARM Cortex-M4F core inside LM4F120H5QR microcontroller Reference implementation: in Stellaris

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Eric McCreath Introduction CPU scheduling is at the heart of a multiprogrammed

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

CPU Scheduling Mehdi Kargahi School of ECE University of Tehran Spring 2008 CPU and I/O Bursts

CPU Scheduling Heechul Yun 1 Administrative Midterm Mar. 15, 2016 Closed book,

CPSC 410/611: Week 4 Threads CPU Scheduling Synchronization (Part I) CPU

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Specific Aims One Page The single most important page in a grant Specific Aims Specific Aims

The Natural Logarithm Function and The Exponential Function One specific logarithm function is

27.10.2016 The Team Dipl.-Ing. Tobias Rohde Dipl.-Ing. Tobias Kaupat

Cortical responses evoked by continuous wrist manipulation Alfred C. Schouten , Martijn P. Vlaar,

Modeling and interpretation of extracellular potentials Gaute T . Einevoll 1 , Szymon ski 2

Multitasking on Cortex-M(0) class MCU A deepdive into the Chromium-EC scheduler $whoami

Visual deep learning models, in particular for face recognition and models of invariant

R goes Mobile: Efficient Scheduling for Parallel R Programs on Heterogeneous Embedded Systems

Neural Networks Guy Hacohen 1,2 and Daphna Weinshall 1 1- School of Computer Science and

Background Large amounts of reciprocal connectivity between cortical layers Lateral