std::map<Code,Performance> myMCU{?} @DanielPenning The - - PowerPoint PPT Presentation

▶

May 27, 2023 199 likes •476 views

std::map<Code,Performance> myMCU{?} @DanielPenning The mapping between Code & Performance www.embeff.com World Map (1459) World Map (1525) People admitted they dont know. @DanielPenning The mapping between Code &

SLIDE 1

@DanielPenning www.embeff.com

std::map<Code,Performance> myMCU{?}

The mapping between Code & Performance

SLIDE 2

World Map (1459)

SLIDE 3

World Map (1525)

SLIDE 4

@DanielPenning www.embeff.com

People admitted they don’t know.

The mapping between Code & Performance

SLIDE 5

@DanielPenning www.embeff.com

The Beginning of Modern Science

1. Admit ignorance
2. Observations

§ Measure and gather data. § Connect data into comprehensive theories.

The mapping between Code & Performance

SLIDE 6

SLIDE 7

@DanielPenning www.embeff.com

Embedded & Ignorance

Code

Target Architecture Target Speed Compiler Compiler Settings Target Cache Possibly a highly complex and interdependent mapping! The mapping between Code & Performance

?

Performance

SLIDE 8

SLIDE 9

@DanielPenning www.embeff.com

Consequences

Prejudices prevail Mistrust against libraries Low code quality Performance suffers

The mapping between Code & Performance

SLIDE 10

@DanielPenning www.embeff.com

Let’s admit our ignorance.

The mapping between Code & Performance

SLIDE 11

@DanielPenning www.embeff.com

Observations in Embedded

Profiling

§ Top Down Process. § Great to identify bottlenecks. § Bad to create specific understanding.

Build knowledge bottom up

§ Start with small code blocks. § Observe performance. § Create heuristics.

The mapping between Code & Performance

SLIDE 12

@DanielPenning www.embeff.com

Code Performance for armv7m

Architecture widely used (Cortex-M3/M4) Provides Data Watchpoint and Trace Unit

CMSIS Register Description DWT_CYCCNT Cycle Count Register DWT_CPICNT CPI Count Register DWT_EXCCNT Exception Overhead Count Register DWT_SLEEPCNT Sleep Count Register DWT_LSUCNT LSU Count Register DWT_FOLDCNT Folded-instruction Count Register

The mapping between Code & Performance

SLIDE 13

@DanielPenning www.embeff.com

Measure Cycles

penocd

(PC)

JTAG

The mapping between Code & Performance STM32F4

DWT

BKPT //< Read CYCCNT CodeUnderTest(<Parameter>) BKPT //< Read CYCCNT

SLIDE 14

@DanielPenning www.embeff.com

Let’s make observations.

The mapping between Code & Performance

SLIDE 15

@DanielPenning www.embeff.com

Example 1: Basic Optimization

The mapping between Code & Performance int square(int x) { return x*x; }

square(int): push {r7} sub sp, sp, #12 add r7, sp, #0 str r0, [r7, #4] ldr r3, [r7, #4] ldr r2, [r7, #4] mul r3, r2, r3 mov r0, r3 adds r7, r7, #12 mov sp, r7 ldr r7, [sp], #4 bx lr square(int): mul r0, r0, r0 bx lr 5 10 15 20 25 30

Minimal (-Og) No (-O0) Cycles

SLIDE 16

@DanielPenning www.embeff.com

Heuristic #1 The difference between minimal and no

ptimization is huge.

The mapping between Code & Performance

SLIDE 17

@DanielPenning www.embeff.com

Example 2: Pipeline

The mapping between Code & Performance int DependentOps(int x) { int tmp = x/3; int tmp2 = x/7; return tmp+tmp2; }

DependentOps_O1(int): ldr r3, .L2 smull r2, r3, r3, r0 asrs r1, r0, #31 subs r3, r3, r1 ldr r2, .L2+4 smull ip, r2, r2, r0 add r0, r0, r2 rsb r0, r1, r0, asr #2 add r0, r0, r3 bx lr .L2: .word 1431655766 .word

1840700269

DependentOps_O2(int): ldr r3, .L3 ldr r1, .L3+4 smull r2, r3, r3, r0 add r3, r3, r0 asrs r2, r0, #31 smull r1, r0, r1, r0 rsb r3, r2, r3, asr #2 subs r0, r0, r2 add r0, r0, r3 bx lr .L3: .word -1840700269 .word 1431655766

10 11 12 13 14 15 16 17

Cycles

SLIDE 18

@DanielPenning www.embeff.com

Heuristic #2 In low-level assembly, the compiler is probably smarter than you.

The mapping between Code & Performance

SLIDE 19

@DanielPenning www.embeff.com

Example 3: FPU vs Soft-FPU

The mapping between Code & Performance int MultiplyWithPi(int input) { return input * 3.14159265359f; } MultiplyWithPi_FPU(int): vmov s15, r0 @ int vldr.32 s14, .L3 vcvt.f32.s32 s15, s15 vmul.f32 s15, s15, s14 vcvt.s32.f32 s15, s15 vmov r0, s15 @ int bx lr .L3: .word 1078530011 MultiplyWithPi_SoftFPU(int): push {r3, lr} bl __aeabi_i2f ldr r1, .L4 bl __aeabi_fmul bl __aeabi_f2iz pop {r3, pc} .L4: .word 1078530011

SLIDE 20

@DanielPenning www.embeff.com

Example 3: FPU vs Soft-FPU

The mapping between Code & Performance int MultiplyWithPi(int input) { return input * 3.14159265359f; }

10 30 50 70 90 110

9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

Cycles Input Value FPU Soft-FPU

SLIDE 21

@DanielPenning www.embeff.com

Heuristic #3 Software-FPU ~ 6x slower and not deterministic.

The mapping between Code & Performance

SLIDE 22

@DanielPenning www.embeff.com

Example 4: CRC Computation

Cyclic Redundancy Check

§ Direct Computation § Lookup-Table § Hardware-Support

Online Benchmarking

§ Execute on real hardware. § Technical Preview Stage. § https://barebench.com

The mapping between Code & Performance

SLIDE 23

@DanielPenning www.embeff.com

barebench.com

Demo -

The mapping between Code & Performance

SLIDE 24

@DanielPenning www.embeff.com

Heuristic #4 Performance may be dependent on clock speed.

The mapping between Code & Performance

SLIDE 25

@DanielPenning www.embeff.com

Heuristic #5 Caching is essential for high clock speeds.

The mapping between Code & Performance

SLIDE 26

@DanielPenning www.embeff.com

Conclusion

Admit lack of knowledge. Measure performance. Use measurements to form heuristics. Share heuristics. Use heuristics instead of prejudices. Let‘s make embedded systems better!

The mapping between Code & Performance