std map code performance mymcu
play

std::map<Code,Performance> myMCU{?} @DanielPenning The - PowerPoint PPT Presentation

std::map<Code,Performance> myMCU{?} @DanielPenning The mapping between Code & Performance www.embeff.com World Map (1459) World Map (1525) People admitted they dont know. @DanielPenning The mapping between Code &


  1. std::map<Code,Performance> myMCU{?} @DanielPenning The mapping between Code & Performance www.embeff.com

  2. World Map (1459)

  3. World Map (1525)

  4. People admitted they don’t know. @DanielPenning The mapping between Code & Performance www.embeff.com

  5. The Beginning of Modern Science 1. Admit ignorance 2. Observations § Measure and gather data. § Connect data into comprehensive theories. @DanielPenning The mapping between Code & Performance www.embeff.com

  6. Embedded & Ignorance ? Compiler Target Architecture Code Performance Compiler Settings Target Cache Target Speed Possibly a highly complex and interdependent mapping! @DanielPenning The mapping between Code & Performance www.embeff.com

  7. Consequences Prejudices prevail Mistrust against libraries Low code quality Performance suffers @DanielPenning The mapping between Code & Performance www.embeff.com

  8. Let’s admit our ignorance. @DanielPenning The mapping between Code & Performance www.embeff.com

  9. Observations in Embedded Profiling Top Down Process. § Great to identify bottlenecks. § Bad to create specific understanding. § Build knowledge bottom up Start with small code blocks. § Observe performance. § Create heuristics. § @DanielPenning The mapping between Code & Performance www.embeff.com

  10. Code Performance for armv7m Architecture widely used (Cortex-M3/M4) Provides D ata W atchpoint and T race Unit CMSIS Register Description DWT_CYCCNT Cycle Count Register DWT_CPICNT CPI Count Register DWT_EXCCNT Exception Overhead Count Register DWT_SLEEPCNT Sleep Count Register DWT_LSUCNT LSU Count Register DWT_FOLDCNT Folded-instruction Count Register @DanielPenning The mapping between Code & Performance www.embeff.com

  11. Measure Cycles STM32F4 openocd JTAG (PC) DWT BKPT //< Read CYCCNT CodeUnderTest(<Parameter>) BKPT //< Read CYCCNT @DanielPenning The mapping between Code & Performance www.embeff.com

  12. Let’s make observations. @DanielPenning The mapping between Code & Performance www.embeff.com

  13. Example 1: Basic Optimization int square(int x) { square(int): mul r0, r0, r0 return x*x; bx lr } square(int): 30 push {r7} 25 sub sp, sp, #12 add r7, sp, #0 20 str r0, [r7, #4] Cycles 15 ldr r3, [r7, #4] ldr r2, [r7, #4] 10 mul r3, r2, r3 mov r0, r3 5 adds r7, r7, #12 0 mov sp, r7 ldr r7, [sp], #4 Minimal (-Og) No (-O0) bx lr @DanielPenning The mapping between Code & Performance www.embeff.com

  14. Heuristic #1 The difference between minimal and no optimization is huge. @DanielPenning The mapping between Code & Performance www.embeff.com

  15. Example 2: Pipeline DependentOps_O2(int): DependentOps_O1(int): int DependentOps(int x) { ldr r3, .L3 ldr r3, .L2 int tmp = x/3; ldr r1, .L3+4 smull r2, r3, r3, r0 int tmp2 = x/7; smull r2, r3, r3, r0 asrs r1, r0, #31 add r3, r3, r0 subs r3, r3, r1 return tmp+tmp2; asrs r2, r0, #31 ldr r2, .L2+4 } smull r1, r0, r1, r0 smull ip, r2, r2, r0 rsb r3, r2, r3, asr #2 add r0, r0, r2 17 subs r0, r0, r2 rsb r0, r1, r0, asr #2 16 add r0, r0, r3 add r0, r0, r3 15 Cycles bx lr 14 bx lr 13 .L3: .L2: 12 .word -1840700269 .word 1431655766 11 .word 1431655766 .word -1840700269 10 -O1 -O2 @DanielPenning The mapping between Code & Performance www.embeff.com

  16. Heuristic #2 In low-level assembly, the compiler is probably smarter than you. @DanielPenning The mapping between Code & Performance www.embeff.com

  17. Example 3: FPU vs Soft-FPU int MultiplyWithPi(int input) { return input * 3.14159265359f; } MultiplyWithPi_FPU(int): MultiplyWithPi_SoftFPU(int): vmov s15, r0 @ int push {r3, lr} vldr.32 s14, .L3 bl __aeabi_i2f vcvt.f32.s32 s15, s15 ldr r1, .L4 vmul.f32 s15, s15, s14 bl __aeabi_fmul vcvt.s32.f32 s15, s15 bl __aeabi_f2iz vmov r0, s15 @ int pop {r3, pc} bx lr .L4: .L3: .word 1078530011 .word 1078530011 @DanielPenning The mapping between Code & Performance www.embeff.com

  18. Example 3: FPU vs Soft-FPU int MultiplyWithPi(int input) { return input * 3.14159265359f; } 110 90 70 Cycles FPU 50 Soft-FPU 30 10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Input Value @DanielPenning The mapping between Code & Performance www.embeff.com

  19. Heuristic #3 Software-FPU ~ 6x slower and not deterministic. @DanielPenning The mapping between Code & Performance www.embeff.com

  20. Example 4: CRC Computation Cyclic Redundancy Check § Direct Computation § Lookup-Table § Hardware-Support Online Benchmarking § Execute on real hardware. § Technical Preview Stage. § https://barebench.com @DanielPenning The mapping between Code & Performance www.embeff.com

  21. barebench.com - Demo - @DanielPenning The mapping between Code & Performance www.embeff.com

  22. Heuristic #4 Performance may be dependent on clock speed. @DanielPenning The mapping between Code & Performance www.embeff.com

  23. Heuristic #5 Caching is essential for high clock speeds. @DanielPenning The mapping between Code & Performance www.embeff.com

  24. Conclusion Admit lack of knowledge. Measure performance. Use measurements to form heuristics. Share heuristics. Use heuristics instead of prejudices. Let‘s make embedded systems better! @DanielPenning The mapping between Code & Performance www.embeff.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend