Programming Tools for Embedded Multicore
Jakob Engblom Technical Marketing Manager – Simics Wind River
jakob.engblom@windriver.com | http://blogs.windriver.com/engblom/
Programming Tools for Embedded Multicore Jakob Engblom Technical - - PowerPoint PPT Presentation
Programming Tools for Embedded Multicore Jakob Engblom Technical Marketing Manager Simics Wind River jakob.engblom@windriver.com | http://blogs.windriver.com/engblom/ Disclaimer These are my personal views on multicore and embedded
Jakob Engblom Technical Marketing Manager – Simics Wind River
jakob.engblom@windriver.com | http://blogs.windriver.com/engblom/
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 2
Some Advantages
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 3
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 4
1012 1010 108 106 104 102 1 Software-dominated systems industry 1960 1970 1980 1990 2000 2010 2020 Gates/chip 2x / 18months SW/chip: 2 x / 10 months SW Productivity: 2x HW/ 5 years
Lines of Code
Lines of Code
– Embedded debug tools tend to be better at dealing with timing errors and doing debug of low-level code and – Operating-system – application interfaces have better debug support – Hardware-supported debug far beyond what desktops and servers can do – OS awareness in external debug tools
– But it gets pretty complex pretty quickly…
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 5
Multiple Targets
instance
connections including shared connections
simultaneously
supported simultaneously
Bay Networks Bay Networks Bay NetworksFunction Processors Control Processors Multiple Contexts
Processes/Threads
specific thread
individual thread
Target boards may be any mix of physical, logical, or virtual boards and any mix of uniprocessors and multicore running SMP or UP with Hypervisor, VxWorks, Wind River Linux and bare metal software.
Host System Target System
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 6
– Added feature of hardware – Costs some chip area, some designers – and some customers – do not consider it worth the cost – Mostly for processors and their buses – Being added for other parts
become more important
common in complex devices today
– Interface bandwidth limitations can put a limit
Board 1
Flash
SoC
DDR RAM Eth Eth PIC Timer Serial PCIe Mem Intf L1$ L1$ CPU CPU L2$ Peripherals
T T T T T T P P P P
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 7
– Coordination across the chip – Cause action in one place based on events occurring elsewhere in the system
tracing, stop tracing, interrupt, ...
programmable little supervisor processor
Board 1
Flash
SoC
DDR RAM Eth Eth PIC Timer Serial PCIe Mem Intf L1$ L1$ CPU CPU L2$ Peripherals
B B B B B B B
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 8
– Software stacks adding tracing as feature – Hardware support for extracting traces from software – Hardware actually tracing its own operation – Simulators hooks for getting data and key points out
– Trace processing and analysis of data stream a key technology for the future, manual inspection does not suffice
http://jakob.engbloms.se/archives/1251
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 9
Cary Millsap, Thinking Clearly about Performance 2, CACM Oct 2010 http://mags.acm.org/communications/201010#pg40
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 10
Software Architecture and Hypervisors
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 11
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
OS Core 2 Core 1 OS
AMP
Single Core Multi-core
OS: Could be VxWorks, Wind River Linux, or other executive or OS
Combinations of these primary configurations can be used to create more advanced configurations.
Core OS
“Traditional”
Hypervisor
Core Virtualization
Core OS OS Hypervisor
SMP
OS Core 1 Core 2
12
Unit 1 Unit 2 Unit 3
Consolidated unit Wind River Hypervisor Multicore Hardware Single-core OS 1
App 1
Single-core
Bare-metal application
Multicore OS 3
App 3
OS 1
App 1 Bare-metal application
OS 3
App 3 Single-core apps keep running as single-core, avoiding the risk of breakage due to true concurrency Single hardware = easier to manage, reduced manufacturing cost, more units fits in the same
multicore gain with very limited pain! Hypervisor provides isolation between guests, virtual boards keep running as-is
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 13
Wind River Hypervisor Multicore Hardware Control-Plane OS
Management, control
Core Core Core Core
Network stack
Core
Network stack Network stack
WRE WRE WRE
WRE – Wind River Executive. Clear trend to provide sub-RTOS “executives” to provide very high performance for applications with no need for a full OS. Typically per-core. Hypervisor can simplify the coordination between OS instances and provide a simpler programming interface for a WRE:
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 14
Debug
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 15
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
Wind River Simics
Virtual Platform
An adaptive virtual platform that enables customers to define, develop, and deploy electronics systems more efficiently
Aerospace and Defense Industrial and Medical Mobile and Consumer Network Equipment Automotive
16
Checkpoint and restore Multicore, processor, board Real-world connections Repeatable fault injection on any system component Scripting Mixed endianness, word sizes, heterogeneity
con0.wait-for-string "$“ con0.record-start con0.input "./ptest.elf 5\n" con0.wait-for-string "." $r = con0.record-stop if ($r == "fail.”) { echo ”test failed” }
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 17
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 18
Simics
Wind River Hypervisor
Multicore Hardware
OS 1
App 1 Bare-metal application
OS 3
App 3
32/64-bit PC Linux, Windows
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 19
Synchronous stop for entire system Determinism and repeatability Reverse execution Unlimited and powerful breakpoints Trace anything Insight into all devices
break –x 0x0000->0x1F00 break-io uart0 break-exception int13
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 20
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
– No need to rerun and hope for bug to reoccur
– No rerunning program from start – Breakpoints and watchpoints backward in time – Investigate exactly what happened this time This control and reliable repeatability is very powerful for parallel code.
Discover Bug Rerun, bug doesn’t show up Rerun, bug doesn’t show up Rerun, different bug Rerun, initial bug occurs Discover Bug Reverse execute and find source of bug On virtual hardware, debugging is much easier. On hardware, only some runs reproduce an error.
http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html
21
Virtual platform
checkpoint software package, load, or configuration hardware configuration or reconfiguration
P P
R D
The software user finds a bug and needs to report it to the developer. This makes him or her the reporter R A developer D creates a piece of software and passes it on for testing and use The developer and reporter are both using a virtual platform to run software The reporter uses virtual platform checkpointing to pass the bug to the developer. This ensures perfect replication and that the complete target state is communicated.
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 22
R1
R R P P
Boot...
P
D R
stimuli
RC
R
Configure...
R0
Run tests...
Note that many different tests can be started from this checkpoint
Rn R2
Inputs occurring after the last checkpoint was taken, but before the bug hits Checkpoint merge
Bug!
Recording of last few inputs Merged checkpoint and the recording is the bug report contents
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 23
– Change of OS exposed a latent bug in the code
MPC8641 8 core Glibc 2.5.1 Linux 2.6.23 Rule30_threaded.elf MPC8641 8 core Glibc 2.5.1 Linux 2.6.27 (WR Linux 3.0) Rule30_threaded.elf
R R
R D
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 24
Simics
breakpoints inside target program
– On data accesses to shared work queue used by all threads – Unintrusive – does not change the behavior of the target system in any way
– Diagnostics: state of queue (read target memory, perform calculations), queue control variable being accessed, source line, thread ID – For both successful and failing runs -> spotted the difference R R
MPC8641 8 core Glibc 2.5.1 Linux 2.6.27 (WR Linux 3.0) Rule30_threaded.elf
D
OS awareness Source code debug Custom script Debug information for binary program, outside the target
D
D
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 25
Example Diagnostic Output
[bp] Thread 918, writing variable empty with value 1. At rule30_packet_queue_get, line 157
[bp] Thread 918, writing variable full with value 0. At rule30_packet_queue_get, line 158
... [bp] Thread 921, writing variable done with value 1. At rule30_packet_queue_signal_done, line 62
The Bug
68 // - It only wakes up one thread... 69 pthread_cond_signal (&(q->notEmpty)); 70 // To be correct: 71 //pthread_cond_broadcast (&(q->notEmpty));
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 26
Nice speedup with 1 to 3 worker threads With four worker threads, the program uses only two cores With five worker threads, the efficiency is horrible and two
hanging!
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 27
Evaluating Software Scalability on Flexible Virtual Hardware
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 28
Global shared memory PPC440 core Local memory Interrupt controller Serial port Interrupt network PPC440 core Local memory Interrupt controller Serial port PPC440 core Local memory Interrupt controller Serial port
Scalable virtual Power Architecture multicore machine Clock frequency Size Size Number of cores Access delay Contention to global memory
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 29
– Shared memory restricted single access and high latencies – Testing two different transfer modes, 1 packet and 4 packets per transmission – Scalability quite different
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1 2 3 4 5 6 7 8 9 Perofrmance relative to one worker ndoe Number of worker nodes
Scaling as Worker Nodes are Added
Perfect memory 100 cycles, single port 200 cycles, single port 500 cycles, single port Perfect memory, 4 packets/trans 100 cycles, single port, 4 packets/trans 200 cycles, single port, 4 packets/trans 500 cycles, single port, 4 packets/trans
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 30
Speeding up development by smart tricks
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 31
consistent time across all cores
– First task in development is to establish such a timebase – On hardware, tricky timing loops are needed
using scripting
– Mark time sync point with a magic instruction – Script triggers on the magic instructions, and resets all local times to the same time
added task becomes trivial
– Shorten time to interesting experiments using Simics
http://www.tentech.ca/index.php/2010/09/easy-multi-core-powerpc-timebase-synchronization-with-simics/
The Code: OS and Script
static void __VBOOT synchronize_clocks(void) { if (0 == GET_CPU_ID()) { MAGIC(4); } BarrierWait(&g_smpPartitionInitBarrier); } def synchronize_ppc_timebase(): # Get number of CPUs from system 0. # Using some assumptions num_cpus = conf.sim.cpu_info[0][1] # Iterate through all the cores for cpu_id in range(num_cpus): cpu = getattr(conf, "cpu%d" % cpu_id) # Simply reset the timebase cpu.tbu = 0 cpu.tbl = 0 print "Synchronized the CPU timebases at cpu0 cycle count %ld" % SIM_cycle_count(conf.cpu0)
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 32