Programming Tools for Embedded Multicore Jakob Engblom Technical - - PowerPoint PPT Presentation

programming tools for embedded multicore
SMART_READER_LITE
LIVE PREVIEW

Programming Tools for Embedded Multicore Jakob Engblom Technical - - PowerPoint PPT Presentation

Programming Tools for Embedded Multicore Jakob Engblom Technical Marketing Manager Simics Wind River jakob.engblom@windriver.com | http://blogs.windriver.com/engblom/ Disclaimer These are my personal views on multicore and embedded


slide-1
SLIDE 1

Programming Tools for Embedded Multicore

Jakob Engblom Technical Marketing Manager – Simics Wind River

jakob.engblom@windriver.com | http://blogs.windriver.com/engblom/

slide-2
SLIDE 2

Disclaimer

  • These are my personal views on multicore and

embedded

  • Nothing in this presentation should be interpreted as

indicating the plan (or lack of plan) for products and product features in Wind River products

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 2

slide-3
SLIDE 3

Embedded Multicore

Some Advantages

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 3

slide-4
SLIDE 4

Software Dominates Development

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 4

1012 1010 108 106 104 102 1 Software-dominated systems industry 1960 1970 1980 1990 2000 2010 2020 Gates/chip 2x / 18months SW/chip: 2 x / 10 months SW Productivity: 2x HW/ 5 years

  • No. Gates

Lines of Code

  • No. Gates

Lines of Code

slide-5
SLIDE 5

Embedded Multicore Advantage

  • When it comes to multicore, there are certain advantages

to the embedded tools field

– Embedded debug tools tend to be better at dealing with timing errors and doing debug of low-level code and – Operating-system – application interfaces have better debug support – Hardware-supported debug far beyond what desktops and servers can do – OS awareness in external debug tools

  • Debuggers and tools are starting to catch up, including

awareness of cores, systems, threads, domains, …

– But it gets pretty complex pretty quickly…

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 5

slide-6
SLIDE 6

Multiple Context Debugging

Multiple Targets

  • One Wind River Workbench

instance

  • Target manager
  • Multiple simultaneous

connections including shared connections

  • Multiple OS types supported

simultaneously

  • Multiple target processors

supported simultaneously

Bay Networks Bay Networks Bay Networks

Function Processors Control Processors Multiple Contexts

  • Core, process, or thread
  • Each context has a set of views:
  • Source
  • Stack
  • Registers

Processes/Threads

  • Qualify breakpoints on a process or

specific thread

  • Stop the entire process or an

individual thread

Target boards may be any mix of physical, logical, or virtual boards and any mix of uniprocessors and multicore running SMP or UP with Hypervisor, VxWorks, Wind River Linux and bare metal software.

Host System Target System

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 6

slide-7
SLIDE 7

Hardware Trace

  • On-Chip Trace

– Added feature of hardware – Costs some chip area, some designers – and some customers – do not consider it worth the cost – Mostly for processors and their buses – Being added for other parts

  • f the system, as they

become more important

  • Performance counters

common in complex devices today

– Interface bandwidth limitations can put a limit

  • n effectiveness

Board 1

Flash

SoC

DDR RAM Eth Eth PIC Timer Serial PCIe Mem Intf L1$ L1$ CPU CPU L2$ Peripherals

T T T T T T P P P P

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 7

slide-8
SLIDE 8

Hardware Triggering

  • Cross-triggering

– Coordination across the chip – Cause action in one place based on events occurring elsewhere in the system

  • Stop execution, start

tracing, stop tracing, interrupt, ...

  • Requires logic on the chip
  • Basically, it is an on-chip

programmable little supervisor processor

  • Conclusion: wise users

buy hardware with good debug support

Board 1

Flash

SoC

DDR RAM Eth Eth PIC Timer Serial PCIe Mem Intf L1$ L1$ CPU CPU L2$ Peripherals

B B B B B B B

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 8

slide-9
SLIDE 9

Trace, Trace, Trace

  • There seems to be a growing consensus that trace is a

key tool for debugging multicore large-scale software

– Software stacks adding tracing as feature – Hardware support for extracting traces from software – Hardware actually tracing its own operation – Simulators hooks for getting data and key points out

  • Only way to get an overview of the system
  • Trace long runs…

– Trace processing and analysis of data stream a key technology for the future, manual inspection does not suffice

  • And drop back to a debugger around a problem

http://jakob.engbloms.se/archives/1251

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 9

slide-10
SLIDE 10

Overhead vs Efficiency

  • Common complaint

about debug hooks in hardware and software: it costs too much power / performance / throughput / chip area / money / …

Cary Millsap, Thinking Clearly about Performance 2, CACM Oct 2010 http://mags.acm.org/communications/201010#pg40

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 10

slide-11
SLIDE 11

Embedded Multicore

Software Architecture and Hypervisors

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 11

slide-12
SLIDE 12

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24

OS Core 2 Core 1 OS

AMP

More Than Just SMP

Single Core Multi-core

OS: Could be VxWorks, Wind River Linux, or other executive or OS

Combinations of these primary configurations can be used to create more advanced configurations.

Core OS

“Traditional”

Hypervisor

Core Virtualization

Core OS OS Hypervisor

SMP

OS Core 1 Core 2

12

slide-13
SLIDE 13

Unit 1 Unit 2 Unit 3

Example: Consolidation with Hypervisor

Consolidated unit Wind River Hypervisor Multicore Hardware Single-core OS 1

App 1

Single-core

Bare-metal application

Multicore OS 3

App 3

OS 1

App 1 Bare-metal application

OS 3

App 3 Single-core apps keep running as single-core, avoiding the risk of breakage due to true concurrency Single hardware = easier to manage, reduced manufacturing cost, more units fits in the same

  • space. Most of the

multicore gain with very limited pain! Hypervisor provides isolation between guests, virtual boards keep running as-is

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 13

slide-14
SLIDE 14

Example: Back to Basics

Wind River Hypervisor Multicore Hardware Control-Plane OS

Management, control

Core Core Core Core

Network stack

Core

Network stack Network stack

WRE WRE WRE

WRE – Wind River Executive. Clear trend to provide sub-RTOS “executives” to provide very high performance for applications with no need for a full OS. Typically per-core. Hypervisor can simplify the coordination between OS instances and provide a simpler programming interface for a WRE:

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 14

slide-15
SLIDE 15

Simics and Multicore

Debug

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 15

slide-16
SLIDE 16

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24

Wind River Simics

Wind River Simics: Full Simulation of Any Electronic System

Virtual Platform

An adaptive virtual platform that enables customers to define, develop, and deploy electronics systems more efficiently

Aerospace and Defense Industrial and Medical Mobile and Consumer Network Equipment Automotive

16

slide-17
SLIDE 17

System-Level Features

Checkpoint and restore Multicore, processor, board Real-world connections Repeatable fault injection on any system component Scripting Mixed endianness, word sizes, heterogeneity

con0.wait-for-string "$“ con0.record-start con0.input "./ptest.elf 5\n" con0.wait-for-string "." $r = con0.record-stop if ($r == "fail.”) { echo ”test failed” }

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 17

slide-18
SLIDE 18

Full-System Insight

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 18

slide-19
SLIDE 19

Simics

Hypervisor is just any Software Stack

Wind River Hypervisor

Multicore Hardware

OS 1

App 1 Bare-metal application

OS 3

App 3

32/64-bit PC Linux, Windows

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 19

slide-20
SLIDE 20

Simics Debugging Features

Synchronous stop for entire system Determinism and repeatability Reverse execution Unlimited and powerful breakpoints Trace anything Insight into all devices

break –x 0x0000->0x1F00 break-io uart0 break-exception int13

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 20

slide-21
SLIDE 21

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24

Repeatability and Reverse Debugging

  • Repeat any run trivially

– No need to rerun and hope for bug to reoccur

  • Stop and go back in time

– No rerunning program from start – Breakpoints and watchpoints backward in time – Investigate exactly what happened this time This control and reliable repeatability is very powerful for parallel code.

Discover Bug Rerun, bug doesn’t show up Rerun, bug doesn’t show up Rerun, different bug Rerun, initial bug occurs Discover Bug Reverse execute and find source of bug On virtual hardware, debugging is much easier. On hardware, only some runs reproduce an error.

http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html

21

slide-22
SLIDE 22

Transporting Bugs

Virtual platform

checkpoint software package, load, or configuration hardware configuration or reconfiguration

P P

R D

The software user finds a bug and needs to report it to the developer. This makes him or her the reporter R A developer D creates a piece of software and passes it on for testing and use The developer and reporter are both using a virtual platform to run software The reporter uses virtual platform checkpointing to pass the bug to the developer. This ensures perfect replication and that the complete target state is communicated.

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 22

slide-23
SLIDE 23

Replaying Target Stimuli

R1

R R P P

Boot...

P

D R

stimuli

RC

R

Configure...

R0

Run tests...

Note that many different tests can be started from this checkpoint

Rn R2

Inputs occurring after the last checkpoint was taken, but before the bug hits Checkpoint merge

Bug!

Recording of last few inputs Merged checkpoint and the recording is the bug report contents

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 23

slide-24
SLIDE 24

Debug Multicore Hang: Problem

  • Multithreaded program, stable on existing system
  • OS changed, hardware and software stack not changed
  • Started to freeze occasionally (1 run in 20)

– Change of OS exposed a latent bug in the code

  • Reporter captured bug as a checkpoint + script
  • Passed checkpoint and script to developer for analysis

MPC8641 8 core Glibc 2.5.1 Linux 2.6.23 Rule30_threaded.elf MPC8641 8 core Glibc 2.5.1 Linux 2.6.27 (WR Linux 3.0) Rule30_threaded.elf

R R

R D

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 24

slide-25
SLIDE 25

Simics

Debug Multicore Hang: Debug

  • Reproduction of bug trivial with checkpoint and script
  • Developer used OS awareness and source code debug to set

breakpoints inside target program

– On data accesses to shared work queue used by all threads – Unintrusive – does not change the behavior of the target system in any way

  • Custom script catches breakpoints

– Diagnostics: state of queue (read target memory, perform calculations), queue control variable being accessed, source line, thread ID – For both successful and failing runs -> spotted the difference R R

MPC8641 8 core Glibc 2.5.1 Linux 2.6.27 (WR Linux 3.0) Rule30_threaded.elf

D

OS awareness Source code debug Custom script Debug information for binary program, outside the target

D

D

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 25

slide-26
SLIDE 26

Debug Multicore Hang

Example Diagnostic Output

[bp] Thread 918, writing variable empty with value 1. At rule30_packet_queue_get, line 157

  • Prev. state: Done: 1 Empty: 0 Full: 0 Tail: 0 Head: 0 Elems: 0

[bp] Thread 918, writing variable full with value 0. At rule30_packet_queue_get, line 158

  • Prev. state: Done: 1 Empty: 1 Full: 0 Tail: 0 Head: 0 Elems: 0

... [bp] Thread 921, writing variable done with value 1. At rule30_packet_queue_signal_done, line 62

  • Prev. state: Done: 0 Empty: 0 Full: 0 Tail: 0 Head: 98 Elems: 2

The Bug

68 // - It only wakes up one thread... 69 pthread_cond_signal (&(q->notEmpty)); 70 // To be correct: 71 //pthread_cond_broadcast (&(q->notEmpty));

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 26

slide-27
SLIDE 27

Analyzer Looking at the Program

Nice speedup with 1 to 3 worker threads With four worker threads, the program uses only two cores With five worker threads, the efficiency is horrible and two

  • f the worker threads are left

hanging!

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 27

slide-28
SLIDE 28

Simics and Multicore

Evaluating Software Scalability on Flexible Virtual Hardware

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 28

slide-29
SLIDE 29

Scalable AMP Hardware

  • Scales to any number of cores
  • Configurable in several dimensions

Global shared memory PPC440 core Local memory Interrupt controller Serial port Interrupt network PPC440 core Local memory Interrupt controller Serial port PPC440 core Local memory Interrupt controller Serial port

Scalable virtual Power Architecture multicore machine Clock frequency Size Size Number of cores Access delay Contention to global memory

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 29

slide-30
SLIDE 30
  • Varying memory latency of shared memory
  • Parallel processing benchmark

– Shared memory restricted single access and high latencies – Testing two different transfer modes, 1 packet and 4 packets per transmission – Scalability quite different

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1 2 3 4 5 6 7 8 9 Perofrmance relative to one worker ndoe Number of worker nodes

Scaling as Worker Nodes are Added

Perfect memory 100 cycles, single port 200 cycles, single port 500 cycles, single port Perfect memory, 4 packets/trans 100 cycles, single port, 4 packets/trans 200 cycles, single port, 4 packets/trans 500 cycles, single port, 4 packets/trans

Memory Speed Impact

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 30

slide-31
SLIDE 31

Simics and Multicore

Speeding up development by smart tricks

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 31

slide-32
SLIDE 32

OS Prototyping: Xtratum Timebase

  • A multicore OS needs

consistent time across all cores

– First task in development is to establish such a timebase – On hardware, tricky timing loops are needed

  • With Simics, we can prototype

using scripting

– Mark time sync point with a magic instruction – Script triggers on the magic instructions, and resets all local times to the same time

  • A complex but non-value-

added task becomes trivial

– Shorten time to interesting experiments using Simics

http://www.tentech.ca/index.php/2010/09/easy-multi-core-powerpc-timebase-synchronization-with-simics/

The Code: OS and Script

static void __VBOOT synchronize_clocks(void) { if (0 == GET_CPU_ID()) { MAGIC(4); } BarrierWait(&g_smpPartitionInitBarrier); } def synchronize_ppc_timebase(): # Get number of CPUs from system 0. # Using some assumptions num_cpus = conf.sim.cpu_info[0][1] # Iterate through all the cores for cpu_id in range(num_cpus): cpu = getattr(conf, "cpu%d" % cpu_id) # Simply reset the timebase cpu.tbu = 0 cpu.tbl = 0 print "Synchronized the CPU timebases at cpu0 cycle count %ld" % SIM_cycle_count(conf.cpu0)

Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24 32

slide-33
SLIDE 33