Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 - - PowerPoint PPT Presentation

observation of multi core socs
SMART_READER_LITE
LIVE PREVIEW

Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 - - PowerPoint PPT Presentation

Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 Accemic GmbH & Co. KG Multi-Core SoC Observation You can design it, but can you debug it? [ Martin, Grant; Mayer, Albrecht; The challenges of heterogeneous multicore


slide-1
SLIDE 1

Observation of Multi-Core SoCs

Alexander Weiss

Garching, 15.11.2013 Accemic GmbH & Co. KG

slide-2
SLIDE 2

Garching, 15.11.2013 MAD 2013 2

Multi-Core SoC Observation

You can design it, but can you debug it?

[Martin, Grant; Mayer, Albrecht; “The challenges of heterogeneous multicore debug ”, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010]

slide-3
SLIDE 3

Garching, 15.11.2013 MAD 2013 3

Multi-Core SoC Observation

Observation for

  • Debug
  • Tests / coverage analysis
  • WCET measurement
  • Detection of race conditions
  • Profiling / optimization

Execution time (ET) Probability Measured ET Static analysis measured BCET measured WCET computed WCET computed BCET

  • verestimation

avoidable costs possible WCET Possible ET

Thread A AcquireLock(A); ... AcquireLock(B); ... ReleaseLock(B); ... ReleaseLock(A); Thread B AcquireLock(B); ... AcquireLock(A); ... ReleaseLock(A); ... ReleaseLock(B);

Defects deterministic Bohrbugs non-deterministic Mandelbugs Heisenbugs Aging-related bugs Disturbances Controllable Challenge

slide-4
SLIDE 4

Garching, 15.11.2013 MAD 2013 4

Multi-Core SoC Observation

Analyzer MPSoC

Available bandwith Internal states Subset of internal states Results

Observation Analysis Questions

Trace data

CPU CPU CPU CPU Interconnect

WCET Profiling Race conditions? Defects?

Multi-Core observation challenges

  • Make internal states visible
  • Analyze trace data
slide-5
SLIDE 5

Garching, 15.11.2013 MAD 2013 5

What we want to know

  • CPUs
  • Executed instructions
  • Clock cycles / instruction
  • Data access (value, address, direction)
  • Cache
  • CPU register
  • Events
  • Bus system / Bus master peripherals
  • Bus master data access

(value, address, direction)

  • Events (timeouts, concurrent access,

splitted transfers, errors)

  • Vector clock
  • Temporal assignment of CPUs

and bus master operations

Multi-Core SoC Observation - Requirements

UNDESIRED MECHANISMS AFFECTING THE TEMPORAL DETERMINISM from: Kotaba et al, “Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources”, 2013

slide-6
SLIDE 6

Garching, 15.11.2013 MAD 2013 6

Other requirements

  • Real-time capability
  • Non-intrusiveness
  • Concurrent observation of multiple CPUs / Busses / Peripherals
  • State specific observation focus
  • Observation of mass-produced SoCs
  • Unlimited time observation
  • Low latency
  • Intuitive to use

Multi-Core SoC Observation - Requirements

slide-7
SLIDE 7

Garching, 15.11.2013 MAD 2013 7

Software Instrumentation

Multi-Core SoC Observation – State of the Art

Original source code Instrumented code (Statement / Condition Coverage)

void foo() { bool found=false; for (int i=0; (i<100) && (!found); ++i) { if (i==50) break; if (i==20) found=true; if (i==30) found=true; } printf("foo\n"); } char inst[15]; void foo() { bool found=false; for (int i=0;((i<100)?inst[0]=1:inst[1]=1,0) && ((!found)?inst[2]=1:inst[3]=1,0); ++i) { if ((i==50?inst[4]=1:inst[5]=1,0)) { inst[6]=1; break; } if ((i==20?inst[7]=1:inst[8]=1,0)) { inst[9]=1; found=true; } if ((i==30?inst[10]=1:inst[11]=1,0)) { inst[12]=1; found=true; } inst[13]=1; } printf("foo\n"); inst[14]=1; }

+ Easy to implement / low cost for tests  Additional resources / different behavior  Changing observation focus requires code recompilation  Temporal assignment of different CPUs processes is very limited  Questionable approach: removing instrumentation from production code

slide-8
SLIDE 8

Garching, 15.11.2013 MAD 2013 8

State of the Art: Embedded trace

Multi-Core SoC Observation – State of the Art

Time stamps MPSoC (4 CPUs) Data read Instructions CPU 0 Data written Time stamps Data read Instructions CPU 1 Data written Time stamps Data read Instructions CPU 2 Data written Time stamps Data read Instructions CPU 3 Data written

Bandwith (average): 0,3 .. 4 Bit / Instruction (non-cycle accurate) 5 .. 8 Bit / Instruction (cycle accurate) 8 Bit / Instruction (data trace) MPSoC, 4 CPUs, 1 GHz Cycle accurate instruction + data trace 4 x 14 Bit x 1 GHz => approx. 28 Gbit/s ( + timestamps) ( + bus master trace) ( + peak bandwidth)

embedded trace based emulation system Emulator Target MPSoC CPUs peripherals Embedded trace trace data

Offline analysis

slide-9
SLIDE 9

Garching, 15.11.2013 MAD 2013 9

Multi-Core SoC Observation – Academic Research

  • Huang/Kao/Yang

(National Sun Yat-Sen University Taiwan)

  • SYS-SIP SoC Development Infrastructure
  • three stages lossless instruction trace compression

Fu-Ching Yang; ,“SYS-SIP SoC Development Infrastructure“, Dissertation, National Sun Yat-Sen University, 2009 Fu-Ching Yang; Yi-Ting Lin; Chung-Fu Kao; Ing-Jer Huang; , "An On-Chip AHB Bus Tracer With Real-Time Compression and Dynamic Multiresolution Supports for SoC", Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.19, no.4, pp.571-584, April 2011

Branch / Target filtering (~1k gates) 20% Slicing & Differential (~2k gates) 10% LZ-based Compression (~120k gates) 0,3% 100%

slide-10
SLIDE 10

Garching, 15.11.2013 MAD 2013 10

hidICE (hidden ICE)

  • Emulate the SoC core region and access trace data from there
  • Easy access to full trace data

Multi-Core SoC Observation – hidICE

hidICE Based Emulator Trace tool Device under test Processing Trace recorder Code coverage Benchmarking Profiling Full trace data Emulation CPU core(s) SoC CPU core(s) Peripherals hidICE IP hidICE IP Sync data

slide-11
SLIDE 11

Garching, 15.11.2013 MAD 2013 11

Multi-Core SoC Observation – hidICE

Emulator SoC CPU1 CPU2 DMA ROM RAM Bus Bridge Periphery Bus System Periphery Unit Periphery Unit Periphery Unit External Bus Interface Interrupt Controller Clock Sync TX IP (High Performance) Bus System DMA Trace Data Interface Trace analysis Instructions CPU register Data Time Development System CPU1 CPU2 DMA ROM RAM (High Performance) Bus System Sync RX IP Bus OCD

Signals to transmit Serialization Deserialization Synchronization

slide-12
SLIDE 12

Garching, 15.11.2013 MAD 2013 12

Multi-Core SoC Observation – hidICE

Emulator SoC CPU1 CPU2 DMA ROM RAM Bus Bridge Periphery Bus System Periphery Unit Periphery Unit Periphery Unit External Bus Interface Hash IP Interrupt Controller Clock Sync TX IP (High Performance) Bus System Comperator DMA Trace Data Interface Trace analysis Instructions CPU register Data Time System integrity Development System CPU1 CPU2 DMA ROM RAM Hash IP (High Performance) Bus System Sync RX IP Bus OCD

Signals nals to transmit mit Serial ializa ization ion Deser erial ializ izat ation ion Hash h calculat lculation ion Hash h chec eck Sync nchr hron

  • nizat

ization ion System tem Integr grity ity Control

  • l
slide-13
SLIDE 13

Garching, 15.11.2013 MAD 2013 13

Multi-Core SoC Observation - hidICE

FPGA #1 (SoC) FPGA #2 (Emulation) Synchronisation ML507 Board #1 ML507 Board #2 hidICE based emulation Emulator Target Trace Data Emulation 3 CPUs hidICE IP (RX) MPSoC 3 CPUs hidICE IP (TX) Peripherals

Synchronisation

Trace data computation Recording Analysis

MPSoC implementation (3 x SPARC V8 / LEON3)

slide-14
SLIDE 14

Garching, 15.11.2013 MAD 2013 14

Multi-Core SoC Observation - hidICE

Data read (IO) Events (CPU1) Events (CPU0) Events (CPU2) Events (CPU3) MPSoC (4 CPUs)

MPSoC, 4 CPUs, 1 GHz

1 x USB2.0, 2 x Gbit Ethernet, some low speed peripherals Synchronization bandwith: < 4 GBit

  • timestamps included
  • bus master trace included
  • peak bandwith included

Time stamps MPSoC (4 CPUs) Data read Instructions CPU 0 Data written Time stamps Data read Instructions CPU 1 Data written Time stamps Data read Instructions CPU 2 Data written Time stamps Data read Instructions CPU 3 Data written

MPSoC, 4 CPUs, 1 GHz Cycle accurate instruction + data trace 4 x 14 Bit x 1 GHz => approx. 28 Gbit/s ( + timestamps) ( + bus master trace) ( + peak bandwidth)

Embedded trace hidICE

slide-15
SLIDE 15

Garching, 15.11.2013 MAD 2013 15

Multi-Core SoC Observation - hidICE

Core0 AXI ETM Core1 ETM Core2 ETM Core3 ETM DMA Display Ether USB APB Bridge Ether APB Per Per Per Per Per Per Bus Trace System Trace hidICE Sync TX hidICE Sync RX Core0 AXI ETM Core1 ETM Core2 ETM Core3 ETM DMA Ether USB APB Bridge Ether APB Per Per Per Per Per Per Bus Trace System Trace hidICE Sync TX 284 signal pins Extended TPIU hidICE Sync RX Display 28 Bit Sync Extended TPIU 8 x 32 Bit Trace Port 28 Bit Display 28 Bit Sync Input

Draft: hidICE for quad core SoC

slide-16
SLIDE 16

Garching, 15.11.2013 MAD 2013 16

hidICE summary

+ Cycle accurate instruction and data trace from all CPUs + Cycle accurate data trace from all bus masters + Long-time observation + Low latency + Low intrusiveness (port replacement)

  • Implementation effort (e.g. clock domain synchronization,

correct implementation of the emulation)

  • “All or none” – no partial trace
  • Not applicable for SoCs with high I/O bandwidth

hidICE - New Observation Approach

slide-17
SLIDE 17

Garching, 15.11.2013 MAD 2013 17

Multi-Core SoC Observation

Analyzer MPSoC

Available bandwith Internal states Subset of internal states Results

Observation Analysis Questions

Trace data

CPU CPU CPU CPU Interconnect

WCET Profiling Race conditions? Defects?

Multi-Core observation challenges

  • Make internal states visible
  • Analyze trace data
slide-18
SLIDE 18

Garching, 15.11.2013 MAD 2013 18

State of the Art: Trace data recording and offline processing

Multi-Core SoC Observation – Trace data analysis

SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Trace tool SDRAM Host Hard disc Trace data processing (offline) SoC vendor’s responsibility Tool vendor’s responsibility Results

Limitations:

  • Trace bandwidth >> Processing bandwith
  • High latency in detection of SoC internal states state
  • Limited in multiple observation focusses
slide-19
SLIDE 19

Garching, 15.11.2013 MAD 2013 19

Multi-Core SoC Observation - Trace data analysis Infineon MCDS

  • Combine SoC and trace tool

Mayer, A.; Siebert, H.; McDonald-Maier, K.D.; , "Boosting Debugging Support for Complex Source: www.ipextreme.com Systems on Chip" Computer , vol.40, no.4, pp.76-81, April 2007

slide-20
SLIDE 20

SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Trace tool Trace data processing (online) Results Real-time processing trace tool Event processing SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Message processing Instrumentation Trace Data Trace Ownership Trace Watchpoint Trace Instruction Instrution Reconstruction Configurable generic event processing structures Custom programmable FPGA fabric Custom programmable processor

configuration

HLL propositions compiler CoreSight / Nexus via parallel, Aurora, PCIe lanes up to 20 Gbit/s

Garching, 15.11.2013 MAD 2013 20

Online trace data processing

Multi-Core SoC Observation – Online trace processing

  • Processing bandwidth >= trace bandwith
  • µs latency in detection of SoC internal states state
  • Multiple observation focusses
slide-21
SLIDE 21

Garching, 15.11.2013 MAD 2013 21

Online trace processing implementation example

P4080 (8 cores e500mc @1.3 GHz)

Multi-Core SoC Observation – Online trace processing

QorIQ P4080

up to 8 ch PCIe Xilinx Virtex 7 FPGA

PCIe endpoint Ownership Data Acquisition Data Trace (W) (Core and NPC) Watchpoints InstructionTrace Data Trace (R) Timestamp Generator Trace memory (ring buffer) Event Processing

  • Ext. Interface

Errors Data Trace (R) (NPC only) Sequencer Counter Instruction Reconstruction

Xilinx Zynq

Debug server

JTAG User programmable Dual core Cortex A9 User programmable FPGA fabric (Verilog, VHDL, SystemC)

Message Processing

(Aurora)

slide-22
SLIDE 22

Garching, 15.11.2013 MAD 2013 22

Online trace processing

  • High implementation effort (algorithms, FPGA resources) for
  • Transport protocol decoding (ARM CoreSight TPIU)
  • Detection of message boundaries (ARM CoreSight)
  • Online Reconstruction of direct branches (Nexus / ARM)
  • Runtime Verification infrastructure

Our wish to SOC vendors: More trace bandwidth!

Multi-Core SoC Observation – Online trace processing

slide-23
SLIDE 23

Garching, 15.11.2013 MAD 2013 23

Online trace processing advantages

+ Real-time capability / unlimited observation time + Low latency of state specific changes of observation focus + Parallel observation of multiple hotspots + Low latency of state specific triggering of the SoC (e.g. tests) + Immediate access to the runtime information by the developer + Instruction / basic blocks level continuous profiling + Basic block level continuous WCET measurement + Automatic detection of race conditions (e.g. “Happened before” algorithm) + Extension of SoC internal debug / profiling structures + Combination of Trace debugging and Run/Stop debugging + The “swiss army knife” in debugging: Online Runtime Verification

Multi-Core SoC Observation – Online trace processing

slide-24
SLIDE 24

Garching, 15.11.2013 MAD 2013 24

Questions?

In cooperation with

Contact: Alexander Weiss Accemic GmbH & Co. KG Franz-Huber-Str. 39 83088 Kiefersfelden aweiss@accemic.com +49 8033 6039790

QorIQ P4080

up to 8 ch PCIe Xilinx Virtex 7 FPGA

PCIe endpoint Ownership Data Acquisition Data Trace (W) (Core and NPC) Watchpoints InstructionTrace Data Trace (R) Timestamp Generator Trace memory (ring buffer) Event Processing

  • Ext. Interface

Errors Data Trace (R) (NPC only) Sequencer Counter Instruction Reconstruction

Xilinx Zynq

Debug server

JTAG User programmable Dual core Cortex A9 User programmable FPGA fabric (Verilog, VHDL, SystemC)

Message Processing

(Aurora)