Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 - - PowerPoint PPT Presentation
Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 - - PowerPoint PPT Presentation
Observation of Multi-Core SoCs Alexander Weiss Garching, 15.11.2013 Accemic GmbH & Co. KG Multi-Core SoC Observation You can design it, but can you debug it? [ Martin, Grant; Mayer, Albrecht; The challenges of heterogeneous multicore
Garching, 15.11.2013 MAD 2013 2
Multi-Core SoC Observation
You can design it, but can you debug it?
[Martin, Grant; Mayer, Albrecht; “The challenges of heterogeneous multicore debug ”, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010]
Garching, 15.11.2013 MAD 2013 3
Multi-Core SoC Observation
Observation for
- Debug
- Tests / coverage analysis
- WCET measurement
- Detection of race conditions
- Profiling / optimization
Execution time (ET) Probability Measured ET Static analysis measured BCET measured WCET computed WCET computed BCET
- verestimation
avoidable costs possible WCET Possible ET
Thread A AcquireLock(A); ... AcquireLock(B); ... ReleaseLock(B); ... ReleaseLock(A); Thread B AcquireLock(B); ... AcquireLock(A); ... ReleaseLock(A); ... ReleaseLock(B);
Defects deterministic Bohrbugs non-deterministic Mandelbugs Heisenbugs Aging-related bugs Disturbances Controllable Challenge
Garching, 15.11.2013 MAD 2013 4
Multi-Core SoC Observation
Analyzer MPSoC
Available bandwith Internal states Subset of internal states Results
Observation Analysis Questions
Trace data
CPU CPU CPU CPU Interconnect
WCET Profiling Race conditions? Defects?
Multi-Core observation challenges
- Make internal states visible
- Analyze trace data
Garching, 15.11.2013 MAD 2013 5
What we want to know
- CPUs
- Executed instructions
- Clock cycles / instruction
- Data access (value, address, direction)
- Cache
- CPU register
- Events
- Bus system / Bus master peripherals
- Bus master data access
(value, address, direction)
- Events (timeouts, concurrent access,
splitted transfers, errors)
- Vector clock
- Temporal assignment of CPUs
and bus master operations
Multi-Core SoC Observation - Requirements
UNDESIRED MECHANISMS AFFECTING THE TEMPORAL DETERMINISM from: Kotaba et al, “Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources”, 2013
Garching, 15.11.2013 MAD 2013 6
Other requirements
- Real-time capability
- Non-intrusiveness
- Concurrent observation of multiple CPUs / Busses / Peripherals
- State specific observation focus
- Observation of mass-produced SoCs
- Unlimited time observation
- Low latency
- Intuitive to use
Multi-Core SoC Observation - Requirements
Garching, 15.11.2013 MAD 2013 7
Software Instrumentation
Multi-Core SoC Observation – State of the Art
Original source code Instrumented code (Statement / Condition Coverage)
void foo() { bool found=false; for (int i=0; (i<100) && (!found); ++i) { if (i==50) break; if (i==20) found=true; if (i==30) found=true; } printf("foo\n"); } char inst[15]; void foo() { bool found=false; for (int i=0;((i<100)?inst[0]=1:inst[1]=1,0) && ((!found)?inst[2]=1:inst[3]=1,0); ++i) { if ((i==50?inst[4]=1:inst[5]=1,0)) { inst[6]=1; break; } if ((i==20?inst[7]=1:inst[8]=1,0)) { inst[9]=1; found=true; } if ((i==30?inst[10]=1:inst[11]=1,0)) { inst[12]=1; found=true; } inst[13]=1; } printf("foo\n"); inst[14]=1; }
+ Easy to implement / low cost for tests Additional resources / different behavior Changing observation focus requires code recompilation Temporal assignment of different CPUs processes is very limited Questionable approach: removing instrumentation from production code
Garching, 15.11.2013 MAD 2013 8
State of the Art: Embedded trace
Multi-Core SoC Observation – State of the Art
Time stamps MPSoC (4 CPUs) Data read Instructions CPU 0 Data written Time stamps Data read Instructions CPU 1 Data written Time stamps Data read Instructions CPU 2 Data written Time stamps Data read Instructions CPU 3 Data written
Bandwith (average): 0,3 .. 4 Bit / Instruction (non-cycle accurate) 5 .. 8 Bit / Instruction (cycle accurate) 8 Bit / Instruction (data trace) MPSoC, 4 CPUs, 1 GHz Cycle accurate instruction + data trace 4 x 14 Bit x 1 GHz => approx. 28 Gbit/s ( + timestamps) ( + bus master trace) ( + peak bandwidth)
embedded trace based emulation system Emulator Target MPSoC CPUs peripherals Embedded trace trace data
Offline analysis
Garching, 15.11.2013 MAD 2013 9
Multi-Core SoC Observation – Academic Research
- Huang/Kao/Yang
(National Sun Yat-Sen University Taiwan)
- SYS-SIP SoC Development Infrastructure
- three stages lossless instruction trace compression
Fu-Ching Yang; ,“SYS-SIP SoC Development Infrastructure“, Dissertation, National Sun Yat-Sen University, 2009 Fu-Ching Yang; Yi-Ting Lin; Chung-Fu Kao; Ing-Jer Huang; , "An On-Chip AHB Bus Tracer With Real-Time Compression and Dynamic Multiresolution Supports for SoC", Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.19, no.4, pp.571-584, April 2011
Branch / Target filtering (~1k gates) 20% Slicing & Differential (~2k gates) 10% LZ-based Compression (~120k gates) 0,3% 100%
Garching, 15.11.2013 MAD 2013 10
hidICE (hidden ICE)
- Emulate the SoC core region and access trace data from there
- Easy access to full trace data
Multi-Core SoC Observation – hidICE
hidICE Based Emulator Trace tool Device under test Processing Trace recorder Code coverage Benchmarking Profiling Full trace data Emulation CPU core(s) SoC CPU core(s) Peripherals hidICE IP hidICE IP Sync data
Garching, 15.11.2013 MAD 2013 11
Multi-Core SoC Observation – hidICE
Emulator SoC CPU1 CPU2 DMA ROM RAM Bus Bridge Periphery Bus System Periphery Unit Periphery Unit Periphery Unit External Bus Interface Interrupt Controller Clock Sync TX IP (High Performance) Bus System DMA Trace Data Interface Trace analysis Instructions CPU register Data Time Development System CPU1 CPU2 DMA ROM RAM (High Performance) Bus System Sync RX IP Bus OCD
Signals to transmit Serialization Deserialization Synchronization
Garching, 15.11.2013 MAD 2013 12
Multi-Core SoC Observation – hidICE
Emulator SoC CPU1 CPU2 DMA ROM RAM Bus Bridge Periphery Bus System Periphery Unit Periphery Unit Periphery Unit External Bus Interface Hash IP Interrupt Controller Clock Sync TX IP (High Performance) Bus System Comperator DMA Trace Data Interface Trace analysis Instructions CPU register Data Time System integrity Development System CPU1 CPU2 DMA ROM RAM Hash IP (High Performance) Bus System Sync RX IP Bus OCD
Signals nals to transmit mit Serial ializa ization ion Deser erial ializ izat ation ion Hash h calculat lculation ion Hash h chec eck Sync nchr hron
- nizat
ization ion System tem Integr grity ity Control
- l
Garching, 15.11.2013 MAD 2013 13
Multi-Core SoC Observation - hidICE
FPGA #1 (SoC) FPGA #2 (Emulation) Synchronisation ML507 Board #1 ML507 Board #2 hidICE based emulation Emulator Target Trace Data Emulation 3 CPUs hidICE IP (RX) MPSoC 3 CPUs hidICE IP (TX) Peripherals
Synchronisation
Trace data computation Recording Analysis
MPSoC implementation (3 x SPARC V8 / LEON3)
Garching, 15.11.2013 MAD 2013 14
Multi-Core SoC Observation - hidICE
Data read (IO) Events (CPU1) Events (CPU0) Events (CPU2) Events (CPU3) MPSoC (4 CPUs)
MPSoC, 4 CPUs, 1 GHz
1 x USB2.0, 2 x Gbit Ethernet, some low speed peripherals Synchronization bandwith: < 4 GBit
- timestamps included
- bus master trace included
- peak bandwith included
Time stamps MPSoC (4 CPUs) Data read Instructions CPU 0 Data written Time stamps Data read Instructions CPU 1 Data written Time stamps Data read Instructions CPU 2 Data written Time stamps Data read Instructions CPU 3 Data written
MPSoC, 4 CPUs, 1 GHz Cycle accurate instruction + data trace 4 x 14 Bit x 1 GHz => approx. 28 Gbit/s ( + timestamps) ( + bus master trace) ( + peak bandwidth)
Embedded trace hidICE
Garching, 15.11.2013 MAD 2013 15
Multi-Core SoC Observation - hidICE
Core0 AXI ETM Core1 ETM Core2 ETM Core3 ETM DMA Display Ether USB APB Bridge Ether APB Per Per Per Per Per Per Bus Trace System Trace hidICE Sync TX hidICE Sync RX Core0 AXI ETM Core1 ETM Core2 ETM Core3 ETM DMA Ether USB APB Bridge Ether APB Per Per Per Per Per Per Bus Trace System Trace hidICE Sync TX 284 signal pins Extended TPIU hidICE Sync RX Display 28 Bit Sync Extended TPIU 8 x 32 Bit Trace Port 28 Bit Display 28 Bit Sync Input
Draft: hidICE for quad core SoC
Garching, 15.11.2013 MAD 2013 16
hidICE summary
+ Cycle accurate instruction and data trace from all CPUs + Cycle accurate data trace from all bus masters + Long-time observation + Low latency + Low intrusiveness (port replacement)
- Implementation effort (e.g. clock domain synchronization,
correct implementation of the emulation)
- “All or none” – no partial trace
- Not applicable for SoCs with high I/O bandwidth
hidICE - New Observation Approach
Garching, 15.11.2013 MAD 2013 17
Multi-Core SoC Observation
Analyzer MPSoC
Available bandwith Internal states Subset of internal states Results
Observation Analysis Questions
Trace data
CPU CPU CPU CPU Interconnect
WCET Profiling Race conditions? Defects?
Multi-Core observation challenges
- Make internal states visible
- Analyze trace data
Garching, 15.11.2013 MAD 2013 18
State of the Art: Trace data recording and offline processing
Multi-Core SoC Observation – Trace data analysis
SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Trace tool SDRAM Host Hard disc Trace data processing (offline) SoC vendor’s responsibility Tool vendor’s responsibility Results
Limitations:
- Trace bandwidth >> Processing bandwith
- High latency in detection of SoC internal states state
- Limited in multiple observation focusses
Garching, 15.11.2013 MAD 2013 19
Multi-Core SoC Observation - Trace data analysis Infineon MCDS
- Combine SoC and trace tool
Mayer, A.; Siebert, H.; McDonald-Maier, K.D.; , "Boosting Debugging Support for Complex Source: www.ipextreme.com Systems on Chip" Computer , vol.40, no.4, pp.76-81, April 2007
SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Trace tool Trace data processing (online) Results Real-time processing trace tool Event processing SoC Trace port CPU0 CPU1 CPU2 CPU3 Per Per Mem Mem Message processing Instrumentation Trace Data Trace Ownership Trace Watchpoint Trace Instruction Instrution Reconstruction Configurable generic event processing structures Custom programmable FPGA fabric Custom programmable processor
configuration
HLL propositions compiler CoreSight / Nexus via parallel, Aurora, PCIe lanes up to 20 Gbit/s
Garching, 15.11.2013 MAD 2013 20
Online trace data processing
Multi-Core SoC Observation – Online trace processing
- Processing bandwidth >= trace bandwith
- µs latency in detection of SoC internal states state
- Multiple observation focusses
Garching, 15.11.2013 MAD 2013 21
Online trace processing implementation example
P4080 (8 cores e500mc @1.3 GHz)
Multi-Core SoC Observation – Online trace processing
QorIQ P4080
up to 8 ch PCIe Xilinx Virtex 7 FPGA
PCIe endpoint Ownership Data Acquisition Data Trace (W) (Core and NPC) Watchpoints InstructionTrace Data Trace (R) Timestamp Generator Trace memory (ring buffer) Event Processing
- Ext. Interface
Errors Data Trace (R) (NPC only) Sequencer Counter Instruction Reconstruction
Xilinx Zynq
Debug server
JTAG User programmable Dual core Cortex A9 User programmable FPGA fabric (Verilog, VHDL, SystemC)
Message Processing
(Aurora)
Garching, 15.11.2013 MAD 2013 22
Online trace processing
- High implementation effort (algorithms, FPGA resources) for
- Transport protocol decoding (ARM CoreSight TPIU)
- Detection of message boundaries (ARM CoreSight)
- Online Reconstruction of direct branches (Nexus / ARM)
- Runtime Verification infrastructure
Our wish to SOC vendors: More trace bandwidth!
Multi-Core SoC Observation – Online trace processing
Garching, 15.11.2013 MAD 2013 23
Online trace processing advantages
+ Real-time capability / unlimited observation time + Low latency of state specific changes of observation focus + Parallel observation of multiple hotspots + Low latency of state specific triggering of the SoC (e.g. tests) + Immediate access to the runtime information by the developer + Instruction / basic blocks level continuous profiling + Basic block level continuous WCET measurement + Automatic detection of race conditions (e.g. “Happened before” algorithm) + Extension of SoC internal debug / profiling structures + Combination of Trace debugging and Run/Stop debugging + The “swiss army knife” in debugging: Online Runtime Verification
Multi-Core SoC Observation – Online trace processing
Garching, 15.11.2013 MAD 2013 24
Questions?
In cooperation with
Contact: Alexander Weiss Accemic GmbH & Co. KG Franz-Huber-Str. 39 83088 Kiefersfelden aweiss@accemic.com +49 8033 6039790
QorIQ P4080
up to 8 ch PCIe Xilinx Virtex 7 FPGA
PCIe endpoint Ownership Data Acquisition Data Trace (W) (Core and NPC) Watchpoints InstructionTrace Data Trace (R) Timestamp Generator Trace memory (ring buffer) Event Processing
- Ext. Interface
Errors Data Trace (R) (NPC only) Sequencer Counter Instruction Reconstruction
Xilinx Zynq
Debug server
JTAG User programmable Dual core Cortex A9 User programmable FPGA fabric (Verilog, VHDL, SystemC)
Message Processing
(Aurora)