1
Statically Calculating Secondary Thread Statically Calculating - - PowerPoint PPT Presentation
Statically Calculating Secondary Thread Statically Calculating - - PowerPoint PPT Presentation
Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in ASTI Systems Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimares Dean alex_dean@ncsu.edu Center for Embedded
2
Overview Overview
- ASTI: Asynchronous Software Thread
Integration
- Register File Partitioning
- Experiments
- Conclusions
3
Basic Idea of ASTI Basic Idea of ASTI
- Goal: recover fine-grain idle-time for use by other threads
- Examine program to find a function f with significant
internal idle time
- Idle time is imposed by instruction-level timing
requirements (e.g. for input, output instructions)
- If an idle time piece n is coarse-grain (TIdle(f,n) >>
2*TContextSwitch), then we can recover it efficiently with
context switching
- If it is fine grain (TIdle(f,n) !>> 2*TContextSwitch), then apply
ASTI (Asynchronous Software Thread Integration)
- Details of ASTI in LCTES 2004, CASES 2004
4
ASTI Applied to Communication Protocols ASTI Applied to Communication Protocols
Check for errors, save bit, update CRC Executive ReceiveMessage ReceiveBit Subroutine calls Prepare message buffer Read bit from bus 3 times and vote Idle time Return Sample bus for resynchronization Secondary Thread Executive ReceiveMessage ReceiveBit Primary Thread Integrated Secondary Thread Coroutine calls Wasted idle time, too short for cocall Need only first and last coroutine calls Recover even short idle time
5
Protocol Controller Options Protocol Controller Options
Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver I/O Expander Discrete Protocol Controller On-Board Protocol Controller System MCU System MCU I/O Expander Discrete protocol controller with MCU MCU with
- n-board
protocol controller Communication Network Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O Physical Layer Transceiver Generic MCU with ASTI S/W Protocol Controller Optional System MCU Generic MCU with ASTI SW protocol controller Analog & Digital I/O
6
But what about But what about… …
- Caches?
- Deep instruction pipelines?
- Branch prediction?
- Superscalar instruction execution?
- Speculative execution?
- The reorder buffer?
- Page faults?
- Forwarding paths?
- Load queues?
- Data prefetching?
- Predicated execution?
- Branch delay slots?
- Instruction prefetching?
- Store forwarding?
- R-ops
- Dynamic optimization
- The phase of the moon
- Wind direction
- Et cetera, et cetera
7
Register File Partitioning Register File Partitioning
- Single register file must support primary and secondary
threads
- Three ways to use a register
– For primary thread exclusively – For secondary thread exclusively – Shared between the two, swapped on coroutine calls
- Register file may not be homogeneous
– Pointer/ address registers – Immediate-operand capable registers – ... so need to pick best partition for each type. How?
8
Primary and Secondary Thread Performance Primary and Secondary Thread Performance
- Impact of register file partitioning
– More registers for primary thread
- Less spilling and filling -> primary code takes fewer cycles
- More idle time -> more cycles for secondary thread
- Fewer registers for secondary thread -> more spilling and filling -
> secondary thread requires more cycles, response time rises – More registers for secondary thread
- Similar case
– More registers swapped
- Both threads require fewer cycles to execute
- Coroutine call takes longer
– -> More cycles wasted switching between threads – -> Now coroutine call fits doesn’t fit into shorter idle time fragments, reducing cycles available for secondary thread
- How do we find the best register file partitioning?
– Too complex to compute everything analytically – Instead compile and analyze iteratively to perform design space exploration
9
Thrint Thrint
Integration foo.s foo.int.s foo.id Data-flow Analysis Control-flow Analysis Static Timing Analysis Integration Analysis GProf XVCG GnuPlot
- Thread Integrating Compiler Back-End: Thrint
- Have enhanced Thrint to
– Integrate threads using ASTI methods (was just STI) – Measure best, worst case performance for secondary thread
10
Iterative Partition Analysis Iterative Partition Analysis Toolchain Toolchain
s_m.c
gcc
r_m.c s_b.c r_b.c
Thrint: ICTA
TSegmentIdle
Thrint Register File Partitioning Decisions gcc Secondary Thread
TSec
Thrint
TSec-Seg-Part
Performance Comparison: Slowdown vs. Dedicated MCU Primary Thread gcc Performance of Segmented, Partitioned Secondary Thread Original Performance of Secondary Thread
11
Experiments Experiments
- Atmel AVR
– 8-bit load/ store architecture for microcontrollers – Register File (32)
- Pointer + immediate (6)
- Immediate (10)
- Other (16)
- Protocol controllers in C
– CAN: 62.5 kbps – MIL-STD-1553: 1 Mbps
- Secondary threads in C
– Network-RS232 bridge – PID controller
- Compiled with AVR-GCC, -O3
Message Queues
Primary Thread (J1850) Secondary Thread (Interface) UART
Dig. In Dig. Out
ASTI Software
Bridge MCU Bus
UART
System MCU
12
Performance Evaluation Performance Evaluation
- Measure slowdown of integrated secondary thread
(worst-case execution path) with partitioned register file, compared with original full-register file performance
– Need to evaluate and schedule for worst-case to ensure system
always meets its deadlines
– How much performance do we give up by partitioning the register
file?
- Not all partitions are schedulable
– Not enough time for coroutine call – Not enough time for primary thread to meet its I/ O instruction deadlines
13
Results I: Average Performance Results I: Average Performance
- Average performance for all feasible partitioning
approaches
0% 20% 40% 60% 1 5 5 3 S e n d 1 5 5 3 R e c e i v e C A N S e n d C A N R e c e i v e Slowdown Bridge (Host Interface) PID Controller
14
Results II: Best Performance Results II: Best Performance
- Find best (least slow-down) of all feasible partitionings
- AVR register file is adequate to handle register pressure
for both threads, or idle time is adequate for coroutine calls
- 0.5%
0.0% 0.5% 1.0% 1.5% 2.0% 1 5 5 3 S e n d 1 5 5 3 R e c e i v e C A N S e n d C A N R e c e i v e Slowdown Bridge (Host Interface) PID Controller
15
Detailed Analysis Example Detailed Analysis Example
- Primary: 1553 send
- Secondary: PID controller
- Immediate registers
- Secondary is sensitive to # of
immediate registers
- Primary: CAN send
- Secondary: RS232-CAN Bridge
- Other registers
- Cocall must be brief for
schedulability
- Best is 1.5% slowdown: 10,6 to 14,2
with no swapped registers
16
Conclusions and Future Work Conclusions and Future Work
- Conclusions
– Performance varies significantly for AVR architecture
- Average case bad
- Best case close to non-partitioned register file
- Future Work
– Derive and evaluate heuristics to search efficiently through partitioning design space – Replace coroutine call with dispatcher to support multiple secondary threads
17
Questions? Questions?
- Have you applied this to SPEC?
– No, that’s not representative of embedded software-implemented communication protocols
- Don’t caches break the timing predictability you need?
– The processors we use run at under 50 MHz, so we don’t have a memory wall
- Why not use a multithreaded processor?
– They’re too expensive, too rare, and businesses prefer familiar processors
- Why not just design an ASIC to do it?
– Too expensive to get the first one
- Why not program an FPGA to do it?
– Too expensive to get the rest of them
18
Appendices Appendices
19
Why Network Communication Protocol Controllers? Why Network Communication Protocol Controllers?
- Multiple threads must be able to make progress, even with fully-
loaded bus
- Idle time is very fine grain (under one bit time)
- Each application domain customizes its protocols
– Wireless sensor networks tweak the medium access control, etc. for minimal energy use – Automotive: optimize for guaranteed (hard real-time) delivery
- Chicken and egg problem
– Protocol controller chip won’t appear until adequate market anticipated – Chip costs remain high until volumes amortize design costs – Delay until protocol controller appears as peripheral on cheap MCUs
- MCUs are good fit for many embedded protocols, if concurrency is
cheap
– 10 to 200 cycles of processing needed per bit – 1 kbps – 1 Mbps bus speed – Temporally predictable MCUs are cheap and flexible
- 1 MHz for $0.25
- 100 MHz for $5-$10 (but you pay in increased energy use and other
issues)
20
Assumptions Assumptions -
- Small Embedded Systems
Small Embedded Systems
- Processors
– Not practical to design a custom processor – Not practical to use fast processor (e.g. raise clock speed by 10x or more) – Can handle some code explosion (e.g. up to 3x) – Using a generic microcontroller (e.g. 4, 8 or 16 bit) without memory protection, virtual memory, caches.
- Workload
– At most a few threads need to make asynchronous progress,
- thers can wait
– One hard-real-time thread with tight deadlines – Other threads may have deadlines which are significantly longer – Interrupts are delayed or handled with polling servers (CASES 2003) – Subroutine calls are cloned or inlined (CASES 2004)
8 Bit 57% 4 Bit 12% DSP 11% 32 Bit 8% 16 Bit 12%
21
Control Control-
- Flow View of ASTI
Flow View of ASTI
Idle Time Idle Time Break secondary thread into segments lasting approximately for the total idle time Integrate intervening primary code into each segment Idle Time Primary Function Secondary Function Insert coroutine calls at start of idle time and end of each segment
Control Flow
22
Big Picture Big Picture
- How do we efficiently allocate N concurrent (potentially
real-time) tasks onto fewer than N processors?
– Compilation and scheduling for concurrent/ parallel/ distributed systems – Real-time systems – Hardware/ software cosynthesis
- Bottlenecks
– Scheduling each context switch – Performing each context switch
- We focus on 1 processor, and that processor is generic
(low-cost) with no special features for accelerating context switch bottlenecks
- Note: threads must be able to make independent
(asynchronous) progress
23
Steps in STI: Source Code Preparation Steps in STI: Source Code Preparation
- Structure program (C) to accumulate work to perform in
integrated functions
- Write functions (C) to be integrated
- Compile to assembly code, partitioning register file for
functions to be integrated (-ffixed)
24
Procedure Code Conditional Loop
Thread Representation Thread Representation
- CDG’s hierarchical structure simplifies integration
–Vertical = conditional nesting, Horizontal = execution order –Summary information at each level
- Our Thrint back-end compiler operates on CDGs of host, guest
threads
–Annotates host with execution time predictions –Moves guest code into host, enforcing ctl/ data/ time dependencies
- Find gap, or else descend into subgraph
- Have code transformations to handle conditionals & loops
Control Dependence Graph
Conditional Nesting Execution Order
25
Thrint Overview Thrint Overview
STI
Parse Asm Form CFG/CDG Read Integration Directives Static Timing Analysis Node Labelling Idle Time Analysis Data-Flow Analysis Register Virtualization Integration Register Reallocation Static Timing Analysis Timing Verification Code Regeneration Temporal Determinism Analysis
Pad Timing Jitter in secondary thread Plan integration Pad excess timing jitter For each guest Pad excess timing jitter Do host loop transformations Clone and insert guest node(s) For each host For each guest If Fused loop, add fused loop control test code For each host Delete original guests
ASTI
Pad Timing Jitter in message level function. Delete original guests Pad Timing Jitter in bit level function. Plan integration in secondary thread Pad jitter in predicate nodes and blocking I/O loops Integrate cocalls within the secondary thread. Integrate intervening guest code at appropriate locations.
26
Steps in STI: Analysis and Integration Planning Steps in STI: Analysis and Integration Planning
- Parse assembly code to form CFG and then CDG
- Perform tree-based static timing analysis
- Pad away timing variations from conditionals with nops or nop loops
(example)
- Perform basic data-flow analysis to identify loop-control variables and
possibly iteration counts
- Compare duration of primary functions with maximum allowed latency
for ISRs and other short-laxity tasks
– Create polling servers to handle these as needed
- Compare duration of secondary functions with amount of idle time time
in primary functions, considering minimum period for primary function
– Break long secondary functions into segments which fit into primary functions’ idle time minus polling servers minus two context switch times – Also end segments when reaching a loop with an unknown iteration count
- Define target times for regions in primary code which are time-critical
27
Steps in STI: Integration Steps in STI: Integration
- Note: conditionals have been padded away previously
- Single primary events
– Move primary code to execute at proper times within secondary code
- Replicate primary code into conditionals
- Split and peel loops and insert primary code
- Guard primary code within loop to trigger on given iteration
- Looping primary events
– Peel off primary function loop iterations which don’t overlap with secondary loops
- Integrate as single primary events
– Fuse loop iterations which do overlap
- Fuse loop control tests
– Unroll loop to match idle time in primary loop to work in secondary loop
- Create clean-up loops to perform remaining iterations
- Redo static timing analysis and verify correct timing
- Recreate assembly file
- Compile, link, download and run!
28
Protocol Software Structure Protocol Software Structure
send idle receive send_message() send_bit() receive_bit() receive_message() protocol_executive()
Most idle time is located in these functions
29
What about Interrupts? What about Interrupts? What about Frequent Primaries and Long What about Frequent Primaries and Long Secondaries Secondaries? ?
- Interrupts?
– STI disables interrupts while integrated threads run – STIGLitz: ints. disabled for one field of video (16.167 ms)
- Frequent primaries and long
secondaries?
– Primary thread needs to run again before integrated version would finish
- Solutions
– Use polling servers to service each non-deferrable thread (e.g. UART) – Break secondary into segments and integrate primary in multiple times
Worst case execution time of integrated thread WCET for Secondary Thread Laxity for Secondary Thread (max. latency allowed) Minimum primary thread period minus maximum primary thread work
30
Detail: Detail: Register File Partitioning vs. Performance Register File Partitioning vs. Performance
- Problem: STI requires that integrated threads share the register file
- Trade-off:
- Code compiled to fit into fewer registers switches contexts faster
– Dispatcher switches contexts roughly every 900 cycles – Two context switches for one register take 12 cycles
- Code compiled to fit into fewer registers runs slower
– More variables must remain in memory
- Goal: Squeeze pre-integrated threads into as few registers as practical
- Method: Determine sensitivity of the host threads’ execution time to the
number of registers available
– Divide AVR registers into three classes:
- Pointer registers (r26-r31)
- Immediate-operand capable registers (r16-r25)
- Other registers (r0-r15)
– Analyze DrawSprite, DrawLine, DrawCircle functions – Limit registers available to the register allocator through gcc’s –ffixed
- ption.
– Measure execution time using an on-chip timer/ counter
31
Results Results
- Measurements
– DrawLine and DrawCircle not very sensitive – DrawSprite very sensitive – Strange speed-up when excluding
- ne pointer register
- Design decisions
– DrawLine and DrawCircle
- Exclude eight “other" registers
and two pointer registers
- Use 22 registers
- Each context switch: 132 cycles
– DrawSprite
- Exclude only one “other”
register and two pointer registers
- Use 29 registers
- Each context switch: 174 cycles
Draw_Circle Sensitivity to Register Exclusion
0.9 1 1.1 1.2 1.3 2 4 6 8 10 12
Total Registers Excluded
Normalized Run Time DrawCircle - Immediate DrawCircle - Pointer DrawCircle - Other
Draw_Line Sensitivity to Register Exclusion 0.9 1 1.1 1.2 1.3
2 4 6 8 10 12
Total Registers Excluded
Normalized Run Time
DrawLine - Pointer DrawLine - Other DrawLine - Immediate
Draw_Sprite 1 2 3 4 5 5 10 15 Registers_removed
Normalized Run Time DrawSprite - Immediate DrawSprite - Pointer DrawSprite - Other
32
To Do To Do
- Remove intervening code from primary code - animate