Statically Calculating Secondary Thread Statically Calculating - - PowerPoint PPT Presentation

statically calculating secondary thread statically
SMART_READER_LITE
LIVE PREVIEW

Statically Calculating Secondary Thread Statically Calculating - - PowerPoint PPT Presentation

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in ASTI Systems Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimares Dean alex_dean@ncsu.edu Center for Embedded


slide-1
SLIDE 1

1

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in ASTI Systems Performance in ASTI Systems

Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean

alex_dean@ncsu.edu

Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University www.cesr.ncsu.edu/ agdean

slide-2
SLIDE 2

2

Overview Overview

  • ASTI: Asynchronous Software Thread

Integration

  • Register File Partitioning
  • Experiments
  • Conclusions
slide-3
SLIDE 3

3

Basic Idea of ASTI Basic Idea of ASTI

  • Goal: recover fine-grain idle-time for use by other threads
  • Examine program to find a function f with significant

internal idle time

  • Idle time is imposed by instruction-level timing

requirements (e.g. for input, output instructions)

  • If an idle time piece n is coarse-grain (TIdle(f,n) >>

2*TContextSwitch), then we can recover it efficiently with

context switching

  • If it is fine grain (TIdle(f,n) !>> 2*TContextSwitch), then apply

ASTI (Asynchronous Software Thread Integration)

  • Details of ASTI in LCTES 2004, CASES 2004
slide-4
SLIDE 4

4

ASTI Applied to Communication Protocols ASTI Applied to Communication Protocols

Check for errors, save bit, update CRC Executive ReceiveMessage ReceiveBit Subroutine calls Prepare message buffer Read bit from bus 3 times and vote Idle time Return Sample bus for resynchronization Secondary Thread Executive ReceiveMessage ReceiveBit Primary Thread Integrated Secondary Thread Coroutine calls Wasted idle time, too short for cocall Need only first and last coroutine calls Recover even short idle time

slide-5
SLIDE 5

5

Protocol Controller Options Protocol Controller Options

Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver I/O Expander Discrete Protocol Controller On-Board Protocol Controller System MCU System MCU I/O Expander Discrete protocol controller with MCU MCU with

  • n-board

protocol controller Communication Network Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O Physical Layer Transceiver Generic MCU with ASTI S/W Protocol Controller Optional System MCU Generic MCU with ASTI SW protocol controller Analog & Digital I/O

slide-6
SLIDE 6

6

But what about But what about… …

  • Caches?
  • Deep instruction pipelines?
  • Branch prediction?
  • Superscalar instruction execution?
  • Speculative execution?
  • The reorder buffer?
  • Page faults?
  • Forwarding paths?
  • Load queues?
  • Data prefetching?
  • Predicated execution?
  • Branch delay slots?
  • Instruction prefetching?
  • Store forwarding?
  • R-ops
  • Dynamic optimization
  • The phase of the moon
  • Wind direction
  • Et cetera, et cetera
slide-7
SLIDE 7

7

Register File Partitioning Register File Partitioning

  • Single register file must support primary and secondary

threads

  • Three ways to use a register

– For primary thread exclusively – For secondary thread exclusively – Shared between the two, swapped on coroutine calls

  • Register file may not be homogeneous

– Pointer/ address registers – Immediate-operand capable registers – ... so need to pick best partition for each type. How?

slide-8
SLIDE 8

8

Primary and Secondary Thread Performance Primary and Secondary Thread Performance

  • Impact of register file partitioning

– More registers for primary thread

  • Less spilling and filling -> primary code takes fewer cycles
  • More idle time -> more cycles for secondary thread
  • Fewer registers for secondary thread -> more spilling and filling -

> secondary thread requires more cycles, response time rises – More registers for secondary thread

  • Similar case

– More registers swapped

  • Both threads require fewer cycles to execute
  • Coroutine call takes longer

– -> More cycles wasted switching between threads – -> Now coroutine call fits doesn’t fit into shorter idle time fragments, reducing cycles available for secondary thread

  • How do we find the best register file partitioning?

– Too complex to compute everything analytically – Instead compile and analyze iteratively to perform design space exploration

slide-9
SLIDE 9

9

Thrint Thrint

Integration foo.s foo.int.s foo.id Data-flow Analysis Control-flow Analysis Static Timing Analysis Integration Analysis GProf XVCG GnuPlot

  • Thread Integrating Compiler Back-End: Thrint
  • Have enhanced Thrint to

– Integrate threads using ASTI methods (was just STI) – Measure best, worst case performance for secondary thread

slide-10
SLIDE 10

10

Iterative Partition Analysis Iterative Partition Analysis Toolchain Toolchain

s_m.c

gcc

r_m.c s_b.c r_b.c

Thrint: ICTA

TSegmentIdle

Thrint Register File Partitioning Decisions gcc Secondary Thread

TSec

Thrint

TSec-Seg-Part

Performance Comparison: Slowdown vs. Dedicated MCU Primary Thread gcc Performance of Segmented, Partitioned Secondary Thread Original Performance of Secondary Thread

slide-11
SLIDE 11

11

Experiments Experiments

  • Atmel AVR

– 8-bit load/ store architecture for microcontrollers – Register File (32)

  • Pointer + immediate (6)
  • Immediate (10)
  • Other (16)
  • Protocol controllers in C

– CAN: 62.5 kbps – MIL-STD-1553: 1 Mbps

  • Secondary threads in C

– Network-RS232 bridge – PID controller

  • Compiled with AVR-GCC, -O3

Message Queues

Primary Thread (J1850) Secondary Thread (Interface) UART

Dig. In Dig. Out

ASTI Software

Bridge MCU Bus

UART

System MCU

slide-12
SLIDE 12

12

Performance Evaluation Performance Evaluation

  • Measure slowdown of integrated secondary thread

(worst-case execution path) with partitioned register file, compared with original full-register file performance

– Need to evaluate and schedule for worst-case to ensure system

always meets its deadlines

– How much performance do we give up by partitioning the register

file?

  • Not all partitions are schedulable

– Not enough time for coroutine call – Not enough time for primary thread to meet its I/ O instruction deadlines

slide-13
SLIDE 13

13

Results I: Average Performance Results I: Average Performance

  • Average performance for all feasible partitioning

approaches

0% 20% 40% 60% 1 5 5 3 S e n d 1 5 5 3 R e c e i v e C A N S e n d C A N R e c e i v e Slowdown Bridge (Host Interface) PID Controller

slide-14
SLIDE 14

14

Results II: Best Performance Results II: Best Performance

  • Find best (least slow-down) of all feasible partitionings
  • AVR register file is adequate to handle register pressure

for both threads, or idle time is adequate for coroutine calls

  • 0.5%

0.0% 0.5% 1.0% 1.5% 2.0% 1 5 5 3 S e n d 1 5 5 3 R e c e i v e C A N S e n d C A N R e c e i v e Slowdown Bridge (Host Interface) PID Controller

slide-15
SLIDE 15

15

Detailed Analysis Example Detailed Analysis Example

  • Primary: 1553 send
  • Secondary: PID controller
  • Immediate registers
  • Secondary is sensitive to # of

immediate registers

  • Primary: CAN send
  • Secondary: RS232-CAN Bridge
  • Other registers
  • Cocall must be brief for

schedulability

  • Best is 1.5% slowdown: 10,6 to 14,2

with no swapped registers

slide-16
SLIDE 16

16

Conclusions and Future Work Conclusions and Future Work

  • Conclusions

– Performance varies significantly for AVR architecture

  • Average case bad
  • Best case close to non-partitioned register file
  • Future Work

– Derive and evaluate heuristics to search efficiently through partitioning design space – Replace coroutine call with dispatcher to support multiple secondary threads

slide-17
SLIDE 17

17

Questions? Questions?

  • Have you applied this to SPEC?

– No, that’s not representative of embedded software-implemented communication protocols

  • Don’t caches break the timing predictability you need?

– The processors we use run at under 50 MHz, so we don’t have a memory wall

  • Why not use a multithreaded processor?

– They’re too expensive, too rare, and businesses prefer familiar processors

  • Why not just design an ASIC to do it?

– Too expensive to get the first one

  • Why not program an FPGA to do it?

– Too expensive to get the rest of them

slide-18
SLIDE 18

18

Appendices Appendices

slide-19
SLIDE 19

19

Why Network Communication Protocol Controllers? Why Network Communication Protocol Controllers?

  • Multiple threads must be able to make progress, even with fully-

loaded bus

  • Idle time is very fine grain (under one bit time)
  • Each application domain customizes its protocols

– Wireless sensor networks tweak the medium access control, etc. for minimal energy use – Automotive: optimize for guaranteed (hard real-time) delivery

  • Chicken and egg problem

– Protocol controller chip won’t appear until adequate market anticipated – Chip costs remain high until volumes amortize design costs – Delay until protocol controller appears as peripheral on cheap MCUs

  • MCUs are good fit for many embedded protocols, if concurrency is

cheap

– 10 to 200 cycles of processing needed per bit – 1 kbps – 1 Mbps bus speed – Temporally predictable MCUs are cheap and flexible

  • 1 MHz for $0.25
  • 100 MHz for $5-$10 (but you pay in increased energy use and other

issues)

slide-20
SLIDE 20

20

Assumptions Assumptions -

  • Small Embedded Systems

Small Embedded Systems

  • Processors

– Not practical to design a custom processor – Not practical to use fast processor (e.g. raise clock speed by 10x or more) – Can handle some code explosion (e.g. up to 3x) – Using a generic microcontroller (e.g. 4, 8 or 16 bit) without memory protection, virtual memory, caches.

  • Workload

– At most a few threads need to make asynchronous progress,

  • thers can wait

– One hard-real-time thread with tight deadlines – Other threads may have deadlines which are significantly longer – Interrupts are delayed or handled with polling servers (CASES 2003) – Subroutine calls are cloned or inlined (CASES 2004)

8 Bit 57% 4 Bit 12% DSP 11% 32 Bit 8% 16 Bit 12%

slide-21
SLIDE 21

21

Control Control-

  • Flow View of ASTI

Flow View of ASTI

Idle Time Idle Time Break secondary thread into segments lasting approximately for the total idle time Integrate intervening primary code into each segment Idle Time Primary Function Secondary Function Insert coroutine calls at start of idle time and end of each segment

Control Flow

slide-22
SLIDE 22

22

Big Picture Big Picture

  • How do we efficiently allocate N concurrent (potentially

real-time) tasks onto fewer than N processors?

– Compilation and scheduling for concurrent/ parallel/ distributed systems – Real-time systems – Hardware/ software cosynthesis

  • Bottlenecks

– Scheduling each context switch – Performing each context switch

  • We focus on 1 processor, and that processor is generic

(low-cost) with no special features for accelerating context switch bottlenecks

  • Note: threads must be able to make independent

(asynchronous) progress

slide-23
SLIDE 23

23

Steps in STI: Source Code Preparation Steps in STI: Source Code Preparation

  • Structure program (C) to accumulate work to perform in

integrated functions

  • Write functions (C) to be integrated
  • Compile to assembly code, partitioning register file for

functions to be integrated (-ffixed)

slide-24
SLIDE 24

24

Procedure Code Conditional Loop

Thread Representation Thread Representation

  • CDG’s hierarchical structure simplifies integration

–Vertical = conditional nesting, Horizontal = execution order –Summary information at each level

  • Our Thrint back-end compiler operates on CDGs of host, guest

threads

–Annotates host with execution time predictions –Moves guest code into host, enforcing ctl/ data/ time dependencies

  • Find gap, or else descend into subgraph
  • Have code transformations to handle conditionals & loops

Control Dependence Graph

Conditional Nesting Execution Order

slide-25
SLIDE 25

25

Thrint Overview Thrint Overview

STI

Parse Asm Form CFG/CDG Read Integration Directives Static Timing Analysis Node Labelling Idle Time Analysis Data-Flow Analysis Register Virtualization Integration Register Reallocation Static Timing Analysis Timing Verification Code Regeneration Temporal Determinism Analysis

Pad Timing Jitter in secondary thread Plan integration Pad excess timing jitter For each guest Pad excess timing jitter Do host loop transformations Clone and insert guest node(s) For each host For each guest If Fused loop, add fused loop control test code For each host Delete original guests

ASTI

Pad Timing Jitter in message level function. Delete original guests Pad Timing Jitter in bit level function. Plan integration in secondary thread Pad jitter in predicate nodes and blocking I/O loops Integrate cocalls within the secondary thread. Integrate intervening guest code at appropriate locations.

slide-26
SLIDE 26

26

Steps in STI: Analysis and Integration Planning Steps in STI: Analysis and Integration Planning

  • Parse assembly code to form CFG and then CDG
  • Perform tree-based static timing analysis
  • Pad away timing variations from conditionals with nops or nop loops

(example)

  • Perform basic data-flow analysis to identify loop-control variables and

possibly iteration counts

  • Compare duration of primary functions with maximum allowed latency

for ISRs and other short-laxity tasks

– Create polling servers to handle these as needed

  • Compare duration of secondary functions with amount of idle time time

in primary functions, considering minimum period for primary function

– Break long secondary functions into segments which fit into primary functions’ idle time minus polling servers minus two context switch times – Also end segments when reaching a loop with an unknown iteration count

  • Define target times for regions in primary code which are time-critical
slide-27
SLIDE 27

27

Steps in STI: Integration Steps in STI: Integration

  • Note: conditionals have been padded away previously
  • Single primary events

– Move primary code to execute at proper times within secondary code

  • Replicate primary code into conditionals
  • Split and peel loops and insert primary code
  • Guard primary code within loop to trigger on given iteration
  • Looping primary events

– Peel off primary function loop iterations which don’t overlap with secondary loops

  • Integrate as single primary events

– Fuse loop iterations which do overlap

  • Fuse loop control tests

– Unroll loop to match idle time in primary loop to work in secondary loop

  • Create clean-up loops to perform remaining iterations
  • Redo static timing analysis and verify correct timing
  • Recreate assembly file
  • Compile, link, download and run!
slide-28
SLIDE 28

28

Protocol Software Structure Protocol Software Structure

send idle receive send_message() send_bit() receive_bit() receive_message() protocol_executive()

Most idle time is located in these functions

slide-29
SLIDE 29

29

What about Interrupts? What about Interrupts? What about Frequent Primaries and Long What about Frequent Primaries and Long Secondaries Secondaries? ?

  • Interrupts?

– STI disables interrupts while integrated threads run – STIGLitz: ints. disabled for one field of video (16.167 ms)

  • Frequent primaries and long

secondaries?

– Primary thread needs to run again before integrated version would finish

  • Solutions

– Use polling servers to service each non-deferrable thread (e.g. UART) – Break secondary into segments and integrate primary in multiple times

Worst case execution time of integrated thread WCET for Secondary Thread Laxity for Secondary Thread (max. latency allowed) Minimum primary thread period minus maximum primary thread work

slide-30
SLIDE 30

30

Detail: Detail: Register File Partitioning vs. Performance Register File Partitioning vs. Performance

  • Problem: STI requires that integrated threads share the register file
  • Trade-off:
  • Code compiled to fit into fewer registers switches contexts faster

– Dispatcher switches contexts roughly every 900 cycles – Two context switches for one register take 12 cycles

  • Code compiled to fit into fewer registers runs slower

– More variables must remain in memory

  • Goal: Squeeze pre-integrated threads into as few registers as practical
  • Method: Determine sensitivity of the host threads’ execution time to the

number of registers available

– Divide AVR registers into three classes:

  • Pointer registers (r26-r31)
  • Immediate-operand capable registers (r16-r25)
  • Other registers (r0-r15)

– Analyze DrawSprite, DrawLine, DrawCircle functions – Limit registers available to the register allocator through gcc’s –ffixed

  • ption.

– Measure execution time using an on-chip timer/ counter

slide-31
SLIDE 31

31

Results Results

  • Measurements

– DrawLine and DrawCircle not very sensitive – DrawSprite very sensitive – Strange speed-up when excluding

  • ne pointer register
  • Design decisions

– DrawLine and DrawCircle

  • Exclude eight “other" registers

and two pointer registers

  • Use 22 registers
  • Each context switch: 132 cycles

– DrawSprite

  • Exclude only one “other”

register and two pointer registers

  • Use 29 registers
  • Each context switch: 174 cycles

Draw_Circle Sensitivity to Register Exclusion

0.9 1 1.1 1.2 1.3 2 4 6 8 10 12

Total Registers Excluded

Normalized Run Time DrawCircle - Immediate DrawCircle - Pointer DrawCircle - Other

Draw_Line Sensitivity to Register Exclusion 0.9 1 1.1 1.2 1.3

2 4 6 8 10 12

Total Registers Excluded

Normalized Run Time

DrawLine - Pointer DrawLine - Other DrawLine - Immediate

Draw_Sprite 1 2 3 4 5 5 10 15 Registers_removed

Normalized Run Time DrawSprite - Immediate DrawSprite - Pointer DrawSprite - Other

slide-32
SLIDE 32

32

To Do To Do

  • Remove intervening code from primary code - animate