Statically Calculating Secondary Thread Statically Calculating - PowerPoint PPT Presentation

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in ASTI Systems Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean alex_dean@ncsu.edu Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University www.cesr.ncsu.edu/ agdean 1

Overview Overview • ASTI: Asynchronous Software Thread Integration • Register File Partitioning • Experiments • Conclusions 2

Basic Idea of ASTI Basic Idea of ASTI • Goal: recover fine-grain idle-time for use by other threads • Examine program to find a function f with significant internal idle time • Idle time is imposed by instruction-level timing requirements (e.g. for input, output instructions) • If an idle time piece n is coarse-grain ( T Idle (f,n) >> 2*T ContextSwitch ), then we can recover it efficiently with context switching • If it is fine grain ( T Idle (f,n) !>> 2*T ContextSwitch ), then apply ASTI (Asynchronous Software Thread Integration) • Details of ASTI in LCTES 2004, CASES 2004 3

ASTI Applied to Communication Protocols ASTI Applied to Communication Protocols Check for errors, Subroutine Idle time save bit, calls Return update CRC Executive ReceiveMessage ReceiveBit Prepare Sample bus for Read bit from bus message buffer resynchronization 3 times and vote Need only first and last coroutine calls Coroutine calls Executive Primary ReceiveMessage Thread ReceiveBit Secondary Integrated Thread Secondary Wasted idle time, too Recover even Thread short for cocall short idle time 4

Protocol Controller Options Protocol Controller Options Analog & Analog & Analog & Digital I/O Digital I/O Digital I/O Analog & Optional System System Digital I/O System MCU MCU MCU Generic MCU Discrete On-Board with ASTI S/W I/O Expander Protocol Protocol Protocol Controller Controller Controller Physical Layer Physical Layer Physical Layer Physical Layer Transceiver Transceiver Transceiver Transceiver Communication Network I/O Expander Discrete MCU with Generic MCU protocol on-board with ASTI SW controller protocol protocol with MCU controller controller 5

But what about… … But what about • Caches? • Deep instruction pipelines? • Branch prediction? • Superscalar instruction execution? • Speculative execution? • The reorder buffer? • Page faults? • Forwarding paths? • Load queues? • Data prefetching? • Predicated execution? • Branch delay slots? • Instruction prefetching? • Store forwarding? • R-ops • Dynamic optimization • The phase of the moon • Wind direction • Et cetera, et cetera 6

Register File Partitioning Register File Partitioning • Single register file must support primary and secondary threads • Three ways to use a register – For primary thread exclusively – For secondary thread exclusively – Shared between the two, swapped on coroutine calls • Register file may not be homogeneous – Pointer/ address registers – Immediate-operand capable registers – ... so need to pick best partition for each type. How? 7

Primary and Secondary Thread Performance Primary and Secondary Thread Performance • Impact of register file partitioning – More registers for primary thread • Less spilling and filling -> primary code takes fewer cycles • More idle time -> more cycles for secondary thread • Fewer registers for secondary thread -> more spilling and filling - > secondary thread requires more cycles, response time rises – More registers for secondary thread • Similar case – More registers swapped • Both threads require fewer cycles to execute • Coroutine call takes longer – -> More cycles wasted switching between threads – -> Now coroutine call fits doesn’t fit into shorter idle time fragments, reducing cycles available for secondary thread • How do we find the best register file partitioning? – Too complex to compute everything analytically – Instead compile and analyze iteratively to perform design space exploration 8

Thrint Thrint foo.s Data-flow Control-flow Static Timing Analysis Analysis Analysis GProf Integration Integration foo.int.s Analysis foo.id XVCG GnuPlot • Thread Integrating Compiler Back-End: Thrint • Have enhanced Thrint to – Integrate threads using ASTI methods (was just STI) – Measure best, worst case performance for secondary thread 9

Iterative Partition Analysis Iterative Partition Analysis Toolchain Toolchain Register File Partitioning Decisions s_m.c r_m.c Primary Thrint: T SegmentIdle gcc Thread s_b.c ICTA r_b.c Original Performance of Secondary Thread Secondary T Sec Performance Comparison: gcc Thrint Thread Slowdown vs. Dedicated MCU Thrint gcc T Sec-Seg-Part Performance of Segmented, Partitioned Secondary Thread 10

Experiments Experiments • Atmel AVR – 8-bit load/ store architecture for microcontrollers – Register File (32) • Pointer + immediate (6) • Immediate (10) • Other (16) • Protocol controllers in C – CAN: 62.5 kbps – MIL-STD-1553: 1 Mbps Bridge MCU ASTI Software • Secondary threads in C Primary Secondary Thread Thread System MCU (J1850) (Interface) – Network-RS232 bridge Dig. UART UART Out Message – PID controller Queues Dig. • Compiled with AVR-GCC, -O3 In Bus 11

Performance Evaluation Performance Evaluation • Measure slowdown of integrated secondary thread (worst-case execution path) with partitioned register file, compared with original full-register file performance – Need to evaluate and schedule for worst-case to ensure system always meets its deadlines – How much performance do we give up by partitioning the register file? • Not all partitions are schedulable – Not enough time for coroutine call – Not enough time for primary thread to meet its I/ O instruction deadlines 12

Results I: Average Performance Results I: Average Performance 60% Bridge (Host Interface) PID Controller 40% Slowdown 20% 0% d d e n e n e v v e i i S e S e c N c 3 e e 5 A R R 5 C N 1 3 A 5 5 C 1 • Average performance for all feasible partitioning approaches 13

Results II: Best Performance Results II: Best Performance 2.0% Bridge (Host Interface) PID Controller 1.5% Slowdown 1.0% 0.5% 0.0% d d e n e n v e v e i S e i S e c -0.5% N c 3 e e A 5 R R 5 C N 1 3 A 5 5 C 1 • Find best (least slow-down) of all feasible partitionings • AVR register file is adequate to handle register pressure for both threads, or idle time is adequate for coroutine calls 14

Detailed Analysis Example Detailed Analysis Example • Primary: 1553 send • Primary: CAN send • Secondary: PID controller • Secondary: RS232-CAN Bridge • Immediate registers • Other registers • Secondary is sensitive to # of • Cocall must be brief for immediate registers schedulability • Best is 1.5% slowdown: 10,6 to 14,2 with no swapped registers 15

Conclusions and Future Work Conclusions and Future Work • Conclusions – Performance varies significantly for AVR architecture • Average case bad • Best case close to non-partitioned register file • Future Work – Derive and evaluate heuristics to search efficiently through partitioning design space – Replace coroutine call with dispatcher to support multiple secondary threads 16

Questions? Questions? • Have you applied this to SPEC? – No, that’s not representative of embedded software-implemented communication protocols • Don’t caches break the timing predictability you need? – The processors we use run at under 50 MHz, so we don’t have a memory wall • Why not use a multithreaded processor? – They’re too expensive, too rare, and businesses prefer familiar processors • Why not just design an ASIC to do it? – Too expensive to get the first one • Why not program an FPGA to do it? – Too expensive to get the rest of them 17

Appendices Appendices 18

Why Network Communication Protocol Controllers? Why Network Communication Protocol Controllers? • Multiple threads must be able to make progress, even with fully- loaded bus • Idle time is very fine grain (under one bit time) • Each application domain customizes its protocols – Wireless sensor networks tweak the medium access control, etc. for minimal energy use – Automotive: optimize for guaranteed (hard real-time) delivery • Chicken and egg problem – Protocol controller chip won’t appear until adequate market anticipated – Chip costs remain high until volumes amortize design costs – Delay until protocol controller appears as peripheral on cheap MCUs • MCUs are good fit for many embedded protocols, if concurrency is cheap – 10 to 200 cycles of processing needed per bit – 1 kbps – 1 Mbps bus speed – Temporally predictable MCUs are cheap and flexible • 1 MHz for $0.25 • 100 MHz for $5-$10 (but you pay in increased energy use and other issues) 19

Statically Calculating Secondary Thread Statically Calculating - PowerPoint PPT Presentation

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in ASTI Systems Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimares Dean alex_dean@ncsu.edu Center for Embedded

Secondary Framing Secondary Framing Secondary Framing Secondary Framing 1 1 Secondary Framing

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Introduction to Statically Introduction to Statically Indeterminate Indeterminate Analysis

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Calculating distributions Chung-chieh Shan Indiana University 2018-09-21 Calculating

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Approximate Analysis of pproximate Analysis of Statically Indeterminate Statically Indeterminate

Introduction to Introduction to Statically Statically c ompatibility relationships ,

Exercise and Secondary Exercise and Secondary Exercise and Secondary Exercise and Secondary

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

TDDC17 Computer systems Robotic architectures Robotics/Perception II Navigation:

Linear Slides(Lead screws) MOONS Linear Slides are designed to meet the needs of customers'

Source level debugging Handel-C Debug the FPGA, Herman Roebbers 07-Sep-2008 Introduction

System-Level Design Tools and RTOS for Multiprocessor SoCs Hiroyuki Tomiyama Shinya Honda

How reversing the COMBUS protocol resulted in breaking security Hacking COMBUS in a of a

Weather Station PHYSICS PROJECT WITH A RESEACH BASIS 5c Frida Bfverfeldt Supervisor: Volker

Performance Bounds in a Switched Aircraft Cabin Emanuel Heidinger Supervision TUM: Prof. Carle

A new cluster mass proxy and galaxy evolution studies in clusters from the Dark Energy Survey