2 3 Intel 48-core SCC processor Tilera 100-core processor - PowerPoint PPT Presentation

3 Intel 48-core SCC processor Tilera 100-core processor

• Introduction • Parallel program structure and behavior – Case study of fluidanimate – Thread criticality problem – Communication impact on thread criticality • Thread-criticality support in on-chip network – Bypass flow control – Priority-based arbitration • Methodology & results • Conclusion 4

void AdvanceFrameMT(int i) • fluidanimate in PARSEC { • Particle-based fluid simulation ClearParticlesMT(i); which solves Navier-Stokes pthread_barrier_wait(&barrier); RebuildGridMT(i); equations pthread_barrier_wait(&barrier); • Particles are spatially sorted in a InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); uniform grid and each thread ComputeDensitiesMT(i); covers subgrids in entire pthread_barrier_wait(&barrier); simulation domain. ComputeDensities2MT(i); • Each thread executes pthread_barrier_wait(&barrier); ComputeForcesMT(i); AdvanceFrameMT () function. pthread_barrier_wait(&barrier); • 8 sub-functions with 8 barriers ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); • Provided input sets process 5 AdvanceParticlesMT(i); frames. pthread_barrier_wait(&barrier); } 5

void AdvanceFrameMT(int i) N threads (one thread per core) { ClearParticlesMT(i); … pthread_barrier_wait(&barrier); RebuildGridMT(i); Barrier 0 pthread_barrier_wait(&barrier); Execution time … InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); Barrier 1 ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); … ComputeDensities2MT(i); Barrier 2 pthread_barrier_wait(&barrier); ComputeForcesMT(i); … … ... pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); … pthread_barrier_wait(&barrier); Barrier 7 } 6

ComputeDensitiesMT ComputeForcesMT thread If we accelerate half number of threads, execution time of ComputeForcesMT can be reduced by 29%. Execution time (10 8 cycles) 7 Barrier 3 Barrier 5

• Variation of executed instructions – Different execution paths in the same control flow graph ⇒ Different computation time • Variation of memory accesses – Different cache behavior on L2 cache – Thread criticality predictor based on per-core L2 hits and misses (Bhattacharjee et al., ISCA ’09) • Larger total L1 miss penalties ==> higher thread criticality ⇒ Different memory stall time 8

• Large portion of cache miss penalty is communication latency incurred by on-chip network. – L2 cache is distributed and interconnected with multiple banks. • L2 cache access latency in 8x8 mesh network – 3-cycle hop latency, 6-cycle bank access latency, 12 hops for round trip (uniform random) – 36 cycles (86%) are communication latency in total latency of 42 cycles. • Our work aims at reducing communication latency of high-criticality threads to accelerate their execution. 9

• Low latency support – Express virtual channel (ISCA ‘07) • Router pipeline skipping by flow control – Single-cycle router (ISCA ‘04) • All dependent operations through speculation are handled in single cycle. • Quality of service support – Globally synchronized frames (ISCA ‘09) • Per-flow bandwidth guarantee within a time window – Application-aware prioritization (MICRO ‘09) • High system throughput across many single-threaded applications by exploiting different stall cycles per packet in each application 10

• Bypass flow control – Reduce per-hop latency for critical threads. – Preserve internal router state to skip router pipelines. – Find a state that maximizes bypassing opportunities. • Priority-based arbitration – Reduce stall time caused by router resource arbitration for critical threads. – Assign high priority to critical threads and low priority to non- critical threads. – Allocate VCs and switch ports based on priority-based arbitration. 11

• Router preserves a bypass (default) state between input ports and output ports. • When a packet follows the same path in a bypass state of router, it bypasses router pipeline and directly goes to link. • Bypass state corresponds to preserved router resources. – Bypass VC • Preserved VC for bypass – State-preserving switch crossbar • Preserved switch input/output ports for bypass 12

Port state table IN OUT Routing 0 1 VC allocator 1 0 2 2 Switch allocator Bypass VC 3 3 Input 0 Output 0 … … … Input 3 Output 3 … Crossbar switch State-preserving crossbar switch 13

Port state table IN OUT 0 1 // 1 0 2x4 decoder 2 2 3 3 Input 0 Output 0 Input 0 … … Input 3 Input 3 Output 3 Crossbar switch Output 0 Output 3 State is preserved when switch allocation does not occur at previous cycle. 14

• Each router has switch usage counters. – Each counter is incremented on a packet basis only for critical threads. – Each counter tracks usage of one input port and one output port of switch. • n 2 counters for n × n switch • Trade-off more (monitoring) resources for improved performance • These counters are used to update port state table periodically. • Each port state table represents switch usage patterns for critical threads during previous time interval. 15

• When multiple packets request the same resource, arbitration is necessary. – VC arbitration, switch arbitration, speculative-switch arbitration • Higher-priority packets win arbitration over lower-priority packets. – This priority is the same as level of thread criticality. • Aging for starvation freedom 16

• 64-core system modeled by SIMICS – 8x8 mesh network – 2-stage pipeline router + 1-cycle link • 3-cycle hop latency (no bypass) • 1-cycle hop latency (bypass) – 6-cycle bank access for 16MB L2 cache • PARSEC benchmarks • Thread criticality predictor based on accumulated L1 miss penalty – Switch usage counters are updated only for top four critical threads. 17

• Each thread can have different performance due to different memory behavior. • Accelerating slowest (critical) threads reduces execution time of parallel applications. • On-chip network is designed to support thread criticality through bypass flow control and priority-based arbitration techniques. 22

2 3 Intel 48-core SCC processor Tilera 100-core processor - PowerPoint PPT Presentation

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program structure and behavior Case study of fluidanimate Thread criticality problem Communication impact on thread criticality

A Highly-dense Mixed Grained Reconfigurable Architecture with Overlay Crossbar Interconnect Using

Analysis of TDMA Crossbar Real-Time Switch Design for AFDX Networks Lei Rao *, Qixin Wang

Short Summery Classification according to memory organization distributed memory

Short Summery Taxonomy of parallel computers SISD: von Neumann model SIMD: Single

Highlydense Mixed Grained Reconfigurable Architecture with Viaswitch Ryutaro Doi 1,6 Junshi

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger & Zhonghai Lu

Interconnection Networks Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

JAVASCRIPT Miguel Angel Pastor Halfbrick Presentation Miguel Angel Pastor Manuel 15+ years game

GCC Code Organization Emulation libraries (eg. libgcc to emulate operations not supported on

Translation Models Machine-dependent Generate Machine Code Directly Through

development is challenging Let tools help! Marc Goodner What are challenges with Cross Platform

I n t r o d u c t i o n t o E mb e d d e d L i n u x Fernando Rincn

CMake Behind the Scenes of Code Development Rodolfo Lima rodolfo@digitok.com.br Outline

ECE 3574: Applied Software Design Embedded Programming in C++ Today we are going to look at what

Making build systems not suck! Jussi Pakkanen jpakkane@gmail.com @jpakkane

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

RF Power Test Status The MICE RF Group Contributions from Daresbury, RAL, CERN, LANL, Strathclyde

ILC Modulator Talk Overview of Existing FNAL Bouncer Modulator DESY/PPT Bouncer Modulator

UMBC A B M A L T F O U M B C I M Y O R T 1 (Nov. 19, 2001) I E S R C E O

Improved CRT Algorithm for class polynomials in genus 2 01/08/2012 (Microsoft Research) Kristin

Today Finish Euclid. Bijection/CRT/Isomorphism. Fermats Little Theorem. 1 / 27 More

Bijections for tree-decorated map and applications to random maps. Luis Fredes (Work in progress

How to Generalize RSA Cryptanalyses Atsushi Takayasu and Noboru Kunihiro The University of Tokyo,

2 3 Intel 48-core SCC processor Tilera 100-core processor - PowerPoint PPT Presentation

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program structure and behavior Case study of fluidanimate Thread criticality problem Communication impact on thread criticality

A Highly-dense Mixed Grained Reconfigurable Architecture with Overlay Crossbar Interconnect Using

Analysis of TDMA Crossbar Real-Time Switch Design for AFDX Networks Lei Rao *, Qixin Wang

Short Summery Classification according to memory organization distributed memory

Short Summery Taxonomy of parallel computers SISD: von Neumann model SIMD: Single

Highlydense Mixed Grained Reconfigurable Architecture with Viaswitch Ryutaro Doi 1,6 Junshi

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger &amp; Zhonghai Lu

Interconnection Networks Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

JAVASCRIPT Miguel Angel Pastor Halfbrick Presentation Miguel Angel Pastor Manuel 15+ years game

GCC Code Organization Emulation libraries (eg. libgcc to emulate operations not supported on

Translation Models Machine-dependent Generate Machine Code Directly Through

development is challenging Let tools help! Marc Goodner What are challenges with Cross Platform

I n t r o d u c t i o n t o E mb e d d e d L i n u x Fernando Rincn

CMake Behind the Scenes of Code Development Rodolfo Lima rodolfo@digitok.com.br Outline

ECE 3574: Applied Software Design Embedded Programming in C++ Today we are going to look at what

Making build systems not suck! Jussi Pakkanen jpakkane@gmail.com @jpakkane

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

RF Power Test Status The MICE RF Group Contributions from Daresbury, RAL, CERN, LANL, Strathclyde

ILC Modulator Talk Overview of Existing FNAL Bouncer Modulator DESY/PPT Bouncer Modulator

UMBC A B M A L T F O U M B C I M Y O R T 1 (Nov. 19, 2001) I E S R C E O

Improved CRT Algorithm for class polynomials in genus 2 01/08/2012 (Microsoft Research) Kristin

Today Finish Euclid. Bijection/CRT/Isomorphism. Fermats Little Theorem. 1 / 27 More

Bijections for tree-decorated map and applications to random maps. Luis Fredes (Work in progress

How to Generalize RSA Cryptanalyses Atsushi Takayasu and Noboru Kunihiro The University of Tokyo,

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger & Zhonghai Lu