2
play

2 3 Intel 48-core SCC processor Tilera 100-core processor - PowerPoint PPT Presentation

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program structure and behavior Case study of fluidanimate Thread criticality problem Communication impact on thread criticality


  1. 2

  2. 3 Intel 48-core SCC processor Tilera 100-core processor

  3. • Introduction • Parallel program structure and behavior – Case study of fluidanimate – Thread criticality problem – Communication impact on thread criticality • Thread-criticality support in on-chip network – Bypass flow control – Priority-based arbitration • Methodology & results • Conclusion 4

  4. void AdvanceFrameMT(int i) • fluidanimate in PARSEC { • Particle-based fluid simulation ClearParticlesMT(i); which solves Navier-Stokes pthread_barrier_wait(&barrier); RebuildGridMT(i); equations pthread_barrier_wait(&barrier); • Particles are spatially sorted in a InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); uniform grid and each thread ComputeDensitiesMT(i); covers subgrids in entire pthread_barrier_wait(&barrier); simulation domain. ComputeDensities2MT(i); • Each thread executes pthread_barrier_wait(&barrier); ComputeForcesMT(i); AdvanceFrameMT () function. pthread_barrier_wait(&barrier); • 8 sub-functions with 8 barriers ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); • Provided input sets process 5 AdvanceParticlesMT(i); frames. pthread_barrier_wait(&barrier); } 5

  5. void AdvanceFrameMT(int i) N threads (one thread per core) { ClearParticlesMT(i); … pthread_barrier_wait(&barrier); RebuildGridMT(i); Barrier 0 pthread_barrier_wait(&barrier); Execution time … InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); Barrier 1 ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); … ComputeDensities2MT(i); Barrier 2 pthread_barrier_wait(&barrier); ComputeForcesMT(i); … … ... pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); … pthread_barrier_wait(&barrier); Barrier 7 } 6

  6. ComputeDensitiesMT ComputeForcesMT thread If we accelerate half number of threads, execution time of ComputeForcesMT can be reduced by 29%. Execution time (10 8 cycles) 7 Barrier 3 Barrier 5

  7. • Variation of executed instructions – Different execution paths in the same control flow graph ⇒ Different computation time • Variation of memory accesses – Different cache behavior on L2 cache – Thread criticality predictor based on per-core L2 hits and misses (Bhattacharjee et al., ISCA ’09) • Larger total L1 miss penalties ==> higher thread criticality ⇒ Different memory stall time 8

  8. • Large portion of cache miss penalty is communication latency incurred by on-chip network. – L2 cache is distributed and interconnected with multiple banks. • L2 cache access latency in 8x8 mesh network – 3-cycle hop latency, 6-cycle bank access latency, 12 hops for round trip (uniform random) – 36 cycles (86%) are communication latency in total latency of 42 cycles. • Our work aims at reducing communication latency of high-criticality threads to accelerate their execution. 9

  9. • Low latency support – Express virtual channel (ISCA ‘07) • Router pipeline skipping by flow control – Single-cycle router (ISCA ‘04) • All dependent operations through speculation are handled in single cycle. • Quality of service support – Globally synchronized frames (ISCA ‘09) • Per-flow bandwidth guarantee within a time window – Application-aware prioritization (MICRO ‘09) • High system throughput across many single-threaded applications by exploiting different stall cycles per packet in each application 10

  10. • Bypass flow control – Reduce per-hop latency for critical threads. – Preserve internal router state to skip router pipelines. – Find a state that maximizes bypassing opportunities. • Priority-based arbitration – Reduce stall time caused by router resource arbitration for critical threads. – Assign high priority to critical threads and low priority to non- critical threads. – Allocate VCs and switch ports based on priority-based arbitration. 11

  11. • Router preserves a bypass (default) state between input ports and output ports. • When a packet follows the same path in a bypass state of router, it bypasses router pipeline and directly goes to link. • Bypass state corresponds to preserved router resources. – Bypass VC • Preserved VC for bypass – State-preserving switch crossbar • Preserved switch input/output ports for bypass 12

  12. Port state table IN OUT Routing 0 1 VC allocator 1 0 2 2 Switch allocator Bypass VC 3 3 Input 0 Output 0 … … … Input 3 Output 3 … Crossbar switch State-preserving crossbar switch 13

  13. Port state table IN OUT 0 1 // 1 0 2x4 decoder 2 2 3 3 Input 0 Output 0 Input 0 … … Input 3 Input 3 Output 3 Crossbar switch Output 0 Output 3 State is preserved when switch allocation does not occur at previous cycle. 14

  14. • Each router has switch usage counters. – Each counter is incremented on a packet basis only for critical threads. – Each counter tracks usage of one input port and one output port of switch. • n 2 counters for n × n switch • Trade-off more (monitoring) resources for improved performance • These counters are used to update port state table periodically. • Each port state table represents switch usage patterns for critical threads during previous time interval. 15

  15. • When multiple packets request the same resource, arbitration is necessary. – VC arbitration, switch arbitration, speculative-switch arbitration • Higher-priority packets win arbitration over lower-priority packets. – This priority is the same as level of thread criticality. • Aging for starvation freedom 16

  16. • 64-core system modeled by SIMICS – 8x8 mesh network – 2-stage pipeline router + 1-cycle link • 3-cycle hop latency (no bypass) • 1-cycle hop latency (bypass) – 6-cycle bank access for 16MB L2 cache • PARSEC benchmarks • Thread criticality predictor based on accumulated L1 miss penalty – Switch usage counters are updated only for top four critical threads. 17

  17. 18

  18. 19

  19. 20

  20. 21

  21. • Each thread can have different performance due to different memory behavior. • Accelerating slowest (critical) threads reduces execution time of parallel applications. • On-chip network is designed to support thread criticality through bypass flow control and priority-based arbitration techniques. 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend