SLIDE 1
2 3 Intel 48-core SCC processor Tilera 100-core processor - - PowerPoint PPT Presentation
2 3 Intel 48-core SCC processor Tilera 100-core processor - - PowerPoint PPT Presentation
2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program structure and behavior Case study of fluidanimate Thread criticality problem Communication impact on thread criticality
SLIDE 2
SLIDE 3
Intel 48-core SCC processor Tilera 100-core processor
3
SLIDE 4
- Introduction
- Parallel program structure and behavior
– Case study of fluidanimate – Thread criticality problem – Communication impact on thread criticality
- Thread-criticality support in on-chip network
– Bypass flow control – Priority-based arbitration
- Methodology & results
- Conclusion
4
SLIDE 5
- fluidanimate in PARSEC
- Particle-based fluid simulation
which solves Navier-Stokes equations
- Particles are spatially sorted in a
uniform grid and each thread covers subgrids in entire simulation domain.
- Each thread executes
AdvanceFrameMT() function.
- 8 sub-functions with 8 barriers
- Provided input sets process 5
frames.
5
void AdvanceFrameMT(int i) { ClearParticlesMT(i); pthread_barrier_wait(&barrier); RebuildGridMT(i); pthread_barrier_wait(&barrier); InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); ComputeDensities2MT(i); pthread_barrier_wait(&barrier); ComputeForcesMT(i); pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); pthread_barrier_wait(&barrier); }
SLIDE 6
void AdvanceFrameMT(int i) { ClearParticlesMT(i); pthread_barrier_wait(&barrier); RebuildGridMT(i); pthread_barrier_wait(&barrier); InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); ComputeDensities2MT(i); pthread_barrier_wait(&barrier); ComputeForcesMT(i); pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); pthread_barrier_wait(&barrier); } Execution time N threads … … … … Barrier 0 Barrier 1 Barrier 2 Barrier 7 ... … … (one thread per core)
6
SLIDE 7
7
ComputeDensitiesMT ComputeForcesMT Barrier 3
Execution time (108 cycles) thread
Barrier 5
If we accelerate half number of threads, execution time of ComputeForcesMT can be reduced by 29%.
SLIDE 8
- Variation of executed instructions
– Different execution paths in the same control flow graph ⇒ Different computation time
- Variation of memory accesses
– Different cache behavior on L2 cache – Thread criticality predictor based on per-core L2 hits and misses (Bhattacharjee et al., ISCA ’09)
- Larger total L1 miss penalties ==> higher thread criticality
⇒ Different memory stall time
8
SLIDE 9
- Large portion of cache miss penalty is communication
latency incurred by on-chip network.
– L2 cache is distributed and interconnected with multiple banks.
- L2 cache access latency in 8x8 mesh network
– 3-cycle hop latency, 6-cycle bank access latency, 12 hops for round trip (uniform random) – 36 cycles (86%) are communication latency in total latency of 42 cycles.
- Our work aims at reducing communication latency of
high-criticality threads to accelerate their execution.
9
SLIDE 10
- Low latency support
– Express virtual channel (ISCA ‘07)
- Router pipeline skipping by flow control
– Single-cycle router (ISCA ‘04)
- All dependent operations through speculation are handled in single cycle.
- Quality of service support
– Globally synchronized frames (ISCA ‘09)
- Per-flow bandwidth guarantee within a time window
– Application-aware prioritization (MICRO ‘09)
- High system throughput across many single-threaded applications by
exploiting different stall cycles per packet in each application
10
SLIDE 11
- Bypass flow control
– Reduce per-hop latency for critical threads. – Preserve internal router state to skip router pipelines. – Find a state that maximizes bypassing opportunities.
- Priority-based arbitration
– Reduce stall time caused by router resource arbitration for critical threads. – Assign high priority to critical threads and low priority to non- critical threads. – Allocate VCs and switch ports based on priority-based arbitration.
11
SLIDE 12
- Router preserves a bypass (default) state between input
ports and output ports.
- When a packet follows the same path in a bypass state of
router, it bypasses router pipeline and directly goes to link.
- Bypass state corresponds to preserved router resources.
– Bypass VC
- Preserved VC for bypass
– State-preserving switch crossbar
- Preserved switch input/output ports for bypass
12
SLIDE 13
13
VC allocator Switch allocator
IN OUT 1 1 2 2 3 3
Routing
Port state table
…
Input 0 Input 3 Output 0 Output 3
… … …
State-preserving crossbar switch Bypass VC Crossbar switch
SLIDE 14
14
Crossbar switch
IN OUT 1 1 2 2 3 3
Port state table
Output 0 Output 3 … Input 0 Input 3 … Input 0 Input 3 Output 0 Output 3
//
2x4 decoder
State is preserved when switch allocation does not occur at previous cycle.
SLIDE 15
- Each router has switch usage counters.
– Each counter is incremented on a packet basis only for critical threads. – Each counter tracks usage of one input port and one output port of switch.
- n2 counters for n × n switch
- Trade-off more (monitoring) resources for improved performance
- These counters are used to update port state table
periodically.
- Each port state table represents switch usage patterns for
critical threads during previous time interval.
15
SLIDE 16
- When multiple packets request the same resource,
arbitration is necessary.
– VC arbitration, switch arbitration, speculative-switch arbitration
- Higher-priority packets win arbitration over lower-priority
packets.
– This priority is the same as level of thread criticality.
- Aging for starvation freedom
16
SLIDE 17
- 64-core system modeled by SIMICS
– 8x8 mesh network – 2-stage pipeline router + 1-cycle link
- 3-cycle hop latency (no bypass)
- 1-cycle hop latency (bypass)
– 6-cycle bank access for 16MB L2 cache
- PARSEC benchmarks
- Thread criticality predictor based on accumulated L1 miss
penalty
– Switch usage counters are updated only for top four critical threads.
17
SLIDE 18
18
SLIDE 19
19
SLIDE 20
20
SLIDE 21
21
SLIDE 22
- Each thread can have different performance due to
different memory behavior.
- Accelerating slowest (critical) threads reduces execution
time of parallel applications.
- On-chip network is designed to support thread criticality
through bypass flow control and priority-based arbitration techniques.
22
SLIDE 23