2 3 Intel 48-core SCC processor Tilera 100-core processor - - PowerPoint PPT Presentation

2
SMART_READER_LITE
LIVE PREVIEW

2 3 Intel 48-core SCC processor Tilera 100-core processor - - PowerPoint PPT Presentation

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program structure and behavior Case study of fluidanimate Thread criticality problem Communication impact on thread criticality


slide-1
SLIDE 1
slide-2
SLIDE 2

2

slide-3
SLIDE 3

Intel 48-core SCC processor Tilera 100-core processor

3

slide-4
SLIDE 4
  • Introduction
  • Parallel program structure and behavior

– Case study of fluidanimate – Thread criticality problem – Communication impact on thread criticality

  • Thread-criticality support in on-chip network

– Bypass flow control – Priority-based arbitration

  • Methodology & results
  • Conclusion

4

slide-5
SLIDE 5
  • fluidanimate in PARSEC
  • Particle-based fluid simulation

which solves Navier-Stokes equations

  • Particles are spatially sorted in a

uniform grid and each thread covers subgrids in entire simulation domain.

  • Each thread executes

AdvanceFrameMT() function.

  • 8 sub-functions with 8 barriers
  • Provided input sets process 5

frames.

5

void AdvanceFrameMT(int i) { ClearParticlesMT(i); pthread_barrier_wait(&barrier); RebuildGridMT(i); pthread_barrier_wait(&barrier); InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); ComputeDensities2MT(i); pthread_barrier_wait(&barrier); ComputeForcesMT(i); pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); pthread_barrier_wait(&barrier); }

slide-6
SLIDE 6

void AdvanceFrameMT(int i) { ClearParticlesMT(i); pthread_barrier_wait(&barrier); RebuildGridMT(i); pthread_barrier_wait(&barrier); InitDensitiesAndForcesMT(i); pthread_barrier_wait(&barrier); ComputeDensitiesMT(i); pthread_barrier_wait(&barrier); ComputeDensities2MT(i); pthread_barrier_wait(&barrier); ComputeForcesMT(i); pthread_barrier_wait(&barrier); ProcessCollisionsMT(i); pthread_barrier_wait(&barrier); AdvanceParticlesMT(i); pthread_barrier_wait(&barrier); } Execution time N threads … … … … Barrier 0 Barrier 1 Barrier 2 Barrier 7 ... … … (one thread per core)

6

slide-7
SLIDE 7

7

ComputeDensitiesMT ComputeForcesMT Barrier 3

Execution time (108 cycles) thread

Barrier 5

If we accelerate half number of threads, execution time of ComputeForcesMT can be reduced by 29%.

slide-8
SLIDE 8
  • Variation of executed instructions

– Different execution paths in the same control flow graph ⇒ Different computation time

  • Variation of memory accesses

– Different cache behavior on L2 cache – Thread criticality predictor based on per-core L2 hits and misses (Bhattacharjee et al., ISCA ’09)

  • Larger total L1 miss penalties ==> higher thread criticality

⇒ Different memory stall time

8

slide-9
SLIDE 9
  • Large portion of cache miss penalty is communication

latency incurred by on-chip network.

– L2 cache is distributed and interconnected with multiple banks.

  • L2 cache access latency in 8x8 mesh network

– 3-cycle hop latency, 6-cycle bank access latency, 12 hops for round trip (uniform random) – 36 cycles (86%) are communication latency in total latency of 42 cycles.

  • Our work aims at reducing communication latency of

high-criticality threads to accelerate their execution.

9

slide-10
SLIDE 10
  • Low latency support

– Express virtual channel (ISCA ‘07)

  • Router pipeline skipping by flow control

– Single-cycle router (ISCA ‘04)

  • All dependent operations through speculation are handled in single cycle.
  • Quality of service support

– Globally synchronized frames (ISCA ‘09)

  • Per-flow bandwidth guarantee within a time window

– Application-aware prioritization (MICRO ‘09)

  • High system throughput across many single-threaded applications by

exploiting different stall cycles per packet in each application

10

slide-11
SLIDE 11
  • Bypass flow control

– Reduce per-hop latency for critical threads. – Preserve internal router state to skip router pipelines. – Find a state that maximizes bypassing opportunities.

  • Priority-based arbitration

– Reduce stall time caused by router resource arbitration for critical threads. – Assign high priority to critical threads and low priority to non- critical threads. – Allocate VCs and switch ports based on priority-based arbitration.

11

slide-12
SLIDE 12
  • Router preserves a bypass (default) state between input

ports and output ports.

  • When a packet follows the same path in a bypass state of

router, it bypasses router pipeline and directly goes to link.

  • Bypass state corresponds to preserved router resources.

– Bypass VC

  • Preserved VC for bypass

– State-preserving switch crossbar

  • Preserved switch input/output ports for bypass

12

slide-13
SLIDE 13

13

VC allocator Switch allocator

IN OUT 1 1 2 2 3 3

Routing

Port state table

Input 0 Input 3 Output 0 Output 3

… … …

State-preserving crossbar switch Bypass VC Crossbar switch

slide-14
SLIDE 14

14

Crossbar switch

IN OUT 1 1 2 2 3 3

Port state table

Output 0 Output 3 … Input 0 Input 3 … Input 0 Input 3 Output 0 Output 3

//

2x4 decoder

State is preserved when switch allocation does not occur at previous cycle.

slide-15
SLIDE 15
  • Each router has switch usage counters.

– Each counter is incremented on a packet basis only for critical threads. – Each counter tracks usage of one input port and one output port of switch.

  • n2 counters for n × n switch
  • Trade-off more (monitoring) resources for improved performance
  • These counters are used to update port state table

periodically.

  • Each port state table represents switch usage patterns for

critical threads during previous time interval.

15

slide-16
SLIDE 16
  • When multiple packets request the same resource,

arbitration is necessary.

– VC arbitration, switch arbitration, speculative-switch arbitration

  • Higher-priority packets win arbitration over lower-priority

packets.

– This priority is the same as level of thread criticality.

  • Aging for starvation freedom

16

slide-17
SLIDE 17
  • 64-core system modeled by SIMICS

– 8x8 mesh network – 2-stage pipeline router + 1-cycle link

  • 3-cycle hop latency (no bypass)
  • 1-cycle hop latency (bypass)

– 6-cycle bank access for 16MB L2 cache

  • PARSEC benchmarks
  • Thread criticality predictor based on accumulated L1 miss

penalty

– Switch usage counters are updated only for top four critical threads.

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

slide-22
SLIDE 22
  • Each thread can have different performance due to

different memory behavior.

  • Accelerating slowest (critical) threads reduces execution

time of parallel applications.

  • On-chip network is designed to support thread criticality

through bypass flow control and priority-based arbitration techniques.

22

slide-23
SLIDE 23