GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - - PowerPoint PPT Presentation

globally synchronized frames for guaranteed quality of
SMART_READER_LITE
LIVE PREVIEW

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - - PowerPoint PPT Presentation

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China Resource sharing increases performance variation


slide-1
SLIDE 1

Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

June 23th 2008

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS

ISCA-35, Beijing, China

slide-2
SLIDE 2

Jae W. Lee (2 / 33)

Resource sharing increases performance variation

P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont

Resource sharing

(+) reduces hardware cost (-) increases performance variation

This performance variation

becomes larger and larger as the number of sharers (cores) increases.

slide-3
SLIDE 3

Jae W. Lee (3 / 33)

Desired quality-of-service from shared resources

Performance isolation

(fairness)

P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C P P P P P P P P P P P P P P P P processor ID (hotspot) (hotspot) minimum guaranteed BW minimum guaranteed BW

slide-4
SLIDE 4

Jae W. Lee (4 / 33)

P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont processor ID accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C diff differen erentia tiated ted allocation processor ID (hotspot) (hotspot)

Desired quality-of-service from shared resources

minimum guaranteed BW minimum guaranteed BW

Performance isolation

(fairness)

Differentiated services

(flexibility)

slide-5
SLIDE 5

Jae W. Lee (5 / 33)

Resources w/ centralized arbitration are well investigated

R R R R R R

  • n-chip

routers [MICRO ’06] [MICRO ’06] [PACT ’07] [PACT ’07] [USENIX sec. ’07] [USENIX sec. ’07] [IBM ’07] [IBM ’07] [MICRO ’07] [MICRO ’07] [ISCA ’08] [ISCA ’08] ... ... [HPCA ‘02] [HPCA ‘02] [ICS ‘04] [ICS ‘04] [ISCA ‘07] [ISCA ‘07] …

P+ P+ L1$ L1$

P+ P+ L1$ L1$

P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$

R R R

mem mem ctrl ctrl L2$ L2$ bank bank

Resources with

centralized arbitration

SDRAM controllers L2 cache banks

They have a single entry

point for all requests. → QoS is relatively easier and well investigated.

slide-6
SLIDE 6

Jae W. Lee (6 / 33)

QoS from on-chip networks is a challenge

R R R R R R

  • n-chip

routers

P+ P+ L1$ L1$

P+ P+ L1$ L1$

P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$

R R R

mem mem ctrl ctrl L2$ L2$ bank bank

Resources with

distributed arbitration

multi-hop on-chip networks

They have distributed

arbitration points. → QoS is more difficult.

Off-chip solutions cannot

be directly applied because

  • f resource constraints.
slide-7
SLIDE 7

Jae W. Lee (7 / 33)

We guarantee QoS for flows

Flow: a sequence of packets

between a unique pair of end nodes (src and dest)

physical links shared by flows multiple stages of arbitration

for each packet

We provide guaranteed QoS

to each flow with:

minimum bandwidth

guarantees

bounded maximum delay

R R R R R R R R R R R R R R R R

physical link shared by 3 flows

hotspot resource

slide-8
SLIDE 8

Jae W. Lee (8 / 33)

Locally fair ⇒ globally fair

With locally fair round-robin (RR) arbitration:

Throughput (Flow A) = (0.5) C Throughput (Flow B) = (0.5)2 C Throughput (Flow C) = Throughput (Flow D) = (0.5)3 C

→ Throughput of a flow decreases exponentially as its distance to the destination (hotspot) increases.

SRC A SRC A SRC B SRC B SRC C SRC C SRC D SRC D DEST DEST

arbitration point 1 arbitration point 2 arbitration point 3 channel rate = C [Gb/s]

slide-9
SLIDE 9

Jae W. Lee (9 / 33)

Motivational simulation

In 8x8 mesh network with RR arbitration (hotspot at (8, 8))

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06

node index (Y) n n

  • d

d e e i i n n d d e e x x ( ( X X ) )

accepted throughput [flits/cycle/node]

w/ dimension-ordered routing w/ minimal-adaptive routing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06

node index (Y) n n

  • d

d e e i i n n d d e e x x ( ( X X ) )

accepted throughput [flits/cycle/node]

locally-fair round-robin scheduling → globally unfair bandwidth usage

1 8

hotspot

8x8 2D mesh 8x8 2D mesh 2 3 4 5 6 7 8 7 65 4 3 21

slide-10
SLIDE 10

Jae W. Lee (10 / 33)

Desired bandwidth allocation: an example

Taken from simulation results with GSF:

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06

n

  • d

e i n d e x ( X ) node index (Y)

accepted throughput [flits/cycle/node]

n

  • d

e i n d e x ( X )

n

  • d

e i n d e x ( Y )

accepted throughput [flits/cycle/node]

Fair allocation Differentiated allocation

slide-11
SLIDE 11

Jae W. Lee (11 / 33)

Globally Synchronized Frames (GSF)

provide guaranteed QoS guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi- hop on-chip networks:

with high network utilization comparable to best-effort

virtual-channel router

with minimal area/energy overhead by avoiding per-flow

queues/structures in on-chip routers → scalable to # of concurrent flows

slide-12
SLIDE 12

Jae W. Lee (12 / 33)

Outline of this talk

Motivation Globally-Synchronized Frames: a step-by-step

development of mechanism

Implementation of GSF router Evaluation Related work Conclusion

slide-13
SLIDE 13

Jae W. Lee (13 / 33)

GSF takes a frame-based approach

R R R R R R R R R R R R R R R R

shared physical link time time fram frame # e #

2

Frame is a coarse quantization of time.

The network can transport a finite number of flits during this interval.

We constrain each flow source to inject a certain number of flits per frame.

shorter frames → coarser BW control but lower maximum delay typically 1-100s Kflits / frame (over all flows) in 8x8 mesh network

1 3 4

slide-14
SLIDE 14

Jae W. Lee (14 / 33)

Admission control of flows

R R R R R R R R R R R R R R R R

shared physical link time time fram frame # e #

2

Admission control: reject a new flow if it would make

the network unable to transport all the injected flits within a frame interval

1 3 4

slide-15
SLIDE 15

Jae W. Lee (15 / 33)

Single frame does not service bursty traffic well

Both traffic sources have the same long-term rate: 2 flits /

frame.

Allocating 2 flits / frame penalizes the bursty source.

time time fram frame # e #

1 2 3 4 5

regulated src regulated src bursty bursty src src

slide-16
SLIDE 16

Jae W. Lee (16 / 33)

Overlapping multiple frames to help bursty traffic

Overlapping multiple frames

Overlapping multiple frames to multiply injection slots

Sources can inject flits into future frames (w/ separate per-frame buffers) Older frames have higher priorities for contended channels. Drain time of head frame does not change. Future frames can use unclaimed BW by older frames. Maximum network delay < 3 * (frame interval)

Best-effort traffic: always lowest priority (throughput ↑)

time time fram frame # e #

7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6

2 future frames 2 future frames head frame head frame

slide-17
SLIDE 17

Jae W. Lee (17 / 33)

Reclamation of frame buffers

Per-frame buffers (at each node) = virtual channels At every frame window shift, frame buffers (or VCs)

associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame.

time time fram frame # e #

7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6

Frame Frame window window shift shift

VC2 VC2 VC1 VC1 VC0 VC0

epoch epoch 1 epoch 2 epoch 3 epoch 4 epoch 5

slide-18
SLIDE 18

Jae W. Lee (18 / 33)

Early reclamation improves network throughput

time time fram frame # e #

Observation: Head frame usually drains much earlier than frame interval

→ low buffer utilization

Terminate head frame early if empty

Terminate head frame early if empty

Use a global barrier network to confirm no pending packet in router or

source queue belongs to head frame.

Empty buffers are reclaimed much faster and overall throughput increases.

(by >30% for hotspot traffic pattern)

7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6

Frame Frame window window shift shift

epoch epoch 1 epoch 2 epoch 3 epoch 4 epoch 5

7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6

Frame Frame window window shift shift

e0 e1 e2 e3 e4 e5

7 6 7

e6 e7

7 1 2 2 3 3 4 4 5 5 5 6 6

slide-19
SLIDE 19

Jae W. Lee (19 / 33)

GSF in action: two-router network example (3 VCs)

A A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) C B VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

GSF in action

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

B C B A D

slide-20
SLIDE 20

Jae W. Lee (20 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

A B C B A D A C B VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

slide-21
SLIDE 21

Jae W. Lee (21 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

A B C D B A B A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

slide-22
SLIDE 22

Jae W. Lee (22 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

A B C D A B A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

slide-23
SLIDE 23

Jae W. Lee (23 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

D A B A B C empty! empty! VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

slide-24
SLIDE 24

Jae W. Lee (24 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)

Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D

D A B C A

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • frame

window shift D C B A VC 0 VC 0 (Fr3) (Fr3) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr3) (Fr3) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)

Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

  • active

frame window:

slide-25
SLIDE 25

Jae W. Lee (25 / 33)

Carpool lane sharing

A B

A VC 0 VC 0 (F (Fk) k) VC 1&2 VC 1&2 (F (Fk, F(k+1) (k+1), F(k+2) (k+2)) carpool lane: carpool lane: reserved for head frame

  • nly

A A VC 0 VC 0 (F (Fk) k) VC 1 VC 1 (F (F(k+1) (k+1)) VC2 VC2 (F (F(k+2) (k+2)) C B C

Buffers are expensive in on-chip environment.

Cannot transport a flit even if there is an empty slot in other frame buffers.

Carpool lane sharing

Carpool lane sharing: relaxing frame-VC mapping to improve buffer utilization

Reserve one frame buffer (VC0) for head frame only

→ does not increase the drain time of head frame

The other buffers are now colorless and can be used by any frame.

Head-of-line (HoL) blocking prevented by not allowing two packets to

  • ccupy a VC simultaneously (OK for shallow buffers).

a flit in Frame (k+1)

slide-26
SLIDE 26

Jae W. Lee (26 / 33)

Baseline virtual channel (VC) router

Best-effort router Three-stage pipeline with

look-ahead routing: VA/NRC-SA-ST

Credit-based flow control VC, SW allocators: iSlip uses round-robin arbiters

(locally fair)

updates the priority of each

arbiter only when that arbiter generates a winning grant

VC0 VC1 VC2 VC3

flit buffers

crossbar (5x5) VC0 VC1 VC2 VC3

  • VC allocation

SW allocation next route computation

to P to N to E to S to W

  • • •
  • • • • •

Baseline router for 2D mesh networks

slide-27
SLIDE 27

Jae W. Lee (27 / 33)

GSF router

VC0: carpool lane reserved for head frame only New registers head_frame (HF) (per node) frame_num (per VC) NRC: priority precalculation

(frame_num-HF) (mod W) (0 is the highest priority.)

VC and SW allocation:

priority enforcement

Global barrier network for

frame window shifting

VC0 VC1 VC2 VC3

flit buffers frame_num

2 x 1 ≠ ≠ ≠ ≠ crossbar (5x5) VC0 VC1 VC2 VC3 2 3 2 ≠ ≠ ≠ ≠

  • VC allocation

SW allocation next route computation

head_frame=2 increment_head_frame

global barrier network

to P to N to E to S to W

  • head_frame_empty_at_node_X
  • • •
  • • • • •

next route computation VC allocation SW allocation

slide-28
SLIDE 28

Jae W. Lee (28 / 33)

Simulation setup

Network simulator:

Network simulator: Booksim

  • oksim

0.5 M cycles with 50K-cycle warming up Network configuration

Network configuration

8x8 2D mesh, dimension-ordered routing, 1 flit/cycle link capacity Four traffic patterns

Four traffic patterns

  • ne QoS traffic pattern: hotspot

three best-effort traffic patterns: uniform random, transpose,

nearest neighbor

packet size is either 1 or 9 flits (with 50-50 chance) Baseline VC router

Baseline VC router

3-stage pipeline (VA/NRC-SA-ST), 2-cycle credit pipeline delay 6 VCs/physical link, buffer depth is 5 flits/VC GSF parameters

GSF parameters

frame window size = 6 [frames], frame size = 1,000 [flits] global barrier latency = 16 [cycles] (conservative)

slide-29
SLIDE 29

Jae W. Lee (29 / 33)

Flexible guaranteed QoS provided

All flows receive more than their minimum guaranteed

bandwidth (Ri/eMAX) in accessing hotspot.

Ri: # of flit injection slots for Flow i eMAX: maximum epoch interval. Example: 8x8 mesh network

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06

n n

  • d

d e e i i n n d d e e x x ( ( X X ) ) node index (Y) accepted throughput [flits/cycle/node] n n

  • d

d e e i i n n d d e e x x ( ( X X ) ) node index (Y) accepted throughput [flits/cycle/node]

0.02 0.04 0.06

(b) differentiated allo (b) differentiated allocation cation (a) fair allocation (a) fair allocation

slide-30
SLIDE 30

Jae W. Lee (30 / 33)

Flexible guaranteed QoS provided

1 2 3 4 5 6 7 8 9 A B C D E F 1 2 3 4 5 6 7 8 9 A B C D E F 0.02 0.04 0.06

(c) differentiated allo (c) differentiated allocation cation

accepted throughput [flits/cycle/node] node index (Y) n

  • d

e i n d e x ( X )

All flows receive more than their minimum guaranteed

bandwidth (Ri/eMAX) in accessing hotspot.

Ri: # of flit injection slots for Flow I eMAX: maximum epoch interval. Example: 16x16 torus network with 4 hotspot nodes

slide-31
SLIDE 31

Jae W. Lee (31 / 33)

Small throughput degradation for best-effort traffic

average delay [cycles] uni uniform

  • rm random

random

  • ffered load [flits/cycle/node]

50 100 150 200 250 300 350 400 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 iSlip GSF/1 GSF/8 GSF/16

50 100 150 200 250 300 350 400 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 iSlip GSF/1 GSF/8 GSF/16

50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 iSlip GSF/1 GSF/8 GSF/16

transpose transpose average delay [cycles]

  • ffered load [flits/cycle/node]

average delay [cycles]

  • ffered load [flits/cycle/node]

nearest neighbor nearest neighbor

Network behavior with non-

QoS traffic

no latency increase in

uncongested region

at most 12 % degradation of

network saturation throughput → can be reduced with larger frame (at the cost of delay bound increase)

12 % 12 % 7 % 7 %

slide-32
SLIDE 32

Jae W. Lee (32 / 33)

Related work

QoS support in IP or multiprocessor networks Fair Queueing [SIGCOMM ’89],

Virtual Clock [SIGCOMM ’90]

Multi-rate channel switching [IEEE Comm ’86] Source throttling [HPCA ’01] Age-based arbitration [IEEE TPADS ’92, SC ’07] Rotating Combined Queueing (RCQ) [ISCA ’96]

→ expensive, inflexible, and/or without guaranteed QoS expensive, inflexible, and/or without guaranteed QoS

QoS on-chip networks AEthereal (strict TDM; exp. channel setup) [IEEE Design & Test ’05] SonicsMX (per-thread queues at each node) [DATE ’05] MANGO clockless NoC (partitioning GS and BE VCs) [DATE ’05] Nostrum (routes fixed at design time) [DATE ’04]

slide-33
SLIDE 33

Jae W. Lee (33 / 33)

Conclusion

The GSF network is

guaranteed QoS-capable

guaranteed QoS-capable

with minimum bandwidth guarantees and maximum delay

flexible

flexible

fair and differentiated bandwidth allocation no explicit channel setup required along the path

robust

robust

<5 % throughput degradation on average (12 % in the

worst) for four traffic patterns in 8x8 mesh network

fairness vs overall throughput tradeoff with frame size

simple

simple

no per-flow queues/structures in on-chip routers

→ scalable

relatively small modifications to a conventional VC router