GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - - PowerPoint PPT Presentation
GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - - PowerPoint PPT Presentation
GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China Resource sharing increases performance variation
Jae W. Lee (2 / 33)
Resource sharing increases performance variation
P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont
Resource sharing
(+) reduces hardware cost (-) increases performance variation
This performance variation
becomes larger and larger as the number of sharers (cores) increases.
Jae W. Lee (3 / 33)
Desired quality-of-service from shared resources
Performance isolation
(fairness)
P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C P P P P P P P P P P P P P P P P processor ID (hotspot) (hotspot) minimum guaranteed BW minimum guaranteed BW
Jae W. Lee (4 / 33)
P P P P P P P P P P P P P P P P multi-hop on-chip n multi-hop on-chip network work L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank L2$ L2$ bank bank mem mem cont. cont. mem mem cont cont processor ID accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C accepted throughput [MB/s] 0 1 F 2 3 E 4 5 6 7 8 9 A B D C diff differen erentia tiated ted allocation processor ID (hotspot) (hotspot)
Desired quality-of-service from shared resources
minimum guaranteed BW minimum guaranteed BW
Performance isolation
(fairness)
Differentiated services
(flexibility)
Jae W. Lee (5 / 33)
Resources w/ centralized arbitration are well investigated
R R R R R R
- n-chip
routers [MICRO ’06] [MICRO ’06] [PACT ’07] [PACT ’07] [USENIX sec. ’07] [USENIX sec. ’07] [IBM ’07] [IBM ’07] [MICRO ’07] [MICRO ’07] [ISCA ’08] [ISCA ’08] ... ... [HPCA ‘02] [HPCA ‘02] [ICS ‘04] [ICS ‘04] [ISCA ‘07] [ISCA ‘07] …
P+ P+ L1$ L1$
P+ P+ L1$ L1$
P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$
R R R
mem mem ctrl ctrl L2$ L2$ bank bank
Resources with
centralized arbitration
SDRAM controllers L2 cache banks
They have a single entry
point for all requests. → QoS is relatively easier and well investigated.
Jae W. Lee (6 / 33)
QoS from on-chip networks is a challenge
R R R R R R
- n-chip
routers
P+ P+ L1$ L1$
P+ P+ L1$ L1$
P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$ P+ P+ L1$ L1$
R R R
mem mem ctrl ctrl L2$ L2$ bank bank
Resources with
distributed arbitration
multi-hop on-chip networks
They have distributed
arbitration points. → QoS is more difficult.
Off-chip solutions cannot
be directly applied because
- f resource constraints.
Jae W. Lee (7 / 33)
We guarantee QoS for flows
Flow: a sequence of packets
between a unique pair of end nodes (src and dest)
physical links shared by flows multiple stages of arbitration
for each packet
We provide guaranteed QoS
to each flow with:
minimum bandwidth
guarantees
bounded maximum delay
R R R R R R R R R R R R R R R R
physical link shared by 3 flows
hotspot resource
Jae W. Lee (8 / 33)
Locally fair ⇒ globally fair
With locally fair round-robin (RR) arbitration:
Throughput (Flow A) = (0.5) C Throughput (Flow B) = (0.5)2 C Throughput (Flow C) = Throughput (Flow D) = (0.5)3 C
→ Throughput of a flow decreases exponentially as its distance to the destination (hotspot) increases.
SRC A SRC A SRC B SRC B SRC C SRC C SRC D SRC D DEST DEST
arbitration point 1 arbitration point 2 arbitration point 3 channel rate = C [Gb/s]
Jae W. Lee (9 / 33)
Motivational simulation
In 8x8 mesh network with RR arbitration (hotspot at (8, 8))
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06
node index (Y) n n
- d
d e e i i n n d d e e x x ( ( X X ) )
accepted throughput [flits/cycle/node]
w/ dimension-ordered routing w/ minimal-adaptive routing
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06
node index (Y) n n
- d
d e e i i n n d d e e x x ( ( X X ) )
accepted throughput [flits/cycle/node]
locally-fair round-robin scheduling → globally unfair bandwidth usage
1 8
hotspot
8x8 2D mesh 8x8 2D mesh 2 3 4 5 6 7 8 7 65 4 3 21
Jae W. Lee (10 / 33)
Desired bandwidth allocation: an example
Taken from simulation results with GSF:
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06
n
- d
e i n d e x ( X ) node index (Y)
accepted throughput [flits/cycle/node]
n
- d
e i n d e x ( X )
n
- d
e i n d e x ( Y )
accepted throughput [flits/cycle/node]
Fair allocation Differentiated allocation
Jae W. Lee (11 / 33)
Globally Synchronized Frames (GSF)
provide guaranteed QoS guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi- hop on-chip networks:
with high network utilization comparable to best-effort
virtual-channel router
with minimal area/energy overhead by avoiding per-flow
queues/structures in on-chip routers → scalable to # of concurrent flows
Jae W. Lee (12 / 33)
Outline of this talk
Motivation Globally-Synchronized Frames: a step-by-step
development of mechanism
Implementation of GSF router Evaluation Related work Conclusion
Jae W. Lee (13 / 33)
GSF takes a frame-based approach
R R R R R R R R R R R R R R R R
shared physical link time time fram frame # e #
2
Frame is a coarse quantization of time.
The network can transport a finite number of flits during this interval.
We constrain each flow source to inject a certain number of flits per frame.
shorter frames → coarser BW control but lower maximum delay typically 1-100s Kflits / frame (over all flows) in 8x8 mesh network
1 3 4
Jae W. Lee (14 / 33)
Admission control of flows
R R R R R R R R R R R R R R R R
shared physical link time time fram frame # e #
2
Admission control: reject a new flow if it would make
the network unable to transport all the injected flits within a frame interval
1 3 4
Jae W. Lee (15 / 33)
Single frame does not service bursty traffic well
Both traffic sources have the same long-term rate: 2 flits /
frame.
Allocating 2 flits / frame penalizes the bursty source.
time time fram frame # e #
1 2 3 4 5
regulated src regulated src bursty bursty src src
Jae W. Lee (16 / 33)
Overlapping multiple frames to help bursty traffic
Overlapping multiple frames
Overlapping multiple frames to multiply injection slots
Sources can inject flits into future frames (w/ separate per-frame buffers) Older frames have higher priorities for contended channels. Drain time of head frame does not change. Future frames can use unclaimed BW by older frames. Maximum network delay < 3 * (frame interval)
Best-effort traffic: always lowest priority (throughput ↑)
time time fram frame # e #
7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6
2 future frames 2 future frames head frame head frame
Jae W. Lee (17 / 33)
Reclamation of frame buffers
Per-frame buffers (at each node) = virtual channels At every frame window shift, frame buffers (or VCs)
associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame.
time time fram frame # e #
7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6
Frame Frame window window shift shift
VC2 VC2 VC1 VC1 VC0 VC0
epoch epoch 1 epoch 2 epoch 3 epoch 4 epoch 5
Jae W. Lee (18 / 33)
Early reclamation improves network throughput
time time fram frame # e #
Observation: Head frame usually drains much earlier than frame interval
→ low buffer utilization
Terminate head frame early if empty
Terminate head frame early if empty
Use a global barrier network to confirm no pending packet in router or
source queue belongs to head frame.
Empty buffers are reclaimed much faster and overall throughput increases.
(by >30% for hotspot traffic pattern)
7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6
Frame Frame window window shift shift
epoch epoch 1 epoch 2 epoch 3 epoch 4 epoch 5
7 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6
Frame Frame window window shift shift
e0 e1 e2 e3 e4 e5
7 6 7
e6 e7
7 1 2 2 3 3 4 4 5 5 5 6 6
Jae W. Lee (19 / 33)
GSF in action: two-router network example (3 VCs)
A A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) C B VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
GSF in action
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
B C B A D
Jae W. Lee (20 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
A B C B A D A C B VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
Jae W. Lee (21 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
A B C D B A B A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
Jae W. Lee (22 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
A B C D A B A VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
Jae W. Lee (23 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
D A B A B C empty! empty! VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr0) (Fr0) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
Jae W. Lee (24 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)
Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D
D A B C A
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- frame
window shift D C B A VC 0 VC 0 (Fr3) (Fr3) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2) VC 0 VC 0 (Fr3) (Fr3) VC 1 VC 1 (Fr1) (Fr1) VC 2 VC 2 (Fr2) (Fr2)
Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5
- active
frame window:
Jae W. Lee (25 / 33)
Carpool lane sharing
A B
A VC 0 VC 0 (F (Fk) k) VC 1&2 VC 1&2 (F (Fk, F(k+1) (k+1), F(k+2) (k+2)) carpool lane: carpool lane: reserved for head frame
- nly
A A VC 0 VC 0 (F (Fk) k) VC 1 VC 1 (F (F(k+1) (k+1)) VC2 VC2 (F (F(k+2) (k+2)) C B C
Buffers are expensive in on-chip environment.
Cannot transport a flit even if there is an empty slot in other frame buffers.
Carpool lane sharing
Carpool lane sharing: relaxing frame-VC mapping to improve buffer utilization
Reserve one frame buffer (VC0) for head frame only
→ does not increase the drain time of head frame
The other buffers are now colorless and can be used by any frame.
Head-of-line (HoL) blocking prevented by not allowing two packets to
- ccupy a VC simultaneously (OK for shallow buffers).
a flit in Frame (k+1)
Jae W. Lee (26 / 33)
Baseline virtual channel (VC) router
Best-effort router Three-stage pipeline with
look-ahead routing: VA/NRC-SA-ST
Credit-based flow control VC, SW allocators: iSlip uses round-robin arbiters
(locally fair)
updates the priority of each
arbiter only when that arbiter generates a winning grant
VC0 VC1 VC2 VC3
flit buffers
crossbar (5x5) VC0 VC1 VC2 VC3
- VC allocation
SW allocation next route computation
to P to N to E to S to W
- • •
- • • • •
- •
Baseline router for 2D mesh networks
Jae W. Lee (27 / 33)
GSF router
VC0: carpool lane reserved for head frame only New registers head_frame (HF) (per node) frame_num (per VC) NRC: priority precalculation
(frame_num-HF) (mod W) (0 is the highest priority.)
VC and SW allocation:
priority enforcement
Global barrier network for
frame window shifting
VC0 VC1 VC2 VC3
flit buffers frame_num
2 x 1 ≠ ≠ ≠ ≠ crossbar (5x5) VC0 VC1 VC2 VC3 2 3 2 ≠ ≠ ≠ ≠
- VC allocation
SW allocation next route computation
head_frame=2 increment_head_frame
global barrier network
to P to N to E to S to W
- head_frame_empty_at_node_X
- • •
- • • • •
- •
next route computation VC allocation SW allocation
Jae W. Lee (28 / 33)
Simulation setup
Network simulator:
Network simulator: Booksim
- oksim
0.5 M cycles with 50K-cycle warming up Network configuration
Network configuration
8x8 2D mesh, dimension-ordered routing, 1 flit/cycle link capacity Four traffic patterns
Four traffic patterns
- ne QoS traffic pattern: hotspot
three best-effort traffic patterns: uniform random, transpose,
nearest neighbor
packet size is either 1 or 9 flits (with 50-50 chance) Baseline VC router
Baseline VC router
3-stage pipeline (VA/NRC-SA-ST), 2-cycle credit pipeline delay 6 VCs/physical link, buffer depth is 5 flits/VC GSF parameters
GSF parameters
frame window size = 6 [frames], frame size = 1,000 [flits] global barrier latency = 16 [cycles] (conservative)
Jae W. Lee (29 / 33)
Flexible guaranteed QoS provided
All flows receive more than their minimum guaranteed
bandwidth (Ri/eMAX) in accessing hotspot.
Ri: # of flit injection slots for Flow i eMAX: maximum epoch interval. Example: 8x8 mesh network
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.02 0.04 0.06
n n
- d
d e e i i n n d d e e x x ( ( X X ) ) node index (Y) accepted throughput [flits/cycle/node] n n
- d
d e e i i n n d d e e x x ( ( X X ) ) node index (Y) accepted throughput [flits/cycle/node]
0.02 0.04 0.06
(b) differentiated allo (b) differentiated allocation cation (a) fair allocation (a) fair allocation
Jae W. Lee (30 / 33)
Flexible guaranteed QoS provided
1 2 3 4 5 6 7 8 9 A B C D E F 1 2 3 4 5 6 7 8 9 A B C D E F 0.02 0.04 0.06
(c) differentiated allo (c) differentiated allocation cation
accepted throughput [flits/cycle/node] node index (Y) n
- d
e i n d e x ( X )
All flows receive more than their minimum guaranteed
bandwidth (Ri/eMAX) in accessing hotspot.
Ri: # of flit injection slots for Flow I eMAX: maximum epoch interval. Example: 16x16 torus network with 4 hotspot nodes
Jae W. Lee (31 / 33)
Small throughput degradation for best-effort traffic
average delay [cycles] uni uniform
- rm random
random
- ffered load [flits/cycle/node]
50 100 150 200 250 300 350 400 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 iSlip GSF/1 GSF/8 GSF/16
50 100 150 200 250 300 350 400 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 iSlip GSF/1 GSF/8 GSF/16
50 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 iSlip GSF/1 GSF/8 GSF/16
transpose transpose average delay [cycles]
- ffered load [flits/cycle/node]
average delay [cycles]
- ffered load [flits/cycle/node]
nearest neighbor nearest neighbor
Network behavior with non-
QoS traffic
no latency increase in
uncongested region
at most 12 % degradation of
network saturation throughput → can be reduced with larger frame (at the cost of delay bound increase)
12 % 12 % 7 % 7 %
Jae W. Lee (32 / 33)
Related work
QoS support in IP or multiprocessor networks Fair Queueing [SIGCOMM ’89],
Virtual Clock [SIGCOMM ’90]
Multi-rate channel switching [IEEE Comm ’86] Source throttling [HPCA ’01] Age-based arbitration [IEEE TPADS ’92, SC ’07] Rotating Combined Queueing (RCQ) [ISCA ’96]
→ expensive, inflexible, and/or without guaranteed QoS expensive, inflexible, and/or without guaranteed QoS
QoS on-chip networks AEthereal (strict TDM; exp. channel setup) [IEEE Design & Test ’05] SonicsMX (per-thread queues at each node) [DATE ’05] MANGO clockless NoC (partitioning GS and BE VCs) [DATE ’05] Nostrum (routes fixed at design time) [DATE ’04]
Jae W. Lee (33 / 33)
Conclusion
The GSF network is
guaranteed QoS-capable
guaranteed QoS-capable
with minimum bandwidth guarantees and maximum delay
flexible
flexible
fair and differentiated bandwidth allocation no explicit channel setup required along the path