globally synchronized frames for guaranteed quality of
play

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN - PowerPoint PPT Presentation

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China Resource sharing increases performance variation


  1. GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley) June 23 th 2008 ISCA-35, Beijing, China

  2. Resource sharing increases performance variation � Resource sharing ( + ) reduces hardware cost P P P P P P P P ( - ) increases performance variation P P P P P P P P multi-hop on-chip network multi-hop on-chip n work � This performance variation becomes larger and larger as L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. the number of sharers (cores) increases. Jae W. Lee (2 / 33)

  3. Desired quality-of-service from shared resources � Performance isolation P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P (fairness) multi-hop on-chip network multi-hop on-chip n work L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. (hotspot) (hotspot) accepted throughput [MB/s] minimum guaranteed BW minimum guaranteed BW processor ID 0 1 2 3 4 5 6 7 8 9 A B C D E F Jae W. Lee (3 / 33)

  4. Desired quality-of-service from shared resources � Performance isolation P P P P P P P P P P P P P P P P (fairness) multi-hop on-chip n multi-hop on-chip network work � Differentiated services (flexibility) L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ mem mem mem mem bank bank bank bank bank bank cont cont bank bank cont. cont. (hotspot) (hotspot) accepted accepted throughput [MB/s] throughput [MB/s] minimum guaranteed BW minimum guaranteed BW differen diff erentia tiated ted allocation processor processor ID ID 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F Jae W. Lee (4 / 33)

  5. Resources w/ centralized arbitration are well investigated � Resources with P+ P+ P+ P+ P+ P+ centralized arbitration L1$ L1$ L1$ L1$ L1$ L1$ R R R � SDRAM controllers P+ P+ P+ P+ P+ P+ on-chip L1$ L1$ L1$ L1$ L1$ L1$ � L2 cache banks routers R R R L2$ L2$ mem mem � They have a single entry bank bank ctrl ctrl R R R point for all requests. → QoS is relatively easier [MICRO ’06] [MICRO ’06] [HPCA ‘02] [HPCA ‘02] and well investigated. [PACT ’07] [PACT ’07] [ICS ‘04] [ICS ‘04] [USENIX sec. ’07] [USENIX sec. ’07] [ISCA ‘07] [ISCA ‘07] [IBM ’07] [IBM ’07] … [MICRO ’07] [MICRO ’07] [ISCA ’08] [ISCA ’08] ... ... Jae W. Lee (5 / 33)

  6. QoS from on-chip networks is a challenge � Resources with P+ P+ P+ P+ P+ P+ distributed arbitration L1$ L1$ L1$ L1$ L1$ L1$ R R R � multi-hop on-chip networks P+ P+ P+ P+ P+ P+ on-chip L1$ L1$ L1$ L1$ L1$ L1$ routers R R R � They have distributed L2$ L2$ mem mem arbitration points. bank bank ctrl ctrl → QoS is more difficult. R R R � Off-chip solutions cannot be directly applied because of resource constraints. Jae W. Lee (6 / 33)

  7. We guarantee QoS for flows physical link � Flow: a sequence of packets shared by 3 flows between a unique pair of end nodes (src and dest) R R R R � physical links shared by flows � multiple stages of arbitration R R R R for each packet � We provide guaranteed QoS R R R R to each flow with: � minimum bandwidth R R R R guarantees � bounded maximum delay hotspot resource Jae W. Lee (7 / 33)

  8. Locally fair ⇒ globally fair arbitration arbitration arbitration point 1 point 2 point 3 SRC D SRC D DEST DEST SRC C SRC C SRC B SRC B SRC A SRC A channel rate = C [Gb/s] With locally fair round-robin (RR) arbitration: � Throughput (Flow A) = (0.5) C � Throughput (Flow B) = (0.5) 2 C � Throughput (Flow C) = Throughput (Flow D) = (0.5) 3 C → Throughput of a flow decreases exponentially as its distance to the destination (hotspot) increases. Jae W. Lee (8 / 33)

  9. Motivational simulation � In 8x8 mesh network with RR arbitration (hotspot at (8, 8)) 7 65 4 3 21 accepted throughput accepted throughput [flits/cycle/node] [flits/cycle/node] hotspot 0.06 0.06 8 0.04 0.04 8 7 6 5 4 3 2 1 0.02 0.02 8x8 2D mesh 8x8 2D mesh 0 0 8 8 7 7 node index (Y) node index (Y) 6 1 1 6 2 2 5 5 ) ) ) ) X X 3 3 X X 4 4 ( ( ( ( 4 4 x x x x 3 3 e e 5 5 e e d d 2 d d 6 2 6 n n n n i i 1 7 i i 1 7 e e e e 8 8 d d d d o o o o n n n n w/ minimal-adaptive routing w/ dimension-ordered routing locally-fair round-robin scheduling → globally unfair bandwidth usage Jae W. Lee (9 / 33)

  10. Desired bandwidth allocation: an example � Taken from simulation results with GSF: accepted throughput accepted throughput [flits/cycle/node] [flits/cycle/node] 0.06 0.06 0.04 0.04 0.02 0.02 0 0 8 8 7 7 n node index (Y) 1 1 6 6 o 2 2 5 d 5 3 3 ) 4 e 4 X ) 4 4 X ( 3 3 x 5 ( i 5 e x n 2 2 6 6 d e n d d 7 1 7 1 n i e e 8 i 8 d e x o d o n n ( Y ) Differentiated allocation Fair allocation Jae W. Lee (10 / 33)

  11. Globally Synchronized Frames (GSF) provide guaranteed QoS guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi- hop on-chip networks: � with high network utilization comparable to best-effort virtual-channel router � with minimal area/energy overhead by avoiding per-flow queues/structures in on-chip routers → scalable to # of concurrent flows Jae W. Lee (11 / 33)

  12. Outline of this talk � Motivation � Globally-Synchronized Frames: a step-by-step development of mechanism � Implementation of GSF router � Evaluation � Related work � Conclusion Jae W. Lee (12 / 33)

  13. GSF takes a frame-based approach shared physical link frame # fram e # R R R R 4 R R R R 3 2 R R R R 1 0 R R R R time time � Frame is a coarse quantization of time. � The network can transport a finite number of flits during this interval. � We constrain each flow source to inject a certain number of flits per frame. � shorter frames → coarser BW control but lower maximum delay � typically 1-100s Kflits / frame (over all flows) in 8x8 mesh network Jae W. Lee (13 / 33)

  14. Admission control of flows shared physical link frame # fram e # R R R R 4 R R R R 3 2 R R R R 1 0 R R R R time time � Admission control: reject a new flow if it would make the network unable to transport all the injected flits within a frame interval Jae W. Lee (14 / 33)

  15. Single frame does not service bursty traffic well frame # fram e # 5 4 3 2 1 0 time time regulated src regulated src bursty src bursty src � Both traffic sources have the same long-term rate: 2 flits / frame. � Allocating 2 flits / frame penalizes the bursty source. Jae W. Lee (15 / 33)

  16. Overlapping multiple frames to help bursty traffic 7 fram frame # e # 6 6 5 5 5 4 4 4 3 3 3 2 2 2 2 future frames 2 future frames 1 1 0 head frame head frame time time � Overlapping multiple frames Overlapping multiple frames to multiply injection slots � Sources can inject flits into future frames (w/ separate per-frame buffers) � Older frames have higher priorities for contended channels. � Drain time of head frame does not change. � Future frames can use unclaimed BW by older frames. � Maximum network delay < 3 * (frame interval) � Best-effort traffic: always lowest priority (throughput ↑ ) Jae W. Lee (16 / 33)

  17. Reclamation of frame buffers 7 fram frame # e # Frame Frame 6 6 window window 5 5 5 shift shift 4 4 4 3 3 3 2 2 2 VC2 VC2 VC1 VC1 1 1 VC0 VC0 0 time time epoch epoch epoch epoch epoch epoch 0 1 2 3 4 5 � Per-frame buffers (at each node) = virtual channels � At every frame window shift, frame buffers (or VCs) associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame. Jae W. Lee (17 / 33)

  18. Early reclamation improves network throughput 7 7 7 7 7 fram frame # e # Frame Frame Frame Frame 6 6 6 6 6 6 6 window window window window 5 5 5 5 5 5 5 5 5 shift shift shift shift 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 0 0 0 time time e0 epoch e1 e2 epoch e3 epoch e4 e5 epoch epoch epoch e6 e7 0 1 2 3 4 5 � Observation: Head frame usually drains much earlier than frame interval → low buffer utilization � Terminate head frame early if empty Terminate head frame early if empty � Use a global barrier network to confirm no pending packet in router or source queue belongs to head frame. � Empty buffers are reclaimed much faster and overall throughput increases. (by >30% for hotspot traffic pattern) Jae W. Lee (18 / 33)

  19. GSF in action � GSF in action: two-router network example (3 VCs) Flow A Flow A Flow B Flow B Flow C Flow C Flow D Flow D VC 0 VC 0 VC 0 VC 0 A B C (Fr0) (Fr0) (Fr0) (Fr0) VC 1 VC 1 VC 1 VC 1 A C B (Fr1) (Fr1) (Fr1) (Fr1) VC 2 VC 2 VC 2 VC 2 A B D (Fr2) (Fr2) (Fr2) (Fr2) Frame 0 Frame 0 active frame Frame 1 Frame 1 window: Frame 2 Frame 2 Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5 Jae W. Lee (19 / 33) •••

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend