i o congestion avoidance via routing and object placement
play

I/O Congestion Avoidance via Routing and Object Placement David - PowerPoint PPT Presentation

I/O Congestion Avoidance via Routing and Object Placement David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang 1 Motivation Goal: 240 GB/s routed Direct-attached vs. Center-wide Limited allocations INCITE averages 27 million


  1. I/O Congestion Avoidance via Routing and Object Placement David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang 1

  2. Motivation ● Goal: 240 GB/s routed ● Direct-attached vs. Center-wide ● Limited allocations ● INCITE averages 27 million hours ● Prefer to spend time computing ● Performance issues at scale 2

  3. Spider Resources ● 48 DDN 9900 Couplets ● 13,440 SATA 1 TB hard drives ● DDR InfiniBand connectivity ● 192 Dell PowerEdge 1950 OSS ● 16 GB memory ● 2x Quad-core Xeons @ 2.3 GHz ● 4 Cisco 7024D 288 port DDR IB switches ● 48 Flextronics 24 port DDR IB switches 3

  4. Wiring up SION 32 links 96 links 96 links 64 links 64 links 8 links 96 links 96 links 4

  5. Direct-attached Traffic Flow Fabric Client OSS Storage 5

  6. Direct-attached Raw I/O Baseline 6

  7. Direct-attached Lustre Baseline 7

  8. Writer Skew 8

  9. SeaStar Bandwidth 9

  10. Link Oversubscription ● Each link can sustain ~3100 MB/s (unidir) ● Each OST can contribute 180 MB/s with a balanced load presented to the DDN 9900 ● 260 MB/s individually ● Therefore, each link can support 17 client-OST pairs at saturation ● 11 client-OST pairs @ 260 MB/s 10

  11. Link Oversubscription ● 70% of tests had more than one link with 18 client-OST pairs ● 42% had more than 34 pairs ● 21% had more than 60 ● 3% had over 70% But that's only part of the issue 11

  12. Imbalanced Sharing 12

  13. Placing I/O in the Torus ● We want to minimize link congestion ● Prefer no more than 11 client-OST pairs ● Easiest method is to place active clients topologically close to servers ● Use hop count as our metric 13

  14. Placing I/O in the Torus ● For each OST to be used ● Calculate hop count from to OSS from each client ● Pick the client with the lowest count ● Remove that client from further consideration 14

  15. Placing I/O in the Torus Fabric Client OSS Storage 15

  16. Placement Results 16

  17. Improved Writer Skew 17

  18. Does it work in a smaller space? 18

  19. LNET Routing ● Allows us to separate storage from compute platform ● Very simple in nature ● List of routers for each remote LNET ● Routers can have different weights ● 1024 character max for route option ● use lctl add_route for larger configs 19

  20. Simple LNET Routing ● 196 routers in the torus ● Client uses each router in a weight class in a round-robin manner ● 8 back-to-back messages to a single destination will use 8 different routers ● Congestion in both torus and InfiniBand ● No opportunity to improve placement to control congestion 20

  21. Simple LNET Routing Fabric Client Router Storage 21

  22. InfiniBand Congestion 22

  23. Improved Routing Configurations ● Aim to eliminate InfiniBand congestion ● Aim to reduce torus congestion ● Provide ability for application to determine which router will be used for a particular OST ● given OST-to-OSS mapping ● given client-to-router mappings 23

  24. Nearest Neighbor ● 32 sets of routers ● one set for each leaf module ● 6 OSS servers in each set ● 6 routers in each set ● Each client chooses the nearest router to talk to the OSSes in a set ● Variable performance ● by job size ● by job location 24

  25. Nearest Neighbor Fabric Client 1 1 1 1 2 2 2 2 Router (Group A) 1 1 1 1 2 2 2 2 Storage (Group A) 1 1 1 1 2 2 2 2 Router (Group B) 1 1 1 1 2 2 2 2 Storage (Group B) 25

  26. Round Robin ● Again, 32 sets of routers ● Ordered list of routers for each set ● Client chooses router (nid % 6) for the set ● Throws I/O traffic around the torus 26

  27. Round Robin Fabric Client 1 2 1 2 1 2 1 2 Router (Group A) 2 1 1 1 2 1 2 1 Storage (Group A) 1 1 1 2 1 2 1 2 Router (Group B) 2 1 2 1 2 1 2 1 Storage (Group B) 27

  28. Projection ● 192 LNET networks ● one for each OSS ● One primary router for each LNET ● add higher weights for backup routers ● Clients experience variable latency to OSSes based on location ● Placement calculations similar to direct- attached 28

  29. Projection Fabric Client Router Storage 29

  30. Routed Placement Results (IOR) 30

  31. Conclusions ● Goals exceeded: 244 GB/s on routed storage ● “Projected” configuration in production ● Working with library developers to bring this to users 31

  32. Questions? ● Contact info: David Dillow 865-241-6602 dillowda@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. Notice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend