1
I/O Congestion Avoidance via Routing and Object Placement David - - PowerPoint PPT Presentation
I/O Congestion Avoidance via Routing and Object Placement David - - PowerPoint PPT Presentation
I/O Congestion Avoidance via Routing and Object Placement David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang 1 Motivation Goal: 240 GB/s routed Direct-attached vs. Center-wide Limited allocations INCITE averages 27 million
2
Motivation
- Goal: 240 GB/s routed
- Direct-attached vs. Center-wide
- Limited allocations
- INCITE averages 27 million hours
- Prefer to spend time computing
- Performance issues at scale
3
Spider Resources
- 48 DDN 9900 Couplets
- 13,440 SATA 1 TB hard drives
- DDR InfiniBand connectivity
- 192 Dell PowerEdge 1950 OSS
- 16 GB memory
- 2x Quad-core Xeons @ 2.3 GHz
- 4 Cisco 7024D 288 port DDR IB switches
- 48 Flextronics 24 port DDR IB switches
4
Wiring up SION
64 links 32 links 96 links 96 links 96 links 96 links 64 links 8 links
5
Direct-attached Traffic Flow
Client OSS Storage
Fabric
6
Direct-attached Raw I/O Baseline
7
Direct-attached Lustre Baseline
8
Writer Skew
9
SeaStar Bandwidth
10
Link Oversubscription
- Each link can sustain ~3100 MB/s (unidir)
- Each OST can contribute 180 MB/s with a
balanced load presented to the DDN 9900
- 260 MB/s individually
- Therefore, each link can support 17 client-OST
pairs at saturation
- 11 client-OST pairs @ 260 MB/s
11
Link Oversubscription
- 70% of tests had more than one link with 18
client-OST pairs
- 42% had more than 34 pairs
- 21% had more than 60
- 3% had over 70%
But that's only part of the issue
12
Imbalanced Sharing
13
Placing I/O in the Torus
- We want to minimize link congestion
- Prefer no more than 11 client-OST pairs
- Easiest method is to place active clients
topologically close to servers
- Use hop count as our metric
14
Placing I/O in the Torus
- For each OST to be used
- Calculate hop count from to OSS from each client
- Pick the client with the lowest count
- Remove that client from further consideration
15
Placing I/O in the Torus
Client OSS Storage
Fabric
16
Placement Results
17
Improved Writer Skew
18
Does it work in a smaller space?
19
LNET Routing
- Allows us to separate storage from compute
platform
- Very simple in nature
- List of routers for each remote LNET
- Routers can have different weights
- 1024 character max for route option
- use lctl add_route for larger configs
20
Simple LNET Routing
- 196 routers in the torus
- Client uses each router in a weight class in a
round-robin manner
- 8 back-to-back messages to a single
destination will use 8 different routers
- Congestion in both torus and InfiniBand
- No opportunity to improve placement to control
congestion
21
Simple LNET Routing
Client Router Storage
Fabric
22
InfiniBand Congestion
23
Improved Routing Configurations
- Aim to eliminate InfiniBand congestion
- Aim to reduce torus congestion
- Provide ability for application to determine
which router will be used for a particular OST
- given OST-to-OSS mapping
- given client-to-router mappings
24
Nearest Neighbor
- 32 sets of routers
- one set for each leaf module
- 6 OSS servers in each set
- 6 routers in each set
- Each client chooses the nearest router to talk to
the OSSes in a set
- Variable performance
- by job size
- by job location
25
Nearest Neighbor
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
Router (Group A) Storage (Group A)
Fabric
Router (Group B) Storage (Group B) Client
26
Round Robin
- Again, 32 sets of routers
- Ordered list of routers for each set
- Client chooses router (nid % 6) for the set
- Throws I/O traffic around the torus
27
Round Robin
1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 1 2 2
Router (Group A) Storage (Group A)
Fabric
Router (Group B) Storage (Group B) Client
28
Projection
- 192 LNET networks
- one for each OSS
- One primary router for each LNET
- add higher weights for backup routers
- Clients experience variable latency to OSSes
based on location
- Placement calculations similar to direct-
attached
29
Projection
Client Router Storage
Fabric
30
Routed Placement Results (IOR)
31
Conclusions
- Goals exceeded: 244 GB/s on routed storage
- “Projected” configuration in production
- Working with library developers to bring this to
users
32
Questions?
- Contact info:
David Dillow 865-241-6602 dillowda@ornl.gov
This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. Notice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.