I/O Congestion Avoidance via Routing and Object Placement David - - PowerPoint PPT Presentation

i o congestion avoidance via routing and object placement
SMART_READER_LITE
LIVE PREVIEW

I/O Congestion Avoidance via Routing and Object Placement David - - PowerPoint PPT Presentation

I/O Congestion Avoidance via Routing and Object Placement David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang 1 Motivation Goal: 240 GB/s routed Direct-attached vs. Center-wide Limited allocations INCITE averages 27 million


slide-1
SLIDE 1

1

I/O Congestion Avoidance via Routing and Object Placement

David Dillow, Galen Shipman, Sarp Oral, and Zhe Zhang

slide-2
SLIDE 2

2

Motivation

  • Goal: 240 GB/s routed
  • Direct-attached vs. Center-wide
  • Limited allocations
  • INCITE averages 27 million hours
  • Prefer to spend time computing
  • Performance issues at scale
slide-3
SLIDE 3

3

Spider Resources

  • 48 DDN 9900 Couplets
  • 13,440 SATA 1 TB hard drives
  • DDR InfiniBand connectivity
  • 192 Dell PowerEdge 1950 OSS
  • 16 GB memory
  • 2x Quad-core Xeons @ 2.3 GHz
  • 4 Cisco 7024D 288 port DDR IB switches
  • 48 Flextronics 24 port DDR IB switches
slide-4
SLIDE 4

4

Wiring up SION

64 links 32 links 96 links 96 links 96 links 96 links 64 links 8 links

slide-5
SLIDE 5

5

Direct-attached Traffic Flow

Client OSS Storage

Fabric

slide-6
SLIDE 6

6

Direct-attached Raw I/O Baseline

slide-7
SLIDE 7

7

Direct-attached Lustre Baseline

slide-8
SLIDE 8

8

Writer Skew

slide-9
SLIDE 9

9

SeaStar Bandwidth

slide-10
SLIDE 10

10

Link Oversubscription

  • Each link can sustain ~3100 MB/s (unidir)
  • Each OST can contribute 180 MB/s with a

balanced load presented to the DDN 9900

  • 260 MB/s individually
  • Therefore, each link can support 17 client-OST

pairs at saturation

  • 11 client-OST pairs @ 260 MB/s
slide-11
SLIDE 11

11

Link Oversubscription

  • 70% of tests had more than one link with 18

client-OST pairs

  • 42% had more than 34 pairs
  • 21% had more than 60
  • 3% had over 70%

But that's only part of the issue

slide-12
SLIDE 12

12

Imbalanced Sharing

slide-13
SLIDE 13

13

Placing I/O in the Torus

  • We want to minimize link congestion
  • Prefer no more than 11 client-OST pairs
  • Easiest method is to place active clients

topologically close to servers

  • Use hop count as our metric
slide-14
SLIDE 14

14

Placing I/O in the Torus

  • For each OST to be used
  • Calculate hop count from to OSS from each client
  • Pick the client with the lowest count
  • Remove that client from further consideration
slide-15
SLIDE 15

15

Placing I/O in the Torus

Client OSS Storage

Fabric

slide-16
SLIDE 16

16

Placement Results

slide-17
SLIDE 17

17

Improved Writer Skew

slide-18
SLIDE 18

18

Does it work in a smaller space?

slide-19
SLIDE 19

19

LNET Routing

  • Allows us to separate storage from compute

platform

  • Very simple in nature
  • List of routers for each remote LNET
  • Routers can have different weights
  • 1024 character max for route option
  • use lctl add_route for larger configs
slide-20
SLIDE 20

20

Simple LNET Routing

  • 196 routers in the torus
  • Client uses each router in a weight class in a

round-robin manner

  • 8 back-to-back messages to a single

destination will use 8 different routers

  • Congestion in both torus and InfiniBand
  • No opportunity to improve placement to control

congestion

slide-21
SLIDE 21

21

Simple LNET Routing

Client Router Storage

Fabric

slide-22
SLIDE 22

22

InfiniBand Congestion

slide-23
SLIDE 23

23

Improved Routing Configurations

  • Aim to eliminate InfiniBand congestion
  • Aim to reduce torus congestion
  • Provide ability for application to determine

which router will be used for a particular OST

  • given OST-to-OSS mapping
  • given client-to-router mappings
slide-24
SLIDE 24

24

Nearest Neighbor

  • 32 sets of routers
  • one set for each leaf module
  • 6 OSS servers in each set
  • 6 routers in each set
  • Each client chooses the nearest router to talk to

the OSSes in a set

  • Variable performance
  • by job size
  • by job location
slide-25
SLIDE 25

25

Nearest Neighbor

1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2

Router (Group A) Storage (Group A)

Fabric

Router (Group B) Storage (Group B) Client

slide-26
SLIDE 26

26

Round Robin

  • Again, 32 sets of routers
  • Ordered list of routers for each set
  • Client chooses router (nid % 6) for the set
  • Throws I/O traffic around the torus
slide-27
SLIDE 27

27

Round Robin

1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 1 2 2

Router (Group A) Storage (Group A)

Fabric

Router (Group B) Storage (Group B) Client

slide-28
SLIDE 28

28

Projection

  • 192 LNET networks
  • one for each OSS
  • One primary router for each LNET
  • add higher weights for backup routers
  • Clients experience variable latency to OSSes

based on location

  • Placement calculations similar to direct-

attached

slide-29
SLIDE 29

29

Projection

Client Router Storage

Fabric

slide-30
SLIDE 30

30

Routed Placement Results (IOR)

slide-31
SLIDE 31

31

Conclusions

  • Goals exceeded: 244 GB/s on routed storage
  • “Projected” configuration in production
  • Working with library developers to bring this to

users

slide-32
SLIDE 32

32

Questions?

  • Contact info:

David Dillow 865-241-6602 dillowda@ornl.gov

This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. Notice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.