[PPT] - Automatically Encapsulating HPC Best Practices Into Data Transfers PowerPoint Presentation

SLIDE 1

National Aeronautics and Space Administration www.nasa.gov

Automatically Encapsulating  HPC Best Practices Into Data Transfers

Paul Z. Kolano NASA Advanced Supercomputing Division paul.kolano@nasa.gov

SLIDE 2

NASA High End Computing Capability

Outline of Presentation

Introduction
Transport tuning and selection
Global resource management
File system optimization
Conclusions

2

SLIDE 3

NASA High End Computing Capability

Introduction

Data transfers are part of life in HPC environments
Finite storage capacity
Transfer to cheaper tape storage
Back up existing data
Make room for new data
Transfer from tape storage to reprocess old data
Transfer between file systems to fix imbalances
Finite computational capacity
Transfer from off-site systems with cheaper pre-processing
Transfer to off-site systems for cheaper post-processing

3

SLIDE 4

NASA High End Computing Capability

Introduction (cont.)

User transfer concerns
Ease to use, integrity, turnaround time
Administrator and owner transfer concerns
Environment stability, cost effectiveness
These items can conflict with each other
Easy to use tools or those ensuring integrity may not be fast
Easiest file structure may degrade tape performance
Fastest turnaround time may lead to resource exhaustion
Takes HPC expert to reconcile conflicts
Understands and applies accepted best practices to achieve

fast and efficient verified transfers without impact on stability

4

SLIDE 5

NASA High End Computing Capability

Goal

Let scientists focus on science without wading

through documentation on transfer best practices

Specify transfers in simplest, naive fashion
Source and destination
Provide tool to perform transfer as if scientist were

HPC expert

Choose appropriate tools and optimize for best performance
Fully utilize available resources without starving other users
Manage files appropriately by file system type to ensure

efficient access by later and/or behind the scenes processes

5

SLIDE 6

NASA High End Computing Capability

Shift: Self-Healing Independent File Transfer

Satisfies user requirements
Simple cp/scp syntax for local/remote transfers
End-to-end integrity via checksums and sanity checks
High speed via transport selection/tuning and automatic parallelization
Satisfies administrator and owner requirements
Helps prevent resource exhaustion that leads to environment instability
Global throttling to allocate resources fairly
Load balancing to avoid highly loaded hosts
Automatic striping to avoid imbalanced disk utilization
Helps prevent wasted resources that impact cost effectiveness
Allows easy utilization of idle resources
Reduces wasted CPU cycles during jobs due to inefficient disk I/O
Prevents issues leading to inefficient tape I/O

6

SLIDE 7

NASA High End Computing Capability

Shift: Self-Healing Independent File Transfer (cont.)

Used in production at NASA's Advanced

Supercomputing Facility for over 3.5 years

User transfers across local/LAN/WAN
Disaster recovery backups to/from remote organizations
Rebalancing entire multi-PB Lustre file systems
etc.
Facilitated transfers of over 14 PB in past year
8 PB local transfers
4 PB LAN transfers
2 PB WAN transfers
"I used to hate archiving data - now I almost look for

a reason to archive something" –Shift user

7

SLIDE 8

NASA High End Computing Capability

Shift Interface

> shiftc --create-tar /nobackup/user1/dataset1 archive1:dataset1.tar Shift id is 36 Detaching process (use --status option to monitor progress) > shiftc --status

--+-------+-------------+-------------+---------------+-------+------------+---------

34 | error | 0/0 | 23121/23121 | 39.5TB/39.5TB | 10/02 | 2d14h32m5s | 175MB/s | | 46222/46242 | 23111/23121 | 79TB/79TB | 10:26 | | 35 | done | 1/1 | 5131/5131 | 303GB/303GB | 10/05 | 1m35s | 3.19GB/s | | 10262/10262 | 5132/5132 | 605GB/605GB | 12:28 | | 36 | run | 24/24 | 26656/26656 | 1.78TB/1.78TB | 10/06 | 2h48m37s | 176MB/s | | 15463/53312 | 10/26684 | 1.02TB/3.56TB | 12:11 | 1h47m55s |

8

SLIDE 9

NASA High End Computing Capability

Shift Components

9

Command-line client
Performs file operations and reports results to manager
Command-line manager
Invoked by clients to track operations and parcel out work

Client Host Shift Client Remote Host App C1 Interconnect App C2 App Cj OS Client File System App R1 App R2 App Rk Shift Manager(s) OS Remote File System

SLIDE 10

NASA High End Computing Capability

Outline of Presentation

Introduction
Transport tuning and selection
Global resource management
File system optimization
Conclusions

10

SLIDE 11

NASA High End Computing Capability

Transport Tuning and Selection

Shift includes built-in local/remote transports and

checksum capabilities

Fully functional out of the box
Perl-based equivalents of cp, sftp, fish, m(d5)sum
Shift calls higher performance tools when available
bbcp, bbftp, gridftp, mcp, rsync, msum
Knows how to construct command-lines and parse output
Tune transports for optimal performance
Select transports based on transfer characteristics

11

SLIDE 12

NASA High End Computing Capability

Transport Tuning

TCP-based transports
bbcp, bbftp, gridftp
Choose TCP window size
Transports with internal parallelism
TCP streams (bbcp, bbftp, gridftp) or threads (mcp, msum)
Choose appropriate level of parallelism
SSH-based transports
fish, rsync, sftp-perl
Choose fastest SSH cipher and MAC algorithm

12

SLIDE 13

NASA High End Computing Capability

TCP Window Size Tuning

TCP window is amount of data sender or receiver

willing to buffer while waiting for acknowledgment

Optimal value is bandwidth delay product (BDP)
bandwidth * round-trip time
Constrained by configured operating system limits
e.g. Linux net.core.[wr]mem_max
Single stream only achieves bandwidth if limit at least BDP

13

SLIDE 14

NASA High End Computing Capability

TCP Window Size Tuning (cont.)

100 200 300 400 500 600 700 Default 1 4 16 64 100 LAN Transfer Performance (MB/s) TCP Window Size (MB) bbcp bbftp gridftp 100 200 300 400 500 600 700 800 Default 1 4 16 64 100 WAN Transfer Performance (MB/s) TCP Window Size (MB) bbcp (get) bbftp (get) gridftp (put) 14

Shift determines latency using icmp/echo/syn ping
Shift guesses bandwidth based on network type

and client hardware if not given via --bandwidth

Bandwidth difficult to compute a priori
Chooses window size up to operating system limit

SLIDE 15

NASA High End Computing Capability

Transport Parallelism Tuning

Number of streams in TCP-based transports
Overcome improperly configured TCP window maximums
Overcome improperly specified TCP window
Overcome interference by cross traffic
Number of threads in mcp and msum
Take advantage of excess resource capacity on one host

15

SLIDE 16

NASA High End Computing Capability

Transport Parallelism Tuning (cont.)

100 200 300 400 500 600 1 2 4 8 LAN Transfer Performance (MB/s) TCP Streams bbcp bbftp gridftp 100 200 300 400 500 600 700 1 2 4 8 WAN Transfer Performance (MB/s) TCP Streams bbcp (get) bbftp (get) gridftp (put) 16

Shift chooses streams based on bandwidth

available beyond operating system window limit

A minimum value can be configured for LAN/WAN

to help overcome cross traffic

SLIDE 17

NASA High End Computing Capability

Transport Parallelism Tuning (cont.)

17 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 Local Copy Performance (MB/s) Mcp Threads Haswell (12-core, 4x FDR) Ivy Bridge (10-core, 4x FDR) Sandy Bridge (8-core, 4x FDR) Westmere (6-core, 4x QDR)

Threads can be centrally

configured on the manager

High thread counts can

induce high load on shared resources

Intentionally set lower than
ptimal at NAS due to high

load on archive front-ends

SLIDE 18

NASA High End Computing Capability

SSH Cipher and MAC Algorithm Tuning

SSH-based transports use SSH pipe to communicate
Performance directly correlated to SSH performance
SSH does not expose TCP window settings
HPN SSH patches can be used for better window handling
Main SSH tuning parameters available
Encryption algorithm
Message authentication code (MAC) algorithm

18

SLIDE 19

NASA High End Computing Capability

SSH Cipher and MAC Algorithm Tuning (cont.)

50 100 150 200 250 a e s 1 2 8

c

b c a e s 1 9 2

c

b c a e s 2 5 6

c

b c a e s 1 2 8

c

t r a e s 1 9 2

c

t r a e s 2 5 6

c

t r a r c f

u

r a r c f

u

r 1 2 8 a r c f

u

r 2 5 6 b l

w

f i s h

c

b c c a s e 1 2 8

c

b c c h a c h a 2 3 d e s

c

b c LAN Transfer Performance (MB/s) (fish with umac-64 MAC) SSH Cipher 50 100 150 200 250 h m a c

m

d 5 h m a c

m

d 5

9

6 h m a c

r

i p e m d 1 6 h m a c

s

h a 1 h m a c

s

h a 1

9

6 h m a c

s

h a 2

2

5 6 h m a c

s

h a 2

5

1 2 u m a c

6

4 u m a c

1

2 8 LAN Transfer Performance (MB/s) (fish with arcfour256 cipher) SSH MAC Algorithm 19

Shift allows preferred cipher/mac order to be

centrally configured

Availability checked on client host before transfer

SLIDE 20

NASA High End Computing Capability

Transport Selection

Different transports have different strengths and

weaknesses

Supporting multiple transports provides opportunity

to vary transport across each batch of files within transfer

20

SLIDE 21

NASA High End Computing Capability

Transport Selection (cont.)

0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 SSH/Non-SSH Break-Even Point LAN Transfer Performance (MB/s) File Size (MB) bbcp bbftp fish gridftp rsync sftp-perl 0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 SSH/Non-SSH Break-Even Point WAN Transfer Performance (MB/s) File Size (MB) bbcp bbftp fish gridftp rsync sftp-perl 21

Using a single transport does not achieve

maximum performance

Shift's support of multiple transports allows it to

use the optimum transport for each batch of files

SLIDE 22

NASA High End Computing Capability

Transport Selection (cont.)

22 0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 Local Transfer Performance (MB/s) File Size (MB) bbcp bbftp cp-perl fish gridftp mcp rsync

Traditionally remote

transports also perform well for local transfers

SLIDE 23

NASA High End Computing Capability

Transport Selection (cont.)

500 1000 1500 2000 2500 3000 bbcp bbftp cp/sftp-perl fish gridftp mcp rsync 64 x 1GB Performance (MB/s) Transport Local LAN WAN 50 100 150 200 250 300 350 400 450 500 bbcp bbftp cp/sftp-perl fish gridftp mcp rsync 1000 x 4MB Performance (MB/s) Transport Local LAN WAN 23

Shift has configurable small file thresholds
Preferred local/LAN/WAN transports above/below
Transport chosen using avg. size of each batch

SLIDE 24

NASA High End Computing Capability

Outline of Presentation

Introduction
Transport tuning and selection
Global resource management
File system optimization
Conclusions

24

SLIDE 25

NASA High End Computing Capability

Global Resource Management

Transfer parallelization
The only way to maximize performance in HPC

environments is to use multiple resources at once

Global throttling
If everyone tries to run at the maximum rate possible,

everyone loses

Load balancing
Avoid resource exhaustion on individual hosts

25

SLIDE 26

NASA High End Computing Capability

Transfer Parallelization

Single host may have excess capacity
Transport does not have its own parallelism
Single client cannot fully utilize host's resources
Full HPC environment may have excess capacity
Single system bottlenecks
Aggregate resources of many hosts
Two complementary forms of parallelization
Multiple clients running on a single host
One or more clients running on multiple hosts

26

SLIDE 27

NASA High End Computing Capability

Transfer Parallelization (cont.)

500 1000 1500 2000 2500 3000 1 2 4 8 16 Local Transfer Performance (MB/s) Parallel Clients bbcp bbftp cp-perl fish gridftp mcp rsync 2000 4000 6000 8000 10000 12000 1 2 4 8 16 32 Local Transfer Performance (MB/s) Parallel Hosts bbcp bbftp cp-perl fish gridftp mcp rsync 27

Shift derives file system equivalency from mount

information to determine parallelization candidates

Shift can automatically parallelize all transfers using

centrally configured clients/hosts values

Can trivially run across allocated compute nodes
e.g. --host-file=$PBS_NODEFILE

SLIDE 28

NASA High End Computing Capability

Global Throttling

Support transfers at

full speed when excess capacity exists

Prevent resource

exhaustion when systems are busy

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 Configured Limit Local Write Performance (MB/s) Time Unthrottled (5.34 GB/s) Throttled (3.62 GB/s) 28

Shift supports throttling limits on CPU %, disk

usage, I/O rate, and network rate

Limits can be global, per user, per host, per file system
Transfers directed to sleep until average reaches

(transfer's fair share of) user's fair share of limit

SLIDE 29

NASA High End Computing Capability

Load Balancing

29

Single host resource

limitations impact maximum transfer speed of all processes on that host

Less loaded hosts have

more resources available for new transfers

1000 2000 3000 4000 5000 6000 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 32-way dd write iperf IPoIB I/O Bandwidth Utilization (MB/s) Time 32 Hosts 16 Hosts 8 Hosts 4 Hosts 2 Hosts 1 Host

SLIDE 30

NASA High End Computing Capability

Load Balancing (cont.)

30

Shift load balances during host parallelization
Picks least loaded hosts when spawning clients
Shift load balances during remote transfers
Remote host switched to least loaded
Overlap prevented between parallel clients

1000 2000 3000 4000 5000 6000 7000 1 2 4 8 16 32 LAN Transfer Performance (MB/s) Client Hosts Disjoint remote hosts Same remote host

SLIDE 31

NASA High End Computing Capability

Outline of Presentation

Introduction
Transport tuning and selection
Global resource management
File system optimization
Conclusions

31

SLIDE 32

NASA High End Computing Capability

File System Optimization

Different file systems have different idiosyncrasies
Can impact performance, stability, and operating cost
Not always directly visible to users
Tape optimization
File extents impact tape write speed
Sequential retrieval results in inefficient tape movement
Tar creation/extraction
Tape I/O is inefficient with small files
Lustre striping
Too few stripes creates later I/O inefficiencies

32

SLIDE 33

NASA High End Computing Capability

Tape Optimization

Tape-backed file systems have limited parallelism

due to large slow-moving physical components

Inefficient access by one user can have major impact
n all others
Large number of file extents degrades write speed
Sequential file retrieval inhibits minimization of internal tape

movement

33

SLIDE 34

NASA High End Computing Capability

Tape Optimization (cont.)

50 100 150 200 250 300 350 1 4 16 64 256 1024 4096 16384 65536 Performance (MB/s) Source File Extents Tape Write Buffered dd Read Direct dd Read 500 1000 1500 2000 2500 1 4 16 64 Tape Writes Degrade Extents Produced File Size (GB) bbcp bbftp cp-perl fish gridftp mcp rsync 34

Shift can be configured to preallocate files below

a given sparsity

Minimize extents for most common regular files
Minimize disk usage for large sparse files

SLIDE 35

NASA High End Computing Capability

Tape Optimization (cont.)

200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 Tape Retrieval Time (s) 1GB Files Sequential Batched

Files automatically

retrieved from tape if

ffline when accessed
May be accessed in

different order than stored on tape

35

Shift automatically initiates a retrieval of all source

files on SGI DMF file systems

Retrieval initiated again for files in each batch in

case files pushed offline before being transferred

SLIDE 36

NASA High End Computing Capability

Tar Creation/Extraction

36 50 100 150 200 250 300 1 4 16 64 256 1024 4096 16384 65536 Tape Performance (MB/s) Source File Size (MB) Write Read

Users prefer archiving data

in original hierarchy

Each file accessed as needed
Small files impact stability
Not migrated below certain

size so consume limited disk

Tape I/O is inefficient
Normally use tar files
Tar is very slow (100 MB/s)
Must retrieve to view contents
No assurance of integrity
Retrieve entire tar for one file
May not have quota to create

SLIDE 37

NASA High End Computing Capability

Tar Creation/Extraction (cont.)

500 1000 1500 2000 2500 1 2 4 8 16 Tar Creation Performance (MB/s) Parallel Hosts Local (mcp) LAN (fish) WAN (fish) GNU tar 1000 2000 3000 4000 5000 6000 1 2 4 8 16 Tar Extraction Performance (MB/s) Parallel Hosts Local (mcp) LAN (fish) WAN (fish) GNU tar 37

Shift has built-in tar creation/extraction
Uses high performance transports and parallelism
Automatic creation of index files to see contents
Integrity verification of every file added/extracted
Automatic split of tar files at configurable size
Direct creation/extraction over network without use of quota

SLIDE 38

NASA High End Computing Capability

Lustre Striping

Striping must be set before a file is first written
Higher stripe count
More I/O bandwidth available for large files
More balanced distribution of large files over OSTs
Lower stripe count
Less contention during metadata operations
Less wasted space for small files
Striping can only be changed by copying file

38

SLIDE 39

NASA High End Computing Capability

Lustre Striping (cont.)

2000 4000 6000 8000 10000 12000 1 2 4 8 16 32 64 Noncontiguous Write Performance (MB/s) Lustre Stripes 64 Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 Noncontiguous Read Performance (MB/s) Lustre Stripes 64 Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 39

Shift automatically stripes files according to size
Increases write performance during parallel transfers
Increases read performance during later job access
Reduces wasted CPU cycles due to I/O
Preserves non-default striping when applicable

SLIDE 40

NASA High End Computing Capability

Conclusion

Shift is an automated transfer tool that encapsulates

HPC best practices to achieve better performance while preserving stability of HPC environments

Centralized configuration in manager component allows

policies to be changed globally across all clients

New best practices can be incorporated transparently

without user in the loop

Shift is open source and available for download
http://shiftc.sourceforge.net
Contact info
paul.kolano@nasa.gov
Questions?

40