National Aeronautics and Space Administration www.nasa.gov
Automatically Encapsulating HPC Best Practices Into Data Transfers - - PowerPoint PPT Presentation
Automatically Encapsulating HPC Best Practices Into Data Transfers - - PowerPoint PPT Presentation
National Aeronautics and Space Administration Automatically Encapsulating HPC Best Practices Into Data Transfers Paul Z. Kolano NASA Advanced Supercomputing Division paul.kolano@nasa.gov www.nasa.gov Outline of Presentation
NASA High End Computing Capability
Outline of Presentation
- Introduction
- Transport tuning and selection
- Global resource management
- File system optimization
- Conclusions
2
NASA High End Computing Capability
Introduction
- Data transfers are part of life in HPC environments
- Finite storage capacity
- Transfer to cheaper tape storage
- Back up existing data
- Make room for new data
- Transfer from tape storage to reprocess old data
- Transfer between file systems to fix imbalances
- Finite computational capacity
- Transfer from off-site systems with cheaper pre-processing
- Transfer to off-site systems for cheaper post-processing
3
NASA High End Computing Capability
Introduction (cont.)
- User transfer concerns
- Ease to use, integrity, turnaround time
- Administrator and owner transfer concerns
- Environment stability, cost effectiveness
- These items can conflict with each other
- Easy to use tools or those ensuring integrity may not be fast
- Easiest file structure may degrade tape performance
- Fastest turnaround time may lead to resource exhaustion
- Takes HPC expert to reconcile conflicts
- Understands and applies accepted best practices to achieve
fast and efficient verified transfers without impact on stability
4
NASA High End Computing Capability
Goal
- Let scientists focus on science without wading
through documentation on transfer best practices
- Specify transfers in simplest, naive fashion
- Source and destination
- Provide tool to perform transfer as if scientist were
HPC expert
- Choose appropriate tools and optimize for best performance
- Fully utilize available resources without starving other users
- Manage files appropriately by file system type to ensure
efficient access by later and/or behind the scenes processes
5
NASA High End Computing Capability
Shift: Self-Healing Independent File Transfer
- Satisfies user requirements
- Simple cp/scp syntax for local/remote transfers
- End-to-end integrity via checksums and sanity checks
- High speed via transport selection/tuning and automatic parallelization
- Satisfies administrator and owner requirements
- Helps prevent resource exhaustion that leads to environment instability
- Global throttling to allocate resources fairly
- Load balancing to avoid highly loaded hosts
- Automatic striping to avoid imbalanced disk utilization
- Helps prevent wasted resources that impact cost effectiveness
- Allows easy utilization of idle resources
- Reduces wasted CPU cycles during jobs due to inefficient disk I/O
- Prevents issues leading to inefficient tape I/O
6
NASA High End Computing Capability
Shift: Self-Healing Independent File Transfer (cont.)
- Used in production at NASA's Advanced
Supercomputing Facility for over 3.5 years
- User transfers across local/LAN/WAN
- Disaster recovery backups to/from remote organizations
- Rebalancing entire multi-PB Lustre file systems
- etc.
- Facilitated transfers of over 14 PB in past year
- 8 PB local transfers
- 4 PB LAN transfers
- 2 PB WAN transfers
- "I used to hate archiving data - now I almost look for
a reason to archive something" –Shift user
7
NASA High End Computing Capability
Shift Interface
> shiftc --create-tar /nobackup/user1/dataset1 archive1:dataset1.tar Shift id is 36 Detaching process (use --status option to monitor progress) > shiftc --status
id | state | dirs | files | file size | date | run | rate | | sums | attrs | sum size | time | left |
- --+-------+-------------+-------------+---------------+-------+------------+---------
34 | error | 0/0 | 23121/23121 | 39.5TB/39.5TB | 10/02 | 2d14h32m5s | 175MB/s | | 46222/46242 | 23111/23121 | 79TB/79TB | 10:26 | | 35 | done | 1/1 | 5131/5131 | 303GB/303GB | 10/05 | 1m35s | 3.19GB/s | | 10262/10262 | 5132/5132 | 605GB/605GB | 12:28 | | 36 | run | 24/24 | 26656/26656 | 1.78TB/1.78TB | 10/06 | 2h48m37s | 176MB/s | | 15463/53312 | 10/26684 | 1.02TB/3.56TB | 12:11 | 1h47m55s |
8
NASA High End Computing Capability
Shift Components
9
- Command-line client
- Performs file operations and reports results to manager
- Command-line manager
- Invoked by clients to track operations and parcel out work
Client Host Shift Client Remote Host App C1 Interconnect App C2 App Cj OS Client File System App R1 App R2 App Rk Shift Manager(s) OS Remote File System
NASA High End Computing Capability
Outline of Presentation
- Introduction
- Transport tuning and selection
- Global resource management
- File system optimization
- Conclusions
10
NASA High End Computing Capability
Transport Tuning and Selection
- Shift includes built-in local/remote transports and
checksum capabilities
- Fully functional out of the box
- Perl-based equivalents of cp, sftp, fish, m(d5)sum
- Shift calls higher performance tools when available
- bbcp, bbftp, gridftp, mcp, rsync, msum
- Knows how to construct command-lines and parse output
- Tune transports for optimal performance
- Select transports based on transfer characteristics
11
NASA High End Computing Capability
Transport Tuning
- TCP-based transports
- bbcp, bbftp, gridftp
- Choose TCP window size
- Transports with internal parallelism
- TCP streams (bbcp, bbftp, gridftp) or threads (mcp, msum)
- Choose appropriate level of parallelism
- SSH-based transports
- fish, rsync, sftp-perl
- Choose fastest SSH cipher and MAC algorithm
12
NASA High End Computing Capability
TCP Window Size Tuning
- TCP window is amount of data sender or receiver
willing to buffer while waiting for acknowledgment
- Optimal value is bandwidth delay product (BDP)
- bandwidth * round-trip time
- Constrained by configured operating system limits
- e.g. Linux net.core.[wr]mem_max
- Single stream only achieves bandwidth if limit at least BDP
13
NASA High End Computing Capability
TCP Window Size Tuning (cont.)
100 200 300 400 500 600 700 Default 1 4 16 64 100 LAN Transfer Performance (MB/s) TCP Window Size (MB) bbcp bbftp gridftp 100 200 300 400 500 600 700 800 Default 1 4 16 64 100 WAN Transfer Performance (MB/s) TCP Window Size (MB) bbcp (get) bbftp (get) gridftp (put) 14
- Shift determines latency using icmp/echo/syn ping
- Shift guesses bandwidth based on network type
and client hardware if not given via --bandwidth
- Bandwidth difficult to compute a priori
- Chooses window size up to operating system limit
NASA High End Computing Capability
Transport Parallelism Tuning
- Number of streams in TCP-based transports
- Overcome improperly configured TCP window maximums
- Overcome improperly specified TCP window
- Overcome interference by cross traffic
- Number of threads in mcp and msum
- Take advantage of excess resource capacity on one host
15
NASA High End Computing Capability
Transport Parallelism Tuning (cont.)
100 200 300 400 500 600 1 2 4 8 LAN Transfer Performance (MB/s) TCP Streams bbcp bbftp gridftp 100 200 300 400 500 600 700 1 2 4 8 WAN Transfer Performance (MB/s) TCP Streams bbcp (get) bbftp (get) gridftp (put) 16
- Shift chooses streams based on bandwidth
available beyond operating system window limit
- A minimum value can be configured for LAN/WAN
to help overcome cross traffic
NASA High End Computing Capability
Transport Parallelism Tuning (cont.)
17 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 Local Copy Performance (MB/s) Mcp Threads Haswell (12-core, 4x FDR) Ivy Bridge (10-core, 4x FDR) Sandy Bridge (8-core, 4x FDR) Westmere (6-core, 4x QDR)
- Threads can be centrally
configured on the manager
- High thread counts can
induce high load on shared resources
- Intentionally set lower than
- ptimal at NAS due to high
load on archive front-ends
NASA High End Computing Capability
SSH Cipher and MAC Algorithm Tuning
- SSH-based transports use SSH pipe to communicate
- Performance directly correlated to SSH performance
- SSH does not expose TCP window settings
- HPN SSH patches can be used for better window handling
- Main SSH tuning parameters available
- Encryption algorithm
- Message authentication code (MAC) algorithm
18
NASA High End Computing Capability
SSH Cipher and MAC Algorithm Tuning (cont.)
50 100 150 200 250 a e s 1 2 8
- c
b c a e s 1 9 2
- c
b c a e s 2 5 6
- c
b c a e s 1 2 8
- c
t r a e s 1 9 2
- c
t r a e s 2 5 6
- c
t r a r c f
- u
r a r c f
- u
r 1 2 8 a r c f
- u
r 2 5 6 b l
- w
f i s h
- c
b c c a s e 1 2 8
- c
b c c h a c h a 2 3 d e s
- c
b c LAN Transfer Performance (MB/s) (fish with umac-64 MAC) SSH Cipher 50 100 150 200 250 h m a c
- m
d 5 h m a c
- m
d 5
- 9
6 h m a c
- r
i p e m d 1 6 h m a c
- s
h a 1 h m a c
- s
h a 1
- 9
6 h m a c
- s
h a 2
- 2
5 6 h m a c
- s
h a 2
- 5
1 2 u m a c
- 6
4 u m a c
- 1
2 8 LAN Transfer Performance (MB/s) (fish with arcfour256 cipher) SSH MAC Algorithm 19
- Shift allows preferred cipher/mac order to be
centrally configured
- Availability checked on client host before transfer
NASA High End Computing Capability
Transport Selection
- Different transports have different strengths and
weaknesses
- Supporting multiple transports provides opportunity
to vary transport across each batch of files within transfer
20
NASA High End Computing Capability
Transport Selection (cont.)
0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 SSH/Non-SSH Break-Even Point LAN Transfer Performance (MB/s) File Size (MB) bbcp bbftp fish gridftp rsync sftp-perl 0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 SSH/Non-SSH Break-Even Point WAN Transfer Performance (MB/s) File Size (MB) bbcp bbftp fish gridftp rsync sftp-perl 21
- Using a single transport does not achieve
maximum performance
- Shift's support of multiple transports allows it to
use the optimum transport for each batch of files
NASA High End Computing Capability
Transport Selection (cont.)
22 0.1 1 10 100 1000 1 4 16 64 256 1024 4096 16384 65536 Local Transfer Performance (MB/s) File Size (MB) bbcp bbftp cp-perl fish gridftp mcp rsync
- Traditionally remote
transports also perform well for local transfers
NASA High End Computing Capability
Transport Selection (cont.)
500 1000 1500 2000 2500 3000 bbcp bbftp cp/sftp-perl fish gridftp mcp rsync 64 x 1GB Performance (MB/s) Transport Local LAN WAN 50 100 150 200 250 300 350 400 450 500 bbcp bbftp cp/sftp-perl fish gridftp mcp rsync 1000 x 4MB Performance (MB/s) Transport Local LAN WAN 23
- Shift has configurable small file thresholds
- Preferred local/LAN/WAN transports above/below
- Transport chosen using avg. size of each batch
NASA High End Computing Capability
Outline of Presentation
- Introduction
- Transport tuning and selection
- Global resource management
- File system optimization
- Conclusions
24
NASA High End Computing Capability
Global Resource Management
- Transfer parallelization
- The only way to maximize performance in HPC
environments is to use multiple resources at once
- Global throttling
- If everyone tries to run at the maximum rate possible,
everyone loses
- Load balancing
- Avoid resource exhaustion on individual hosts
25
NASA High End Computing Capability
Transfer Parallelization
- Single host may have excess capacity
- Transport does not have its own parallelism
- Single client cannot fully utilize host's resources
- Full HPC environment may have excess capacity
- Single system bottlenecks
- Aggregate resources of many hosts
- Two complementary forms of parallelization
- Multiple clients running on a single host
- One or more clients running on multiple hosts
26
NASA High End Computing Capability
Transfer Parallelization (cont.)
500 1000 1500 2000 2500 3000 1 2 4 8 16 Local Transfer Performance (MB/s) Parallel Clients bbcp bbftp cp-perl fish gridftp mcp rsync 2000 4000 6000 8000 10000 12000 1 2 4 8 16 32 Local Transfer Performance (MB/s) Parallel Hosts bbcp bbftp cp-perl fish gridftp mcp rsync 27
- Shift derives file system equivalency from mount
information to determine parallelization candidates
- Shift can automatically parallelize all transfers using
centrally configured clients/hosts values
- Can trivially run across allocated compute nodes
- e.g. --host-file=$PBS_NODEFILE
NASA High End Computing Capability
Global Throttling
- Support transfers at
full speed when excess capacity exists
- Prevent resource
exhaustion when systems are busy
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 Configured Limit Local Write Performance (MB/s) Time Unthrottled (5.34 GB/s) Throttled (3.62 GB/s) 28
- Shift supports throttling limits on CPU %, disk
usage, I/O rate, and network rate
- Limits can be global, per user, per host, per file system
- Transfers directed to sleep until average reaches
(transfer's fair share of) user's fair share of limit
NASA High End Computing Capability
Load Balancing
29
- Single host resource
limitations impact maximum transfer speed of all processes on that host
- Less loaded hosts have
more resources available for new transfers
1000 2000 3000 4000 5000 6000 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 32-way dd write iperf IPoIB I/O Bandwidth Utilization (MB/s) Time 32 Hosts 16 Hosts 8 Hosts 4 Hosts 2 Hosts 1 Host
NASA High End Computing Capability
Load Balancing (cont.)
30
- Shift load balances during host parallelization
- Picks least loaded hosts when spawning clients
- Shift load balances during remote transfers
- Remote host switched to least loaded
- Overlap prevented between parallel clients
1000 2000 3000 4000 5000 6000 7000 1 2 4 8 16 32 LAN Transfer Performance (MB/s) Client Hosts Disjoint remote hosts Same remote host
NASA High End Computing Capability
Outline of Presentation
- Introduction
- Transport tuning and selection
- Global resource management
- File system optimization
- Conclusions
31
NASA High End Computing Capability
File System Optimization
- Different file systems have different idiosyncrasies
- Can impact performance, stability, and operating cost
- Not always directly visible to users
- Tape optimization
- File extents impact tape write speed
- Sequential retrieval results in inefficient tape movement
- Tar creation/extraction
- Tape I/O is inefficient with small files
- Lustre striping
- Too few stripes creates later I/O inefficiencies
32
NASA High End Computing Capability
Tape Optimization
- Tape-backed file systems have limited parallelism
due to large slow-moving physical components
- Inefficient access by one user can have major impact
- n all others
- Large number of file extents degrades write speed
- Sequential file retrieval inhibits minimization of internal tape
movement
33
NASA High End Computing Capability
Tape Optimization (cont.)
50 100 150 200 250 300 350 1 4 16 64 256 1024 4096 16384 65536 Performance (MB/s) Source File Extents Tape Write Buffered dd Read Direct dd Read 500 1000 1500 2000 2500 1 4 16 64 Tape Writes Degrade Extents Produced File Size (GB) bbcp bbftp cp-perl fish gridftp mcp rsync 34
- Shift can be configured to preallocate files below
a given sparsity
- Minimize extents for most common regular files
- Minimize disk usage for large sparse files
NASA High End Computing Capability
Tape Optimization (cont.)
200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 Tape Retrieval Time (s) 1GB Files Sequential Batched
- Files automatically
retrieved from tape if
- ffline when accessed
- May be accessed in
different order than stored on tape
35
- Shift automatically initiates a retrieval of all source
files on SGI DMF file systems
- Retrieval initiated again for files in each batch in
case files pushed offline before being transferred
NASA High End Computing Capability
Tar Creation/Extraction
36 50 100 150 200 250 300 1 4 16 64 256 1024 4096 16384 65536 Tape Performance (MB/s) Source File Size (MB) Write Read
- Users prefer archiving data
in original hierarchy
- Each file accessed as needed
- Small files impact stability
- Not migrated below certain
size so consume limited disk
- Tape I/O is inefficient
- Normally use tar files
- Tar is very slow (100 MB/s)
- Must retrieve to view contents
- No assurance of integrity
- Retrieve entire tar for one file
- May not have quota to create
NASA High End Computing Capability
Tar Creation/Extraction (cont.)
500 1000 1500 2000 2500 1 2 4 8 16 Tar Creation Performance (MB/s) Parallel Hosts Local (mcp) LAN (fish) WAN (fish) GNU tar 1000 2000 3000 4000 5000 6000 1 2 4 8 16 Tar Extraction Performance (MB/s) Parallel Hosts Local (mcp) LAN (fish) WAN (fish) GNU tar 37
- Shift has built-in tar creation/extraction
- Uses high performance transports and parallelism
- Automatic creation of index files to see contents
- Integrity verification of every file added/extracted
- Automatic split of tar files at configurable size
- Direct creation/extraction over network without use of quota
NASA High End Computing Capability
Lustre Striping
- Striping must be set before a file is first written
- Higher stripe count
- More I/O bandwidth available for large files
- More balanced distribution of large files over OSTs
- Lower stripe count
- Less contention during metadata operations
- Less wasted space for small files
- Striping can only be changed by copying file
38
NASA High End Computing Capability
Lustre Striping (cont.)
2000 4000 6000 8000 10000 12000 1 2 4 8 16 32 64 Noncontiguous Write Performance (MB/s) Lustre Stripes 64 Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 Noncontiguous Read Performance (MB/s) Lustre Stripes 64 Nodes 32 Nodes 16 Nodes 8 Nodes 4 Nodes 2 Nodes 39
- Shift automatically stripes files according to size
- Increases write performance during parallel transfers
- Increases read performance during later job access
- Reduces wasted CPU cycles due to I/O
- Preserves non-default striping when applicable
NASA High End Computing Capability
Conclusion
- Shift is an automated transfer tool that encapsulates
HPC best practices to achieve better performance while preserving stability of HPC environments
- Centralized configuration in manager component allows
policies to be changed globally across all clients
- New best practices can be incorporated transparently
without user in the loop
- Shift is open source and available for download
- http://shiftc.sourceforge.net
- Contact info
- paul.kolano@nasa.gov
- Questions?
40