Cluster 2010 Presentation Optimization Techniques at the I/O - - PowerPoint PPT Presentation

cluster 2010 presentation
SMART_READER_LITE
LIVE PREVIEW

Cluster 2010 Presentation Optimization Techniques at the I/O - - PowerPoint PPT Presentation

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter) : Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka


slide-1
SLIDE 1

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com

Cluster 2010 Presentation

1

2010年9月23日木曜日

slide-2
SLIDE 2

Background: Compute and Storage Imbalance

  • Leadership-class computational scale:
  • 100,000+ processes
  • Advanced Multi-core architectures, Compute node OSs
  • Leadership-class storage scale:
  • 100+ servers
  • Commercial storage hardware, Cluster file system
  • Current leadership-class machines supply only 1GB/s of storage

throughput for every 10TF of compute performance. This gap grew factor of 10 in recent years.

  • Bridging this imbalance between compute and storage is a critical

problem for the large-scale computation.

2

2010年9月23日木曜日

slide-3
SLIDE 3

Storage Devices VFS, FUSE POSIX I/O MPI-IO High-Level I/O Libraries Parallel/Serial Applications Parallel File Systems

Previous Studies: Current I/O Software Stack

Storage Abstraction, Data Portability (HDF5, NetCDF, ADIOS) Organizing Accesses from Many Clients (ROMIO) Logical File System

  • ver Many Storage Devices

(PVFS2, Lustre, GPFS, PanFS, Ceph, etc)

3

2010年9月23日木曜日

slide-4
SLIDE 4

Challenge: Millions of Concurrent Clients

  • 1,000,000+ concurrent clients present a challenge to current I/O stack
  • e,g. metadata performance, locking, network incast problem, etc.
  • I/O Forwarding Layer is introduced.
  • All I/O requests are delegated to dedicated I/O forwarder process.
  • I/O forwarder reduces the number of clients seen by the file system for all

applications, without collective I/O.

PVFS2 PVFS2 PVFS2 PVFS2

Disk Disk Disk Disk

Compute Compute Compute Compute Compute .... Compute I/O Forwarder

Compute Processes I/O Forwarder Parallel File System I/O Path

4

Parallel File System

2010年9月23日木曜日

slide-5
SLIDE 5

I/O Software Stack with I/O Forwarding

Storage Devices VFS, FUSE POSIX I/O MPI-IO High-Level I/O Libraries Parallel/Serial Applications I/O Forwarding Parallel File Systems

Bridge Between Compute Process and Storage System (IBM ciod, Cray DVS, IOFSL)

5

2010年9月23日木曜日

slide-6
SLIDE 6

Example I/O System: Blue Gene/P Architecture

6

2010年9月23日木曜日

slide-7
SLIDE 7

I/O Forwarding Challenges

  • Large Requests
  • Latency of the forwarding
  • Memory limit of the I/O
  • Variety of backend file system node performance
  • Small Requests
  • Current I/O forwarding mechanism reduces the number of clients, but

does not reduces the number of requests.

  • Request processing overheads at the file systems
  • We proposed two optimization techniques for the I/O forwarding layer.
  • Out-Of-Order I/O Pipelining, for large requests.
  • I/O Request Scheduler, for small requests.

7

2010年9月23日木曜日

slide-8
SLIDE 8

Out-Of-Order I/O Pipelining

  • Split large I/O requests into

small fixed-size chunks

  • These chunks are forwarded in

an out-of-order way.

  • Good points
  • Reduce forwarding latency,

by overlapping the I/O requests and the network transfer.

  • I/O sizes are not limited by

the memory size at the forwarding node.

  • Little effect by the slowest

file system node.

Client IOFSL File System No-Pipelining Out-Of-Order Pipelining Client IOFSL FileSystem Threads

8

2010年9月23日木曜日

slide-9
SLIDE 9

I/O Request Scheduler

  • Scheduling and Merging the small requests at the forwarder
  • Reduce number of seeks
  • Reduce number of requests, the file systems actually sees
  • Scheduling overhead must be minimum
  • Handle-Based Round-Robin algorithm for the fairness between files
  • Ranges are managed by Interval Tree
  • The contiguous requests are merged

H H H H H Read Read Write Read Read Pick N Requests and Issue I/O

Q

9

2010年9月23日木曜日

slide-10
SLIDE 10

I/O Forwarding and Scalability Layer (IOFSL)

  • IOFSL Project [Nawab 2009]
  • Open-Source I/O Forwarding Implementation
  • http://www.iofsl.org/
  • Portable on most HPC environment
  • Network Independent
  • All network communication is done by BMI [Carns 2005]
  • TCP/IP

, Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc.

  • File System Independent
  • MPI-IO (ROMIO) / FUSE Client

10

2010年9月23日木曜日

slide-11
SLIDE 11

IOFSL Software Stack

FUSE ZOIDFS Client API ROMIO-ZOIDFS MPI-IO Interface POSIXInterface Application BMI IOFSL Server BMI FileSystem Dispatcher PVFS2 libsysio ZOIDFS Protocol I/O Request (TCP, Infiniband, Myrinet, etc. )

11

  • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented

in the IOFSL, and evaluated on two environments.

  • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P)

2010年9月23日木曜日

slide-12
SLIDE 12

Evaluation on T2K: Spec

  • T2K Open Super Computer, Tokyo Sites
  • http://www.open-supercomputer.org/
  • 32 node Research Cluster
  • 16 cores: 2.3 GHz Quad-Core Opteron*4
  • 32GB Memory
  • 10Gbps Myrinet Network
  • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec)
  • One IOFSL, Four PVFS2, 128 MPI Processes
  • Software
  • MPICH2 1.1.1p1
  • PVFS2 CVS (almost 2.8.2)

12

2010年9月23日木曜日

slide-13
SLIDE 13

Evaluation on T2K: IOR Benchmark

  • Each process issues the same amount of I/O
  • Gradually increasing the message size, and see the

bandwidth change

  • Note: modified to do fsync() for MPI-IO

Message Size

13

2010年9月23日木曜日

slide-14
SLIDE 14

Evaluation on T2K: IOR Benchmark, 128procs

20 40 60 80 100 120 4 16 64 256 1024 4096 16384 65536 Bandwidth (MB/s) Message Size (KB) ROMIO PVFS2 IOFSL none IOFSL hbrr

14

2010年9月23日木曜日

slide-15
SLIDE 15

Evaluation on T2K: IOR Benchmark, 128procs

20 40 60 80 100 120 4 16 64 256 1024 4096 16384 65536 Bandwidth (MB/s) Message Size (KB) ROMIO PVFS2 IOFSL none IOFSL hbrr

Out-Of-Order Pipelining Improvements (~29.5%)

14

2010年9月23日木曜日

slide-16
SLIDE 16

Evaluation on T2K: IOR Benchmark, 128procs

20 40 60 80 100 120 4 16 64 256 1024 4096 16384 65536 Bandwidth (MB/s) Message Size (KB) ROMIO PVFS2 IOFSL none IOFSL hbrr

I/O Scheduler Improvement (~40.0%) Out-Of-Order Pipelining Improvements (~29.5%)

14

2010年9月23日木曜日

slide-17
SLIDE 17

Evaluation on Blue Gene/P: Spec

  • Argonne National Laboratory BG/P “Surveyor”
  • Blue Gene/P platform for research and development
  • 1024 nodes, 4096-core
  • Four PVFS2 servers
  • DataDirect Networks S2A9550 SAN
  • 256 compute nodes, with 4 I/O nodes were used.

Node Card: 4 core Node Board: 128 core Rack: 4096 core

15

2010年9月23日木曜日

slide-18
SLIDE 18

Evaluation on BG/P: BMI PingPong

100 200 300 400 500 600 700 800 900 1 4 16 64 256 1024 4096 Bandwidth (MB/s) Buffer Size (KB) BMI TCP/IP BMI ZOID CNK BG/P Tree Network

16

2010年9月23日木曜日

slide-19
SLIDE 19

Evaluation on BG/P: IOR Benchmark, 256nodes

100 200 300 400 500 600 700 800 1 4 16 64 256 1024 4096 16384 65536 Bandwidth (MiB/s) Message Size (KB) CIOD IOFSL FIFO(32threads)

17

2010年9月23日木曜日

slide-20
SLIDE 20

Evaluation on BG/P: IOR Benchmark, 256nodes

100 200 300 400 500 600 700 800 1 4 16 64 256 1024 4096 16384 65536 Bandwidth (MiB/s) Message Size (KB) CIOD IOFSL FIFO(32threads)

Performance Improvements (~42.0%)

17

2010年9月23日木曜日

slide-21
SLIDE 21

Evaluation on BG/P: IOR Benchmark, 256nodes

100 200 300 400 500 600 700 800 1 4 16 64 256 1024 4096 16384 65536 Bandwidth (MiB/s) Message Size (KB) CIOD IOFSL FIFO(32threads)

Performance Drop (~ -38.5%) Performance Improvements (~42.0%)

17

2010年9月23日木曜日

slide-22
SLIDE 22

Evaluation on BG/P: Thread Count Effect

100 200 300 400 500 600 700 800 1 4 16 64 256 1024 4096 16384 65536 Bandwidth (MiB/s) Message Size (KB) IOFSL FIFO (16threads) IOFSL FIFO (32threads)

18

2010年9月23日木曜日

slide-23
SLIDE 23

Evaluation on BG/P: Thread Count Effect

100 200 300 400 500 600 700 800 1 4 16 64 256 1024 4096 16384 65536 Bandwidth (MiB/s) Message Size (KB) IOFSL FIFO (16threads) IOFSL FIFO (32threads)

16 threads > 32threads

18

2010年9月23日木曜日

slide-24
SLIDE 24

Related Work

  • Computational Plant Project @ Sandia National Laboratory
  • First introduced I/O Forwarding Layer
  • IBM Blue Gene/L, Blue Gene/P
  • All I/O requests are forwarded to I/O nodes
  • Compute OS can be stripped down to minimum

functionality, and reduces the OS noise

  • ZOID: I/O Forwarding Project [Kamil 2008]
  • Only on Blue Gene
  • Lustre Network Request Scheduler (NRS) [Qian 2009]
  • Request scheduler at the parallel file system nodes
  • Only simulation results

19

2010年9月23日木曜日

slide-25
SLIDE 25

Future Work

  • Event-driven server architecture
  • reduced thread contension
  • Collaborative Caching at the I/O forwarding layer
  • multiple I/O forwarder works collaboratively for caching data

and also metadata

  • Hints from MPI-IO
  • Better cooperation with collective I/O
  • Evaluation on other leadership scale machines
  • ORNL Jaguar, Cray XT4, XT5 systems

20

2010年9月23日木曜日

slide-26
SLIDE 26

Conclusions

  • Implementation and evaluation of two optimization techniques at the I/O

Forwarding Layer

  • I/O pipelining that overlaps the file system requests and the network

communication.

  • I/O scheduler that reduces the number of independent, non-

contiguous file systems accesses.

  • Demonstrating portable I/O forwarding layer, and performance

comparison with existing HPC I/O software stack.

  • Two Environments
  • T2K Tokyo Linux cluster
  • ANL Blue Gene/P Surveyor
  • First I/O forwarding evaluations on linux cluster and Blue Gene/P
  • First comparison between BG/P IBM stack with OSS stack

21

2010年9月23日木曜日

slide-27
SLIDE 27

Thanks!

Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com

22

2010年9月23日木曜日